How to do data discovery ? Data discovery is a departure from traditional business intelligence in that it emphasizes interactive, visual analytics rather than static reporting.
The goal of data discovery is to work with and enable people to use their intuition to find meaningful and important information in data.
This process usually consists of asking questions about the data in some way, seeing results visually, and refining the questions. Contrast this with the traditional approach,
Which is for information consumers to ask questions.
This approach causes reports to be developed, which are then fed to the consumer. ‘
This in turn generates more questions, which then generates more reports.
Data Discovery Approaches
Progressive companies consider data to be a strategic asset and understand its importance to drive innovation, differentiation, and growth.
But leveraging data and transforming it into real business value requires a holistic approach to business intelligence and analytics.
This is dramatically different from the business intelligence (BI) platforms of years past. It means going beyond the scope of most data visualization tools.
The continuing evolution of data discovery in the enterprise and the cloud is being driven by these trends:
On big data projects, data discovery is more important and more challenging. Not only is the volume of data that must be efficiently processed for discovery larger, but the diversity of sources and formats presents challenges that make many traditional methods of data discovery fail.
Cases in which big data initiatives also involve rapid profiling of high-velocity big data make data profiling harder and less feasible using existing toolsets.
The ongoing shift toward (nearly) real-time analytics has created a new class of use cases for data discovery.
These use cases are valuable but require data discovery tools that are faster, more automated, and more adaptive
Agile analytics and agile business intelligence
Data scientists and business intelligence teams are adopting more agile, iterative methods of turning data into business value.
They perform data discovery processes more often and in more diverse ways, for example, when profiling new data sets for integration, seeking answers to new questions emerging this week based on last week’s new analysis, or finding alerts about emerging trends that may warrant new analysis workstreams.
Different Data Discovery Techniques
Data discovery tools differ by technique and data matching abilities.
Assume you wanted to find credit card numbers. Data discovery tools for databases use a couple of methods to find and then identify information.
Most use special login credentials to scan internal database structures, itemize tables and columns, and then analyze what was found. Three basic analysis methods are employed:
This is data that describes data. All relational databases store metadata that describes tables and column attributes. In the credit card example, you would examine column attributes to determine whether the name of the column or the size and data type resembles a credit card number.
If a column is a 16-digit number or the name is something like Credit Card or CC#, then there’s a high likelihood of a match.
Of course, the effectiveness of each product will vary depending on how well the analysis rules are implemented.
This remains the most common analysis technique.
This is marked by data elements being grouped with a tag that describes the data. This can be done at the time the data is created, or tags can be added over time to provide additional information and references to describe the data.
In many ways, it is just like metadata but slightly less formal. Some relational database platforms provide mechanisms to create data labels, but this method is more commonly used with flat files, becoming increasingly useful as more firms move to Indexed Sequential Access Method (ISAM) or quasi-relational data storage,
such as Amazon’s SimpleDB, to handle fast-growing data sets.
This form of discovery is similar to a Google search, with the greater the number of similar labels, the greater likelihood of a match. Effectiveness is dependent on the use of labels.
ISAM is a file management system developed at IBM that allows records to be accessed either sequentially (in the order they were entered) or randomly (with an index).
Each index defines a different ordering of the records.
In this form of analysis, the data itself is analyzed by employing pattern matching, hashing, statistical, lexical, or other forms of probability analysis.
In the case of the credit card example, when you find a number that resembles a credit card number, a common method is to perform a Luhn check on the number itself.
This is a simple numeric checksum used by credit card companies to verify if a number is valid.
If the number you discover passes the Luhn check, the probability is high that you have discovered a credit card number.
The Luhn formula, which is also known as the modulus 10, or mod 10 algorithm, generates and validates the accuracy of credit card numbers.
Content analysis is a growing trend and one that’s being used successfully in DLP and web content analysis products.
Data Discovery Issues
You need to be aware of the following issues relating to data discovery:
Poor data quality
Data visualization tools are only as good as the information that is inputted. If organizations lack an enterprise-wide data governance policy, they might be relying on inaccurate or incomplete information to create their charts and dashboards.
Having an enterprise-wide data governance policy helps to mitigate the risk of a data breach.
This includes defining rules and processes related to dashboard creation, ownership, distribution, and usage; creating restrictions on who can access what data; and ensuring that employees follow their organizations’ data usage policies.
With every dashboard, you have to wonder. Is the data accurate? Is the analytical method correct? Most importantly, can critical business decisions be based on this information? Users modify data and change fields with no audit trail and no way to tell who changed what.
This disconnect can lead to inconsistent insight and flawed decisions, drive up administration costs, and inevitably create multiple versions of the truth.
Security also poses a problem with data discovery tools. Information technology (IT) staff typically have little or no control over these types of solutions, which means they cannot protect sensitive information.
This can result in unencrypted data being cached locally and viewed by or shared with unauthorized users.
A common data discovery technique is to put all the data into server RAM to take advantage of the inherent input/output rate improvements over the disk.
This has been successful and spawned a trend of using in-memory analytics for increased BI performance.
Here’s the catch, though: in-memory analytic solutions can struggle to maintain performance as the size of the data goes beyond the fixed amount of server RAM.
For in-memory solutions, companies need to hire someone with the right technical skills and background or purchase prebuilt appliances both are unforeseen added costs.
An integrated approach as part of an existing business intelligence platform delivers a self-managing environment that is a more cost-effective option.
This is of interest especially for companies that are experiencing lagging query responses due to large data volumes or a high volume of ad hoc queries.
Challenges with Data Discovery in the Cloud
The challenges with data discovery in the cloud are threefold. They include identifying where your data is, accessing the data, and performing preservation and maintenance.
Identifying where your data
The ability to have data available on-demand, across almost any platform and access mechanism, is an incredible advancement in end-user productivity and collaboration.
However, at the same time, the security implications of this level of access confound both the enterprise o and the CSP, challenging all to find ways to secure the data that users are accessing in real-time, from multiple locations, across multiple platforms.
Not knowing where data is, where it is going, and where it will be at any given moment with assurance presents significant security concerns for enterprise data and the AIC that is required to be provided by the CSP
Accessing the data
Not all data stored in the cloud can be accessed easily. Sometimes customers do not have the necessary administrative rights to access their data on demand, or long-term data can be visible to the customer but not accessible to download in acceptable formats for use offline.
The lack of data access might require special configurations for the data discovery process, which in turn might result in additional time and expense for the organization.
Data access requirements and capabilities can also change during the data lifecycle.
Archiving, DR, and backup sets tend to offer less control and flexibility for the end-user. In addition, metadata such as indexes and labels might not be accessible.
When planning data discovery architectures, you should make sure you will have access to the data in a usable way and that metadata is accessible and in place.
The required conditions for access to the data should be documented in the CSP SLA.
- There needs to be agreed upon ahead of time on issues such as the following:
- Limits on the volume of data that will be accessible
- The ability to collect and examine large amounts of data
- Whether any related metadata will be preserved
- Other areas to examine and agree about ahead of time include storage costs, networking capabilities and bandwidth limitations, scalability during peak periods of usage, and any additional administrative issues that the CSP would need to bear responsibility for versus the customer.
- Performing preservation and maintenance: Who must preserve data? It is up to you to make sure preservation requirements are documented for and supported by, the CSP as part of the SLA
- If the time required for preservation exceeds what has been documented in the provider SLA, the data may be lost. Long-term preservation of data is possible and can be managed via an SLA with a provider.
However, the issues of data granularity, access, and visibility need to be considered when planning for data discovery against long-term stored data sets.