Data quality issues in big data
To understand data quality issues in Big Data, you need to look at its main features
Data quality issues are as old as databases themselves. But Big Data has added a new dimension to the area; with many more Big Data applications coming online, the problem of bad data quality problems can be far more extensive and catastrophic.
If Big Data is to be used, organisations need to make sure that this information collection sticks to a high standard. To understand the problems, we need to look at them in terms of the important aspects of Big Data itself.
The speed at which data is generated can make it difficult to gauge data quality given the finite amount of time and resources. By the time a quality assessment is concluded, the output could be obsolete and useless.
One way to overcome this is through sampling, but this is at the expense of bias as samples rarely give a truthful picture of the entire dataset.
Data comes in all shapes and sizes in Big Data and this affects data quality. One data metric may not suit all the data collected. Multiple metrics are needed as evaluating and improving data quality of unstructured data is vastly more complex than for structured data.
Data from different sources can have different semantics and this can impact things. Fields with identical names, but from different parts of the business, may have different meanings.
To make sense of this data, reliable metadata is needed (e.g. sales data should come with time stamps, items bought, etc.). Such metadata can be hard to obtain if data is from external sources.
The massive size and scale of Big Data projects makes it nigh-on impossible to undertake a wide-ranging data quality assessment. At best, data quality measurements are imprecise (these are not absolute values, more probabilities).
Data quality metrics have to be redefined based on particular attributes of the Big Data project, in order that metrics have a clear meaning, which can be measured and used for evaluating the alternative strategies for data quality improvement.
The value of data is all about how useful it is in its end purpose. Organisations use Big Data for many business goals and these drive how data quality is expressed, calculated and enhanced.
Data quality is dependent on what your business plans to do with the data; it's all relative. Incomplete of inconsistent data may not impact how useful the data is in achieving a business goal. The data quality may good enough to ignore improving it.
This also has a bearing on the cost vs benefit of improving data quality; is it worth doing and what issues need to take priority.
Veracity is directly tied to quality issues in data. It relates to the imprecision of data along with its biases, consistency, trustworthiness and noise. All of these effect data accountability and integrity.
In different organisations and even different parts of the business, data users have diverse objectives and working processes. This leads to different ideas about what constitutes data quality.
2022 State of the multi-cloud report
What are the biggest multi-cloud motivations for decision-makers, and what are the leading challengesFree Download
The Total Economic Impact™ of IBM robotic process automation
Cost savings and business benefits enabled by robotic process automationFree Download
Multi-cloud data integration for data leaders
A holistic data-fabric approach to multi-cloud integrationFree Download
MLOps and trustworthy AI for data leaders
A data fabric approach to MLOps and trustworthy AIFree Download