IT Pro is supported by its audience. When you purchase through links on our site, we may earn an affiliate commission. Learn more

Data quality issues in big data

To understand data quality issues in Big Data, you need to look at its main features

Big data

Data quality issues are as old as databases themselves. But Big Data has added a new dimension to the area; with many more Big Data applications coming online, the problem of bad data quality problems can be far more extensive and catastrophic. 

If Big Data is to be used, organisations need to make sure that this information collection sticks to a high standard. To understand the problems, we need to look at them in terms of the important aspects of Big Data itself.

Velocity

The speed at which data is generated can make it difficult to gauge data quality given the finite amount of time and resources. By the time a quality assessment is concluded, the output could be obsolete and useless. 

One way to overcome this is through sampling, but this is at the expense of bias as samples rarely give a truthful picture of the entire dataset. 

Variety

Data comes in all shapes and sizes in Big Data and this affects data quality. One data metric may not suit all the data collected. Multiple metrics are needed as evaluating and improving data quality of unstructured data is vastly more complex than for structured data.

Data from different sources can have different semantics and this can impact things. Fields with identical names, but from different parts of the business, may have different meanings.

To make sense of this data, reliable metadata is needed (e.g. sales data should come with time stamps, items bought, etc.). Such metadata can be hard to obtain if data is from external sources.

Volume 

The massive size and scale of Big Data projects makes it nigh-on impossible to undertake a wide-ranging data quality assessment. At best, data quality measurements are imprecise (these are not absolute values, more probabilities). 

Data quality metrics have to be redefined based on particular attributes of the Big Data project, in order that metrics have a clear meaning, which can be measured and used for evaluating the alternative strategies for data quality improvement.

Value

The value of data is all about how useful it is in its end purpose. Organisations use Big Data for many business goals and these drive how data quality is expressed, calculated and enhanced.

Data quality is dependent on what your business plans to do with the data; it's all relative. Incomplete of inconsistent data may not impact how useful the data is in achieving a business goal. The data quality may good enough to ignore improving it. 

This also has a bearing on the cost vs benefit of improving data quality; is it worth doing and what issues need to take priority.

Veracity 

Veracity is directly tied to quality issues in data. It relates to the imprecision of data along with its biases, consistency, trustworthiness and noise. All of these effect data accountability and integrity.

In different organisations and even different parts of the business, data users have diverse objectives and working processes. This leads to different ideas about what constitutes data quality. 

Featured Resources

2022 State of the multi-cloud report

What are the biggest multi-cloud motivations for decision-makers, and what are the leading challenges

Free Download

The Total Economic Impact™ of IBM robotic process automation

Cost savings and business benefits enabled by robotic process automation

Free Download

Multi-cloud data integration for data leaders

A holistic data-fabric approach to multi-cloud integration

Free Download

MLOps and trustworthy AI for data leaders

A data fabric approach to MLOps and trustworthy AI

Free Download

Most Popular

How to boot Windows 11 in Safe Mode
Microsoft Windows

How to boot Windows 11 in Safe Mode

15 Nov 2022
The top 12 password-cracking techniques used by hackers
Security

The top 12 password-cracking techniques used by hackers

14 Nov 2022
Ex-Twitter tech lead says platform's infrastructure can sustain engineering layoffs
Infrastructure

Ex-Twitter tech lead says platform's infrastructure can sustain engineering layoffs

23 Nov 2022