Improve Data Quality on your Big Data Projects

26 April 2017 / InfoReady

In recent years, we have seen a number of examples of data quality causing embarrassing situations for businesses. Among them, Bank of America sending a customer mail with a very rude address line, and Pinterest congratulating single women on their upcoming marriages.

Facebook has also faced criticism for using profiling to automatically show advertisements to a broad cross-section of users, without human interaction or review. Facebook’s system, which builds user profiles to display relevant advertisements based on a range of profile, browsing and mined information, has at times resulted in the presentation of inappropriate content to young users, including dating and weapons products.

So why have these things happened?

In these situations, the problem lies in the quality of the input data, ranging from inaccurate entry of a name, through to an over-reliance on certain (potentially imprecise) indicators to make a decision about the customer.

Enter the fourth ‘V’ of big data: veracity (or quality), which joins velocity, variety and volume as key elements of successful big data projects.  By veracity, we mean the ‘fitness for purpose’ of the data, which can include measures of completeness, consistency, accuracy, and timeliness – all benchmarked against what the data will be used for.

This brings us to a significant question: how do we manage data quality in this complex, fast-moving, and, at times, intimidating world of big data?

Why is Big Data different?

Traditional approaches to data quality may not work in Big Data scenarios. Traditionally, we would:

  1. Define ‘fit for purpose’
  2. Profile all the data
  3. Develop rules to match, link,and standardise the data
  4. Press ‘go’ and iterate the process to refine the quality to an acceptable standard

These initiatives unfortunately don’t fit the big data world as:

  • Traditional tools are built for well-defined, structured data, not varied, unstructured data;
  • Data quality projects are often tactical solutions affecting a small portion of organisational data;
  • Big data approaches are often focussed on getting data into a data lake ASAP, traditional approaches require time and structure, which may slow down rapidly growing data lakes; and
  • Profiling and standardising all data would be like boiling the ocean!


So what resources should we put into the problem?

As with traditional quality techniques, the investment should match the associated impact.  Unfortunately the pursuit of perfection can deliver diminishing returns for the amount of investment made.

A minimum level of quality will have certain costs associated with failure over the product or project’s lifetime.  As investments are made in ‘prevention’ and the level of quality increases, failures will be less severe and less frequent, resulting in a reduced failure cost.  However, after a certain sweet spot – the ‘optimum quality level’ – the increasing investments outpace the benefits of a better quality product, giving a rising total cost.  As professionals in business, we need to find this point to best serve our stakeholders.

To bring this into our context, we show a number of hypothetical applications on the graph below.  Here, Exploratory Analytics has relatively low risk and correspondingly low need to invest in quality, whereas the Financial Markets & Trading application may have the highest risk, and therefore the highest need to invest.

Investment Impact

Data Quality in Practice

As stated earlier, there are many reasons why a purist approach to data quality won’t work in a big data project. So what should organisations that work on big data projects do?

Our answer is to consider the risk and consequence associated with each application, so we can make smart investment decisions in Data Quality.  Here are a few examples.

Exploratory Analytics

Exploratory analytics usually involves a team of people using data in new ways to try and solve particular challenges in an ad-hoc or exploratory way. Examples include the NSW Data Analytics Centre and Walmart’s Data Cafe. The problems that these teams work on might include market research or policy research that can be used to trigger further investigation or examine policy settings.

Often exploratory analytics will result in further work to prove the conclusions that have been drawn, which provides an opportunity to ‘double-check’ the findings.  Consequently, the impact of data errors may be as simple as a little extra time and cost.

The goal in these applications is to discover new things, and an overbearing quality framework may slow down this process.  Worse still, because the level of data quality may itself tell us something useful, a ‘perfect’ dataset could mask discoveries.  Consequently, we should not constrain imagination with onerous quality requirements – our goal should be to be aware of quality but not create roadblocks.  Finally, given the low consequence, it may well be acceptable to say ‘good enough is good enough’!

Client Engagement

Usually the domain of consumer-facing organisations such as retailers, media, and retail service providers, this type of application uses big data to better understand and engage customers or clients.  Often, the input data will include social media feeds, clickstream data, and purchasing history.  This form of engagement might be in the form of recommendation engines, chatbots, and advertising presentation.

Here, the consequences can range from insignificant, where a user has an irrelevant product advert presented to them, through to reputationally damaging, where an interaction may be perceived as insensitive or inappropriate.

Engagement platforms are a new way of delivering ‘mass personalisation’ and improving the customer experience.
However, in this application there are some dangers and we need to understand the potential impacts of a particular application (e.g. Ms McIntire’s letter from Bank of America).  Where appropriate, we can apply strong traditional quality approaches to data that might be later used to take an action (e.g. the client’s address).  To help avoid bad decisions, we can also give source datasets a quality and reliability score that can be built into algorithms.  As a final quality check, we can also link generated insights back to known facts that can be relied upon.  For example, we may run a limited survey to confirm the modelled propensity to purchase a particular product, or compare against historical empirical research.

Financial Markets and Trading

Big Data analysis is being used more and more frequently in the world of finance and trading, with brokers using analytics to inform investment decisions.  Coupled with the rise of algorithmic trading and corresponding reductions in the amount of human intervention, data insights will have an increasing effect on global markets.

As an example, events such as the 2010 Flash Crash (a 36-minute period of extreme instability on the NYSE that blew US$1 trillion) or the Knight algorithm fiasco (where high frequency trades lost US$440 million in 30 minutes) have been blamed in part on Automated Trading Systems.  These systems are a combination of Machine Learning/AI programs that consume vast amounts of data to respond to market conditions.

Unfortunately, these events don’t just affect the unlucky trader, but could wipe billions from investment and superannuation funds used by hundreds of thousands of people.  It is not inconceivable that such events could trigger the collapse of a large institution, with consequences similar to the Enron and Lehman Brothers events in the early 2000’s.  This assessment is reinforced by the US Securities Exchange Commission actively moving to regulate algorithmic trading.

Despite these risks, the potential for big data to improve the operation of complex financial processes is still huge. Those who can quickly interpret and act on massive amounts of incoming data will gain first-mover advantage in volatile markets.  Additionally, banking and insurance companies have the opportunity and technology to proactively mine big data to detect fraud, money laundering, and other financial crimes.

However, due to organisational complexity and tight regulation, bankers and investors need to understand their intricate enterprise architectures and the quality of data in each source.  Once again, it is important to assess the veracity of incoming information before a decision is made.  It is also necessary to understand regulatory requirements for affected business processes and put in place effective controls.  Finally, it is important to ensure that the accuracy of incoming data is appropriate for the application, or put in place secondary processes to ‘double-check’ outcomes.

Operational Monitoring

A number of companies are using big data technologies to monitor operational performance.  For example, Woodside Petroleum is using a combination of IBM Watson and over 200,000 IoT devices to predict faults at one of its plants. The insights gained from this operation allow Woodside to address issues in planned maintenance windows, therefore reducing the incidence of costly unplanned maintenance.

Incorrect conclusions might result in a technician unnecessarily inspecting/replacing equipment based on a predicted failure report.  Alternatively, if the organisation builds a reliance on the tool, those failures that are not predicted could have larger consequences.

In these high-consequence applications, we need to better invest in quality techniques.  It is important for an organisation to understand the veracity of all its data on a record-by-record basis.  In this case each record can have a ‘score’ attached based on reliability.  These scores can then be used during processing to ensure that ‘unreliable’ data can be eliminated or only given a low weighting in the final decision.  

To avoid ‘boiling the ocean’, we can also apply the concept of ‘quality on read’, similar to the Big Data technology concept of ‘schema on read’, to avoid organising every piece of data into a structured format.  Finally, nothing beats a final error check before you act on insights!

In Summary…

Big Data is inherently variable in quality, with data that originates from a broad range of sources (from highly accurate corporate systems to imprecise unstructured data mining tools) and data integration that was never imagined by original system designers.  We have discussed the applicability of traditional techniques to the brave new world of Big Data, but adhered to the principle of finding an optimum level of investment in quality for each particular application and risk profile.  Finally, we have walked through a number of hypothetical scenarios and identified guidelines for each that will help reach an optimal solution.  

Finally, as with all IT Systems, no amount of technology will change the old maxim of ‘garbage in, garbage out’. However, with the right approach, organisations can make a success of big data.

The issue of quality in big data solutions will certainly remain relevant and our team of expert data scientists, data engineers and data quality professionals are keen to work with organisations who are ready to realise the potential of big data.  Speak to one of our data experts today to find out how your organisation can use big data to deliver business value.

Contact details for our team can be found here.

James Hartley and Suhel Khan

Previous  > >  Next