The importance of improving data quality at source

SA Mathieson

SA Mathieson

Published: 29 Dec 2022

The UK can blame its bad immigration data on Hungary, one of the eight countries which joined the European Union in 2004. Unlike most existing EU countries, the UK government allowed its citizens to move and work without restrictions, expecting 5,000 to 13,000 people to arrive each year. But this was a massive underestimate, causing accusations that immigration was out of control and arguably contributing to Britain’s exit from the EU.

Based on the results of the 2021 Census, the country which sent the most people to the UK was Poland, followed by Romania. But Hungary is the home of budget airline Wizz Air, which as part of keeping down costs tends to use smaller airports such as Luton, Birmingham and Sheffield Doncaster.

Also to keep down costs, the International Passenger Survey run by the Office for National Statistics (ONS) at the time focused on Heathrow, Gatwick and Manchester. As a result, it didn’t notice increasing numbers of eastern Europeans using budget flights run by Wizz Air and others.

Georgina Sturge, a statistician for the House of Commons Library research service, highlights the episode in her new book, Bad data, as an example of how data collection can go awry. The passenger survey had been set up in the 1960s, when far fewer people travelled internationally, more left the UK permanently than arrived, and most people required visas.

“People didn’t tend to travel in large droves from Poznań to Doncaster in the past,” says Sturge. “Unfortunately for the statisticians, who hadn’t even stationed anyone there to do the survey at the time, that was exactly what people started to do.”

Sturge says the UK has excellent official data in some areas, including health, traffic accident statistics and much of the ONS’s output. The Office for Statistics Regulation maintains a list of approved national statistics which she describes as the gold standard.

“But ultimately, if we’re asked a question or we need to produce some briefing material on something and there is any data out there which seems remotely reliable, we will pretty much end up using it,” she says of her work for MPs and their staff. “From our perspective, it’s about explaining the caveats.” This means thinking about where data comes from, how it is collected and for what purpose, considering the human processes involved rather than just the technical matter of getting hold of it.

Replication crisis

Parliamentarians are not alone in being hungry for data, and not too picky about what they consume. Recent years have seen several scientific fields threatened by a replication crisis, where the results of research published in peer-reviewed journals cannot be reproduced by others repeating the work, in some cases because the data has errors or is faked.

Researchers who rely on such research data may find their work is undermined, but the risk can be lessened by using services that carry out reliability checks on papers. Healthcare journalist and academic Ivan Oransky co-founded Retraction Watch, a database of scientific papers that have been withdrawn. Its data is used by publishers and companies to check references through bibliographic management software including EndNote, Papers and Zotero, as well as digital library service Third Iron. “We would be happy to work with more, and to have our database integrated into the manuscript management systems that publishers use,” he says.

However, he adds, the bigger problem lies in inaccurate papers and data that have not been retracted, making it worth using post-publication review services such as PubPeer, of which he is a volunteer director. More generally, he adds that researchers are well-advised to follow the Russian proverb, “trust, but verify”, adopted by former US president Ronald Reagan in nuclear disarmament talks with the Soviet Union.

Researchers should aim to obtain and analyse the original data before relying on it for a project or further research. “That may seem inefficient, but it’s far better than being caught unaware when a project is much further along,” says Oransky.

Another approach is to improve the classification of scientific data, particularly that held in text. Neal Dunkinson, vice-president of solutions and professional services for semantic analytics company SciBite, says the word “hedgehog” in a genetics paper may refer to the sonic hedgehog gene that helps control how bodies develop from embryos, named after the video game character, or it may refer to the small, spiny mammal in general.

Cambridge-based SciBite, which was bought by Dutch scientific publisher Elsevier in 2020, has developed a service to automate the tagging mentions of 40,000 genes to standard identities, making searches of papers, slides and electronic lab notebooks more precise. To do so, it has built lists of acronyms, alternative names and spellings, and common misspellings. As well as applying it to existing material, it offers a real-time option that prompts researchers to add tags through drop-down lists or the equivalent of a spellchecker.

Dunkinson says that good-quality data in life sciences should be “fair” – findable, accessible, interoperable and reusable. “We don’t at the moment critique the quality of the information written down – that’s about repeatability in the experimental process – but how usable is that information, is it tagged properly, is it stored correctly, do people know where it is, is it in the right formats,” he says.

Dependency chain in financial auditing

Financial auditing, like much scientific research, relies on other people’s data. Organisations are responsible for their accounts, but auditors have to extract data so they can check its accuracy and integrity. London-based audit technology company Engine B has worked with the Institute of Chartered Accountants in England and Wales and audit firms to build a common data model that can extract material from common enterprise resource planning suite packages.

The company’s head of audit and ethics, Franki Hackett, says the system uses knowledge of common software and practices to present what it thinks will correctly transform a file so it can be loaded into this common model, but it remains wise to include human checks. “You can take the human out of the loop, but when you do, you quite often see errors in fidelity, or mistranslation of data or inappropriate transformation and loading,” she says. “Keeping a good balance between the machine and the human being is a critical part of that stage of data quality.”

If it has processed a previous version, Engine B’s system flags any changes in the data’s structure, such as new fields. Hackett says organisations tend to be weak at reviewing data processes after they have been set up, meaning that such changes get missed. “An ‘if it ain’t broke, don’t fix it’ mentality can miss that creeping brokeness,” she says.

Auditors working to decide if they can verify the accuracy and completeness of an organisation’s financial records often compare two sets of data recording the same things, such as the general ledger with details of all transactions and the trial balance which summarises debits and credits. These should match up, but it’s common to find discrepancies such as different dates for transactions, which can indicate poor controls. Hackett says she has seen senior financial professionals sticking their usernames and passwords on their monitors for others to use, risking outright fraud but also making mistakes more likely – and different dates in the two data sets can indicate attempts to fix such mistakes.

In similar fashion, through academic research on tax transparency, Hackett has found that country-level data that a European directive requires some large companies to publish often doesn’t tally with global figures. The parameters of the required national data are badly defined, she says: “They can produce something which is fundamentally kind-of unusable, a nonsense that’s a public relations exercise a lot of the time.” It demonstrates the need to know exactly what questions data collection is trying to answer.

Waseem Ali, chief executive of diversity-focused consultancy training business Rockborne, previously worked as chief data officer for insurance market Lloyds of London and head of analytics for healthcare provider Virgin Care. For insurers, bad data can mean wrongly priced premiums, but in healthcare, it can mean failing to provide potentially life-saving advice.

“There is a high likelihood that I will have some sort of heart disease, based on my family history and my ethnicity,” says Ali. “Having the right quality data about me allows healthcare providers to intervene sooner, so they can ensure that someone like Waseem goes to the gym regularly and eats properly.” As well as being in the interest of patients, data-driven predictive work could cut healthcare system costs by reducing the number of major interventions later.

Ali says organisations can seek to improve data quality by understanding its end-to-end journey and focusing on the most business-critical material. Improvements can be made through simple changes such as standardising how teams calculate the likes of profit margins and customers’ experience so these can be properly compared. “I’ve been in organisations where the same statistic is reported with different numbers due to the way it is being interpreted,” he says.

Anthony Scriffignano, chief data scientist of Dun & Bradstreet, a Florida-based company that has published data on businesses for two centuries, sees four types of data quality: accuracy, completeness, timeliness and veracity. Completeness and timeliness are relatively easy to check, although a blank field can mean the data doesn’t exist rather than it has been missed – such as because a business does not have a telephone number – and data collected today may have been created or updated some time previously.

Checking accuracy is harder. In some cases, Dun & Bradstreet can draw on official documents, but if there is no authoritative source, “it becomes a little bit of an art”, says Scriffignano. It can consider the reliability of the organisation providing information and whether numerical data is within likely ranges, although the latter needs to be. It may sound unlikely that a removal and storage provider is more than five centuries old, but as it says on its lorries, Aberdeen’s Shore Porters Society was founded in 1498. The key is to have rigorous checking processes. “You can’t just wing it,” he says.

The hardest of the four is veracity. Scriffignano points out that “the truth, the whole truth and nothing but the truth” can be three different things, with the first broken by lying, the second broken by omission and the third only fulfilled through being entirely truthful.

There are ways to check that a set of data satisfies all three, such as statistical analysis of its distribution. If a graph of a set of data would normally look like a bell curve with a high point in the middle and tapering sides, but instead only includes the high middle, it indicates that some data is being excluded – the truth but not the whole truth. Dun & Bradstreet saw data on bankruptcies being warped during the Covid-19 pandemic, as smaller ones were missed or not reported.

Despite all the ways that data can be tested, Scriffignano says the biggest problems are caused by organisations unintentionally ingesting data that has unknown issues. “As a consumer of data, depending on what you’re doing with it, you probably should think about where you’re getting it from and how you know that you trust it,” he says.