A Guide To High Quality Data

Introduction

Over the last 2 - 3 years I’ve been doing a lot of data warehousing work, and it strikes me how many companies have fundamental data problems.

Not big data

A lot of talk happens about ‘big data’, which isn’t even applicable to the vast majority of businesses, who aren’t generating terabytes or petabytes of data. The fundamental ‘small data’ problems need fixing first.

Fundamentals

For data to be valuable within a company it needs the following attributes:

  • Trusted
  • Easily consumable
  • Answers the right questions

The main issue for consumers of data within a business is ‘Can I trust this?’

Unfortunately, in most cases the answer is no.

Appling good engineering practices

This is because a lot of data science is done in a way which ignores good software engineering practices.

The generally accepted way to create high quality code is as follows:

  • Peer review of code
  • Tests for that code
  • Follow accepted best practices / software design patterns
  • Encourage code reuse

Lets compare that to a lot of data science as we see it today:

  • Done in an adhoc way, with many standalone scripts.
  • Excel is still king, with formulas buried within the user interface, rather than there being a clear separation of concerns (think MVC in web design)
  • A few ‘data barons’ within a company who know how everything hangs together, but there’s no real transparency to the rest of the company.

Conclusions

Data science as an industry can learn a lot from software engineering, and has to be approached with the same rigour and focus on quality code. Data is the life blood of most modern enterprises, and building reliable, transparent infrastructure is essential for gaining trust with users.