To Go AI, IT Must Stop Deleting Your Data

Luke Metcalfe

Published Jan 22, 2018

Consulting in the data science space, I often come across prospects who say "We're sold on the need for data scientists, but our data is messy. We'll have our IT guys make the data clean for you. Come back in 3 months."

3-6 months later when the IT project is done, I finally get the keys to the new database and find that much of the history has been purged. Millions of dollars worth of IP assets have been destroyed.

The predictive models the client dreamed of aren't going to work without relevant data.

How could this happen? Those (hopefully) trusty IT guys that keep the systems up and know it intimately (sometimes to the point of what seems like pedantry). They're the guys that built the database and migrated to the new system. They're in house, and have taken all the support calls. Surely they're the ones you can entrust such a project to?

Why you can't assume IT will be good data stewards

Ss a Computer Science grad with subsequent 20 years experience coding and managing dev teams, I know that the very nature of the IT industry makes it hard for them to be good data stewards.

The challenges of an IT are very different to data science. IT end users typically don't know exactly what they want and can ask for extra things that didn't seem to be agreed from the outside. Engineering processes are complex: the only sure way to predict how long a project will take is to actually do the project. If you could predict it exactly, you're probably doing routine work that smarter programmers would automate anyway.

Also, computer geeks have limited curiosity about the commercial aspects of the systems they're building. They learn a few assumptions but as quickly as possible go back to the technology where they feel at home. It's similar to how a business person's attention span is limited and they'll want to go back to talking business.

So the solution is agree on exactly what is needed. The more IT can pare down the requirements and keep the code small, the more likely your systems will be delivered on time and stay up. Every good IT manager knows that as the feature set grows, the interdependencies and thus project complexity grows exponentially.

An IT project that isn't well spec'd, that hasn't nailed down the critical user flows, that is subject to scope creep is headed towards failure. You can't simply throw more money at it if there is no underlying focus to the project. New coders can't handle the mess of the old coders who quit out of exasperation.

Why Data Science is so different to IT.

Data scientists on the other hand aren't primarily judged on whether systems stay up (if you're thinking about dashboards, that's Business Intelligence - you may be overpaying or understimulating your "data scientists" in that case, but that's a story for another day).

Data scientists should be judged on their ability to find insights and predict outcomes. No amount of bleeding edge machine learning techniques can compensate for a lack of data. Data driven insights, and machine learning in particular, needs a lot of data to work with. Computers aren't generalised learners where they can apply knowledge from one domain to another. They're fundamentally just brilliant counting machines, that can take into account far more information than any human can, but it has to be restricted.

Data scientists need lots of rows - if there aren't enough examples to learn from, you can't rely on the observations. They also need lots of columns - if the machine doesn't have enough context, it won't able to work its magic, finding complex and undiscovered relationships between entities to get real insights and predict behaviour. It doesn't matter if your data is text, images or in dozens of tables. Good data scientists will bring it all together for the machine learning to use.

In Computer Science, the critical flow is a sequence of steps performed by a user. They implement idealised worlds based on limited assumptions spelled out in interfaces. But in Data Science, only the measurable outcome matters - being the business objective like more sales, less labour input per product or higher Net Promoter Score.

It is the constant experience of a good data scientist that domain expertise is worth paying careful attention to, but reality is complex and humans simply can't encounter as many examples, remember and weight them in an unbiased fashion.

So in practice, what went wrong?

The IT guys didn't want to build any more code than was necessary. They think in terms of migration - that is, only the things necessary to make the move to the new system did so, and the old system should be turned off as quickly as possible.

They didn't want a call at 3 in the morning that the hard drives were full either.

IT may point out correctly that the data is imperfect. But I can tell you, no client ever has a perfect database but also no client has data that is unworkable (if they save it). Beyond fulfilling transactions and reporting on the dashboard, organisations should not strive for perfect formatting, merely perfect memory. Just save everything.

The thing is, you don't know what quality until you know what problem you're trying to solve. Perhaps as much as 90% of your rows lack a field critical to answer a decision. But is the remaining 10% representative enough of the rest? In that case, you may not need the rest to get an answer. After all you listen to your market research agency and they rarely do a study on more than a few hundred people. Or perhaps the rest of the columns for that 90% are pretty good and you could actually use machine learning to fill the empty values.

And that old dataset from the legacy software probably has insights you can't get from the new one. Not least because it covers a different time period but also it probably encapsulates different workflows that can test different assumptions.

So becoming data driven - or to use more the buzzword, going AI - is not about adding interfaces that make decisions you can't fathom - like adding a chat bot. It's certainly not about how many data scientists you've hired. It's about whether they have all the raw materials necessary to allow machines to see things from every angle.

Faiz ‎فيض Jamdar 🔹

Helping Tech Non Profits achieve greater impact through marketing and product design

Fascinating article Luke.

1 Reaction

Stephen Mitchell

Principal Consultant - SME & home Lending

You’ve sparked my interest Luke, where did you learn about this?

Geoffrey Pidcock

Analytics Manager, ANSTO | EMBA Candidate, Melbourne Business School

Interesting article. Of course, on the flip side, in making more data available for a data discovery project, you end up burning 80-90% of your data scientists' time on cleaning and munging. Maybe this present dichotomy is why data engineers are so hot right now.

See more comments

To Go AI, IT Must Stop Deleting Your Data

Luke Metcalfe

Why you can't assume IT will be good data stewards

Why Data Science is so different to IT.

So in practice, what went wrong?

Insights from the community

Others also viewed

Why Data Science Matters A Lot And How it is Beneficial for the Business?

Top 6 Data Science Pain Points in 2021

Tech Forecast 2017

Bad-Viz: The Silent Killer of Data Science Careers - The Shocking Truth About How Poor Data Visualisation "Hurts".

How To Choose Data Science Consultants (Buyer’s Guide)

5 Real-Time Challenges Budding Data Scientists Face in IT Projects—and How to Overcome Them

The Importance of Citizen Data Science in a Data-Driven World

Optimal Data Science (product)

How Do You Win the Data Science Wars? You Cheat By Doing The Necessary Pre-work!

July Data News

Explore topics