Stuck in the Muck: Big Data means Big Problems
Image from you.com

Stuck in the Muck: Big Data means Big Problems

Imagine that your organization is a sleek thing of beauty, like a very fast, very expensive, highly polished Ferrari. But it can't reach top speed – or its true potential – because it's stuck in the mud all the way to the top of its wheels. 

And your data is the key problem because data is the new mud. Big Data, especially Big Text Data, is slowing you down and holding you back – no matter how hard your team pushes, no matter how fast your Ferrari can go. Regardless of what datacenter and database vendors tell you, your data is the problem, not the solution. And by the way, generative AI won't magically save you from the data swamps that your organization has created -- it's built on top of the same quivering muck that you need to be rid of.

Decades ago, the Big Data movement created a paradise for statistics- and math-minded workers as organizations everywhere accumulated oceans of numeric data. But it also opened the floodgates for a vast mudslide of text data, as well:  labels, words, queries, sentences, reviews, documents, emails, comments, suggestions, complaints, text messages, and more. Data that no one was prepared for. Data that databases don't work well with. Billions upon billions of texts! I've even seen estimates that today 80% of all data is text data. 

Take LinkedIn for one example. When I worked there, member profiles (those mini-résumés that everyone posts to promote themselves) mentioned more than 150 million deduped job titles, in 23 languages. (Now there are many, many more!) And there were millions more job titles (usually worded differently from profile jobs) in job postings, as well. Text, text, and more text!

A treasure trove, right? Wrong!

It's more like a humongous, web-scale headache!  Search, ads, insights, and recommendations – all key drivers of revenue and engagement – are constantly marred with wildly irrelevant, incorrectly labeled "related" results. For one simple example, the vast majority of jobs for which I am supposedly a "top candidate" (the very best matches!) are positions that I have absolutely no training, experience, or qualifications for (much less any interest). Today I got a new "top pick": as a Senior Archeologist! I also, apparently, use many of the same keywords that a very highly qualified Distinguished Software Engineer uses – but why should I care? Granted my background looks like a dog's breakfast, but their simplistic string-matching approach yields results that are most often useless. This painfully naïve approach also means that I'll see the wrong ads, retrieve the wrong search results, and stats about or insights into job trends will see me (and millions others) counted in the wrong places. Clearly, LinkedIn will also get more complaints, so they'll have more support costs and more missed or delayed sales, as well.

All because they have too much text data and too few reliable methods for text processing from engineering.

Each and every mismatch like this is a tear in LinkedIn's reputation and a dip in revenue. For web-scale operations, that ends up being very real money left behind. Instead of being the best, most reliable service, they're simply better than most others – a rather low bar as it turns out. My team there ended up developing robust methods and resources to address these problems, so I know that solutions are available, even if they are not widely known, widely understood, or widely adopted.

LinkedIn is just a single example. Almost all other organizations have similar problems.  If customers are searching for products or content and can't find them, then you won't sell. If clients depend on your data but it's not reliable, then they won't renew their subscriptions.  If you match ads to viewers incorrectly, then your clients won't sell as much.  If you build gen AI on top of all this mucky data, you get hallucinations. 

In sum, the state of the art for Text Data is this:

  • A huge swath of revenue generation depends directly on text data and how we process it. There's a huge amount of business value that is untapped and virtually unexplored.
  • Organizations have accumulated vast amounts of text data – Big Data – without understanding which parts of it are reliable, valuable, or interoperable. 
  • Huge expenses are associated with storing, protecting, and processing all this data – with unclear returns.
  • Engineering methods for processing text – as seen in lousy search results and generative AI's hallucinations – are simply ineffectual in real life. These methods relate words based on their order or spelling rather than on their meaning. 
  • Software engineers' training pays precious little attention to text and string data. To the point where they call text "unstructured data" – essentially, random stuff. That's not exactly a rich conceptual framework for them to start from, especially given the scale and impact of such an important problem. The software engineers that we rely on so heavily for processing numeric data are clearly at a loss for how to effectively face this massive onslaught of words. 
  • Effective but little-known methods are available for addressing the complexities of text data.

More data is definitely NOT better data, no matter what statistically-minded advisors say. The key lessons learned from decades of Big Data are that high-quality data is far more valuable than high-volume data and that text data is very hard to process reliably with engineering methods. 

Slogging through LinkedIn-scale text data shows very clearly that what's really needed to advance both business goals and the state of the art is substantially more investment in meaning-centric methods. To dig out from this vast collection of messy, muddy data, we need to transform it into meaningful bricks that we can build with. These bricks we can shape into powerfully structured knowledge graphs that have already proven invaluable when we need to aggregate, integrate, and evaluate data or tame unruly language models. 

Your Ferrari will go much faster on a track paved with bricks than on one knee-deep in mud.
Putcha Narasimham

Founder Proprietor at Knowledge Enabler Systems

4mo

Text is "inappropriately" dubbed and "blamed" as "unstructured" but most text we deal with is certainly well-structured according to the the language grammar and conventions humans use in their communications. It may not be perfect but NOT unstructured. Text in print or speach is the ultimate form for agreements among humans. The only thing is that text is NOT readily machine processible yet. So, there is no scope for "semantic computation" eqivalent to "numerical computation" With "knowledge hypergraphs" the problem of machine compatibility of text is well settled at least conceptually. Now, it is a short leap to interconvert "text to knowledge hypergraphs" and vice versa, with human assistance. Progressively that can be automatic with flagging for necessary human validation. Then we can rely on machines to aid humans to make sense and validate understanding of most text. Some text will still need human analysis, arguments and negotiations but that should be miniscule fraction of the volume of text now in use. Soon enough bulk text will not be in much use. X tweets and messages will dominate human communications and interworking. Let us develop on this putchavn@yahoo.com

Gideon Kory, CFA 🎗️

Artificially Intelligent. Bringing together people, ideas, and data. I am because we are.

4mo

How to get every knowledge worker engaged in #dataintelligence and own their domain #datagovernance ? How to bring business context into data-centric decision making?

Like
Reply
Gordon Hamilton

Data Quality Evangelist for 20 years, steadily improving my ability to communicate the importance of DQ for Cost Reduction & Data Monetization.

5mo

Loved "data is the new mud".

Like
Reply

Clearly, briefly and well stated Mike. The photo is a bit out of sync to your otherwise well expressed message. Great to see Knowledge Architecture supported as system informatics. Thank you for your posting Mike.

To view or add a comment, sign in

More articles by Mike Dillinger, PhD

Insights from the community

Others also viewed

Explore topics