5th grade data science (NLP: Computers & Text)
I'm thinking of doing a series where I attempt to explain complex topics at a 5th grade level. Hoping that some of the data science magic can become less magic. Also, as a disclaimer for my literal readers I don't think all 5th graders could grasp these topics yet, it is more a figure of speech. I do, however, think that any professional adult regardless of their math/stats background can understand these concepts at a high level with the right explanation.
5th grade data science, here we go:
Do you recognize this man below? He's a hero of mine... all of you are drawing a blank right now. Look at him! Can't you see the goodness just pouring out of his soul? Ok, fine, you don't know him, so I'm going to introduce you.
Meet Scott Fogler. Fogler? Yes, Scott Fogler is a hero because he was one of the first to take a topic like graduate level chemical kinetics for chemical engineers and write a book that many said had a 5th grade reading level. I would tell people, give nearly anyone his book and assuming they read it from cover to cover they will feel enabled to learn from it.
So what? So what!?! It's freakin chemical kinetics!
Fogler should have rewritten every single chemical engineering topic out there. Why stop there? Fogler should have rewritten every technical math book on the planet. Unfortunately for the human race he is a limited resource and we must continue to slog ourselves uphill when learning other topics for now.
Data science is no different, if anything it is more approachable than kinetics. When you think about it, even some of the more advanced methods we use on a regular basis the approaches are simple at a high level. I think so much of data science magic can be explained to a general audience. However, this is easier said than done.
First Topic: Natural Language Processing (NLP)
So my first topic will be natural language processing (NLP). This topic amuses me because it can be daunting at first, but then once understood it tends to underwhelm (i.e. That's it!?!, I could have come up with that!).
Consider the following numeric dataset below. If it is easier to imagine this as an Excel file then do that. I have three numeric columns (grit, motivation, engagement) that I know about the person in each row and I have my final column (performance) that I really want to predict. The majority of individuals with minor math or stats under their belt know what to do here. For the ones that don't know what to do there are high level tools in Excel and other packages that can build a linear model with minimal training.
The linear model will essentially assign weights to each of the inputs relative to how much they are liked/disliked when it comes to predicting performance. To summarize, you will take the columns you've collected (grit, motivation, engagement) assign weights to prioritize their value (large positive numbers would be good, large negative numbers would be bad) and add them together. Easy-peasey, as long as you keep yourself away from the underlying theory of building these amazing models (i.e. matrix calculus).
This next part is the part that amuses me about NLP. Now if we take the same problem and you say "I've got great news! I've also collected a text response from each of these people and I would like to include it in the prediction". I would respond "Fine, send it my way". Then when I open your email I would see this:
This is amusing because a very simple change in the data has now isolated the majority of the people that understood the first problem. They are stuck. So if you are feeling directionless on speaking through solving this problem now, no worries, you are feeling the exact same way the rest of us felt when we first saw this.
Well, you have two options really:
The core thing to remember with any computer problem: is despite how smart or fast your computer is, it can't do anything useful unless it is a number.
So to solve ANY text (NLP) problem that computer needs to replace all text with a number. The first way this can be done is with what I like to call mapping. The classical example might be:
bad, good, better, best
becoming:
bad=0, good=1, better=2, best=3
Ok, fine. That seems easy enough. Now consider the example of gender. I have the words "male" and "female" in the same column. The example before had an obvious rank, if gender has a rank in your mind you have bigger problems. The truth is gender can't have a rank or your sexist, so how do you deal with text that is not rank-able. Why not map the text to the same value? Give each of them a 1. Well... I can't give each of them a 1 because it will essentially nuke my text feature completely, to get around this I create a new column so each of them can have their own unique contribution.
This is called tokenizing. We have taken these two words and mapped them into two new columns that a computer can understand. Now, for bigger text problems we actually need to do this for ALL words.
Hopefully, your heart skipped a beat when I said "we will just make a column for every word". I know what you are thinking and the answer is "Yes". Yes, you will get a ridiculous number of columns doing this. Depending on the diversity in your text and the size of the dataset you could easily exceed 10,000 columns, 100,000, or even 1,000,000 columns. For reference the max number of columns that Excel 2013 can handle is 16,384.
Quick Geek Moment:
I'm going to forget keeping things at a high level for a second and geek out. Did you know that your fancy laptop in most cases can't even load this dataset in memory now? You have way too many elements, you will run out of memory. So, for the programs and programming languages that can handle this type of dataset they use what is called a sparse matrix to store it. All that means is instead of storing every single element they just store the locations of your non-zero data for you.
Also, in practice you can store other numbers in your text columns besides 1s. Some will do word count to take how many times the word was mentioned in the tweet or essay. Others will normalize this later by the word frequency and scale everything between 0 and 1.
There are probably other alarms that have gone of in your mind, like the realization that sequence no longer matters. That is right, if I map "bob has a dog" into my table it would make identical to "dog has a bob", "has bob a dog", etc... This is why this method is called bag-of-words. I like to think of it as dropping your sentence or essay on the ground, "What a mess!", and picking it back up in random order. Despite your criticisms on all of the problems with doing this, this method is surprisingly predictive on real data problems.
If you still think this is a really really bad idea let me talk about some things you can do to put a bandaid on your concerns.
Meet n-grams:
So how do we address this mess of losing sequence on our text. One thing we can do is instead of mapping a column for every word, we can also do every single word pair. This is called a bigram. So when you hear the term n-gram that is talking about single word, double word, triple word pairing where the n is whatever you want it to be. Yes, your column count will explode to even more, but no worries because we know how to handle that. In practice for many problems will see an improvement in accuracy when you go to higher word pairings.
Dealing with sparse issues (stemming, bucketing)
One of our main challenges with text is how sparse our observations are. Going to something like bigrams makes that an even bigger problem. You say "run", I say "running", the computer treats that as two distinct unrelated mentions. To fix this you can use stemming functions that will auto crop all words to their root. So runner > run, running > run, etc... The other problem that comes up that makes our datasets sparse is we have different words that mean the same thing. I say "joy", you say "happy" and again the computer will treat that as two distinct things. So to address this we have to use a dictionary that maps these words into the same word bucket. You can download one or purchase one. You can also make one automagically from a large body of text but that is a deeper topic outside the scope of this 30,000 foot tutorial.
Too Many Columns
Before we said too many columns didn't matter since we had ways to handle them effectively in memory. However, when it comes to actually training a model sometimes less is more and we have an incentive to throw away garbage words. This concept is handled already in many of these methods with the idea of using stopwords. Stopwords can be downloaded already for popular languages. Or you can make your own stopwords list but that is knocking on the data science book of black magic that is beyond this post. Credit to Tyler Byers for reminding me about missing stopwords.
Conclusion
So to conclude, hopefully for those who have not been exposed to text you will see the baseline approach is pretty simple at a high level. Also, despite all of the problems with this type of approach many individuals realize over 80% of their value from a simplistic method like this.
Comments Please:
Please let me know if there are some critical NLP mentions you think a nontechnical person would want to know after their first introduction. Of course you have LSTM deepnets, word2vec, Parsey McParseyface, and other deeper methods, but I think they are beyond the scope of this, so think basic. Also, if there are complex topics like boosting, bagging, logistic regression, ensemble methods, deep learning, etc.. that you would want to see a similar high level tutorial on comment below.
Here are my most popular posts:
This Is Why Your Data Scientist Sucks (11,436)
3 Reasons You'll Never Be A Data Scientist (6,283)
4 Reasons To Work For A Startup Instead (4,731)
Death Of The Data Scientist (7,553)
How To Find The Smartest Data Scientist! (3,659)
How To Land Your First REAL Data Science Job (4,505)
A Quant, Physicist, & Chemist Walk Into HR (1,418)
Also, checkout my favorite HR conference of the year here: bit.ly/best_hr_conf
Director of People Analytics Products & Projects at Cencora | MBA - SPHR®
8yLove the 5th Grade concept, but... miss your old profile picture. ;-) Gotta look professional and all, I know.
Managing Partner at Igneous Bio - Building Companies, Culture, Teams and Technologies
8yWhile a bit separate from some of the NLP tasks you've mentioned above, topic modeling is an interesting area that should be reducible to a 5th grade level. I'm thinking LDA. Although HDPs are much more flexible, they might be too complicated, but LDA should be sufficiently simple to explain and is one of the easiest ways to extract clearly interesting structure from a lot of text.
DMTS Cloud and AI Architect at Micron Technology
8yMaybe just add some examples of stop words as well so people can easily recognize their insignificance... the, is, at, etc. Might be good to list some examples of tasks you can use NLP for... also agree with comments on add part of speech tagging and then maybe add sentiment snalysis as that is pretty commonly used
Senior Data Scientist at Snapdocs | AI Artist
8yPerhaps you can write a post on the nuances of ConvNet in the near future?
Business Intelligence I Data Engineering | AWS Certified
8yThis article is a delight to read Benjamin. In past, I have performed Text Analytics and Text Mining using SAS, but to really understand what goes beneath them, it took me almost a semester of my graduate coursework. Upon reading your article, I was like, damn that was actually easy!. Eagerly looking forward to read more figurative '5th grader' material on complex DS topics. Cheers!