Pipeline for Semi-Supervised Learning in Topic Modelling
A useful technique when you have a lot of text and no labels
Introduction
Suppose you are part of a team that just launched a new product, or you are tasked with monitoring service feedback. How do you quickly understand what your customers like or dislike about you product/service amidst the mountains of reviews? While traditional topic modelling techniques such as Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) can help, there is limited scope for fine-tuning. This article will describe a pipeline amalgamating various Natural Language Processing (NLP) techniques to perform text classification. The advantages of this method include:
- Suitable for multi-label classification
- Minimal manual labelling
- Many options for experimenting and fine-tuning
Topic modelling is a field within NLP which seeks to find broad topics that occur within a corpus of text. With this, a large amount of text can be quickly categorised. Typical techniques such as LDA and NMF work by employing statistical techniques to determine which words are closely associated and require the user to input a pre-specified number of topics that he/she expects to be present, and the output is the probability that each document belongs to the topic. While the models do not inherently understand the topics, it can output top words associated with each topic and allow the user to make a judgment about what the topic is about. An example could look like this:
The user would then classify the first text under Topic 1, and the second text under Topic 2. While this is great for a quick and broad overview, there is very limited scope for fine-tuning for better accuracy. My personal observation from initial tries with LDA and NMF is that these do not do very well in multi-label classification.
Data
The pipeline can be applied on any corpus of text; in my case I used data obtained from a private enterprise that I was helping out. Sample documents:
Pipeline Overview
The crux of the pipeline lies with the second stage, where a small subset of hand labelled data is responsible for guiding downstream modelling.
1. Identify topics with BerTopic
BerTopic is an unsupervised topic modelling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
Unlike LDA and NMF which use statistical methods, BerTopic uses BERT (or any other language model) to deduce the “meaning” of the text via sentence embeddings. Pre-determining the number of topics is also optional. However, due to its stochastic nature, the results are generally not reproducible (though the variability is low). Furthermore, it also does not do very well in multi-label classification.
Output:
Topic -1
['customers', 'product', 'technician', 'professional', 'questions', 'solutions', 'contact', 'maintenance', 'customer service', 'store']
Topic 0
['solved problem', 'problem solved', 'zoomskype problem', 'friendly helpers', 'friendly helps', 'friendly helping', 'friendly helpful', 'friendly helped', 'friendly help', 'friendly maintenance']
Topic 1
['service good', 'good service', 'service happy', 'friendly helpers', 'friendly information', 'friendly importantly', 'friendly engaging', 'friendly human', 'friendly hospitable', 'friendly highly']
Topic 2
['simple', 'friendly helping', 'friendly helpful', 'friendly helpers', 'friendly helped', 'friendly help', 'friendly guide', 'friendly helps', 'friendly highly', 'friendly layout']
Topic 3
['service good', 'good service', 'servive', 'good servive', 'servive good', 'friendly hospitable', 'friendly highly', 'friendly helpful', 'friendly helpers', 'friendly great']
Here, BerTopic produced 145 topics (Topic -1 represents outliers) and we can see the top words associated with each topic. Unfortunately, some manual effort is required here to condense the number of topics. Here, I’ve condensed the number of topics down to 7.
2. Manually label a small subset of the corpus
Next, a small subset of the corpus has to be manually labelled. I ensured that each topic (derived from the output of BerTopic) had at least 10 documents. In total, I labelled about 200 records. This one time effort will be used for downstream model training, hence it needs to be prepared with care. It cannot be avoided, especially if you want to have a sense of how well your model is performing.
3. Generate synthetic training data with backtranslation
With the manually labelled data, we are now ready to create new data. The process is analogous to that used in image classification, where transformations can be applied to existing images to generate new images. The nlpaug library provides a diverse range of techniques to alter data to create new text, and is even able to simulate spelling errors.
Backtranslation is one of the techniques and it involves translating a given text into another language and then back to English, which in the process changes the choice of words used (see this hilarious example of Frozen’s Let It Go). It was selected after testing a variety of techniques as it is a “soft” way of altering text. Some other methods might create entirely new sentences with wildly differing meanings which may not be ideal. However, the downside of backtranslation is that the alterations can be quite subtle and this inevitably “leaks” information during model training. To increase alterations, translated languages should be as different from English as possible, such as Japanese, Mandarin and German.
Output:
['The language your partner uses is very straightforward, polite, and difficult to grasp. The connection is smooth.',
'The language used between my partner is very friendly, polite yet easy to understand, and communication is smooth.',
'The language used by the partner is very friendly, polite and easy to understand, and communication is smooth.']
After augmentation (this process may take several hours depending on the volume), we have about 3,000 rows of training data.
Baseline Model
As the field of NLP advances over the years, new tools have emerged to help us with various NLP tasks, and one such is Zero Shot Classification (ZSC) from HuggingFace, where one can input the target text and list of topics and generate probabilities that the text belongs to each topic.
Output:
{'sequence': '\nThe language used by partners is very friendly and polite and easy to understand, the connection is also smooth\n',
'labels': ['communication',
'information',
'facilities',
'user interface',
'location',
'price',
'waiting time'],
'scores': [0.9803360104560852,
0.7030088901519775,
0.6719058156013489,
0.6212607622146606,
0.3871709108352661,
0.33242109417915344,
0.13848033547401428]}
In ZSC, the class names have to be carefully chosen as words can mean very different things to humans versus the language model. Unfortunately, as powerful and simple as the technique is, the weighted F1-score is only 54% when evaluated on the test set.
4. Train a model with the synthetic data
At this point, the task is simply a supervised multi-label classification problem. Although various classifiers were tested, Support Vector Classifiers (SVCs) have consistently outperformed the rest, hence it was adopted as the default classifier.
We can see that Expts 1, 3 and 5 produce very similar results.
Further Evaluation
As the test set is relatively small and information would have leaked from the training set to the test set during the data augmentation process, further evaluation of the models has to be done. One way is to further label more data as a holdout set which has not been augmented in any way.
We see that model performance has dipped significantly in the holdout set, although Word2Vec still remains the best. Some topics such as “location” and “facilities” are also harder for the model to pick up possibly due to the vague nature of the texts. This way of evaluation may not necessarily be reflective as a small holdout set would result in large swings in percentages should a few predictions be off. Furthermore, language is subjective in nature and even two people can have differences in how a given text should be labelled. It is thus imperative to devise your own way to validate the model’s performance.
In my context, I decided to inspect the category “Others”, which consists of text that the models have deemed to fit in none of the topics. While there are legitimate texts which should reside in “Others”, most of the time they should fall in at least 1 topic, especially when the text is long. This means that the quality of predictions in this category could give us an intuition of whether the model is able to interpret the text. I thus decided to inspect the longest 5 texts each from Expts 1, 3 and 5 which have been classified as “Others” (Green highlights indicate what I felt should be the correct classification. and I was also ready to accept a reasonable variation of answers eg. Some topics were secondary and may not have been the main point, but I would accept even if the model had not picked it up).
We can observe that the tuned BERT model is not doing as well as originally thought, although the F1 score on the test set was over 90%. This is ironic given that the language model has been tuned toward the context, but it could also have been over-tuned. Together with further random validation checks, I am convinced that the Word2Vec model is the best and has satisfactorily classified my texts.
Sentiment Analysis
While it’s important to know what customers like and dislike about our service or product, higher emphasis is usually placed on negative reviews. To further distinguish between positive and negative reviews, I used HuggingFace’s transformer models to label my documents.
Though bart-large-mnli is typically used for text classification, it performs surprisingly well for sentiment analysis tasks.
Output:
{'sequence': 'The price is $50',
'labels': ['positive', 'negative'],
'scores': [0.5607846975326538, 0.4392153024673462]}
Here, I chose bart-large-mnli over the distilbert-base-uncased because I wanted to minimise false negatives ie. I wanted my negative reviews to be clean, at the expense of missing out on some.
Final Deliverable
To help stakeholders quickly understand the data as well as provide flexibility to obtain different cuts, I created a simple interactive Tableau dashboard to aid data exploration.
You can also play around with a simple app I deployed on HuggingFace!
Conclusion
Even with the advance of NLP techniques, topic modelling is notoriously difficult to have a good sense of the accuracy without having sufficient labelled data. With this pipeline, despite some of its inherent flaws (eg. data leakage) I could train up a decent model to perform classification with only a small amount of labelled data. In other words, small efforts for disproportionately large gains. The pipeline also provides much whitespace to experiment and fine-tune to the domain problem in the data augmentation and model training phases, which is typically not possible in unsupervised learning problems.
Future Work
I hope to be able to test my pipeline on another labelled dataset with multi-labels for further validation, as I wasn’t able to given the tight project timeline.
Do let me know if this has helped you!