Thinking fast and distributions.
(source: own production in Python)

Thinking fast and distributions.

"Thinking, Fast and Slow," written by Nobel Prize winner Daniel Kahneman, is the gold standard for knowing how we make fast decisions.

And the skill of knowing how to make quick decisions is important for any role.

As data analysts, we usually look for different distributions in our datasets.

We love to see when datasets (at least mostly) fit or reflect a certain distribution.

This is because we can develop "heuristics" on how different KPIs or metrics behave.

This can help you make decisions faster as well. It also helps a lot in knowing what to expect. This is important for developing good intuition, and having good intuition can lead to savings, avoiding mistakes, and attracting more luck on your side when you make the next call.

Thus, let's get you to know the distributions of 2 metrics (engagement rate & CPC), so you can also start "thinking fast and distributions".


The Power Law.

I was wondering what can be learned about the engagement that happens on LinkedIn.

And I asked myself: "What is the distribution that the engagement rate on LinkedIn follows?"

It might not seem like a big deal to some (unless you're a SoMe manager).

So later in this article, I'll also analyze 1 more metric that is more cost-oriented if that's what you are interested in.

Nonetheless, there are many benefits to knowing your engagement rate distribution, including:

  • Calculating the most probable engagement rate of a campaign.
  • Estimating what your next engagement rate most likely will be.
  • What type of content falls in the upper part of the distribution with a higher engagement rate?
  • What content type is not performing great and we should make less of it?
  • Which campaign is not engaging enough and sticking as a negative outlier?
  • How likely is it that we are going to see a 10% engagement rate on this post?
  • What engagement rate range can we observe with 68-95-99% probability?

To find the answer to my question, I "ChatGPT-ed it" using the following prompt:

Context: (You are a data analyst specializing in SoMe analytics.)
Details: (Write down an answer that is no longer than 300 characters.)
Targeted social media: (LinkedIn.)
If you were to choose, what distribution would the engagement rate on LinkedIn posts follow?        

And ChatGPT-4 came up with this answer:

For LinkedIn posts, the engagement rate typically follows a right-skewed distribution, with most posts receiving low engagement and a few posts achieving exceptionally high engagement.        
A highgly right-skewed distribution is also called the "Power Law". It's a distribution with very few observations on the right side where you observe the high values for your variable of interest. On the other hand, the majority of observations are concentrated on the left side where you observe the lowest values for your variable of interest.

It looks something like this:

Simulated right-skewed "Power Law" distribution (source: own production in Python)

When I asked to double-check whether ChatGPT-4 meant the so-called "Power Law", which is a highly rightly-skewed distribution, it confirmed to me that yes:

Conversation with ChatGPT about what distribution the engagement rate on LinkedIn follows (source: own production)

Initially, it seemed accurate.

You would expect to have A LOT of posts on LinkedIn that get a low engagement rate.

And then only a few would be sticking out with a higher engagement rate, creating a long tail.

But, when I thought about it more, it got me thinking...

Wouldn't it actually make more sense that the distribution would be more......normal?

Taking into account that you have a SoMe manager who creates "average" content.

Shouldn't there be a few posts in both? A few "losers" receive low and a few "winners" receive high engagement rates, with most posts falling in the middle?

Did this very small note: "ChatGPT can make mistakes. Consider checking important information," just become true?

Did ChatGPT just make a mistake?

Is your engagement rate on LinkedIn following the "Power Law" distribution?

To confirm or deny what ChatGPT told me, I decided to put it to the test.

I got my hands on a dataset including a sample of engagement rates from 3 different anonymized LinkedIn pages.

I then started to simulate the distributions of different, randomly selected samples:

  • n=5 LinkedIn posts
  • n=13 LinkedIn posts
  • n=20 LinkedIn posts
  • n=289 LinkedIn posts

...and I saw some surprising results.

Histogram (n=5 posts)

Distribution of 5 randomly picked engagement rates on LinkedIn (source: own production in Python)

All right!

This is looking promising?

It looks like there could be a hint of the mentioned "Power Law".

Let's keep going and increase the sample size.

Histogram (n=13 posts)

Distribution of 13 randomly picked engagement rates on LinkedIn (source: own production in Python)

Ok.

Even with as few as 13 posts, this is starting to look a little different.

Let's increase the sample size again.

Histogram (n=20 posts)

Distribution of 20 randomly picked engagement rates on LinkedIn (source: own production in Python)

Not looking good for ChatGPT at this point...

Let's see what happens if we increase the sample size to 1 year's worth of posts on LinkedIn.

Histogram (n=289 posts)

Distribution of 1 year's worth of posts, including 289 engagement rates on LinkedIn (source: own production in Python)

In the end, the intuition was right.

The engagement rate on LinkedIn seems to be approximatelly normally distributed around the mean of 0.05112 (5.11%) with a quite high standard deviation of 0.02387 (+-2.39%).

This would be quite in line with this analysis that found that the average engagement rate on LinkedIn for an Italian property management company was 0.06 (6%).

Perhaps, a little right-skewed, as there are 151 values on the right and 138 on the left side from the mean.

But, I think it's fair to say that this resembles a bell-shaped normal distribution (especially if you "smooth it out" with KDE) which you can see in the simulation I prepared below.

Simulation of different sample sizes for the engagement rate on LinkedIn using KDE; Kernel Density Estimation (source: own production in Python)

By placing all 4 distributions on one visual with the calculated Kernel Density Estimation (KDE), you can observe how the engagement rate distribution changes with a bigger sample size.

The Kernel Density Estimation (KDE) will help you understand the distribution of the engagement rate on LinkedIn by estimating the probability density function.

This function can provide insights into the shape, spread, and modality of the engagement rate on LinkedIn.

It simply allows for a smoother and more interpretable representation of the distribution compared to a histogram.

The first smallest sample (blue) was hinting in favor of the "Power Law" distribution with a low engagement rate (1.36%) having the highest probability.

But, as the sample increases, and becomes more representable of the population, the distribution starts to flatten on the left side, and morphs into what resembles a "normal distribution".

We can confirm this by performing the Shapiro-Wilk test that determines whether your data is normally distributed.

Results of The Shapiro-Wilk test; with a significance level of 0.05, we fail to reject that the data is normally distributed (source: own production in Python)

What actually drives the engagement rate on LinkedIn?

When you find the distribution of your variable of interest, you can start using it as your heuristic.

However, you can also start building your case for your SoMe manager by forming different hypotheses (and then either accepting or rejecting them).

H1: Posts with videos on LinkedIn have higher engagement rates.

To investigate, we can first plot the two distributions on a single chart.

Combined distribution plot of the engagement rate; all posts (blue); posts where the content type = video (orange); Y-axis shows the KDE values, not the count (source: own production in Python)

This allows us to see whether it makes sense to perform a t-test to find whether there is a difference between these two means.

But, as it can be observed, the posts with videos seem to form a similar normal distribution, letting us reject the H1.

To make sure, we can still perform the t-test and find out whether the test can confirm what we've observed.

The results of the t-test confirm that the two means are NOT significantly different at the significance level of 0.05 (source: own production in Python)

Ok, so what else?

I hypothesized about "reposts".

H2: Posts with >=10 reposts on LinkedIn have higher engagement rates.

Combined distribution plot of the engagement rate; all posts (blue); posts with reposts >= 10 (orange); Y-axis shows the KDE values, not the count (source: own production in Python)

Bingo!

This is a very nice example of finding what is driving your engagement rate on LinkedIn using distributions.

To confirm this, we can once again perform a t-test to compare these two means.

And by doing so, we find a very low p-value.

The results of the t-test confirm that the two means ARE significantly different at the significance level of 0.05 (source: own production in Python)

The results from the t-test lead us to confirm what we have observed with our eyes on the chart.

The posts that have gotten >= 10 reposts show statistically higher engagement rate than those that have not.

I can then go to our amazing Brand Lead & SoMe Manager and let her know about these results.

She can then get creative and take care of creating content that focuses on people wanting to repost it on their profiles if she wants to try to increase her overall engagement rate.

Other than that, based on this dataset from 3 anonymized LinkedIn pages, I can also let her know about some additional conclusions, so she can have some benchmarks in mind:

  • Using the central limit theorem, we can say that with a 68% probability the engagement rate on LinkedIn posts should fall between 2.72% and 7.5%.
  • Calculating the Z-scores, there's a 50% likelihood that the posts will have an engagement rate higher than the average of 5.11%.
  • There's an 11.33% likelihood that the posts will achieve an engagement rate that is higher than 8%.
  • And, there is only a 2.03% likelihood that the post will exhibit an engagement rate higher than 10%.

From now on, she will be able to use these distributions as heuristics in the future and start "thinking fast and distributions".

Is your cost-per-click (CPC) on LinkedIn following the "Power Law" distribution?

We didn't find evidence that the engagement rate on LinkedIn follows the highly rightly-skewed distribution.

However, I wasn't ready to give up on the "Power Law" just yet.

And I asked myself: "Would cost-per-click (CPC) on LinkedIn follow the Power Law distribution?"

In a new conversation, I wrote down a similar prompt and gave ChatGPT-4 a second attempt to redeem itself:

ChatGPT conversation about what distribution the CPC on LinkedIn follows (source: own production)

As seen in the screenshot above, ChatGPT confirmed that yes.

On LinkedIn, the Cost Per Click (CPC) distribution typically follows a right-skewed distribution due to a range of bids and competition levels across different industries.

This time, the answer made sense.

As, ideally, you would want your CPC to follow the "Power Law" if you or your advertising agency is doing a good job.

Most of your ads on LinkedIn should show low CPC values, and as you move to higher CPC values, you would want these to diminish and become more rare.

Thus, I decided to create a hypothesis:

H1: The CPC on LinkedIn follows the right-skewed "Power Law" distribution.

To test this, I got my hands on a dataset containing 1,144 CPC values (in $) in the IT sector.

CPC LinkedIn distribution; histogram (blue); KDE (blue line); Y-axis shows the KDE values, not the count (source: own production in Python)

Bingo!

It's fair to say that this time, the CPC on LinkedIn does indeed seem to follow the right-skewed "Power Law".

To simulate the "Power Law", randomized samples of 10, 50, 500, and 1,144 CPCs on LinkedIn were created and visualized below.

Simulation of different sample sizes for the CPC on LinkedIn using KDE; Kernel Density Estimation (source: own production in Python)

By placing all 4 randomized distributions on one visual with the calculated Kernel Density Estimation (KDE), you can observe how the CPC distribution changes with a bigger sample size.

As the sample increases, and becomes more representable of the population, the distribution consistently develops a longer tail, rises on the left side, and morphs into what resembles the "Power Law".

I can then go to our great CMO and Director of Revenue Marketing, and let them know about these results.

They can then anticipate that most clicks on LinkedIn should come at a lower cost, but they could expect a few clicks to be significantly more expensive, as this highly right-skewed "Power Law" suggests that spikes in CPC can occur due to less frequent but potentially highly competitive targeting and ad placements represented in the long tail.

Other than that, based on this sample dataset containing 1,144 CPCs, I can also let them know about some probability conclusions, so they can have some benchmarks in mind:

  • There is a 64.77% probability for the CPC to be above $1.
  • There is a 51.66% probability for the CPC to be above $5 (=50/50 breakpoint).
  • There is an 18.01% probability for the CPC to be above $25.
  • There is a 7.87% probability for the CPC to be above $50.
  • There is a 1.14% probability for the CPC to be above $100.
  • There is a 0.09% probability for the CPC to be above $150.

From now on, they will also be able to use this "Power Law" in the CPC context as a heuristic in the future and start "thinking fast and distributions".


Here are 3 other quick examples of distributions observed in marketing.

Customer Acquisition Cost (CAC). This metric will often in real life, with average everyday results, follow a normal distribution (especially in larger datasets where the central limit theorem applies).

Customer Lifetime Value (CLV). CLV can often follow a right-skewed distribution similar to the "Power Law" (here is also where you can apply the "Pareto Principle" which, in fact, is the "Power Law" in itself; 80% of your revenue will most likely come from only 20% of your customers).

Website Traffic. The number of visitors to a website usually follows a poisson distribution (if you are measuring the count of visitors arriving in fixed intervals of time).


What's in it for you?

I hope you now grasp the importance of recognizing distributions in real life...

...and how a) spotting, b) remembering, and c) recalling them as your heuristics can lead to faster decision-making, savings, and attracting more luck on your side when you make the next call.

And whatever your next call is, I hope the luck will be on your side :)

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics