3) A Pivotal Moment

Gabor Szabo

Sr. Principal Engineer, Quality, Reliability, Operations Excellence, Manufacturing at Edwards Lifesciences, R Developer

Published Mar 24, 2023

In this week’s edition, we dig into a scenario you’ve probably run across when working in Excel or other software, for example Minitab —at least I have, many times.

Say you have a complete dataset. The data has been collected, and you’re now getting ready to run plot it or run some sort of analysis on it. It should be plug and play, but it ends up not being the case as the data is not formatted in the right way, and you’re not able to run your analysis (it happens pretty frequently if you ask me).

The data “not being in the right format” can mean many, many different things, and we won’t go over all of them, at least not today. One example though is when you have been given a dataset where one of the variables you’re interested in was captured in groups and over multiple columns. This may have made sense as the data was being captured, but for the purposes of analysis, this sometimes presents a challenge. Let’s take the below example where data were collected from a manufacturing process characterization study. This below table shows that data on an important product characteristic, the tensile strength of a bond, were captured from three consecutive machine cycles from each of the five manufacturing lines, over three different time periods, and the results from each manufacturing line were captured in a different column.

This makes perfect sense from the perspective of the engineer jotting things down during the study, but for analysis purposes, you want all of the observations in one column and another column to say which line each observation belongs to (just like with the Time column).

Now, you could fix this in Excel by either doing some copy-and-paste jockeying or trying to transpose the data. If the dataset is relatively small, this can be done pretty quickly, but for larger datasets, you’re probably looking at wasting time.

In R, you can do this in a couple relatively simple steps. In one simple step, actually, but we’ll also do more than just the bare minimum.

You can download the script here if you want to follow along.

Let’s start with loading the packages we’ll be using; the tidyverse and sherlock. The tidyverse is a collection of packages such as dplyr, readr, stringr, ggplot2, tidyr etc., each of which having its own set of functionality. Sometimes you’ll want to load just one of them, say dplyr, but a lot of times simply loading the tidyverse will do the trick.

We’ll then use sherlock’s load_file() function, as described last week, to read in the dataset from my GitHub repository. Make sure to change the filetype argument to “.csv” as this is a .csv file.

We’ll save the dataset into memory as bond_strength_wide (referring to the current wide format of the data)—we’ll transform it into the format we need in no time though!

Here are the steps:

Use dplyr’s mutate() function to create a column called Cycle, then use base R’s rep() function to create the numbering sequence 1:3, then convert it into a factor variable.
Use tidyr’s pivot_longer() function to convert the data frame into a long format.
Use dplyr’s mutate() function to update the Line column. The update we are going to make is removing the string “Line “ from each observation to make
Then use dplyr’s arrange() function to arrange by the columns Time and Line. This is not really necessary for plotting purposes but provides a way to verify that you did everything right.

Let’s run bond_strength_long by moving the cursor over it and hitting Ctrl + Enter.

It looks like everything checks out, and we are ready the plot the data.

As a first step, we are going to plot using a technique called stratification where the data are grouped and plotted by a specific variable. We do this to separate the data by that variable and ultimately to see what kind of differences exist between the groups.

Recommended by LinkedIn

How to use the Find Function in Google Sheets

Engineering - Follow us to learn every day 💡 8 months ago

Data Analysis Expressions (DAX) Learning Resources…

Rahim Zulfiqar Ali 2 years ago

Choosing the Right Graphical Representation:…

Suraj Kumar Soni 1 year ago

We are going to unleash the power of ggplot2 to do this.

Let me briefly explain the above code.

First, we take the bond_strength_long dataset and pipe it (using %>%) into the ggplot() function and specify what we want plotted on the x and y axes. This time we want to plot Bond Strength on the y axis and the Line variable on the x axis. This essentially creates a blank “canvas” for the plot — nothing has been plotted just yet.

After that, we use the + operator to add different layers to the base canvas. First, we add the function for the type of visual we want to create, which in this case is a scatterplot type of plot (geom_point() function).

We then add a custom theme, which dictates the appearance of the plot. There are many out-of-the-box themes one can use, for example theme_minimal(), theme_bw() etc.; I tend to use the theme_sherlock() from the sherlock package for a minimalistic look.

And finally, to top things off, we add a title and a custom call-out for the y axis using the labs() function.

This is what the plot looks like:

Not too bad for a first try, right?

Now, we are going to further simplify what we just did by using a ready-made plotting function called draw_categorical_scatterplot(), which achieves the same thing while adding additional functionality.

With this function you can also:

Group (stratify) by up to three variables in a nested fashion
Plot the means of each group or connect them with a line
Set whether each group is displayed in a separate color
Set the size and transparency of the data points
Add jitter (a little bit of noise along the x axis) to deal with overplotting

To recap, in this week’s edition we went over how to do a basic data pivoting transformation and plotted the data both using basic ggplot2 functions and a built-in function called draw_categorical_scatterplot().

That’s it for this week—we will continue exploring this dataset next week.

Thanks for reading this newsletter! Reach out with any questions you may have.

Download this week’s script here.

R For Engineering

2,165 followers

+ Subscribe

Keith Fong

Product & Process Improvement Expert, Podcaster

Translating human readable data into software readable data is the number 2 skill in data analysis. This looks like a really useful tool for dealing with the big data sets. It's not clear to me, though, how neat the human readable data has to be. Do all of the columns have to be the same length? In this example, perhaps one manufacturing line runs an extra shift. What's the number 1 skill? Ensuring that the right data is collected along with sufficient metadata to effectively answer the questions getting asked. Thanks for sharing this capability of R, Gabor Szabo. I really appreciate your evangelism of R.

1 Reaction

See more comments

To view or add a comment, sign in

3) A Pivotal Moment

Gabor Szabo

Sr. Principal Engineer, Quality, Reliability, Operations Excellence, Manufacturing at Edwards Lifesciences, R Developer

Recommended by LinkedIn

R For Engineering

2,165 followers

More articles by Gabor Szabo

Insights from the community

Others also viewed

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

Excel Translated to R

Data Wrangling of Electoral Data

R Data Types – Vectors, Matrices, Lists, and Data Frames

Excel's Underprivileged Commands – VI,, the Advanced Filter ☺

#DayDohViz: Play-Doh as a Tool for Data Visualization

Constraint #44

R Challenge #3: Secret Message

Microsoft Data Platform News 2024 - Week 51

R Vs. Power BI: The 2025 Showdown in Data Analysis

Explore topics

Recommended by LinkedIn

R For Engineering

2,165 followers

More articles by Gabor Szabo

10) Is everything in order?

9) The mighty Youden Plot - a graphical technique that every engineer needs in their toolbox

8) Being in a state of flow(charting)

7) Density curves (with a reliability engineering example)

6) Priorities, priorities…

5) Small Multiples for Characterization

4) Small Multiples, Huge Advantage

2) Getting Started

1) How I Stumbled Upon R

How Should One Assess Operator Differences?

Insights from the community

Others also viewed

Choosing the Right Graphical Representation: Understanding the Differences between Bar Charts and Histograms

Excel Translated to R

Data Wrangling of Electoral Data

R Data Types – Vectors, Matrices, Lists, and Data Frames

Excel's Underprivileged Commands – VI,, the Advanced Filter ☺

#DayDohViz: Play-Doh as a Tool for Data Visualization

Constraint #44

R Challenge #3: Secret Message

Microsoft Data Platform News 2024 - Week 51

R Vs. Power BI: The 2025 Showdown in Data Analysis

Explore topics