3) A Pivotal Moment

3) A Pivotal Moment

In this week’s edition, we dig into a scenario you’ve probably run across when working in Excel or other software, for example Minitab —at least I have, many times.

Say you have a complete dataset. The data has been collected, and you’re now getting ready to run plot it or run some sort of analysis on it. It should be plug and play, but it ends up not being the case as the data is not formatted in the right way, and you’re not able to run your analysis (it happens pretty frequently if you ask me).

The data “not being in the right format” can mean many, many different things, and we won’t go over all of them, at least not today. One example though is when you have been given a dataset where one of the variables you’re interested in was captured in groups and over multiple columns. This may have made sense as the data was being captured, but for the purposes of analysis, this sometimes presents a challenge. Let’s take the below example where data were collected from a manufacturing process characterization study. This below table shows that data on an important product characteristic, the tensile strength of a bond, were captured from three consecutive machine cycles from each of the five manufacturing lines, over three different time periods, and the results from each manufacturing line were captured in a different column.

No alt text provided for this image


This makes perfect sense from the perspective of the engineer jotting things down during the study, but for analysis purposes, you want all of the observations in one column and another column to say which line each observation belongs to (just like with the Time column).

Now, you could fix this in Excel by either doing some copy-and-paste jockeying or trying to transpose the data. If the dataset is relatively small, this can be done pretty quickly, but for larger datasets, you’re probably looking at wasting time.

In R, you can do this in a couple relatively simple steps. In one simple step, actually, but we’ll also do more than just the bare minimum.

You can download the script here if you want to follow along.

Let’s start with loading the packages we’ll be using; the tidyverse and sherlock. The tidyverse is a collection of packages such as dplyr, readr, stringr, ggplot2, tidyr etc., each of which having its own set of functionality. Sometimes you’ll want to load just one of them, say dplyr, but a lot of times simply loading the tidyverse will do the trick.

We’ll then use sherlock’s load_file() function, as described last week, to read in the dataset from my GitHub repository. Make sure to change the filetype argument to “.csv” as this is a .csv file.

We’ll save the dataset into memory as bond_strength_wide (referring to the current wide format of the data)—we’ll transform it into the format we need in no time though!

No alt text provided for this image

Here are the steps:

  • Use dplyr’s mutate() function to create a column called Cycle, then use base R’s rep() function to create the numbering sequence 1:3, then convert it into a factor variable.
  • Use tidyr’s pivot_longer() function to convert the data frame into a long format.
  • Use dplyr’s mutate() function to update the Line column. The update we are going to make is removing the string “Line “ from each observation to make
  • Then use dplyr’s arrange() function to arrange by the columns Time and Line. This is not really necessary for plotting purposes but provides a way to verify that you did everything right.

No alt text provided for this image


Let’s run bond_strength_long by moving the cursor over it and hitting Ctrl + Enter.

No alt text provided for this image

It looks like everything checks out, and we are ready the plot the data.

As a first step, we are going to plot using a technique called stratification where the data are grouped and plotted by a specific variable. We do this to separate the data by that variable and ultimately to see what kind of differences exist between the groups.

We are going to unleash the power of ggplot2 to do this.

No alt text provided for this image


Let me briefly explain the above code.

First, we take the bond_strength_long dataset and pipe it (using %>%) into the ggplot() function and specify what we want plotted on the x and y axes. This time we want to plot Bond Strength on the y axis and the Line variable on the x axis. This essentially creates a blank “canvas” for the plot — nothing has been plotted just yet.

After that, we use the + operator to add different layers to the base canvas. First, we add the function for the type of visual we want to create, which in this case is a scatterplot type of plot (geom_point() function).

We then add a custom theme, which dictates the appearance of the plot. There are many out-of-the-box themes one can use, for example theme_minimal(), theme_bw() etc.; I tend to use the theme_sherlock() from the sherlock package for a minimalistic look.

And finally, to top things off, we add a title and a custom call-out for the y axis using the labs() function.

This is what the plot looks like:

No alt text provided for this image

Not too bad for a first try, right?

Now, we are going to further simplify what we just did by using a ready-made plotting function called draw_categorical_scatterplot(), which achieves the same thing while adding additional functionality.

No alt text provided for this image
No alt text provided for this image

With this function you can also:

  • Group (stratify) by up to three variables in a nested fashion
  • Plot the means of each group or connect them with a line
  • Set whether each group is displayed in a separate color
  • Set the size and transparency of the data points
  • Add jitter (a little bit of noise along the x axis) to deal with overplotting

To recap, in this week’s edition we went over how to do a basic data pivoting transformation and plotted the data both using basic ggplot2 functions and a built-in function called draw_categorical_scatterplot().

That’s it for this week—we will continue exploring this dataset next week.


Thanks for reading this newsletter! Reach out with any questions you may have.

Download this week’s script here.

Keith Fong

Product & Process Improvement Expert, Podcaster

1y

Translating human readable data into software readable data is the number 2 skill in data analysis. This looks like a really useful tool for dealing with the big data sets. It's not clear to me, though, how neat the human readable data has to be. Do all of the columns have to be the same length? In this example, perhaps one manufacturing line runs an extra shift. What's the number 1 skill? Ensuring that the right data is collected along with sufficient metadata to effectively answer the questions getting asked. Thanks for sharing this capability of R, Gabor Szabo. I really appreciate your evangelism of R.

To view or add a comment, sign in

More articles by Gabor Szabo

  • 10) Is everything in order?

    10) Is everything in order?

    Hopefully, you have already seen that utilizing R for engineering can unlock a host of powerful tools and techniques…

    1 Comment
  • 9) The mighty Youden Plot - a graphical technique that every engineer needs in their toolbox

    9) The mighty Youden Plot - a graphical technique that every engineer needs in their toolbox

    If there is one graphical technique that deserves a lot more attention that it gets and that every engineer needs to…

    13 Comments
  • 8) Being in a state of flow(charting)

    8) Being in a state of flow(charting)

    Today we look at how to draw simple flowcharts in R. I think I am not far off when I say that flowcharts are an…

    13 Comments
  • 7) Density curves (with a reliability engineering example)

    7) Density curves (with a reliability engineering example)

    Today we look at a couple different ways to visualize the distribution of your data. Understanding the distribution of…

    6 Comments
  • 6) Priorities, priorities…

    6) Priorities, priorities…

    This is the sixth edition of the R for Engineering newsletter, and today we look at the ultimate prioritization tool -…

    6 Comments
  • 5) Small Multiples for Characterization

    5) Small Multiples for Characterization

    In the last edition of R for Engineering, we learned how to draw small multiple plots in R and harness the power of…

  • 4) Small Multiples, Huge Advantage

    4) Small Multiples, Huge Advantage

    In this week’s edition, I introduce you to the concept of small multiples, and, more importantly, how to make them in…

  • 2) Getting Started

    2) Getting Started

    If this is your first time reading my newsletter: I am thrilled that you decided to give it try! If this is not your…

  • 1) How I Stumbled Upon R

    1) How I Stumbled Upon R

    It all started two years ago. I had been in engineering, mostly in quality engineering, all my career, and at that…

    19 Comments
  • How Should One Assess Operator Differences?

    How Should One Assess Operator Differences?

    This is blog post was originally published on the Practical MSA blog. The original version can be accessed here:…

    12 Comments

Insights from the community

Others also viewed

Explore topics