Iron Ore Incident Investigation: An Analysis of Manufacturing Processes
Created using Canva by Megan Smith

Iron Ore Incident Investigation: An Analysis of Manufacturing Processes

Introduction

The United States produced about 46 million metric tons of iron ore in 2022. This only accounted for about 2% of the total iron ore mined globally. For the past 50 years in the United States, more than 90% of iron ore production has been mined from Lake Superior deposits. Being from Michigan, the Great Lakes State, learning more about this mining process was especially intriguing. Iron ore is a raw material used for building bridges, trains and rails, buildings, bikes, and much more. It’s everywhere, including smartphones, and even paint and some cosmetics. You can learn more about where iron ore is present in various products here.

The Mining Process

Iron ore is not found in pure form. It is mixed in with dirt, rock, and other minerals. Flotation plants are commonly used to separate valuable iron ore from impurities. Here is a basic rundown of the process:

  1. Crush and Grind: Crush the rocks down to create a fine powder
  2. Mix with Water: Mix the fine powder with water to create a thick, watery mud called slurry or pulp.
  3. Add Chemicals: Three types of chemicals are added to the slurry: collectors, modifiers, and depressants. Collectors, like amines, are designed to attach to the surfaces of iron ore particles so they can be easily separated from impurities later. Modifiers control the pH levels of the slurry. Maintaining the correct pH level impacts how well the collectors attach to iron ore particles. Depressants, like starch and silica, prevent unwanted minerals from adhering to bubbles where the iron ore particles attach in the next phase of the mining process.
  4. Create Bubbles: Air is blown into the slurry to create a frothy layer. With the added chemicals, the iron ore particles attach to the bubbles.
  5. Collect the Froth: The iron ore bubble froth is skimmed off the top of the slurry.
  6. Separate: The collected froth may still contain other impurities but is rich in iron ore. Then, it is processed further.
  7. Dry and Prepare for Shipping: The concentrated iron ore is dried and prepared for shipping to be used in iron and steel factories.

The data for this project focuses on steps three through six of this process.

Objective

For this project, I’m assuming a data analyst role for a mining company to analyze data from a flotation plant. I have been tasked with the following main objectives:

  1. Identify correlations within the dataset.
  2. Investigate if something significant happened on June 1st.
  3. Determine how the critical columns changed throughout the day on the first of June.

Key Findings

  1. The pH levels of the ore pulp are always between 8.8 and 10.8, which is within the target range
  2. Strong linear relationships exist between the percent silica and iron feeds and the percent silica and iron concentrations.
  3. Fluctuations in critical columns on June 1st were within the normal range, indicating there was not an incident on the first of June.

The Data

The data for this project is from March 2017 to September 2017 and can be found on Kaggle, a community platform where data analysts and data scientists can discover and publish datasets. This is real-world data from a flotation mining plant in South America. The collected data is from one portion of the refining iron ore process.

This dataset contains over 730,000 samples or changes in the flotation process. The timestamp column reveals that some samples were taken every 20 seconds, while others were taken hourly. The columns containing the percent iron feed and percent silica feed are quality measures of the iron ore before it is fed into the flotation plant. The columns that state the starch flow, Amina flow, ore pulp flow, ore pulp pH, and ore pulp density are the most critical variables that impact the ore quality at the end of the process. The airflow and the slurry froth level columns also affect ore quality. Lastly, the percent iron concentrate and percent silica concentrate are measured again at the end of the flotation process.

 The data was downloaded as a Comma Separated Values (CSV) file and then imported into Python, a computer programming language used to perform data analysis, through Deepnote. Like Jupyter, Deepnote is an interactive development environment (IDE) that allows you to write and run Python code and visualize data, but it is a cloud-based platform.

Python Libraries Used in This Project:

  • Pandas
  • Seaborn
  • Matplotlib  
  • NumPy

Analysis: Data Overview

To begin this analysis, I imported the Python libraries Pandas, Matplotlib, Seaborn, and NumPy, as shown in the code below:

Click the picture above to enlarge the query on certain devices

Then, I used the Pandas library to import the dataset into Python. To shorten the name of the CSV file, I called it “df” for the data frame, as shown below:

Click the picture above to enlarge the query on certain devices

To get a snapshot of the data, I utilized the method “head()” on the data frame object to show the first five rows of the dataset. Below is the code and a portion of the corresponding output:

Click the pictures above to enlarge them on certain devices

Notice how elements within the columns have a comma instead of a decimal point. This problem would make any analysis inaccurate. As a result, I wrote the Python code below to change each comma within the dataset to a period and save over the original data frame. This way, each time I call “df” in the code, it will use the modified table with decimal points in the appropriate places. Below is the code and snapshot of the corrected data frame.

Click the pictures above to enlarge them on certain devices

Next, I wanted to know the size of the dataset. So, I used the “shape” attribute within Pandas to retrieve the dataset's number of rows and columns. Below is the code that I utilized for this overview:

Click the picture above to enlarge the query on certain devices

The output for this code was:

  • (737453, 24)

This means there were 737,453 rows, which represent samples or various measurements taken during the flotation process, and 24 columns that represent the percent iron and silica entering the plant, chemical measures, airflow adjustments, froth levels, and final percentages of iron and silica concentrates.

Next, I wanted a quick rundown of each column's name so I could easily refer back to them throughout my analysis. Below is the code and corresponding output:

Click the pictures above to enlarge them on certain devices

In many datasets, dates are commonly classified as strings or other quantities, which can be problematic when performing an analysis. I used the following code to determine how dates were classified within the dataset:

Click the picture above to enlarge the query on certain devices

The output for this code was:

  • <class 'str'>

‘str’ means the data within this column is categorized as text. Since dates must correlate to a timestamp, I wrote the following code to make this adjustment:

Click the picture above to enlarge the query on certain devices

The output for this code was:

  • <class ‘pandas._libs.tslibs.timestamps.Timestamp’>

This means I successfully changed the “Date” column from being classified as text to a date or timestamp.

Now, I wanted to determine the timeframe for the dataset. So, I used the following code to determine the starting and ending dates:

Click the picture above to enlarge the query on certain devices

The output for this code was:

  • The min date is 2017-03-10 01:00:00
  • The max date is 2017-09-09 23:00:00

 This means the dataset contains mining data from the flotation plant from March 10, 2017, through September 9, 2017.

Analysis: Column Aggregations

Next, I wanted to get some basic aggregations on each column. Python makes this very easy using the “describe” method, as shown below:

Click the pictures above to enlarge them on certain devices

The descriptive statistics in the table above show:

  • Count: The number of non-missing values in each column
  • Mean: The average
  • Standard Deviation (std): A measure of dispersion that shows how spread out the data is from the mean
  • Minimum (min): The smallest value in the column
  • 25th Percentile (25%): The first quartile, which is the value for which 25% of the data falls
  • 50th Percentile (50%): The median, which separates the data into two equal halves
  • 75th Percentile (75%): The third quartile, which is the value for which 75% of the data falls
  • Maximum (max): The largest value in the column

The target range of pH levels of the ore pulp, or slurry, should always stay between seven and eleven. Looking at the last column displayed above, Ore Pulp pH, the pH level is always between 8.8 and 10.8, so this is right on track. This means the pH levels were very stable when this data was collected, which is a good sign that the flotation plant is running well.

The percent iron concentrate, measured at the end of the flotation process, has a mean of 65%. That is over a 10% increase in iron ore from the start of the flotation plant. Additionally, the percent iron concentrate has a range of 5.96% and a standard deviation of 1.12%. Since these measures of spread are relatively small, the floatation plant successfully purifies the iron ore.

Analysis: Identifying Linear Relationships

Then, I created a correlation coefficient heatmap to determine if there were any strong correlations between variables in the dataset. A correlation coefficient, or r-value, is always between –1 and 1 and tells you how closely two variables follow a line of best fit, or linear regression line. Values where r  is greater than or equal to the absolute value of 0.8 are considered relatively strong correlations, with 1 or –1 representing the strongest correlations where every data point is precisely on the linear regression line. r-values that are close to zero represent variables that have relatively no correlation. Below is the Python code and corresponding output for this visualization.

Click the pictures above to enlarge them on certain devices

As previously discussed, r-values greater than or equal to 0.8 or less than or equal to -0.8 are desirable when identifying linear relationships between variables. So, in the heatmap above, we are looking for squares, or cells, that are shaded either navy blue or white. Navy blue cells represent positive correlations between variables, while white cells represent negative linear relationships. In the visualization above, most variables show relatively no linear correlation. However, a few navy blue cells exist between consecutive flotation air flow columns. It is not surprising that a linear relationship may exist as the airflow adjusts from one cell to the next as the pulp or slurry moves through the flotation plant.

Two cells are nearly white and are circled in red in the correlation heatmap above. One is a negative linear relationship between the percent silica feed and the percent iron feed with a correlation coefficient of –0.97, and the other is a negative linear relationship between the percent silica concentration and the percent iron concentration with a correlation coefficient of –0.80. These strong linear relationships show that as the percent silica feed increases, the percent iron feed tends to decrease, and vice versa. Similarly, this same relationship exists between the percent silica and iron concentrations. Since silica is used as a depressant in flotation plants to prevent unwanted minerals from adhering to bubbles where the iron ore particles attach, it makes sense that less silica is needed when more iron is present in the slurry. This indicates that the flotation plant was functioning correctly when this data was collected.

Analysis: Pair Plot of Critical Columns

Management at the flotation plant indicated that an incident on June 1st may have occurred and needs to be investigated. Based on criteria provided by management, critical columns of the data were identified. Below is the Python code and the first few rows of data for the first of June. 

Click the pictures above to enlarge them on certain devices

Filtering the data for June 1st still resulted in over 4,000 rows of data. While Python gives a quick overview of the data in each column, it is not enough to conclude something significant occurred. As a result, to determine relationships between variables within the dataset, I created a pair plot using Seaborn. This visualization summarizes a large amount of data into a single figure, making it easy to identify correlations between variables if they exist. Below is the Python code for the pair plot and corresponding output.

Click the pictures above to enlarge them on certain devices

In the pair plot above, joint and marginal distributions are plotted simultaneously. The grey histograms along the diagonal estimate the marginal distributions of each numerical feature of the dataset. The linear regression lines on each scatterplot above emphasize that there is relatively no correlation between any of the given variables. Comparing this to the average day at the flotation plant may reveal if something significant occurred on June 1st. As observed earlier, with the correlation heat map of the whole dataset, there was a strong negative linear correlation between the percent silica concentration and the percent iron concentration. However, as outlined in red in the above pair plot, the data does not closely follow the line of best fit, indicating that there is relatively no linear correlation between these two variables on June 1st. 

To confirm the lack of correlation between these two variables on June 1st, I created another correlation heat map in Python with these same critical columns, as shown below.

Click the picture above to enlarge the visualization on certain devices

Since the r-value is –0.27, there is relatively no linear correlation between the percent silica concentration and the percent iron concentration on June 1st. However, the correlation coefficient for these two variables across the whole dataset is –0.80. This contrast in r-values seems atypical and may indicate that an incident occurred on June 1st, but further analysis is required to be sure.

Analysis: Change in Linear Correlation over Time

Since the relationship between the percent silica and the percent iron concentrations drastically changed on the first of June, I wondered how often this occurs. If it happens regularly, it may be essential to identify a pattern to the loss of this linear relationship to understand how to optimize this flotation plant best. So, I created a code in Python that compares the daily correlation coefficients between the percent silica concentration and the percent iron concentration over the span of the provided dataset to see what kind of relationships, if any, exist. Below is the code and corresponding visualization.

Click the pictures above to enlarge them on certain devices

The dashed red horizontal lines represent correlation coefficients of 0.8 and –0.8. Points above 0.8 or below –0.8 indicate strong linear correlations for that day. The first of June is circled in red in the above visualization. While the correlation between these two variables across the whole dataset tends to be strong (r = –0.80), it is not unusual for some days to have relatively no correlation between the percent silica concentration and the percent iron concentration. It should be noted that when there is relatively no linear relationship between these two variables, it occurs for a couple of days at most. Then, the strong linear relationship is present again. As a result, there is still no indication that an incident occurred on June 1st.

Analysis: Hourly Changes on June 1st

Investigating further, I created Python code to visualize hourly changes on June 1st within the critical columns management was concerned about. Below is the code utilizing a for-loop to show all graphs and the corresponding visualizations.

Click the pictures above to enlarge them on certain devices

The above visualization shows that the percent iron concentration dips at 1:00 p.m. (13:00) on the first of June and again at the end of the day. Although these dips occur, they represent less than a 4% change and are within normal limits.

Click the picture above to enlarge the visualization on certain devices

Similarly to the percent iron concentration, the percent silica concentration fluctuated throughout the day on the first of June, remained within a 4% range, and is within normal limits.

Click the picture above to enlarge the visualization on certain devices

Ore Pulp pH also fluctuates but is still within a stable range, indicating that the flotation plant is running well.

Click the picture above to enlarge the visualization on certain devices

The flotation level in cell five takes a significant dip at 6:00 p.m. (18:00) on June 1st. This column shows froth levels within the fifth cell and gives the thickness of the floats. The lower the level, the higher the concentration grade of the ore being processed. This drastic change could mean one of several things, such as:

  1. A change in the feed material entering this cell, such as core composition, size, or feed rate.
  2. Reduced air supply, leading to poor froth formation and decreased flotation.
  3. Changes in chemicals such as collectors, modifiers, and depressants.
  4. Froth removal from this cell.

While this may cause some concern, this dip in flotation level in this cell also occurred in other flotation cells on days the plant ran flawlessly. So, although there were some variations in the samples collected on June 1st, they were still within acceptable range. This means the flotation plant was running smoothly on June 1st, and there is no cause for concern.

 Flotation Plant Findings and Next Steps

  1. The pH levels of the ore pulp are always between 8.8 and 10.8, which is within the target range even on the first of June, the day of a possible incident.
  2. There are few strong linear relationships across the dataset except for the percent silica and iron feeds and the percent silica and iron concentrations. Both of those relationships have very strong negative linear correlations, which means that as one variable increases, the other decreases. This floatation plant successfully produces iron ore concentrations with a small standard deviation, which means low variability. Both of these are signs that the manufacturing process in this plant is functioning well.
  3. While some fluctuation exists in values within critical columns, they are still within the normal range for a successful flotation plant. This includes fluctuations observed on the first of June, meaning no incident occurred on June 1st.
  4. The next steps would include creating predictive models for the percent of iron and silica concentrations. Since these are measures taken at the end of the flotation process, creating these models would help the engineers and scientists tweak the process to avoid pitfalls before they occur and maximize the percent iron ore concentration while minimizing the percent silica concentration.

Feedback

As a math educator and data analyst, applying my strong background in mathematics to this project was exciting. I am constantly learning and growing my data skills. I value constructive feedback, so please feel free to message me to share any insights or suggestions. I’m also actively pursuing a data analyst role. If you know of any opportunities, I’d love to learn more. Connect with me on LinkedIn, check out my portfolio, and stay tuned for more of my data projects in the future.

Christy Ehlert-Mackie, MBA, MSBA

Data Analyst who 💗 Excel | SQL | Tableau | I analyze and interpret data so companies have the information and insights they need to make sound business decisions.

7mo

Nice job, Megan! I am from Minnesota and iron ore mining was a big industry in the northern part of the state (on the other end of Lake Superior). There is a state park at a former iron ore mine that I remember touring as a kid. It's incredible how large the mines are. The trucks are huge but in the mine pits they look like toys in comparison.

Like
Reply
Roberto Busby

Empowering Organizations with Efficient Reporting and Data-Driven Insights

7mo

Great article! I found the charts particularly insightful, especially with how you highlighted key areas by encircling them. It really helped focus my attention on the important points you were discussing.

Deb Ferrer

Prompt Data Analyst || Data Analytics || Prompt Engineering || AI || Excel

7mo

Very impressive! I especially like how you began the story with an engaging question. 👍🏼

Karthik Thurimella

Data Analyst | Analytics Expert | SQL & Python Enthusiast | Visualization Maven | AI Enthusiast | Business Intelligence Pro

7mo

Good job Megan Smith

Andres Cordero

Business Analyst @ IPON | Data Analytics | Data Viz | SQL | Tableau | Power BI | Excel | Microsoft Fabric

7mo

I like the clean and detailed way you present this project. Great job!

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics