Iron Ore Incident Investigation: An Analysis of Manufacturing Processes
Introduction
The United States produced about 46 million metric tons of iron ore in 2022. This only accounted for about 2% of the total iron ore mined globally. For the past 50 years in the United States, more than 90% of iron ore production has been mined from Lake Superior deposits. Being from Michigan, the Great Lakes State, learning more about this mining process was especially intriguing. Iron ore is a raw material used for building bridges, trains and rails, buildings, bikes, and much more. It’s everywhere, including smartphones, and even paint and some cosmetics. You can learn more about where iron ore is present in various products here.
The Mining Process
Iron ore is not found in pure form. It is mixed in with dirt, rock, and other minerals. Flotation plants are commonly used to separate valuable iron ore from impurities. Here is a basic rundown of the process:
The data for this project focuses on steps three through six of this process.
Objective
For this project, I’m assuming a data analyst role for a mining company to analyze data from a flotation plant. I have been tasked with the following main objectives:
Key Findings
The Data
The data for this project is from March 2017 to September 2017 and can be found on Kaggle, a community platform where data analysts and data scientists can discover and publish datasets. This is real-world data from a flotation mining plant in South America. The collected data is from one portion of the refining iron ore process.
This dataset contains over 730,000 samples or changes in the flotation process. The timestamp column reveals that some samples were taken every 20 seconds, while others were taken hourly. The columns containing the percent iron feed and percent silica feed are quality measures of the iron ore before it is fed into the flotation plant. The columns that state the starch flow, Amina flow, ore pulp flow, ore pulp pH, and ore pulp density are the most critical variables that impact the ore quality at the end of the process. The airflow and the slurry froth level columns also affect ore quality. Lastly, the percent iron concentrate and percent silica concentrate are measured again at the end of the flotation process.
The data was downloaded as a Comma Separated Values (CSV) file and then imported into Python, a computer programming language used to perform data analysis, through Deepnote. Like Jupyter, Deepnote is an interactive development environment (IDE) that allows you to write and run Python code and visualize data, but it is a cloud-based platform.
Python Libraries Used in This Project:
Analysis: Data Overview
To begin this analysis, I imported the Python libraries Pandas, Matplotlib, Seaborn, and NumPy, as shown in the code below:
Then, I used the Pandas library to import the dataset into Python. To shorten the name of the CSV file, I called it “df” for the data frame, as shown below:
To get a snapshot of the data, I utilized the method “head()” on the data frame object to show the first five rows of the dataset. Below is the code and a portion of the corresponding output:
Notice how elements within the columns have a comma instead of a decimal point. This problem would make any analysis inaccurate. As a result, I wrote the Python code below to change each comma within the dataset to a period and save over the original data frame. This way, each time I call “df” in the code, it will use the modified table with decimal points in the appropriate places. Below is the code and snapshot of the corrected data frame.
Next, I wanted to know the size of the dataset. So, I used the “shape” attribute within Pandas to retrieve the dataset's number of rows and columns. Below is the code that I utilized for this overview:
The output for this code was:
This means there were 737,453 rows, which represent samples or various measurements taken during the flotation process, and 24 columns that represent the percent iron and silica entering the plant, chemical measures, airflow adjustments, froth levels, and final percentages of iron and silica concentrates.
Next, I wanted a quick rundown of each column's name so I could easily refer back to them throughout my analysis. Below is the code and corresponding output:
In many datasets, dates are commonly classified as strings or other quantities, which can be problematic when performing an analysis. I used the following code to determine how dates were classified within the dataset:
The output for this code was:
‘str’ means the data within this column is categorized as text. Since dates must correlate to a timestamp, I wrote the following code to make this adjustment:
The output for this code was:
This means I successfully changed the “Date” column from being classified as text to a date or timestamp.
Now, I wanted to determine the timeframe for the dataset. So, I used the following code to determine the starting and ending dates:
The output for this code was:
Recommended by LinkedIn
This means the dataset contains mining data from the flotation plant from March 10, 2017, through September 9, 2017.
Analysis: Column Aggregations
Next, I wanted to get some basic aggregations on each column. Python makes this very easy using the “describe” method, as shown below:
The descriptive statistics in the table above show:
The target range of pH levels of the ore pulp, or slurry, should always stay between seven and eleven. Looking at the last column displayed above, Ore Pulp pH, the pH level is always between 8.8 and 10.8, so this is right on track. This means the pH levels were very stable when this data was collected, which is a good sign that the flotation plant is running well.
The percent iron concentrate, measured at the end of the flotation process, has a mean of 65%. That is over a 10% increase in iron ore from the start of the flotation plant. Additionally, the percent iron concentrate has a range of 5.96% and a standard deviation of 1.12%. Since these measures of spread are relatively small, the floatation plant successfully purifies the iron ore.
Analysis: Identifying Linear Relationships
Then, I created a correlation coefficient heatmap to determine if there were any strong correlations between variables in the dataset. A correlation coefficient, or r-value, is always between –1 and 1 and tells you how closely two variables follow a line of best fit, or linear regression line. Values where r is greater than or equal to the absolute value of 0.8 are considered relatively strong correlations, with 1 or –1 representing the strongest correlations where every data point is precisely on the linear regression line. r-values that are close to zero represent variables that have relatively no correlation. Below is the Python code and corresponding output for this visualization.
As previously discussed, r-values greater than or equal to 0.8 or less than or equal to -0.8 are desirable when identifying linear relationships between variables. So, in the heatmap above, we are looking for squares, or cells, that are shaded either navy blue or white. Navy blue cells represent positive correlations between variables, while white cells represent negative linear relationships. In the visualization above, most variables show relatively no linear correlation. However, a few navy blue cells exist between consecutive flotation air flow columns. It is not surprising that a linear relationship may exist as the airflow adjusts from one cell to the next as the pulp or slurry moves through the flotation plant.
Two cells are nearly white and are circled in red in the correlation heatmap above. One is a negative linear relationship between the percent silica feed and the percent iron feed with a correlation coefficient of –0.97, and the other is a negative linear relationship between the percent silica concentration and the percent iron concentration with a correlation coefficient of –0.80. These strong linear relationships show that as the percent silica feed increases, the percent iron feed tends to decrease, and vice versa. Similarly, this same relationship exists between the percent silica and iron concentrations. Since silica is used as a depressant in flotation plants to prevent unwanted minerals from adhering to bubbles where the iron ore particles attach, it makes sense that less silica is needed when more iron is present in the slurry. This indicates that the flotation plant was functioning correctly when this data was collected.
Analysis: Pair Plot of Critical Columns
Management at the flotation plant indicated that an incident on June 1st may have occurred and needs to be investigated. Based on criteria provided by management, critical columns of the data were identified. Below is the Python code and the first few rows of data for the first of June.
Filtering the data for June 1st still resulted in over 4,000 rows of data. While Python gives a quick overview of the data in each column, it is not enough to conclude something significant occurred. As a result, to determine relationships between variables within the dataset, I created a pair plot using Seaborn. This visualization summarizes a large amount of data into a single figure, making it easy to identify correlations between variables if they exist. Below is the Python code for the pair plot and corresponding output.
In the pair plot above, joint and marginal distributions are plotted simultaneously. The grey histograms along the diagonal estimate the marginal distributions of each numerical feature of the dataset. The linear regression lines on each scatterplot above emphasize that there is relatively no correlation between any of the given variables. Comparing this to the average day at the flotation plant may reveal if something significant occurred on June 1st. As observed earlier, with the correlation heat map of the whole dataset, there was a strong negative linear correlation between the percent silica concentration and the percent iron concentration. However, as outlined in red in the above pair plot, the data does not closely follow the line of best fit, indicating that there is relatively no linear correlation between these two variables on June 1st.
To confirm the lack of correlation between these two variables on June 1st, I created another correlation heat map in Python with these same critical columns, as shown below.
Since the r-value is –0.27, there is relatively no linear correlation between the percent silica concentration and the percent iron concentration on June 1st. However, the correlation coefficient for these two variables across the whole dataset is –0.80. This contrast in r-values seems atypical and may indicate that an incident occurred on June 1st, but further analysis is required to be sure.
Analysis: Change in Linear Correlation over Time
Since the relationship between the percent silica and the percent iron concentrations drastically changed on the first of June, I wondered how often this occurs. If it happens regularly, it may be essential to identify a pattern to the loss of this linear relationship to understand how to optimize this flotation plant best. So, I created a code in Python that compares the daily correlation coefficients between the percent silica concentration and the percent iron concentration over the span of the provided dataset to see what kind of relationships, if any, exist. Below is the code and corresponding visualization.
The dashed red horizontal lines represent correlation coefficients of 0.8 and –0.8. Points above 0.8 or below –0.8 indicate strong linear correlations for that day. The first of June is circled in red in the above visualization. While the correlation between these two variables across the whole dataset tends to be strong (r = –0.80), it is not unusual for some days to have relatively no correlation between the percent silica concentration and the percent iron concentration. It should be noted that when there is relatively no linear relationship between these two variables, it occurs for a couple of days at most. Then, the strong linear relationship is present again. As a result, there is still no indication that an incident occurred on June 1st.
Analysis: Hourly Changes on June 1st
Investigating further, I created Python code to visualize hourly changes on June 1st within the critical columns management was concerned about. Below is the code utilizing a for-loop to show all graphs and the corresponding visualizations.
The above visualization shows that the percent iron concentration dips at 1:00 p.m. (13:00) on the first of June and again at the end of the day. Although these dips occur, they represent less than a 4% change and are within normal limits.
Similarly to the percent iron concentration, the percent silica concentration fluctuated throughout the day on the first of June, remained within a 4% range, and is within normal limits.
Ore Pulp pH also fluctuates but is still within a stable range, indicating that the flotation plant is running well.
The flotation level in cell five takes a significant dip at 6:00 p.m. (18:00) on June 1st. This column shows froth levels within the fifth cell and gives the thickness of the floats. The lower the level, the higher the concentration grade of the ore being processed. This drastic change could mean one of several things, such as:
While this may cause some concern, this dip in flotation level in this cell also occurred in other flotation cells on days the plant ran flawlessly. So, although there were some variations in the samples collected on June 1st, they were still within acceptable range. This means the flotation plant was running smoothly on June 1st, and there is no cause for concern.
Flotation Plant Findings and Next Steps
Feedback
As a math educator and data analyst, applying my strong background in mathematics to this project was exciting. I am constantly learning and growing my data skills. I value constructive feedback, so please feel free to message me to share any insights or suggestions. I’m also actively pursuing a data analyst role. If you know of any opportunities, I’d love to learn more. Connect with me on LinkedIn, check out my portfolio, and stay tuned for more of my data projects in the future.
Data Analyst who 💗 Excel | SQL | Tableau | I analyze and interpret data so companies have the information and insights they need to make sound business decisions.
7moNice job, Megan! I am from Minnesota and iron ore mining was a big industry in the northern part of the state (on the other end of Lake Superior). There is a state park at a former iron ore mine that I remember touring as a kid. It's incredible how large the mines are. The trucks are huge but in the mine pits they look like toys in comparison.
Empowering Organizations with Efficient Reporting and Data-Driven Insights
7moGreat article! I found the charts particularly insightful, especially with how you highlighted key areas by encircling them. It really helped focus my attention on the important points you were discussing.
Prompt Data Analyst || Data Analytics || Prompt Engineering || AI || Excel
7moVery impressive! I especially like how you began the story with an engaging question. 👍🏼
Data Analyst | Analytics Expert | SQL & Python Enthusiast | Visualization Maven | AI Enthusiast | Business Intelligence Pro
7moGood job Megan Smith
Business Analyst @ IPON | Data Analytics | Data Viz | SQL | Tableau | Power BI | Excel | Microsoft Fabric
7moI like the clean and detailed way you present this project. Great job!