Water utility pipes are crucial infrastructure. They provide safe and clean potable water to citizens. But these buried assets are exposed to severe deterioration over the course of time because of various factors which can potentially lead to failure and significant system-wide disruptions. These service-level disruptions can continue for weeks, sometimes months, at a stretch, leaving the citizens without any water supply. And therefore, it is important to understand the health of these assets; and predict their risk of failure to strategize maintenance and plan replacement efforts ahead of any potential failure to ensure the undisrupted continuous supply of safe drinking water.
Water utilities have already started making the use of data to predict pipe health and forecast their risk and criticality. Advanced statistical models, with the virtue of age, material, diameter, location, likelihood and consequence of failure, and other environmental features, can provide actionable insights into the pipe conditions and help water utilities prioritize their maintenance and repair efforts.
In this article, I will try to explain a typical data science workflow for condition assessment of water pipes and forecasting their risk. Please note that this is not an elaborate data science pipeline; however, it will surely give readers an idea of how data can be churned to develop insights.
At the commencement of our data science classes at Carnegie Mellon University, our professor would make an important declaration that using data science should not be a pretext for disregarding common sense. LOL!
Data science workflow for condition assessment to predict risk of water pipe:
- Data Collection: Data is at the center of everything, especially now that we are talking about using data science! Step one is to collect data on various parameters from the utility records that affects the condition of water mains such as year of installation, material, class of pipe, diameter, depth below ground, groundwater level, soil pH, history of maintenance, historical break reports, customer complaints, proximity to important places like airports, schools, etc.
- Data Preprocessing: As we all could guess this data collected from the utility may (I want to use the word 'will') contain errors, missing or absurd values, or outliers. Such data must now be cleaned and preprocessed. This includes removing duplicates, handling missing values, and transforming data into appropriate formats for analysis.
- Exploratory Data Analysis (EDA): Now, once the data is cleaned data exploration is important for our preliminary statistical understanding. Exploratory data analysis gets us a better understanding of the data and identify any skewness, patterns or trends. This step heavily involves data visualization and statistical analysis. I took an amazing class at the Human Computer Interaction Institute, HCII at CMU called 'Interactive Data Science'. That class was all about the grammar of data visualization and interactivity. You may want to google search a foundational book named 'The Grammar of Graphics' by Leland Wilkinson, if you want to dive deeper into data visualization. Highly recommended!
- Feature Engineering: This step is about selecting and transforming data that can be fed to a machine learning model. Statistics and eventually data science is all about numbers! And our dataset might have non-numerical data too! For example, materials, pipe pressure rating, installation date (datetime format), etc. This step help scientists - perhaps engineers, to massage the dataset to transform everything into numbers. It is important to encode all the non-numeric data values, for example, installation year (1989) can be extracted from an installation date (1-1-1989). New features that can provide additional insights into the condition of the pipeline are also created. For example, calculate the age of the pipeline based on the year of installation and current year. Feature engineering is a massive field of study, and this is just the tip of the iceberg, for a quick and easy understanding.
- Model Selection: Choose an appropriate machine learning model to predict the risk of the pipeline based on the available data. This may include regression, classification, or clustering models based on various factors that I will try to cover in a separate article.
- It may very often happen that the utility dataset might not have a 'risk' feature in the data model. That means that the utility might not maintain asset wise risk indices or indicators. If the dataset does not include a pre-defined risk or criticality feature for pipes, then we can create a new feature using a combination of other variables that are likely to influence pipeline criticality. The method of creating the risk feature will depend on the specific domain knowledge and factors that are relevant to a particular water supply network under consideration. Here are some steps that I had considered in my previous experiments and studies:
- Identify variables that are likely to influence pipeline risk and consider consequence of failure factors.
- Assign weights to each variable based on their relative importance in determining pipe risk. For example, historical break reports or age may be assigned a higher weight than proximity to schools.
- Combine the weighted variables to create a new feature that represents risk. One simple way to do this is to sum the weighted variables for each pipeline. For example, if age is assigned a weight of 0.2, material is assigned a weight of 0.3, and historical break reports are assigned a weight of 0.5, the risk feature can be calculated as: risk = 0.2 * age + 0.3 * material + 0.5 * historical_break_reports
- Normalize the risk feature if necessary to ensure that its values fall within a desired range. For example, scaling the risk values to fall between 0 and 1 for ease of interpretation. Use the newly created risk feature as the target variable in your data science workflow to train a model that can predict the risk.
- Model Training: The entire dataset is then split into training and testing subsets, generally in the ratio of 80:20 or 90:10. Then train the selected ML model on the training set.
- Model Evaluation: The trained model is then tested on the testing set using appropriate metrics such as confusion matrix, accuracy, precision, recall, F1 score, etc. to evaluate the performance of prediction depending on the selected ML model. More about this in a separate article.
- Model Optimization: Machine Learning models can very quickly become 'computationally expensive'. And performance optimization becomes very important to deploy the ML model.
- Model Deployment: Once model is optimized for performance, can then be deployed in a production environment to predict the risk of water pipes.
In conclusion, water pipes require careful maintenance and repair efforts to ensure the uninterrupted supply of safe drinking water. Data science can provide valuable insights into the condition of these pipelines and help prioritize maintenance and repair efforts by predicting their risk and criticality.
This article can be overwhelming for some and very primary for others, but it sets a preface to my next article in which I intend to discuss a code that I wrote using Python in detail that practically analyzes real life data of a city's water network.
I hope you found this article useful. If so, I'm glad! Please consider sharing it in your network!
Here are a few technical references that are relevant to this article:
- Jain, A., & Giri, D. (2016). Pipeline Failure Prediction using Machine Learning. International Journal of Advanced Research in Computer Science, 7(3), 127-131.
- Wang, S., & Zhao, Z. (2020). Pipeline Failure Prediction Based on Machine Learning: A Comprehensive Review. Journal of Computing in Civil Engineering, 34(6), 04020026.
- Li, H., Guo, Y., Yang, X., & Sun, J. (2018). Pipeline risk assessment model based on a fuzzy comprehensive evaluation approach. Water Science and Technology: Water Supply, 18(6), 2087-2095. doi: 10.2166/ws.2018.010
- Zhang, Z., Chen, Z., & Chen, X. (2020). Pipeline risk assessment and management: A review. Journal of Hydroinformatics, 22(4), 844-857. doi: 10.2166/hydro.2020.015
- American Society of Civil Engineers. (2017). Condition assessment of pipelines. Reston, VA: American Society of Civil Engineers. doi: 10.1061/9780784414689
- Zhang, H., Wang, J., Guo, X., Sun, J., & Xing, T. (2019). Evaluation of pipeline criticality based on fuzzy comprehensive assessment and principal component analysis. Journal of Loss Prevention in the Process Industries, 60, 243-251. doi: 10.1016/j.jlp.2019.02.005
About the author: Tanay Kulkarni is a data scientist having a deep passion for water, wastewater, and stormwater infrastructure systems. He holds a Master of Science in Civil and Environmental Engineering from Carnegie Mellon University, and a Master of Engineering in Civil Engineering from Pune University. Tanay works as a Data Scientist and Infrastructure Management Consultant at Freese and Nichols and he has worked at companies such as SewerAI, and Bentley Systems, and was also a co-founder and consulting engineer at DTK Hydronet Solutions. He has skills in business development, product management, predictive analytics, machine learning, web application development, and data analytics, and has experience with programming languages such as Python, SQL, and R. Tanay has also worked on various projects related to water infrastructure, including creating a predictive model for inspection prioritization and pipe breaks, and developing a database management system for water infrastructure assets.
International award winning researcher at the International Institute of engineering and researcher
1yHi Tanay, Do you have data for this amazing article, I am working on something similar. I am looking for data to predict waterpipe lifespan. Thank you!
sales enablement, pre-sales engineer, trainer, passionate about water
1yExcellent article Tanay!
Section Manager, Water Network Modeling
1yGreat article Tanay. Are you planning to cover benchmarking and risk mitigation too in subsequent articles? Application of the "right model" for the "expected outcomes" is something which will help utilities leverage ML based models to cater their risk strategies aligning with investment strategies.
Water Reclamation & Management Professional, Indian Army Veteran, Master of Engineering - ME (Hons) - IIT Roorkee. Doctoral Research Scholar; IIT Bombay.
1yLooking forward fo the next article.