Optimising Data Modelling: Leveraging Power BI
Transforming Raw Data into Strategic Insights: A Comprehensive Case Study
Summary
In today's competitive business landscape, effective data modelling is crucial. This article explores how Power BI can transform data analysis in the upstream sector of the oil and gas industry through a real-world case study. Learn about the complexities of data modelling and how a well-structured model can simplify analysis, leading to smarter data-supported decisions in production activities.
Introduction
In the business world, clear data insights are key to staying competitive. But how can companies achieve this? The answer lies in business intelligence (BI) tools.
Data modelling is at the heart of the BI process. It involves structuring and organising data in a way that makes it accessible and useful for analysis. Through a well-designed data model, it is possible to simplify the analysis of complex data, facilitating informed decision-making.
This article explores data modelling in Power BI using a real-world case study from the oil and gas industry. It highlights the complexities and challenges of data modelling and demonstrates how a well-structured data model can transform data analysis, leading to more informed and strategic decisions.
The Challenge of Data Modelling
If your role involves data analysis, you’ve likely spent considerable time using Excel pivot tables and formulas. You might also be familiar with the engines in Power BI and Excel Data Model. These tools combine the full power of a database with the DAX (Data Analysis Expressions) language. However, power must be used wisely and responsibly!
I have often observed individuals struggling to calculate the figures they need to address specific situations or problems. Formulas can sometimes become overly complex, making them difficult to manage, or worse, leading to incorrect results. For example, on one occasion, a complex formula issue was traced back to an incorrect relationship in the data model. Once the model was corrected, the formula not only worked correctly but also became much simpler to write.
Is data modelling simple? Not necessarily. It can be complex and challenging, requiring significant effort to master and the ability to visualise the model across different scenarios. Think of it like navigating a ship through stormy seas —every decision must be carefully considered, with a deep understanding of the environment and potential obstacles.
In the oil and gas industry, where our case study is based, data modelling proved crucial for analysing vast amounts of production and injection data. By structuring the data model correctly, we were able to transform raw data into actionable insights, transforming chaos into conceptual order and aiding in more informed decision-making.
Data modelling can often be complex, challenging, and intellectually stimulating. However, one thing is certain: it is never boring! Mastering data modelling is essential if you want to enhance your analytical capabilities. It enables you to focus on making the right decisions rather than getting bogged down in complex formulas, ultimately leading to more efficient and accurate analyses. Wouldn’t you agree?
What is a Data Model?
In simple terms, a data model is a set of tables linked by relationships based on common keys between the tables. Even a single-table model is a data model, but relationships make the model much more powerful and interesting to analyse.
In this context, we’re referring to a relational or tabular data model, the type that Power BI works with. There are different types of data models, but this article will focus strictly on the relational type.
Case Study: Analysis of a Development Plan
Context of the Project
To understand the complexity and impact of this case study, it's important to first set the stage with some context. As the newly appointed Engineer Manager, I joined the Company at a pivotal moment, when the process of acquiring the exploitation rights of 11 onshore fields from Petrobras in Brazil, was in its final phase. This acquisition was a strategic move, as Petrobras had decided to shift its focus towards offshore developments, leaving behind mature yet still lucrative fields.
Disclaimer: the production and injection figures presented in this article have been deliberately altered to protect proprietary company information. These numbers are purely illustrative and do not represent the actual data. This ensures that the focus remains on the methods and techniques discussed, rather than on specific operational details.
Evaluating the Development Plan
Evaluating a development and production plan in the industry is an inherently iterative and highly technical process. It involves transdisciplinary collaboration among engineers, geologists, economists, and various other experts. Given the volatile nature of market conditions and fluctuating oil prices, our long-term strategy needed to be highly adaptable, requiring a flexible approach and a readiness to revise plans as circumstances evolved.
Key Factors
The Source Data Set
Understanding the source data set is crucial for the success of our development plan. The starting point consisted of production (O: oil, G: gas, W: water) and injection (WI: water and SI: steam) fluid forecasts provided by the Reservoir Development team, with the figures being monthly averages. The dataset included five original tables:
The first three tables, corresponding to the producing wells, differed only in the metrics, meaning each table contained the metric associated with a single production fluid (O, G, or W). These metrics, which are the numbers to be aggregated and processed, are located at each intersection of a row corresponding to a Well TAG and a column corresponding to a particular month. The data points associated with production are substantial, with 2,032,277 raw production data points spread across three tables, and 307,469 raw injection data points in two tables. This extensive amount of data provides a comprehensive view of production and injection activities, essential for thorough analysis and decision-making.
The configuration of these tables (not being in a tabular format), combined with the absence of relationships, makes it extremely challenging to develop accurate formulas and obtain meaningful insights. This complexity can hinder our ability to analyse and understand the development plan's impact on various departments of the Company. By implementing a well-structured tabular data model, we can simplify this process, ensuring that our analysis is both efficient and precise.
Clarification: each time "flow rate" is mentioned, it must be interpreted as forecasted values provided by the Reservoir Development team.
Constructing the Initial Data Model
Building an effective data model begins with a clear understanding of the different perspectives from which the information will be analysed. This step is crucial as it ensures that the data model is tailored to meet the diverse needs of all stakeholders, thereby providing meaningful and actionable insights. This foundational step helps in structuring the data model appropriately, making the subsequent analysis more efficient and effective.
The following visual map illustrates the diverse perspectives and analyses that I mapped out when constructing the initial data model for fluid production. This map highlights the key areas of focus, including executive overview, wells, fields, and facilities, ensuring that all critical aspects of production are captured and analysed effectively.
For Injection (Water and Steam), the visual maps were analogous to those for production, considering the particularities of each.
Having a full grasp of these perspectives was necessary because it influenced how we organised the data tables and relationships. With this clear understanding, I could ensure that the data model would support all necessary analyses, from high-level executive summaries to detailed well-specific evaluations.
The importance of comprehending the different perspectives in four points:
By focusing on these key perspectives, we built an initial data model that was robust, flexible, and capable of addressing the diverse analytical needs of the project. This approach not only simplified the development process but also ensured that the final model was highly effective in delivering the necessary insights.
Separation of Assets and Facts
The distinction between assets and facts leads to a data-modelling technique known as the star schema, where tables are classified into two categories:
Why star schemas? They are easy to comprehend and use. Dimensions are used to slice and dice the data, whereas numbers are aggregated from fact tables.
For this particular case, the metrics depend on the type of well:
This is why I decided to have three FACT tables, one for each well category group.
Initial data model tables
Fact Tables with granularity at the month and well levels:
Dimension Tables:
In a tabular data model, fact tables store quantitative data for analysis and are related to dimension tables, but dimensions must not have relationships among them that introduce ambiguity to the model. An ambiguous model presents not only a technical problem but also a logical one.
Establishing relationships between dimension (DIM) and fact (FACT) tables using keys like DateKey and Well TAG enabled the model to unleash its full potential.
Other observations:
Particularities of Snapshot Fact Tables
Computing values on top of snapshot tables hides some complexity because snapshots do not generate additive measures. Additive measures are those that can be summed across all dimensions in the data model. These measures are straightforward to aggregate and are often used in various analyses.
Semi-additive measures can be summed across some dimensions but not all. These measures pose challenges because they cannot be simply added up across certain dimensions like time. Non-additive measures are those that cannot be summed across any dimension. These measures require special calculations or aggregations that do not involve summing up values.
Mixing changes in granularity with semi-additivity can be problematic. The formulas tend to be more difficult to author. Moreover, if you don’t pay attention to the details, performance will suffer. It’s always good to double-check all the numbers before considering them correct.
As an example, the metrics in the Production Fluid Snapshot Table consist of monthly production rates, which are averages rather than quantities. Consequently, these metrics cannot be aggregated over time. Detailed definitions of some base measures, along with explanations, are provided in the Annex 1.
Expanding the Data Model to Cover Downstream Infrastructure
With this version of the data model ensemble, I was able to cover a wide spectrum of critical knowledge for various essential departments, such as executive management, production engineering, cost control, and oil trade, among others. Using Power BI, I developed dashboards and visuals to answer questions clearly and concisely, facilitating better understanding and analysis of the data.
Key analyses performed include:
However, the model does not map the wells to the surface facilities, nor does it map the connections between the different facilities. Consequently, there is no way to model how the production and injection fluids are distributed and flow downstream of the wells. This limitation prevents us from addressing issues related to downstream infrastructure and logistics from the wells.
Concerns that needed to be resolved:
To properly respond to these inquiries and gain comprehensive insights, it is necessary to expand the model to explore all concerns related to the transport, separation, and processing demands of the production and injection fluids.
Expanding the Data Model
The first step was to model the configuration of the existing fluid collection, processing facilities, and transport network. This included determining:
The expanded model allowed for a holistic study of the assets. Three new dimension tables were developed and incorporated: Prod Facility, WI Facility, and SI Facility, each having a direct relationship with its respective well table.
Recommended by LinkedIn
In case it hasn’t been noticed, it is no longer a star schema but a snowflake schema. A snowflake is a variation of a star schema where a dimension is not linked directly to the fact table. Rather, it is linked through another dimension. I could have avoided snowflakes by denormalising the columns from the farthest tables into the nearer to the fact table. However, in this case, a snowflake schema was fit for purpose.
Advanced Techniques in Data Modelling to Gain Deeper Insights
To unlock the full potential of data analysis, it is crucial to go beyond basic data modelling methodologies and explore advanced techniques. Advanced data modelling techniques can provide deeper insights, enabling more precise and strategic decision-making. In this section, we will delve into some of the sophisticated approaches used to enhance our data model, covering topics such as dynamic segmentation, ABC classification, calculation groups, and advanced visualisations. These techniques not only improve the robustness and flexibility of the data model but also facilitate a comprehensive understanding of complex data relationships and trends. By implementing these advanced strategies, we can extract maximum value from our data and drive more impactful business outcomes.
1. Dynamic Well Segmentation
When analysing well metrics, it is relatively straightforward to compute metrics such as productive time, total production, or average flow rate for each well. However, a more insightful approach involves clustering producing wells based on forecasted oil flow rates or other relevant metrics. Instead of partitioning flow rates by every unique value (which can be overwhelming due to the large number of different values), grouping flow rates into ranges provides meaningful insights.
This technique, known as segmentation, allows for stratifying data based on a configuration table. In this case, the challenge was to cluster wells dynamically based on the average oil production flow rate, taking into account the current filter context. This means that the segmentation adjusts dynamically according to the period, or filters applied in the report. For instance, a well might belong to one segment for one period and another segment for a different period.
To achieve this dynamic segmentation, a virtual relationship was established using DAX, bypassing the need for physical relationships in the data model. The first step was defining configuration tables (oil and gas), which defined four segments: silver, gold, platinum, and diamond. Using percentiles, median, and average flow rates as reference, the ranges for each segment were determined. A DAX calculation group was then created to dynamically compute the number of wells in each segment, or any other selected measure, considering all current filters.
2. ABC Classification | Pareto Analysis
The ABC/Pareto analysis is another impactful technique that introduces new ways of classifying entities based on values, grouping the entities together that contribute to a certain percentage of the total. For instance, this method helps identify the most impactful wells on the overall business by categorising them into three classes (A, B, or C), ensuring:
These class limits can be modified as needed, provided they sum to 100% without overlapping. Similar to well segmentation, the ABC classification was implemented dynamically, allowing for adjustments based on the filter context.
3. Calculation Groups
Adopting calculation groups in Power BI provided numerous advantages, enhancing the efficiency and flexibility of data analysis. Calculation groups centralise and simplify the management of recurring calculations, reducing the number of measures and code redundancy, thus improving model maintainability. For example, rather than creating multiple measures for different time related computations (year-to-date, period-to-date growth, moving averages, etc.), calculation groups allow these transformations to be defined once and applied dynamically.
In this model, calculation groups were defined for various purposes, including:
However, it is crucial to understand the complexities of calculation groups before implementation, as they can be challenging to manage. Investing significant time to master these powerful tools is essential for achieving their full potential.
4. Sankey Diagram
The Sankey diagram is a visualisation technique used to represent the flow of values from one set to another. In this context, it was implemented to illustrate the flow of oil, gas, and water from gathering stations (facilities receiving production from wells) through various surface facilities to the final endpoint. Although this information could be presented as a matrix, the Sankey diagram offers a clearer and more visually appealing representation of data flows, capturing interrelationships and magnitudes more effectively.
To use this custom visual from Microsoft AppSource, the data source required specific formatting, which necessitated adjustments to the data model. The aim was to dynamically represent the flow of fluid production based on any changes in the filter context.
5. Transition Matrix
Wells' classifications can change over time based on their decline rates and production levels. A transition matrix is a powerful modelling technique that uses snapshots to analyse these changes. By creating a calculated table that classifies each production well monthly and establishing the necessary relationships, the transition matrix helps answer questions about the evolution of well classifications over time.
For example, it can track wells classified as "gold" in one period and show how their classification changes in subsequent periods.
6. Like-for-like Comparison
Like-for-like comparison is an adjusted metric that compares two time periods, restricting the comparison to wells with consistent production statuses. This technique ensures that only wells open and producing in all considered periods are evaluated, preventing skewed results from wells that were closed during some periods. Although this comparison was computed dynamically, it could also be done statically if needed.
By implementing these advanced data modelling techniques, the model becomes more robust and capable of delivering deeper insights, ultimately supporting more informed and strategic decision-making. Moreover, the combined use of different techniques significantly enhances the ability to explore the dataset. For instance, Figure 5 features a Sankey Diagram that represents the oil production rate from the Gathering Stations (ECOs) to the Export Terminal (TERMAP).
The analysis is further enriched by the accompanying matrix, which presents additional statistical metrics, such as minimum, median, and maximum for the selected fluid measure (oil prod rate, in this case) within the analysed time period. The matrix also includes the specific dates on which these extremes are anticipated, providing a comprehensive view of the production dynamics over time.
Conclusion
This case study demonstrates the transformative power of effective data modelling in the oil and gas industry using Power BI. Through meticulous structuring and organisation of data, we were able to convert raw data into actionable insights, significantly enhancing our decision-making capabilities.
The initial challenge of working with complex data sets was addressed by understanding the key analytical perspectives and constructing a robust data model. This foundational work enabled us to create dynamic and interactive dashboards that provided clear and concise answers to critical business questions.
By expanding the data model to include downstream infrastructure, we tackled the limitations of our initial setup and were able to analyse the entire lifecycle of production fluids from the well to the delivery point. This holistic view is essential for making informed decisions about infrastructure investments and operational improvements.
Moreover, the incorporation of advanced data modelling techniques such as dynamic segmentation, ABC classification, and calculation groups further enriched our analytical capabilities. These sophisticated approaches allowed us to gain deeper insights, identify trends, and make more precise and strategic decisions. The use of visualisation tools like the Sankey diagram and transition matrix provided a clearer understanding of data flows and changes over time.
Through this journey, it became evident that mastering data modelling is not merely about handling data but about harnessing its potential to drive business success. The capacity to structure data in meaningful ways, apply advanced analytical techniques, and visualise complex relationships is crucial for any organisation striving to remain competitive in today's data-driven world.
In summary, our experience underscores the importance of investing in data modelling expertise and leveraging powerful BI tools like Power BI. By doing so, organisations can transform their data into a strategic asset, enabling smarter decisions and fostering sustainable growth in the ever-evolving landscape of the oil and gas industry.
Giving Credit Where It's Due: Recommended Reads and Resources
For those looking to explore the concepts and techniques discussed in this article further, the following books and resources have been instrumental in my own professional growth and are highly recommended.
Books:
Online Resources:
Annex 1: Special Consideration for Production Metrics
In data modelling, particularly when working with snapshot tables, it is crucial to understand the distinction between additive, semi-additive, and non-additive measures. Additive measures, such as total production, can be summed across dimensions without any issues. However, semi-additive measures, like oil production rate, require special handling because they cannot be aggregated simply by summing values across time periods.
This annex explains these concepts using two DAX measures: "Oil Prod" and "Oil Prod Rate." Each measure is broken down to explain the purpose and function of each part of the code, with specific considerations for handling semi-additive measures in snapshot tables.
Oil Prod Measure
This measure calculates the total oil production, in barrels, within the filter context in place. The Oil Prod measure is additive across all dimensions. However, since the fact table 'Fluid Production' stores the oil production rate for each well on a given month, it’s necessary to account for the number of days in those months.
Oil Prod = -- total volume of oil produced in barrels within the given filter context
SUMX(
'Fluid Production',
'Fluid Production'[O [bpd]]] * RELATED( 'Date'[DaysInMonth] )
)
Breakdown of the Measure.
SUMX: This iterator function processes each visible row in the 'Fluid Production' table (the fact table), based on the current filter context.
Considerations: the "Oil Prod" measure is additive, meaning it can be summed across different wells or months without issue. This is because the measure calculates total production quantities, which are straightforward to aggregate.
Oil Prod Rate Measure
The "Oil Prod Rate" measure computes the average daily oil production rate, making it a semi-additive measure. This type of measure requires careful handling because it cannot be aggregated across time without recalculating for the specific context.
Oil Prod Rate =
/* The average daily oil production rate, in barrels per day,
within the specified filter context */
VAR _OilProd = [Oil Prod]
VAR _NumberOfDays =
CALCULATE(
SUMX(
'Date',
'Date'[DaysInMonth]
),
VALUES( 'Fluid Production'[DateKey] )
)
VAR _Result =
DIVIDE(
_OilProd,
_NumberOfDays
)
RETURN
_Result
Breakdown of the Measure.
VAR _OilProd: this variable stores the total oil production value calculated by the previously defined "Oil Prod" measure, representing the total production for the filter context.
VAR _NumberOfDays: this variable calculates the total number of days in the period being analysed.
How VALUES Works in This Context:
o Filtering the Date Table: the VALUES('Fluid Production'[DateKey]) function extracts the unique dates from the 'Fluid Production' table based on the current filter context (such as a specific well, month, or year selected in the report).
o This ensures that the calculation focuses only on those dates relevant to the data in the 'Fluid Production' table, instead of applying the calculation across the entire 'Date' table.
Applying the Filter: The CALCULATE function applies this filter to the 'Date' table, meaning that the SUMX function sums the DaysInMonth only for those dates matching the filtered DateKey values from the 'Fluid Production' table.
Result of the Calculation: the result is a sum of the DaysInMonth values, but only for those months corresponding to the DateKey values found in the filtered 'Fluid Production' table.
This precise calculation ensures that the "Oil Prod Rate" measure accurately reflects the average daily production rate based on actual periods of production data, avoiding an incorrect average that might include days from unrelated time periods.
VAR Result: this variable compute the average daily production rate by dividing the total oil production (`OilProd`) by the total number of days (`_NumberOfDays`).
Considerations:
Importance of Handling Semi-Additive Measures
In snapshot tables, additive measures allow for straightforward aggregation, but semi-additive measures like "Oil Prod Rate" require more nuanced handling. Summing semi-additive measures over time would lead to incorrect results, as it would not account for the varying number of days in each period or the differences in production volumes. By carefully structuring DAX calculations, as demonstrated with "Oil Prod Rate," we ensure that our data model provides accurate and actionable insights, avoiding common pitfalls associated with improper aggregation of semi-additive measures.
Conclusion: understanding the behaviour of semi-additive measures in snapshot tables is essential for creating accurate and reliable data models. By breaking down the DAX measures "Oil Prod" and "Oil Prod Rate," we can see the importance of recalculating measures for each time period rather than simply summing values. This careful approach ensures that the insights derived from the data model are both accurate and meaningful, providing a solid foundation for decision-making in the context of oil production analysis.
#DataModelling #DataAnalysis #DataStrategy #PowerBI #DataDriven #BusinessIntelligence
Consultora Business Intelligence | Autora en ANAYA | Linkedin Learning Trainer | Microsoft Data Platform MVP | Microsoft Partner Power BI | Ganadora Globant Award Techfluencer España 2022
3moExcelente artículo, Leandro, teoría, práctica y todo bien detallado. Gracias por compartirlo, Saludos, Ana
Electronics Technician
3moExcelente Articulo Lean
Gerente Ingeniería
3moHoy la info esta. El problema es que hacer con ella! Sin duda hay que sumar estas herramientas
Consultor de Gestión de Proyectos
3moMuy interesante!!! Un fuerte abrazo!!!