Modern Data Stack - using Google AppSheet, Airflow, DBT, Google Big Query, and Looker Studio
Data Lakehouse architecture allows Data Professionals to get the best from the Data Lake and the Data Warehouse. The Modern Data Stack implements the Lakehouse architecture clearly and precisely to separate data in stacks to facilitate its creation and access.
The foundations to implement them are Cloud Lakehouse, Modern Ingestion and Transformation Tools, and Self-Service Analytics. To implement this Proof of Concept (POC), I used Google Big Query as Cloud Lakehouse, DBT as the Modern Ingestion and Transformation Tool, and Google Looker as Self-Service Analytics. My goal is to prove that, by taking a few simple steps, it is possible to have a modern approach using the leading technologies available. And even if you do not have these tools available, you can implement the same process with a few adaptations.
Upgrading Data Approach with End Users
Before we start, we must know exactly the end users’ expectations. This requirement phase is crucial to any Analytics project.
Because we will implement a Data Product with clearly defined data stacks, changing the data analytics’ questions mindset is mandatory from what do you want to what data do you work with. Usually, when we ask what the users want, we will receive a report, Excel spreadsheet, PowerPoint, or any other layout. My understanding of the Modern Data Stack to implement Data Products is that we are moving from report developers to subject data experts. On one side, we will never know as much as end users about the business. On the other side, they will – perhaps – never know how to build a Data Model and put it in a Data Stack/Product that will answer their current and future questions.
For our POC, we will help investors to track their current stock portfolio. They want to check the balance quickly, how their investments are concentrated in some markets and industries, and how their stocks perform daily.
We know there is a free service from Yahoo Finance that you can use with Python to retrieve the stocks’ daily moves. We know we must track the buy, sell, and dividend movements to get the stock account balance, and it is simple to retrieve each stock market and industry using Yahoo Finance. Because we have some domain knowledge, we can build a Data Model to support the users’ actual needs and not what they want.
How to design a good Data Stack
Before building the Data Model, it is mandatory to understand what we are designing. If you develop a report based on any given layout, you will lose the opportunity to enable Self-Service Analytics for your users. You will undoubtedly build a Star or Flatten Schema. It is good. But it is far from the optimal solution.
As subject data experts, we want a long-term solution. We want to avoid creating a new Star or Flatten Schema to develop every new report for end users. We want to enable Self-Service Analytics to empower the end users to find answers to the questions raised after we deploy the BI application.
Here we face another concept of the Modern Data Stack and Lakehouse architecture. The Bronze, Silver, and Gold layers will split the data into different aggregation levels. Having these layers defined will enable both Self-Service Analytics and Data Science development. On the Data Engineering side, it will be easy to fix and improve the data model since the layers will record the data from the “data as is” landing to “business-oriented analysis” and, finally, the “report-oriented” approach.
I tried to find a unanimous opinion regarding these three layers but have no success. The Bronze layer usually resides on the Data Lake and represents the data as is in the source (SAP, Salesforce, Workday, etc.). No transformation is expected on this layer; a Change Data Capture (CDC) type of application like IBM Infosphere, Oracle GoldenGate, and Qlik Data Streaming is usually responsible for this task. The Gold layer is known as reporting zone. Star and Flatten’s Schemas are good candidates for design methods for this one. The Silver layer is the most controversial one. Some literature suggests a third normal form (3NF), Data Vault, or Snowflake schema.
POC Application Architecture
Defining the proper architecture for our BI application is crucial. Having a big picture of the main components is vital to avoid pitfalls that can easily arise during the development. The standard Lakehouse architecture is well-known and is the application's starting point. You can see below the proposed application architecture.
Recommended by LinkedIn
As you can see above, Google AppSheet represents the Transactional Systems like SAP, Workday, and Salesforce. I created a simple app to hold Stock Daily Transactions. Google AppSheet Bot saves each new transaction (buy, sell, or dividends) in my Google Drive folder. It is like the CDC process and works well for testing purposes. Airflow DAG runs two Python scripts daily to get all the AppSheet transactions and Yahoo Finance stock data and feed the Google BigQuery bronze layer - Extract and Load, no Transformation here. DBT, as you will see below, will get the data from the bronze layer and transform it, feeding the silver and gold layers in BigQuery. Google Looker Studio will show the dashboard to analyze my stock account. It is a straightforward architecture, but it precisely mimics the main aspects of the Modern Data Stack in a Lakehouse infrastructure.
Application Data Model
Because our goal is to model an Analytics solution and the reporting layer will work with a Star or Flatten Schema, I prefer a Snowflake schema. In reality, there are more important decisions to take. Both Data Models will enable some critical features that make this layer reusable for new reporting developments: facts in the lowest granularity (aggregations will happen on reporting layer) and exclusive rows in the dimensions (if you need to join dimensions, you will do it in the reporting layer).
Considering the concepts above, this is the Data Model to support the POC.
Data Ingestion & Transformation
As mentioned above, DBT will get the raw data from the bronze layer (left), transform it to build the silver layer (middle), and prepare the gold layer (right).
I scheduled DBT to run a daily job to feed the tables in Google BigQuery.
Self-Service Analytics
As a result, I used Google Looker to implement a simple but efficient dashboard as follows.
As you can see, it is not limited to showing the balance of each Stock. There are some simple analyses by Industry and Market, and the investors can track the daily stock price and volume. Several different studies can be developed using the Data Layers.
In my Github, you can find the DBT code, the sample spreadsheet that served as the “data source”, and the Python codes to get data from Yahoo Finance and from Google AppSheet Bot. I added some tips on how to install Airflow in a Windows environment as well.
Hope you enjoyed this article. Feel free to reach me if you have any questions and thanks for reading!
The article emphasizes understanding end-users' expectations and shifting the mindset from report developers to subject data experts. Celso's POC focuses on assisting investors in tracking their stock portfolio, demonstrating how domain knowledge can shape a robust data model tailored to users' actual needs.
Business IT Analyst at NCV Logistics
1yBest article to update new technology skills...Thanks Celso Poderoso
I help companies manage and manipulate data to extract knowledge and wisdom | Father of 2 girls | MS Fabric | Delta Lake | Azure | GCP | SQL | SSIS | Spark | Python | Data Factory | DataOps | Cloud | ETL | ELT |
1yI'm fully grateful for this post, it's very useful and practical!
I help companies manage and manipulate data to extract knowledge and wisdom | Father of 2 girls | MS Fabric | Delta Lake | Azure | GCP | SQL | SSIS | Spark | Python | Data Factory | DataOps | Cloud | ETL | ELT |
1yPaulo Braz, this idea is exactly the one I mentioned as a proposal in the machine learning project with the Yahoo Finance API using Python. He used a stack where the DBT is the orchestrator, the Cloud data warehouse is Google BigQuery and Looker is the self-service analytics/reporting tool. A very powerful approach on complete modern production-ready data-stack on-cloud. The dbt framework is very usefyl for data product projects!
Data & Analytics Professional
1yWell written article, especially the design of data model is optimized for self serve reporting.