V o l. V L e v e r a g i n g C l o u d S t o r a g e, A u t o m a t i o n f o r S c a l a, S p a r k Sql, Py N o t e b o o k/D a t a b r i c k s

Eduardo "Ed" Contreras, M.S in I.T., M.P.A., B.A

Data Modeling/Analyst/Sr. & Data Engineer, use statistical regression and Big data with notebooks and Jupyter Lab, Python, R, or Scala w. use of AWS and Azure (Databricks, Snowflake), or warehouses, AWS Redshift, Synapse

Published Oct 16, 2024

A b s t r a c t - A l g o r i t h m s p r o v e o u t r e l a t i o n s h i p s i n d a t a s e t s i n some cases which help observers to think more about the importance of systematic use of data. Growing businesses are want to consider how big is the business going to be in ten months, two years and beyond and what costs would be involved in maintaining the course of purchases, maintenance, capex (capital expenditures) and R&D, for instance. Consultants, planners would deem it of great significance to determine a critical path or efficient means to get to a certain output level while considering learning curve and cost (Cost as an Independent Variable, for instance). If my aim is to increase capital expenditures at a particular rate, where will I be in five years might be a feesable concern; conversely one might suggest that the growth rate might be increased with a tweak here or there in planning (ERP and supply chain software, CRM (Customer Relationship Management) programs, Physical or Software Security can all be considered) strategically.

While purchase orders, inventory and accounting records are kept for various purposes, the management and maintenance of these "transactions" can be left for reporting and auditing purposes regularly (entirely). Alternatively, the use of efficient data management processes, storage, indexing or arrangement of data by type (by organization, price for example) and access to it readily converges as a tenet of the strategic planning workload. While supporting operations teams I have oft been asked to answer how much output was created on average per group last year and how much can I expect then this year?

I work alongside developers and business analysts to understand how and when data is collected and how accurate it is, then dump the information into a warehouse or database. Software developers write front ends and persist data (gather inputs from end users) that is automatically feeding the database and warehouse. When Joe finishes designing a kitchen cabinet and Phil finishes engineering and assembling it, I have a system that tabulates such things as quantity of raw materials used (Manufacturing Engineering Systems tabulate that), time in work hours used and labor employed (handler, assembler, specialty engineering). But analytical processes are limited in the database and require this to be exported (ETL), transformed and loaded into a data model that can compare aggregates of records in a dimension such as year over year, month over month and day over day.

I might find that 20% of the time is used to produce 80% of the finished product (the Pareto Principle) which aligns with understandings known to be true about manufacturing and input/output analyses which can impact schedule and delivery decisions. In other words rates of production can very during a very seemingly well planned timeline causing risk and event forecasting to deliver on time, as promised. Why would I think that my delivery will be ready in four weeks after 80% of it was done in the first two weeks? In other words I know that the last twenty percent takes twice as long to finish than it was to start and six to 8 weeks is a better final delivery estimate.

While the Pareto Principle is employed in business routinely, as an algorithm it is of great import. Beyond scheduling, algorithms can be used to validate estimation assumptions and productivity assumptions including how long will it take to clear inventory for a product or department. Again if 80% of my inventory is gone two weeks after a coupon went out, it is safe to say that the inventory may take a while longer to deplete if it is not gone in a flat rate, or if the sales are related to the coupon.

At the very least an algorithm takes inputs such as coupon rate, or discount rate, average sales per product and activity during specific shifts (weekends, evenings or days). A fitting algorithm would compare sales of items on weekends, apart from how to sold on weekdays. In other words a transaction table could be modified to suit an algorithm. Tabular data might be labeled as sales record, weekend sale record (y/n), weekday sale record (y/n), time of sale (evening? y/n, datestamp) or off peak (day, y/n?)). Thus one might use a developer to code a record once it is received at a Point of Service (POS). The data that was saved would be added to using simple logic, add column (boolean data type, true/false, under column peak hours?, or boolean data type, true/false, under column weekend sales day, and boolean data type, true/false under column "was coupon used"?). Since this systematic use of data isn't recorded at the POS automatically, a front end could be employed to select which columns and boolean data types (true false, or binary one or zero) should be added to POS data.

After the data is cleaned, transformed and loaded (it should be kept in its own model using another repository), one could make visualizations and draw comparisons such as aye the items with coupons sold better than items without and soforth. We customarily call this process of using data to glean intelligence (business intelligence) an ETL process and data visualization system. The system is often labeled or marketed as a WIndows SSRS/SSIS product, a Databricks, Tableau or PowerPivot/PowerBI solution. In all cases sql is generally the way transactions are recorded. However, data can be stored in xml, excel and other types (like json from api's found on the web or parquet, sql like data types). This data type mishmash gets consolidated in that data warehouse or repository for the data model which process is handled more efficiently or uniquely depending on the system (Tableau verses Power BI for example).

Lately customers have been sold on the theme of using front ends for notebooks which are simply GUI (Graphical User Interface) blank web pages which allow users to write narrative descriptions about workloads like Extract data from database at POS and analyze it here. Then they would write a query for it in the notebook which eliminates the need to write the code in sql at the database level not the GUI level. Moreover notebooks allow the user who does not use sql to write a natural langauge like query using Python or Spark. Notebooks require horsepower beyond normal IDE's as they will often query big data sets or bulks sets of terrabyte size repositories (repos which hold years worth of detailed records) .

In this volume we build on a previous article (called "Algorithms and Data Sets Analysis using Power Apps, Power Pivot and PowerBi) to show how larger data sets do more with less on the cloud using Spark (APache), Cloud Services (Microsoft Azure) while adhering to the princples or models of the sample data set or modeling activity started on prem in Excel.

Previously we discussed how one can pull data from disparate workbooks using Power Apps in Excel. A spreadsheet "A" holding data for September and sales records would be combined with spreadsheet "B" holding data for August, and spreadsheet "C" for data holding July. Without cutting and pasting one simply used a data tab and imported or "get[s]" data from other workbooks without opening them manually and cutting and pasting. Then the three worksheets were appended and a power pivot table or straight up pivot table was used to look at average sales volume per product. (note when using the Power Platform there are nuances on whether one needs Power Pivot Tables or not; in other words building it on the fly is easier with Power Platform while building the singular data model with the Append query puts all data into one table requiring just the pivot table).

We saw that the data had to be cleaned. In cases where column names had to be edited, or where blank values had to replaced, Excel and Power Pivot lets one clean the data in flight. One can clean it using various functions prebuilt into the menu bar. In fact I checked for duplicates, I converted data types when numbers were read as "string" (note data retrieved over the internet has to be sent in protocol that puts it all in strings) and needed to be sorted (transforming the data into integer was necessary in this case).

This cleaning and transformation (ETL) yielded a resultant, "append" table that showed the new results, a singular model. The model, containing price volume for a week's worth of data was then pivotted to aggregate the average volume of price. This metric would be used to compare with daily values moving forward. It then it was combined or merged with a later day's worth of data for the example. We determined whether that day was an average, below average or above average day for price volume.

With larger data sets we can use a different tool to implement the method used above. Azure serves as our central locale. In other words it is easier to upload a group of files into one storage account than manually click on several files on the Excel Server as we did above.

Azure Storage account/Container ("Contrerasenavvarostorage"/"Containeredc1")

The container above is used by Databricks much like a front end uses a php/perl/python (lamp stack: linux (L- file manager to create folders for index.php (front end), images, aboutus.php, et al web pages), A (Apache (web server that hosts the website front end), M (Mysql data base structured database tables) p(perl/python/php ) the notebook page which can serve html and php connection scripts that use sql in data objects to establish and retrieve sql back end views). The Lamp stack such as that seen in control panels, Hostgator, Ipage, WordPress, Drupal and Joomla, as GUI just like this. The control plane, however is stored in the Databricks GUI a separate window. The control plane is manifested as a notebook (aka Jupyter/Anaconda Distribution for python code, plus sql and scala) hosted in a workspace. In the workspace our notebooko will call the azure storage account and the file under the container shown above.

Databricks GUI (Databricks.com) over Azure Back End, "MountExportDatafile" Notebook in Workspace, Scala Code for sql like and natural language like queries (non sql) to ETL Price Volume

Much like Power Apps (PowerPivot, Data Tab), we are automating the extraction of data from csv files, its transformation into the data types that we are analyzing, loading this data into a model; the automation is done in script note an Excel ribbon tab. This code creates variables that contain the Azure account, container name and password (redacted here), which allows a notebook to show the data from the Azure account, creating an infrastracture as code (Iaac) data pipeline (stage one). At this point we have a pipeline to feed data from an inexpensive data repository.

Databricks GUI view of the container data fed from Azure in the IAAC Pipeline Stage - note the file manager "dfs://mount/" prefix stands for databricks file system which is now used to pull data for use in a notebook at the mount location

The notebook continues wth a short code section using the magic command to enable a non scala line of code (magic or % character is followed by choice of available languages, dbfs (databricks file system language (akin to linux/unix,powershell command line interface commands), python, spark sql or scala which is the present default language.

the ls (list command) common to most cli's, linux, powershell (windows cli) was called to show how the pipeline has pulled the files stored online at Azure.

Scala snippet that transforms the data into a table format that is readable with sql like queries (hive/spark sql) that includes a header, or columns and headers that declare the data type to be added (column name, "Double Type" [data type] and rest

The ETL process continues.. Scala notebooks allow the information from stored functions and libraries to be imported (import org.apache.spark.sql.types._) which is part of the ETL process (transform). Scala transformations allow us to create c olumn names, edit names, change names (in my case csv files included spaces and periods that were removed), data type was selected per column name, eg., "syid" has a DoubleType (integer like scala type) along with names and desired types for the rest of the data ultimately called a data frame.

Recommended by LinkedIn

Managing a Data Warehouse Project

Vincent Rainardi 2 months ago

SAP Data & Analytics Newsflash

Mark Henning Mrosek 9 months ago

How to determine if procurement data you are analysing…

Duja Consulting 1 year ago

Code section in Scala that uses "display" command to see the Data Frame

At this stage we can view the column names, the data (per row) which includes average price volume, the value of price volume per quarter hour, the company name (Aapl, ex), the price for new record, the volume for the new record.

In order to fully integrate Business Intelligence work we would leverage the data frame shown above using natural language like commands (or sql). Scala (along with other languages) uses a function for partitioning the data (sql select * from dataframe where appsym = "aapl), for example) using natural language like commands, in this case "window". (see code snippet below):

Scala WIndow Function can be called from the notebook, whereby natural language like partitions are built using column names and group by functions, sorting functions as well

Sql code could be used to look at data by company, say Apple, Cisco and others. What we want to know is what company has the highest score for volume and price (price volume in other words is volume at a specific product or price, eg., Aapl watch version 3 has a volume at the price of 'x' and a separate price volume for that other product by Apple, that Aaapl watch version 4, whereas Cisco product has a price and volume for its products). In this case I am wondering which product has the higheset volume by company. So the scala code chosen for this notebook levarages natrual language like queries not complex sql queries. The code uses import statements to start the code section (in notebook shown) for the window library of functions and creates a function that is stored at:

val "windowfunctCoun1" (which groups by company ("appsym") and orders by the product price-volume. )

..with this "val" or function which partitions data by company then within those windows or groups of rows (by company) it orders the rows by price volume.

Moreover, the script uses these values to do a bit more, adds spicy icing on the entree:

[the with column function uses a new/add column function and the stored val from above (val "windowfunCoun1" function) to call and use the partition data filters from above with an additional function, to not only sort it by price value but show the max value, or the max value function]".withColumn([new column name],max[max value function], .over[val windowFunctionCoun1]" (the ".over" attribute here tells you to chose a max function over the partition previously created.)

At this point we learned how to partition data outside sql or PowerPivot and Excel. It is done very precisely using a Scala notebook. Business Intelligence involves both creating the pipeline and using a scala notebook to visualize the data. Please note the results are quite astounding as we witness below:

Aapl Corporation: sym id, symbol name, price from recent day's data, average price volume, quarter of hour price volume, symbol name (joined table), price for five days ea., price volume for five days ea., and max price volume

The data proves out an important algorithm and result. For the maximum price volume (or 75M), the Apple product had a price of 192 USD (row one). At 192 price what at its lowest for these 9 values selected. For the next two highest price-volume (228 and 229, roughly 17% higher than 192), the price was considerably higher than the established minimum value. For the fourth and fifth price volume values (224 and 234 or 20% and 30% higher than min price value At this point we note that the mean value for price-volume is 50M and the majority of items are below that mean value for price-volume showing left skewness where more price-volume figures were below the mean. Skewness indicates an abnormal distribution of values including situations where there are more highly priced values or lower prices not a situation where the majority of price values are in the mean range.

The Apple situation is worth considering because regardless of price the volume varies. In other words the price relevance is trivial when compared to normal distributions. In normal situations one might see an immediate decline in demand or price because the demand is considered weaker when the price is higher (when price has to go up due to supply shortage or inflation). Because Keynes (a contrary opinion) sees supply and demand as less influential, he posits that the demand may persist with higher or stickier prices. In other words demand seems to go up when an item is popular and reduces input costs such that inflation is overcome by a participation allowing the producer to produce more goods even as inflation went up because demand was continuing. In theory that can happen when there are barriers to entry or a new market has emerged and a pricing power was established by the producer. In this case Apple may exhibit Keynesian stickiness because Apple products are both popular and not seem to have a competitor who can influence Apple prices. From an investor's point of view Apple may seem to be a safe haven in the short to medium turn. When investments in Apple are a safe haven, the company can continue to keep prices high as it can afford inflation and drop in demand. Therefore we see that apple prices went up as volume went down but after some time the demand and price volume returns because consumers saw no alternatives to apple products.

Again business intelligence shows that traditional demand and supply considerations escape certain market participants such as Apple in this case. Apple may have pricing power and demand which allows it to popularize its products by innovating and giving consumers more bells and whistles. By doing so it can justify its higher prices and knows the consumers will eventually buy the products despite seeing drop in demand in the short term. Pricing in the model we see shows no resistance to higher prices, or "stickiness" as Keynesian models point out. Eventually we are taught that regulators will step in to ensure no barriers to entry are eliminating competition and quasi monopoly or oligarchies in the industry (apple phones, verses windows or android). We have heard recently that Europe is charging Apple with such things as monpoly and pricing as were other firms and while that may change the dynamics, it is not clear when we as consumers will have to put up with the inflation and prices from the strong cell phone and tech firm Apple.

Looking at another firm in the same period, or partition two, let us observe the same statistics.

The AES company would be observed next and while its highest price volume point does not correlate with the cheapest price point, we can again point out that price has little bearing on price-volume.

AES results in a Scala Notebook (unsorted)

AES resembles Apple data analysis in that there is little correlation between price and price volume. In other words one can dismiss the supply demand theory in favor of Keynesian theory and stickiness. However, the twist is that the demand for higher prices seems to be consistent with growing demand more directly than with Apple. Unlike Apple, AES did not have its highest price volume at the lowest price rather the highest. For whatever reason demand seems to be growing with AES to the extent that the lowest price point exhibits considerably less price volume than the 11 M it experiences at its highest price. The lower the prices the lower the demand and price volume. What we notice is that there is growing popularity in AES where Apple began the day with constant or high popularity.

..S u m m a r y - Apple Data Analysis, Supply & Demand Theory and Keynesian Economic Models

Both a Kenesian and Supply and demand theory are worth considering: according to keynes, stickiness reflects the inability for price to return to the mean after it increases or tends to stay high (something attributable to the nominal increase in wages we experience over time which lessens impact of inflation and increases the ability for consumers to support higher prices).

Skewness measures how the outliers relate to the mean. In other words if there are more items purchased at cheaper prices, the story is clear, cheaply valued prices are correlated to sales and higher volume. In the opposite (inverse) case when prices are increasing as price volume increases there is also a clear story: the demand for the product is rising, so as prices go up and volume stays the same or goes up the product continues to sell or sell better.

Therefore, what astounds this writer's view clearly, and simply, is that there is no clear correlation between price and popularity. The theory that lower price correlates to the more popular the product is not reflected, or was simply not clear. At its cheapest price an Apple product sold more. At its price within ten percent of a cheaply valued item, the item sells, but moderately, not nearly as high as when it is priced optimally. When it exceeds that ten percent from optimal price the volume decreases precipitously. However, the price never returns to optimal, or it is sticky, indicating lower prices generally go away (higher lows). Economic uncertainty may resolve to lower prices from leaps and bounds, temporarily, but the idea that the price returns to a mean is not the case, rather the price once expanding will continue to do so even if volume wanes, price ebbs and flows in the positive direction.

V o l. V L e v e r a g i n g C l o u d S t o r a g e, A u t o m a t i o n f o r S c a l a, S p a r k Sql, Py N o t e b o o k/D a t a b r i c k s

Eduardo "Ed" Contreras, M.S in I.T., M.P.A., B.A

Data Modeling/Analyst/Sr. & Data Engineer, use statistical regression and Big data with notebooks and Jupyter Lab, Python, R, or Scala w. use of AWS and Azure (Databricks, Snowflake), or warehouses, AWS Redshift, Synapse

Recommended by LinkedIn

Choosing, building Algorithms

87 followers

More articles by this author

Insights from the community

Others also viewed

Epicor Kinetic's Answer to SQL Bottlenecks

Three Steps of DTW: 1 - Data Retrieval

Integrating with Dynamics 365 Finance & Operations using Package REST API

Reporting - Your Secret Weapon

Mastering Odoo's Reporting Tools: Generating Insights for Better Decision-Making

Boost Your Productivity: 5 Benefits of Utilizing PDF to Excel Data Entry Services

How to Combine Two Queries into One with one Easy Keyword

Why is it So Hard to Generate Timely and Accurate Reports?

Adding controlling features to Financial Statements in Power BI

A Magic Wand for Consolidated Reporting & Data Analysis

Explore topics

Recommended by LinkedIn

Choosing, building Algorithms

87 followers

D o c u m e n t a t i o n & D a t a M o d e l s

Nov 29, 2024

A l g o r i t h m s a n d ML (M a c h i n e L e a r n i n g ), P y t h o n a n d D e s c r i p t i v e S t a t i s t i c s

Sep 2, 2024

Using Data Models and Algorithms

Jul 23, 2024

D a t a M o d e l i n g V o l. Three - J u n e '2 4

Jun 26, 2024

Choosing Algorithms and Methodologies

May 26, 2024

Integrate Data from Sharepoint and Excel, Data Pipeline Troubleshooting

May 3, 2024

Using Data to Create Models and Probabilites for Prediction, Analytics for Data Science

Apr 16, 2024

Introduction to Technical Analysis and Options

Feb 21, 2024

Data Cleansing, Unity Catalog, Sampling Data and Inferring Population Statistic

Dec 5, 2023

Introduction to Big Data, multi-platform Versatility, Python, Map Reduce

Nov 9, 2023

Insights from the community

Others also viewed

Epicor Kinetic's Answer to SQL Bottlenecks

Three Steps of DTW: 1 - Data Retrieval

Integrating with Dynamics 365 Finance & Operations using Package REST API

Reporting - Your Secret Weapon

Mastering Odoo's Reporting Tools: Generating Insights for Better Decision-Making

Boost Your Productivity: 5 Benefits of Utilizing PDF to Excel Data Entry Services

How to Combine Two Queries into One with one Easy Keyword

Why is it So Hard to Generate Timely and Accurate Reports?

Adding controlling features to Financial Statements in Power BI

A Magic Wand for Consolidated Reporting & Data Analysis

Explore topics