😵💫 Ever wrestled with SQL that needs to adapt across data warehouses? At Numbers Station, SQLGlot is our secret weapon, handling the heavy lifting of parsing, transforming, and unifying SQL across dialects. Dive in as Maureen Daum breaks down how this powerhouse library keeps our SQL analysis sharp and our codebase streamlined! https://lnkd.in/gkeuJpd8
Numbers Station AI’s Post
More Relevant Posts
-
Benn Stancil latest post on dbt hits close to heart. 🥲 In parts because I remember ‘winging it’ myself whilst building an entire warehouse amount of dbt models. I had some medallions to follow, ideas for encapsulating logic, but other than it was wild west sql slinging. But also through our work with reconfigured v1 where we tried to enforce an auto-generated-yet-exposed-to-the-eyes framework for building core data models. It solved the spaghetti issue for sure, but it really didn’t address the underlying problems. Recommend the read. :-) https://lnkd.in/dmBcPhyA
The rise of the analytics pretendgineer
benn.substack.com
To view or add a comment, sign in
-
Great post again from Benn! What I've seen is that many times Data Engineers need to clean up the mess that Application/Services create -- which is obviously wrong. You get crappy data in and you need to somehow clean it up and organise to tables and eventually through some magic graph of thousands of lines of SQL into clear and understandable and simple charts and reports. Nobody dares to say that "hey, pls fix this JSON where it is produced". "Programming" on th SQL layer is way harder and a lot more unclear and cumbersome than on the real programming language layer. It's the same thing again, companies need to have e2e approach for data. Otherwise they sink into the complexity.
Benn Stancil latest post on dbt hits close to heart. 🥲 In parts because I remember ‘winging it’ myself whilst building an entire warehouse amount of dbt models. I had some medallions to follow, ideas for encapsulating logic, but other than it was wild west sql slinging. But also through our work with reconfigured v1 where we tried to enforce an auto-generated-yet-exposed-to-the-eyes framework for building core data models. It solved the spaghetti issue for sure, but it really didn’t address the underlying problems. Recommend the read. :-) https://lnkd.in/dmBcPhyA
The rise of the analytics pretendgineer
benn.substack.com
To view or add a comment, sign in
-
𝐖𝐞𝐥𝐜𝐨𝐦𝐞 𝐭𝐨 𝐃𝐚𝐲 3 𝐨𝐟 𝐭𝐡𝐞 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 𝐜𝐨𝐮𝐫𝐬𝐞! Today, we’ll explore common operations on DataFrames, such as filtering, selecting, and sorting data. These operations are fundamental for data manipulation and will help you work effectively with large datasets. 𝐈𝐧𝐭𝐫𝐨𝐝𝐮𝐜𝐭𝐢𝐨𝐧 𝐭𝐨 𝐃𝐚𝐭𝐚𝐅𝐫𝐚𝐦𝐞 𝐎𝐩𝐞𝐫𝐚𝐭𝐢𝐨𝐧𝐬: PySpark DataFrames come with a range of powerful methods that allow us to manipulate data efficiently. The key operations include: ◾Filtering rows based on conditions. ◾Selecting specific columns. ◾Sorting the data. ◾Aggregating values. 1. 𝐒𝐞𝐥𝐞𝐜𝐭𝐢𝐧𝐠 𝐂𝐨𝐥𝐮𝐦𝐧𝐬 To select specific columns from a DataFrame, use the select() method. 𝐄𝐱𝐩𝐥𝐚𝐧𝐚𝐭𝐢𝐨𝐧: ◾select(): This extracts the specified columns. ◾The show() method displays the content of the selected columns. 2. 𝐅𝐢𝐥𝐭𝐞𝐫𝐢𝐧𝐠 𝐑𝐨𝐰𝐬 To filter rows based on a specific condition, you can use the filter() method. 𝐄𝐱𝐩𝐥𝐚𝐧𝐚𝐭𝐢𝐨𝐧: ◾filter(): Filters rows that match the condition. ◾Use expressions like startswith() to filter based on string values. 3. 𝐒𝐨𝐫𝐭𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 You can sort your data using the orderBy() method. 𝐄𝐱𝐩𝐥𝐚𝐧𝐚𝐭𝐢𝐨𝐧: ◾orderBy(): This sorts the DataFrame by one or more columns. ◾Use desc() to sort the values in descending order. 4. 𝐀𝐝𝐝𝐢𝐧𝐠 𝐍𝐞𝐰 𝐂𝐨𝐥𝐮𝐦𝐧𝐬 To add new columns, use the withColumn() method. 𝐄𝐱𝐩𝐥𝐚𝐧𝐚𝐭𝐢𝐨𝐧: ◾concat(): Used for concatenating multiple columns or expressions. ◾lit(" "): Adds a space between the first and last names as a literal string. ◾By using concat() along with lit() for spaces, PySpark handles null values properly, and the concatenated result won't be null if only one of the columns is not null. 5. 𝐆𝐫𝐨𝐮𝐩𝐢𝐧𝐠 𝐚𝐧𝐝 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐢𝐧𝐠 𝐃𝐚𝐭𝐚 Use groupBy() to group data based on a column and aggregate it. 𝐄𝐱𝐩𝐥𝐚𝐧𝐚𝐭𝐢𝐨𝐧: ◾groupBy(): Groups rows by the specified column. ◾count(): Aggregates the data by counting occurrences within each group. 𝐊𝐞𝐲 𝐓𝐚𝐤𝐞𝐚𝐰𝐚𝐲𝐬: ◾PySpark offers powerful methods like select(), filter(), and orderBy() for manipulating data. ◾Adding new columns and performing groupBy() operations help you transform and analyze your data. ◾With these foundational operations, you’re ready to perform more advanced data manipulation in PySpark. 𝐂𝐡𝐞𝐜𝐤 𝐨𝐮𝐭 𝐨𝐮𝐫 𝐆𝐨𝐨𝐠𝐥𝐞 𝐂𝐨𝐥𝐚𝐛 𝐥𝐢𝐧𝐤 𝐚𝐧𝐝 𝐭𝐫𝐲 𝐢𝐭 𝐟𝐨𝐫 𝐲𝐨𝐮𝐫𝐬𝐞𝐥𝐟: https://lnkd.in/gQffTudY 𝐘𝐨𝐮 𝐜𝐚𝐧 𝐟𝐨𝐥𝐥𝐨𝐰 𝐭𝐡𝐞 𝐞𝐧𝐭𝐢𝐫𝐞 𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐣𝐨𝐮𝐫𝐧𝐞𝐲 𝐢𝐧 𝐨𝐧𝐞 𝐩𝐥𝐚𝐜𝐞: https://lnkd.in/gQNJUCfA Join us tomorrow for Day 4, where we’ll cover DataFrame Joins and working with multiple DataFrames! Share your Pyspark challenges and suggestions in the comments! Let's learn and grow together. #PySpark #BigData #DataEngineering #GoogleColab #Day3
To view or add a comment, sign in
-
🌟 Data-Driven Revolution: Polars Leads with Lightning Speed in Latest TPC-H Benchmarks 🌟 The recent update on TPC-H benchmark results by Polars has set a new standard in the realm of data processing. This benchmark, crucial for decision support systems, evaluates the performance of database management and data processing systems by simulating complex query execution. Polars has emerged as a frontrunner, showcasing significant optimizations and performance enhancements. 🔍 Why It Matters: In a data-centric world, the ability to process large datasets efficiently is paramount. The TPC-H benchmarks test a variety of SQL queries involving joins, filters, and group-by operations, which are foundational for analytics workloads. Polars' performance in these benchmarks is a testament to its capabilities in handling complex data operations swiftly. 🚀 Key Insights: 🔹 Optimized Performance: Polars has demonstrated superior performance compared to other dataframe libraries, such as pandas and PySpark. This showcases Polars’ ability to handle large datasets with more speed and less resource consumption. 🔹 Benchmarking as a Tool for Improvement: The continual updating of these benchmarks pushes the envelope on data processing technology, driving innovations that trickle down to improved user experiences and application efficiency. 🔹 Implications for Data Professionals: For data scientists and engineers, understanding the strengths and limitations of various processing tools as per these benchmarks can lead to more informed decisions about the right tools for their data operations. 💼 Implications for Professionals: Data professionals should consider the implications of these benchmark results in their projects, particularly those involving large-scale data analytics. The insights gained from such benchmarks can greatly influence the architecture of data solutions, ensuring that they are not only robust but also cost-effective and scalable. 👉 https://lnkd.in/dkUPCkRN 👥 Join the Conversation: 🔹 How do you see the evolution of data processing tools affecting your industry? 🔹 Are there specific challenges in your data workflows that could benefit from the kind of performance Polars is promising? #DataScience #Analytics #BigData #TPCH #Polars #Benchmarking
Updated PDS-H benchmark results
pola.rs
To view or add a comment, sign in
-
Ever encounter a situation where your query failed to recognize top records due to tied scores? This blog post dives deep into this common issue and utilizes the power of window functions in PySpark to ensure accuracy in reporting Discover how a simple tweak in your SQL query can make a world of difference in accurately identifying top records, even when multiple records achieve the same highest value. https://lnkd.in/eEwqp7Ez
Cracking the Code: Tied Scores, a Window Functions Perspective
blog.det.life
To view or add a comment, sign in
-
𝗪𝗵𝗮𝘁 𝗶𝘀 𝗮 𝗣𝗮𝗿𝗾𝘂𝗲𝘁 𝗳𝗶𝗹𝗲 𝗶𝗻 𝗦𝗽𝗮𝗿𝗸? A Parquet file is a type of file format used in Spark, which is a popular big data processing framework. It's designed to store and work with structured data efficiently. Parquet files are particularly useful in big data applications because they can be read and written in parallel, making them fast and scalable. Imagine you're running a website where you collect user data such as their name, age, email, and the date they joined. You want to store this data in a file so that you can analyze it later using Spark. 𝗛𝗼𝘄 𝘆𝗼𝘂 𝗺𝗶𝗴𝗵𝘁 𝘀𝘁𝗼𝗿𝗲 𝘁𝗵𝗶𝘀 𝗱𝗮𝘁𝗮 𝗶𝗻 𝗮 𝗣𝗮𝗿𝗾𝘂𝗲𝘁 𝗳𝗶𝗹𝗲: name age email join_date -------------------------------------------- John 25 john@example.com 2024-05-01 Alice 30 alice@example.com 2024-05-02 Bob 28 bob@example.com 2024-05-03 𝗕𝗲𝗻𝗲𝗳𝗶𝘁𝘀 𝗮𝗻𝗱 𝘂𝘀𝗮𝗴𝗲 𝗼𝗳 𝗣𝗮𝗿𝗾𝘂𝗲𝘁 𝗳𝗶𝗹𝗲𝘀: 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: Parquet files are highly compressed and columnar, which means they store similar types of data together. This makes reading and writing data faster because Spark can skip over the columns that aren't needed for a particular operation. 𝗦𝗰𝗵𝗲𝗺𝗮 𝗣𝗿𝗲𝘀𝗲𝗿𝘃𝗮𝘁𝗶𝗼𝗻: Parquet files store metadata about the schema of the data, including data types and column names. This makes it easy to read the data back into Spark without having to specify the schema explicitly. 𝗣𝗮𝗿𝘁𝗶𝘁𝗶𝗼𝗻𝗶𝗻𝗴: Parquet files can be partitioned based on one or more columns, which can greatly speed up queries that filter on those columns. For example, you could partition the data by the join_date column, allowing Spark to read only the partitions that are relevant to a particular query. 𝗡𝗼𝘄, 𝗳𝗼𝗿 𝗮𝗻 𝗶𝗻𝘁𝗲𝗿𝘃𝗶𝗲𝘄 𝗾𝘂𝗲𝘀𝘁𝗶𝗼𝗻: How would you read a Parquet file into a Spark DataFrame using PySpark? Follow Shivani Bakhade
To view or add a comment, sign in
-
Efficient data linkage is key to successful data operations. Discover how clever combinations of PySpark's left semi join, left anti join, and union can be leveraged for targeted data manipulation to achieve the same result as a left join. #pysark #sql #dataengineering
Granular Look at Left, Semi, and Anti Joins in PySpark
blog.det.life
To view or add a comment, sign in
-
#AI #ML #Tech CTE Vs. Subqueries In SQL — 3 Practical Tips To Make A Right Choice: Master the key differences between CTEs and subqueries to know exactly when and where to use each! Continue reading on Towards Data Science » #MachineLearning #ArtificialIntelligence #DataScience
CTE Vs. Subqueries In SQL — 3 Practical Tips To Make A Right Choice
towardsdatascience.com
To view or add a comment, sign in
2,354 followers