With the latest version, Nussknacker has become a powerful tool for those working with Apache Iceberg based Data Lakehouses. It can handle both 𝗯𝗮𝘁𝗰𝗵 𝗮𝗻𝗱 𝗻𝗲𝗮𝗿 𝗿𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 workloads. By integrating Nussknacker with Apache Iceberg, you can do the following: 🟢 Ingest data: Load data into your Data Lakehouse. 🟠 Transform data: Clean, filter, and restructure your data. 🔴 Aggregate data: Summarize and group data. 🔵 Enrich data: Use ML inference, joins etc to add context to your data. 🟣 Apply 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗹𝗼𝗴𝗶𝗰 to data In this blog post, Arkadiusz Burdach will show you how to use Nussknacker to build a data pipeline in this setup 👉 https://lnkd.in/dQg48qSi #Iceberg #DataLakehouse
Datazip’s Post
More Relevant Posts
-
Recently, I wrote a blog post about Nussknacker integration with Flink catalogs. I prepared a step-by-step tutorial on configuring the setup combining Nu, Apache Iceberg and using them to implement an example business use case. My first impression of using Apache Iceberg is that it has a clean design and many things are rethought compared to old-school data lakes. Let me know, what you think about this idea.
With the latest version, Nussknacker has become a powerful tool for those working with Apache Iceberg based Data Lakehouses. It can handle both 𝗯𝗮𝘁𝗰𝗵 𝗮𝗻𝗱 𝗻𝗲𝗮𝗿 𝗿𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 workloads. By integrating Nussknacker with Apache Iceberg, you can do the following: 🟢 Ingest data: Load data into your Data Lakehouse. 🟠 Transform data: Clean, filter, and restructure your data. 🔴 Aggregate data: Summarize and group data. 🔵 Enrich data: Use ML inference, joins etc to add context to your data. 🟣 Apply 𝗯𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗹𝗼𝗴𝗶𝗰 to data In this blog post, Arkadiusz Burdach will show you how to use Nussknacker to build a data pipeline in this setup 👉 https://lnkd.in/dQg48qSi #Iceberg #DataLakehouse
To view or add a comment, sign in
-
🚀 My recent experience working with Apache Iceberg and its Iceberg Catalog! This open table format truly transforms data management for large-scale analytics. Here are a few highlights of my journey: 📊 Data Versioning: Seamless support for time travel and rollback, making data operations smoother. 🚀 Partition Evolution: Effortless handling of evolving partition strategies without rewriting data. 🛠️ Scalability: Optimized performance for large datasets, ensuring efficient queries across millions of rows. 💼 Iceberg Catalog: Simplified tracking and management of tables across different environments. Apache Iceberg is a game-changer for building reliable, scalable data lakes! #ApacheIceberg #DataManagement #BigData #Scalability #DataLakes #IcebergCatalog #DataEngineering #Analytics #DataVersioning #DataPartitioning #ETL #DataOps #OpenSource #CloudData #TableFormats #DataGovernance #DistributedData #DataAnalytics
To view or add a comment, sign in
-
Open Source Data Summit (OSDS) is happening on 2nd October 🎉 It all about Open Source in Data. OSDS explores the landscape of open source data tools and storage & their pivotal role in modern data ecosystems. OSDS is virtual & free to join (link in comments). Join me & Ashvin A. on our talk on Apache XTable (Incubating) on 2nd Oct. We will start with the nuances of lakehouse architecture, need for interoperability between Apache Hudi, Apache Iceberg & Delta Lake & some critical learnings. Some other talks that I am interested about. 🌟 Mixed model arts - The convergence of data modeling across apps, analytics, and AI by Joe Reis 🤓 🌟 Enhancing interoperability of open table formats with Apache XTable by Stephen Said, Matthias Rudolph 🌟 The new normal: Unbundling your data platform with an open data lakehouse by Vinoth Chandar 🌟 Optimizing data lake infrastructure for sub-second query latency by Emil Emilov Registration link in comments. #dataengineering #softwareengineering
To view or add a comment, sign in
-
What is Delta Lake Uniform and how to use it to make data lake Interoperable? Delta Lake UniForm is a universal format that aims to streamline interoperability across data lake environments like Delta Lake, Apache Iceberg, and Apache Hudi. It is designed to unify and simplify data access across various open data lake formats. Why Use Delta Lake UniForm? 1. Enhanced Portability: Ensures data can be easily moved and accessed across different data lake formats without compatibility issues. 2. Improved Reliability: Maintains ACID transactions, audit history, and other critical features across platforms. 3. Optimized Performance: Delivers efficient query performance by leveraging uniform metadata handling and optimized data layouts. Benefits: - Flexibility: Quickly adapt to changing data and query patterns without extensive reconfiguration. - Scalability: Efficiently scale data lakes from terabytes to petabytes while maintaining high performance and reliability. - Consistent Metadata: Uses a standardized approach to handle metadata, ensuring consistent performance and reliability across platforms. #databricks #dataengineering #WhatsTheData
To view or add a comment, sign in
-
Breaking down data lineage in #ApacheFlink. If you're working with Apache Flink, understanding data lineage is more than just a nice-to-have, it's essential. Knowing how your data moves and transforms through your pipeline can make debugging easier, help with compliance, and boost performance. Colten Pilgreen does a great job of breaking down these concepts in a practical way. From understanding the #JobGraph to #StateManagment, this article provides clear steps that can help you get a better handle on your data flow in Flink. You can find the full article here: https://lnkd.in/dARtGQg2 #apacheflink #datalineage #dataobservability #datorios
To view or add a comment, sign in
-
Struggling to track data flow in your Apache #Flink pipelines? 🛠️ Colten Pilgreen simplifies #datalineage concepts, giving you practical insights into #JobGraph and #StateManagement. These tips can seriously enhance your debugging, compliance, and performance efforts. 👉 Read more here: https://lnkd.in/dARtGQg2 Looking for tailored guidance? Let’s set up a quick discovery call! 💡 ➡ https://lnkd.in/gq2JbHap #ApacheFlink #DataLineage #TechTalks #Datorios
Breaking down data lineage in #ApacheFlink. If you're working with Apache Flink, understanding data lineage is more than just a nice-to-have, it's essential. Knowing how your data moves and transforms through your pipeline can make debugging easier, help with compliance, and boost performance. Colten Pilgreen does a great job of breaking down these concepts in a practical way. From understanding the #JobGraph to #StateManagment, this article provides clear steps that can help you get a better handle on your data flow in Flink. You can find the full article here: https://lnkd.in/dARtGQg2 #apacheflink #datalineage #dataobservability #datorios
To view or add a comment, sign in
-
Why are Data Engineers talking about Apache Iceberg? Iceberg is changing the way we work with data lakes. It makes managing large datasets easier for us and has ACID properties. You can time travel to old versions of data, which can be helpful for tracking changes and fixing errors. For those not using the Databricks ecosystem, Apache Iceberg can be a good alternative due to its flexibility and compatibility with big players in Data Engineering field, without requiring you to change your existing setup. We at Datazip are hosting a webinar this Thursday at 8:30 PM IST to go over the details of Iceberg and why it’s gaining traction. Click the link below to book your exclusive seat. ⤵️ If you haven't tried Iceberg yet, shoot in the DMs- let's talk about how it can help you. 🙌 Varun B. | Rohan Khameshra #apacheiceberg #aws #datalake #datawarehouse #datazip #dataengineers #webinar
To view or add a comment, sign in
-
Apache Druid is a game-changer in the world of data analytics! This open-source, distributed data store is designed for real-time analytics and offers lightning-fast query performance, thanks to its scalable architecture. If you're interested in enhancing your data analytics strategies, Apache Druid can help you achieve your goals! Here are some of its key features: 1️⃣ Lightning-fast query performance 2️⃣ Handles data from diverse sources, including logs, events, and databases 3️⃣ Organizes data into immutable segments for efficient querying 4️⃣ Utilizes indexing for rapid data retrieval 5️⃣ Components include ingestion system, query engine, indexing service, and coordinator nodes 6️⃣ Enables real-time or near-real-time analytics on large datasets. Let's connect and discuss how Apache Druid can take your data analytics to the next level! #ApacheDruid #RealTimeAnalytics #datascience
To view or add a comment, sign in
-
Catch Andrew Lamb, Staff Engineer at InfluxData, and chair of the Apache Arrow Program Management Committee at the Open Data Science Conference (ODSC) next month. He'll cover the basics of Apache Arrow and Apache Parquet, how to load data to/from pyarrow arrays, csv and parquet files, and how to use pyarrow to quickly perform analytic operations. https://bit.ly/4aja5zL #InfluxDB #ODSCEast #ApacheArrow #ApacheParquet
To view or add a comment, sign in
7,718 followers