StarTree Data Infra Team - Crafting the world's finest Pinot

Neha Pawar

Head of Data Infra at StarTree | We're hiring!

Published Jun 11, 2024

This year, I completed four wonderful years at StarTree. In this time, I've witnessed our transformation from a small startup of just five people to a flourishing Series B company with over 100 employees worldwide. We grew to a 40-person engineering team across the US and India, with a strong, talented, and energetic Data Infra team of 16 who tackle challenging database and distributed systems problems daily.

Data Team Mountain View's end of quarter celebration lunch

As I look to actively grow the Data team further, I've had the pleasure of speaking to many wonderful people, sharing insights about our company, our Data team, our projects, and the exciting roles we offer. I felt it was time to articulate the essence of the Data team in a blog post, to truly capture the work and opportunities that lie ahead. Here goes!

Creating a category: User-facing real-time analytics

Before diving into the Data team, I want to address one of the frequent questions I get asked: What’s so special about Apache Pinot when there are so many others in the analytics landscape? Why can’t I just use Snowflake? Isn’t Presto already fast enough? I have real-time systems like key-value stores, ElasticSearch—why can’t I just use those?

While complementary technologies and predecessors, such as Spark, Presto, Trino, and data warehouses like Snowflake and BigQuery, introduced us to faster data processing compared to technologies from a decade ago, these technologies focused on improving batch processing analytics, primarily catering to internal audiences with less stringent requirements for latency, concurrency, and data freshness.

Pinot being purpose-built to cater to the demands of external and real-time analytics, created a niche of its own. It stands out for its ability to support millions of concurrent users, maintain sub-second latency across thousands of queries per second, and ensure data freshness within seconds, setting new standards for performance and flexibility (read more about how that’s possible in this blog What makes Pinot fast). This is a stark contrast to the batch-oriented, internal analytics capabilities of traditional systems.

In the quest for real-time analytics, many turn to key-value (KV) stores and document stores for their real-time data access capabilities. Despite their strengths in specific domains, these systems are not inherently suited for complex analytical tasks. Attempting to repurpose them for analytics often leads to scalability and performance issues due to their fundamental design constraints. Apache Pinot is designed from the ground up with analytics in mind. Pinot has specialized indexes that accommodate a wide range of analytical queries, far beyond the capabilities of KV stores and document stores. Additionally, Pinot's ability to efficiently handle appending and upserting of events, with real-time indexing and pre-aggregation, addresses the limitations of other systems that require frequent rebuilding when encountering new data. These attributes highlight Apache Pinot’s unique position as a solution purpose-built for handling the complexities and demands of real-time analytics. You can watch more about this quadrant and comparison, in this lightboard video by Tim Berglund.

Data Team’s Mission- Building a Robust and Scalable Data Platform

The Data team’s mission at StarTree isn’t just limited to making Pinot the fastest real-time database. Our mission encompasses scalability, resiliency, robustness, cost-efficiency, ease of use, and extensibility in everything we do.

Here’s a 1,000-feet outlook on how we think about the Data Platform at StarTree. We work on a variety of projects, ranging from core database internals like query engine, storage and indexing, ingestion frameworks, data-in connectors, proxy layers for data-out connectors, performance benchmarking, and enterprise readiness. This allows us to explore the breadth of real-time data analytics problems and immerse ourselves in the depth of specific aspects.

Our Focus areas for 2024

We give great thought to what our themes for every year and every quarter should be. While customer requests are generally our top priority, we also leave ample bandwidth for strategic bets and developer productivity. Here are some of our major focus areas and investments for the year:

Multi-stage query engine on-by-default

The Data team at StarTree is leading efforts to transition Apache Pinot from a single-stage scatter-gather query execution engine to a multi-stage query engine, which is allowing us to enable complex operations like joins.

The multi-stage query engine was made generally available in 2023, and this year we’re focusing on making it fully on-par with the v1 query engine, in terms of functionality and operational ease (see all commits made by the team here). As we work towards making our multi-stage query engine the default, we are also steadily marching towards full PostGresSQL compliance.

Pinot as a backend for Observability

Observability data is crucial for every organization, regardless of size or product offering. In recent years, there has been increased scrutiny on infrastructure investments and a directive to do more with less. Alongside this, the rise of specialized systems for various parts of the observability stack has led consumers to gravitate towards a disaggregated observability stack.

We are already seeing high demand from our customers for an observability solution that leverages Pinot for storage and querying. Companies like Cisco and Uber are using Pinot in production for observability, replacing other technologies and achieving better performance and significant cost savings.

We are embarking on a journey to make observability a first-class product offering from StarTree Cloud. This will involve projects across various areas, including increasing coverage of native ingestion formats, encoding, data types, and indexes for different kinds of observability data, enhancing cloud based tiered storage, building a native Grafana Pinot datasource plugin, and much more. If this intrigues you, I encourage you to watch this session from our recent Real-time Analytics Summit on Building an Observability Backend with StarTree Cloud to learn more about our work.

Ingestion at TB scale

We have the most extensive connector ecosystem compared to any other real-time analytics database. This year, we are focusing on making it easier to add new connector plugins and building intuitive and comprehensive observability for our real-time and batch ingestion workflows. We're also continuously pushing the boundaries of scale and efficiency in ingestion. Our robust Minion task framework, complete with autoscaling and specialized tasks for ingesting and altering data, now enables us to handle hundreds of terabytes of data daily with ease.

Additionally, we are working on making our segment processor framework more flexible and performant. This involves introducing smarter intermediate storage file formats, optimal external sort algorithms, and batching mechanisms to optimize metadata updates, ensuring we achieve the best throughput and meet ingestion SLAs.

The people

In the Data team, I’ve met not only the most brilliant minds and best collaborators, but also made friends for life! Working with the Data team is like being in a think tank of fearless innovators. You’ll have the privilege of working with the creators of Apache Pinot, the PMC, the committers and many active contributors, who are deeply committed to making Pinot the fastest real-time analytics engine on the planet.

Open Source Community

One of the most fulfilling aspects of working with an open source project is the chance to interact with so many community members!

Starting from just 100 members four years ago, the Apache Pinot community has flourished, now boasting over 5,000 members, with hundreds of companies adopting it and thousands of fascinating use cases across sectors like retail, IoT, observability, fintech, social media, and more. Hear more about some of these Pinot adoption stories here Stripe at Kafka Summit, Uber blog, Citi at RTAS, Cisco meetup.

You’ll be exposed to the many different ways the community is adopting Pinot, get involved in interesting brainstorming discussions on community slack and Github, hear perspectives from people in different companies on your designs, and also get ample opportunities to speak at meetups if that’s something that interests you.

Empowerment and Growth

Another thing I cherish about StarTree is the leadership's willingness to take bets on their people. You’ll be entrusted with responsibilities that might seem very daunting at first, but you’ll also get the support you need to learn, unlearn, relearn, make mistakes and still succeed. My own journey is a perfect example of this,

I transitioned from a shy IC who didn’t care about anything other than my projects, to a Pinot PMC, an active and vocal Pinot advocate in the database and analytics community, leading several complex technical projects and cross-functional initiatives, and now heading the Data Infra team at StarTree. I’ve seen similar transformations for several of my colleagues here, and I’m positive you can have a transformative and fulfilling journey like this if you set your mind to it.

Join Us!

This could be your workplace, your team, your project. If you revel in a culture that encourages tackling big responsibilities and owning complex problems, and you thrive in a fast-paced, hyper-growth environment, then StarTree is the place for you. We're not just witnessing the evolution of real-time analytics — we're leading it.

Please feel free to reach out to me if you think StarTree could be the right fit for you. I look forward to hearing from you and am excited about the possibility of welcoming you to the StarTree family!

Thank you Xiaobing L. Manish S. Gonzalo Ortiz Jaureguizar Jitender Aswani and Bhavani A. for your contributions to this post!

Nishit Savla

Software Developer at Amazon. Ex-Microsoft.

6mo

I'm interested.

Moaz M.

Technical Consultant @ Occams Group | Data Analytics Expert

6mo

Balassubramanian Srinivasan

Yash Srivastava

Software Engineer @ AWS

6mo

Hi Neha, I am a graduate student at Dartmouth College and have prior experience working as a Software Engineer at a Y-Combinator startup. I am interested in the position at Mountain View. Connecting with you to discuss further.

Thiru Subramaniam

Customer Success | Data Observability | All Views My Own

6mo

This is a fine article introducing Apache pinot to me, will explore more of StarTree and its capabilities. Thank you Neha Pawar

Anagha Arlulkar

Senior Software Engineer

6mo

I'm interested

See more comments

To view or add a comment, sign in

StarTree Data Infra Team - Crafting the world's finest Pinot

Neha Pawar

Head of Data Infra at StarTree | We're hiring!

Creating a category: User-facing real-time analytics

Data Team’s Mission- Building a Robust and Scalable Data Platform

Our Focus areas for 2024

Recommended by LinkedIn

The people

Open Source Community

Empowerment and Growth

Join Us!

Insights from the community

Others also viewed

Awesome Insights Into How Ancestry.com Uses Big Data

Data Lakehouse Roundup #1 - News and Insights on the Lakehouse

Alvin + Mode: Data Lineage and Usage for your Reports & Charts

Modern data stack in 2022, metadata that brings it all together, some weekly reads, and more

Why Delta Lake Is The Most Widely Used Lakehouse Format In The World?

Advanced Data Analytics with Apache’s Cutting-Edge Tools

💊 DATA Pill #102 - 50 Years of SQL, dbt + Airflow = ❤

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

Real-time Universal DataLakeHouse: Harnessing Debezium, Kafka, DeltaStreamer, HiveMetastore, MiniO, and Trino Data Freshness <5min

Analytics and Data Science News for the Week of September 20; Updates from Firebolt, Qrvey, Teradata & More

Explore topics