StarTree Data Infra Team - Crafting the world's finest Pinot

StarTree Data Infra Team - Crafting the world's finest Pinot

This year, I completed four wonderful years at StarTree. In this time, I've witnessed our transformation from a small startup of just five people to a flourishing Series B company with over 100 employees worldwide. We grew to a 40-person engineering team across the US and India, with a strong, talented, and energetic Data Infra team of 16 who tackle challenging database and distributed systems problems daily.

Data Team Mountain View's end of quarter celebration lunch

As I look to actively grow the Data team further, I've had the pleasure of speaking to many wonderful people, sharing insights about our company, our Data team, our projects, and the exciting roles we offer. I felt it was time to articulate the essence of the Data team in a blog post, to truly capture the work and opportunities that lie ahead. Here goes!


Creating a category: User-facing real-time analytics

Before diving into the Data team, I want to address one of the frequent questions I get asked: What’s so special about Apache Pinot when there are so many others in the analytics landscape? Why can’t I just use Snowflake? Isn’t Presto already fast enough? I have real-time systems like key-value stores, ElasticSearch—why can’t I just use those?

While complementary technologies and predecessors, such as Spark, Presto, Trino, and data warehouses like Snowflake and BigQuery, introduced us to faster data processing compared to technologies from a decade ago, these technologies focused on improving batch processing analytics, primarily catering to internal audiences with less stringent requirements for latency, concurrency, and data freshness.

Apache Pinot on the analytics quadrant

Pinot being purpose-built to cater to the demands of external and real-time analytics, created a niche of its own. It stands out for its ability to support millions of concurrent users, maintain sub-second latency across thousands of queries per second, and ensure data freshness within seconds, setting new standards for performance and flexibility (read more about how that’s possible in this blog What makes Pinot fast). This is a stark contrast to the batch-oriented, internal analytics capabilities of traditional systems. 

In the quest for real-time analytics, many turn to key-value (KV) stores and document stores for their real-time data access capabilities. Despite their strengths in specific domains, these systems are not inherently suited for complex analytical tasks. Attempting to repurpose them for analytics often leads to scalability and performance issues due to their fundamental design constraints. Apache Pinot is designed from the ground up with analytics in mind. Pinot has specialized indexes that accommodate a wide range of analytical queries, far beyond the capabilities of KV stores and document stores. Additionally, Pinot's ability to efficiently handle appending and upserting of events, with real-time indexing and pre-aggregation, addresses the limitations of other systems that require frequent rebuilding when encountering new data. These attributes highlight Apache Pinot’s unique position as a solution purpose-built for handling the complexities and demands of real-time analytics. You can watch more about this quadrant and comparison, in this lightboard video by Tim Berglund.

Data Team’s Mission- Building a Robust and Scalable Data Platform

The Data team’s mission at StarTree isn’t just limited to making Pinot the fastest real-time database. Our mission encompasses scalability, resiliency, robustness, cost-efficiency, ease of use, and extensibility in everything we do.

Data Tea's charter

Here’s a 1,000-feet outlook on how we think about the Data Platform at StarTree. We work on a variety of projects, ranging from core database internals like query engine, storage and indexing, ingestion frameworks, data-in connectors, proxy layers for data-out connectors, performance benchmarking, and enterprise readiness. This allows us to explore the breadth of real-time data analytics problems and immerse ourselves in the depth of specific aspects.

Our Focus areas for 2024 

We give great thought to what our themes for every year and every quarter should be. While customer requests are generally our top priority, we also leave ample bandwidth for strategic bets and developer productivity. Here are some of our major focus areas and investments for the year:

Multi-stage query engine on-by-default

The Data team at StarTree is leading efforts to transition Apache Pinot from a single-stage scatter-gather query execution engine to a multi-stage query engine, which is allowing us to enable complex operations like joins.

The multi-stage query engine was made generally available in 2023, and this year we’re focusing on making it fully on-par with the v1 query engine, in terms of functionality and operational ease (see all commits made by the team here). As we work towards making our multi-stage query engine the default, we are also steadily marching towards full PostGresSQL compliance.

Pinot as a backend for Observability

Observability data is crucial for every organization, regardless of size or product offering. In recent years, there has been increased scrutiny on infrastructure investments and a directive to do more with less. Alongside this, the rise of specialized systems for various parts of the observability stack has led consumers to gravitate towards a disaggregated observability stack.

We are already seeing high demand from our customers for an observability solution that leverages Pinot for storage and querying. Companies like Cisco and Uber are using Pinot in production for observability, replacing other technologies and achieving better performance and significant cost savings.

We are embarking on a journey to make observability a first-class product offering from StarTree Cloud. This will involve projects across various areas, including increasing coverage of native ingestion formats, encoding, data types, and indexes for different kinds of observability data, enhancing cloud based tiered storage, building a native Grafana Pinot datasource plugin, and much more. If this intrigues you, I encourage you to watch this session from our recent Real-time Analytics Summit on Building an Observability Backend with StarTree Cloud to learn more about our work.

Ingestion at TB scale

We have the most extensive connector ecosystem compared to any other real-time analytics database. This year, we are focusing on making it easier to add new connector plugins and building intuitive and comprehensive observability for our real-time and batch ingestion workflows. We're also continuously pushing the boundaries of scale and efficiency in ingestion. Our robust  Minion task framework, complete with autoscaling and specialized tasks for ingesting and altering data, now enables us to handle hundreds of terabytes of data daily with ease.

Additionally, we are working on making our segment processor framework more flexible and performant. This involves introducing smarter intermediate storage file formats, optimal external sort algorithms, and batching mechanisms to optimize metadata updates, ensuring we achieve the best throughput and meet ingestion SLAs.

Upserts 

One very special project we work on - which is a confluence of database and ingestion engine - is upserts. At StarTree, we built a highly scalable upserts mechanism for Apache Pinot, that uses less memory, scales to a larger number of primary keys, and is able to do operations more efficiently. We now comfortably support customers with 1 billion upsert keys on a single server.

StarTree Upserts

This year, we're focusing on two key objectives. First, we're aiming to make advanced operations on upsert tables—such as cold start, backfill, rebalance, and merge-rollup—extremely easy and robust using our minion task framework. Second, we'll be adding support to make pre-aggregation techniques, like the StarTree index, compatible with upserts. Yes, that's right—the powerhouse features of the StarTree index and upserts will join forces, and the engineering team is super excited to tackle this challenge head-on!

StarTree Serverless

Earlier this year, we launched StarTree Serverless, allowing users to sign up and gain free access to a workspace on StarTree Cloud. Here’s a tech talk and demo on StarTree Serverless, presented at the RTA Summit. 

Logical isolation on shared physical cluster using workspaces in Apache Pinot

For the Data team, this meant providing a large, multi-tenant Pinot cluster that can be confidently shared across multiple signups while ensuring logical isolation for workspaces. Moving forward, the Data team will focus on adding more safeguards and improving resiliency around isolation, quotas, and throttling as we continue to grow our Serverless customer base. 

The people

In the Data team, I’ve met not only the most brilliant minds and best collaborators, but also made friends for life! Working with the Data team is like being in a think tank of fearless innovators. You’ll have the privilege of working with the creators of Apache Pinot, the PMC, the committers and many active contributors, who are deeply committed to making Pinot the fastest real-time analytics engine on the planet. 

Open Source Community

One of the most fulfilling aspects of working with an open source project is the chance to interact with so many community members!

Starting from just 100 members four years ago, the Apache Pinot community has flourished, now boasting over 5,000 members, with hundreds of companies adopting it and thousands of fascinating use cases across sectors like retail, IoT, observability, fintech, social media, and more. Hear more about some of these Pinot adoption stories here Stripe at Kafka Summit, Uber blog, Citi at RTAS, Cisco meetup.

StarTree All-Stars happy hours!


You’ll be exposed to the many different ways the community is adopting Pinot, get involved in interesting brainstorming discussions on community slack and Github, hear perspectives from people in different companies on your designs, and also get ample opportunities to speak at meetups if that’s something that interests you.

Empowerment and Growth 

Another thing I cherish about StarTree is the leadership's willingness to take bets on their people. You’ll be entrusted with responsibilities that might seem very daunting at first, but you’ll also get the support you need to learn, unlearn, relearn, make mistakes and still succeed. My own journey is a perfect example of this,

Presenting my work at QCon London 2023

I transitioned from a shy IC who didn’t care about anything other than my projects, to a Pinot PMC, an active and vocal Pinot advocate in the database and analytics community, leading several complex technical projects and cross-functional initiatives, and now heading the Data Infra team at StarTree. I’ve seen similar transformations for several of my colleagues here, and I’m positive you can have a transformative and fulfilling journey like this if you set your mind to it.

Join Us!

This could be your workplace, your team, your project. If you revel in a culture that encourages tackling big responsibilities and owning complex problems, and you thrive in a fast-paced, hyper-growth environment, then StarTree is the place for you. We're not just witnessing the evolution of real-time analytics — we're leading it.

Please feel free to reach out to me if you think StarTree could be the right fit for you. I look forward to hearing from you and am excited about the possibility of welcoming you to the StarTree family!

Thank you Xiaobing L. Manish S. Gonzalo Ortiz Jaureguizar Jitender Aswani and Bhavani A. for your contributions to this post!

Nishit Savla

Software Developer at Amazon. Ex-Microsoft.

6mo

I'm interested.

Like
Reply
Moaz M.

Technical Consultant @ Occams Group | Data Analytics Expert

6mo
Like
Reply
Yash Srivastava

Software Engineer @ AWS

6mo

Hi Neha, I am a graduate student at Dartmouth College and have prior experience working as a Software Engineer at a Y-Combinator startup. I am interested in the position at Mountain View. Connecting with you to discuss further.

Like
Reply
Thiru Subramaniam

Customer Success | Data Observability | All Views My Own

6mo

This is a fine article introducing Apache pinot to me, will explore more of StarTree and its capabilities. Thank you Neha Pawar

Like
Reply
Anagha Arlulkar

Senior Software Engineer

6mo

I'm interested

Like
Reply

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics