StarTree Data Infra Team - Crafting the world's finest Pinot
This year, I completed four wonderful years at StarTree. In this time, I've witnessed our transformation from a small startup of just five people to a flourishing Series B company with over 100 employees worldwide. We grew to a 40-person engineering team across the US and India, with a strong, talented, and energetic Data Infra team of 16 who tackle challenging database and distributed systems problems daily.
As I look to actively grow the Data team further, I've had the pleasure of speaking to many wonderful people, sharing insights about our company, our Data team, our projects, and the exciting roles we offer. I felt it was time to articulate the essence of the Data team in a blog post, to truly capture the work and opportunities that lie ahead. Here goes!
Creating a category: User-facing real-time analytics
Before diving into the Data team, I want to address one of the frequent questions I get asked: What’s so special about Apache Pinot when there are so many others in the analytics landscape? Why can’t I just use Snowflake? Isn’t Presto already fast enough? I have real-time systems like key-value stores, ElasticSearch—why can’t I just use those?
While complementary technologies and predecessors, such as Spark, Presto, Trino, and data warehouses like Snowflake and BigQuery, introduced us to faster data processing compared to technologies from a decade ago, these technologies focused on improving batch processing analytics, primarily catering to internal audiences with less stringent requirements for latency, concurrency, and data freshness.
Pinot being purpose-built to cater to the demands of external and real-time analytics, created a niche of its own. It stands out for its ability to support millions of concurrent users, maintain sub-second latency across thousands of queries per second, and ensure data freshness within seconds, setting new standards for performance and flexibility (read more about how that’s possible in this blog What makes Pinot fast). This is a stark contrast to the batch-oriented, internal analytics capabilities of traditional systems.
In the quest for real-time analytics, many turn to key-value (KV) stores and document stores for their real-time data access capabilities. Despite their strengths in specific domains, these systems are not inherently suited for complex analytical tasks. Attempting to repurpose them for analytics often leads to scalability and performance issues due to their fundamental design constraints. Apache Pinot is designed from the ground up with analytics in mind. Pinot has specialized indexes that accommodate a wide range of analytical queries, far beyond the capabilities of KV stores and document stores. Additionally, Pinot's ability to efficiently handle appending and upserting of events, with real-time indexing and pre-aggregation, addresses the limitations of other systems that require frequent rebuilding when encountering new data. These attributes highlight Apache Pinot’s unique position as a solution purpose-built for handling the complexities and demands of real-time analytics. You can watch more about this quadrant and comparison, in this lightboard video by Tim Berglund.
Data Team’s Mission- Building a Robust and Scalable Data Platform
The Data team’s mission at StarTree isn’t just limited to making Pinot the fastest real-time database. Our mission encompasses scalability, resiliency, robustness, cost-efficiency, ease of use, and extensibility in everything we do.
Here’s a 1,000-feet outlook on how we think about the Data Platform at StarTree. We work on a variety of projects, ranging from core database internals like query engine, storage and indexing, ingestion frameworks, data-in connectors, proxy layers for data-out connectors, performance benchmarking, and enterprise readiness. This allows us to explore the breadth of real-time data analytics problems and immerse ourselves in the depth of specific aspects.
Our Focus areas for 2024
We give great thought to what our themes for every year and every quarter should be. While customer requests are generally our top priority, we also leave ample bandwidth for strategic bets and developer productivity. Here are some of our major focus areas and investments for the year:
Multi-stage query engine on-by-default
The Data team at StarTree is leading efforts to transition Apache Pinot from a single-stage scatter-gather query execution engine to a multi-stage query engine, which is allowing us to enable complex operations like joins.
The multi-stage query engine was made generally available in 2023, and this year we’re focusing on making it fully on-par with the v1 query engine, in terms of functionality and operational ease (see all commits made by the team here). As we work towards making our multi-stage query engine the default, we are also steadily marching towards full PostGresSQL compliance.
Pinot as a backend for Observability
Observability data is crucial for every organization, regardless of size or product offering. In recent years, there has been increased scrutiny on infrastructure investments and a directive to do more with less. Alongside this, the rise of specialized systems for various parts of the observability stack has led consumers to gravitate towards a disaggregated observability stack.
We are already seeing high demand from our customers for an observability solution that leverages Pinot for storage and querying. Companies like Cisco and Uber are using Pinot in production for observability, replacing other technologies and achieving better performance and significant cost savings.
We are embarking on a journey to make observability a first-class product offering from StarTree Cloud. This will involve projects across various areas, including increasing coverage of native ingestion formats, encoding, data types, and indexes for different kinds of observability data, enhancing cloud based tiered storage, building a native Grafana Pinot datasource plugin, and much more. If this intrigues you, I encourage you to watch this session from our recent Real-time Analytics Summit on Building an Observability Backend with StarTree Cloud to learn more about our work.
Ingestion at TB scale
We have the most extensive connector ecosystem compared to any other real-time analytics database. This year, we are focusing on making it easier to add new connector plugins and building intuitive and comprehensive observability for our real-time and batch ingestion workflows. We're also continuously pushing the boundaries of scale and efficiency in ingestion. Our robust Minion task framework, complete with autoscaling and specialized tasks for ingesting and altering data, now enables us to handle hundreds of terabytes of data daily with ease.
Additionally, we are working on making our segment processor framework more flexible and performant. This involves introducing smarter intermediate storage file formats, optimal external sort algorithms, and batching mechanisms to optimize metadata updates, ensuring we achieve the best throughput and meet ingestion SLAs.
Recommended by LinkedIn
Upserts
One very special project we work on - which is a confluence of database and ingestion engine - is upserts. At StarTree, we built a highly scalable upserts mechanism for Apache Pinot, that uses less memory, scales to a larger number of primary keys, and is able to do operations more efficiently. We now comfortably support customers with 1 billion upsert keys on a single server.
This year, we're focusing on two key objectives. First, we're aiming to make advanced operations on upsert tables—such as cold start, backfill, rebalance, and merge-rollup—extremely easy and robust using our minion task framework. Second, we'll be adding support to make pre-aggregation techniques, like the StarTree index, compatible with upserts. Yes, that's right—the powerhouse features of the StarTree index and upserts will join forces, and the engineering team is super excited to tackle this challenge head-on!
StarTree Serverless
Earlier this year, we launched StarTree Serverless, allowing users to sign up and gain free access to a workspace on StarTree Cloud. Here’s a tech talk and demo on StarTree Serverless, presented at the RTA Summit.
For the Data team, this meant providing a large, multi-tenant Pinot cluster that can be confidently shared across multiple signups while ensuring logical isolation for workspaces. Moving forward, the Data team will focus on adding more safeguards and improving resiliency around isolation, quotas, and throttling as we continue to grow our Serverless customer base.
The people
In the Data team, I’ve met not only the most brilliant minds and best collaborators, but also made friends for life! Working with the Data team is like being in a think tank of fearless innovators. You’ll have the privilege of working with the creators of Apache Pinot, the PMC, the committers and many active contributors, who are deeply committed to making Pinot the fastest real-time analytics engine on the planet.
Open Source Community
One of the most fulfilling aspects of working with an open source project is the chance to interact with so many community members!
Starting from just 100 members four years ago, the Apache Pinot community has flourished, now boasting over 5,000 members, with hundreds of companies adopting it and thousands of fascinating use cases across sectors like retail, IoT, observability, fintech, social media, and more. Hear more about some of these Pinot adoption stories here Stripe at Kafka Summit, Uber blog, Citi at RTAS, Cisco meetup.
You’ll be exposed to the many different ways the community is adopting Pinot, get involved in interesting brainstorming discussions on community slack and Github, hear perspectives from people in different companies on your designs, and also get ample opportunities to speak at meetups if that’s something that interests you.
Empowerment and Growth
Another thing I cherish about StarTree is the leadership's willingness to take bets on their people. You’ll be entrusted with responsibilities that might seem very daunting at first, but you’ll also get the support you need to learn, unlearn, relearn, make mistakes and still succeed. My own journey is a perfect example of this,
I transitioned from a shy IC who didn’t care about anything other than my projects, to a Pinot PMC, an active and vocal Pinot advocate in the database and analytics community, leading several complex technical projects and cross-functional initiatives, and now heading the Data Infra team at StarTree. I’ve seen similar transformations for several of my colleagues here, and I’m positive you can have a transformative and fulfilling journey like this if you set your mind to it.
Join Us!
This could be your workplace, your team, your project. If you revel in a culture that encourages tackling big responsibilities and owning complex problems, and you thrive in a fast-paced, hyper-growth environment, then StarTree is the place for you. We're not just witnessing the evolution of real-time analytics — we're leading it.
Please feel free to reach out to me if you think StarTree could be the right fit for you. I look forward to hearing from you and am excited about the possibility of welcoming you to the StarTree family!
Thank you Xiaobing L. Manish S. Gonzalo Ortiz Jaureguizar Jitender Aswani and Bhavani A. for your contributions to this post!
Software Developer at Amazon. Ex-Microsoft.
6moI'm interested.
Technical Consultant @ Occams Group | Data Analytics Expert
6moBalassubramanian Srinivasan
Software Engineer @ AWS
6moHi Neha, I am a graduate student at Dartmouth College and have prior experience working as a Software Engineer at a Y-Combinator startup. I am interested in the position at Mountain View. Connecting with you to discuss further.
Customer Success | Data Observability | All Views My Own
6moThis is a fine article introducing Apache pinot to me, will explore more of StarTree and its capabilities. Thank you Neha Pawar
Senior Software Engineer
6moI'm interested