Building your own Event Tracking System
When we are building any website or app, we often tend to use an analytics tool to track the usage of our website or app -- very famous ones being Google Analytics, Mixpanel, etc.
Developers integrating them often have a hard time -- to integrate these properly and track each and every analytics -- button clicks, url redirections, scroll events, time spent on any feature, etc -- the whole user behaviour.
Have you ever thought to how you can simply build your own small feature and build a mini dashboard super quickly? Let's see how! 🚀
Capturing Events
The first goal is to generate events and capture them in our system. Let's see what we need in order to do this -
Defining Event Types
Now before creating the solution pieces, let's focus on what event types would we want to capture from our system. Ideally we would want to track -
While we can have more event types in our system, the above two are the most used event types. Along with this, we would need a eventId so that it specifically says which page was visited or which component was clicked. This will help us track specific items and analyse them later.
Building our Tracking System
The Tracking API
This API would be responsible to register and send events back to our Backend system. Do you think that 1 API would be enough for all types or events, or would you want multiple?
For me, I'll be happy to keep one -- Single Responsibility Principle, and then validate the request body based on event types. An example request body would look like -
{
"type": "CLICK_EVENT", // or PAGE_VIEW, SCROLL_EVENT, etc
"eventId": "contact_button_click",
"data": {
// any addition metadata for the event
}
}
Data modelling is the one of the most important aspect when building a system. As you see above, we are creating a single Request Body for each of our events and have the two main items -- type and eventId.
Storing the Tracking Events
Now as we have our API defined, we need to store our events. You would argue that let's use PostgreSQL or MongoDB and insert one record per event such as -
id, type, eventId, timestamp, data
1, "PAGE_VIEW", "/home", 12345678, {...}
2, "CLICK_EVENT", "submit_button", 12345678, {...}
Now, as we see here, a production ready application would have huge number of events being fired -- in millions, if not billions. Do we really need so much space in our database and pay for that scale? Or can we store the events much better? The below insights might help you answer this -
Recommended by LinkedIn
You would now already understand why we don't save it as independent rows / records.
Let's talk about the bucketing pattern, where we store the counts in a single record (per eventId) rather than one record per event. Now should we store all the event counts in a single record? Nope, then we would never be able to filter out on - daily counts, monthly counts, and so on.
We can bucket on eventId as well as a timeframe, specially the lower most granularity you would want to analyse the data on, say days. A sample record would look like -
{
"type": "CLICK_EVENT",
"eventId": "contact_button_click",
"date": "2024-08-09",
"count": 492
}
Now, this makes it super easy, as for every API call that comes in, we just need to do a count++, and upsert the record with given eventId and today's date. Why upsert? Because as soon as the date changes, upsert will create a new record and start the counter from 0 and increment it to 1.
Single record / document updates in MongoDB are always atomic, even when not used inside a transaction. 💚
The best part is that if I need to analyse the weekly performance, the query will read only 7 records from the database, even if we have a total of millions of events! 💜
Scaling it up
Now a lot of you would question - Why haven't we used a Time Series database? I would ask you to check if you even really need a new infrastructure, time to set up a Time Series Database, the added complexity of storing all events, etc?
Yes! You can absolutely have that, for example when your granularity is minute level, or you would want to query from a certain minute to another, time series databases really help you in that. But if your use cases are very simple, you don't have a lot of eventIds and very minute level granularity, you can probably use your primary general purpose database as well.
I would prefer MongoDB any day, primarily because it is stored in documents, I can run complex Aggregation Queries, and whenever I want to scale my collection (table), I can very quickly shard it based on the time (or days) and the queries would become super performant without any work. Sharding an SQL database is much tougher, specially if you are managing your own infra, but that's just my personal choice.
Future
Now as we have the basic concepts and knowledge done right, next we will now -
All these above are coming up in next few articles, so stay tuned :)
If you liked this article, please do ❤️ Like and share your feedback in 💬 Comments.
Cosmocloud Low-Code Hackathon
Cosmocloud is coming up with a Low-Code Hackathon where Developers and Engineers can can build applications super duper quickly and win up to INR 12 Lacs worth of prizes. Sponsored by Google for Developers & MongoDB 🚀
Special prizes for Ideas Submitted, Top 50 teams, Best Use and many more exclusives! Register here now! 🚀
Senior Engineer | Java | Data Structure And Algorithm | OOPs | Spring Boot | Committed to Quality Software Delivery | Ensuring Seamless User Experiences
4moInteresting
Software Architect | REST API, Microservices | PHP, MySQL, PostgreSQL, MongoDB | Leadership in Diverse Teams & Technology Projects | Aadhaar Services, eSign, DigiLocker, NPCI, IUDX, Payment Gateway Integrations
4moGreat post, Shrey Batra! Building a robust event tracking system is crucial for deriving actionable insights. While your outline provides a solid foundation, I'd like to emphasize a few critical production-grade considerations: * Scalability: Anticipate exponential data growth. Employ sharding, partitioning, and columnar storage (like ClickHouse or Druid) for efficient handling of massive event volumes. * Real-time Processing: For immediate insights, consider stream processing frameworks (Apache Kafka, Flink) to process events in real-time and update dashboards dynamically. * Data Quality: Implement robust data validation and cleaning mechanisms to ensure data accuracy and reliability. * Privacy and Security: Adhere to data protection regulations (GDPR, CCPA) and secure data transmission and storage to safeguard user privacy. * Cost Optimization: Optimize storage and compute resources by employing data retention policies and cost-effective cloud infrastructure. * Flexibility: Design a flexible schema to accommodate evolving event types and properties without system disruptions. Would love to hear your thoughts on these aspects and delve deeper into specific use cases!
Founder & AI Educator, Building AI-Agents, Graph RAG Solutions & Agentic Framework ⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀ ⠀⠀⠀⠀⠀⠀⠀ 🗣️ LLM, Multi Modality AI, Python, Transformers, Stable Diffusion, Data Science, Azure & Life 2.0
4moAdding Kafka + Socket layer before the logs go to the database layer would improve the overall systems performance, use Kafka as for cache read. Good insight.
Zoho | Salesforce DataDev-23 & Twilio Datapalooza winner | Backend developer
4moNice Article. In my opinion,as the system scales, it would be better to store data in hadoop like file system and query it with engines.This would allow us to store large amount of data in the distributed manner with partitions and query it effectively in SQL like fashion. GraphDB is also an alternative incase we are directly storing aggregates.