Email Notification Logs: Story of building a scalable email delivery tracking system
Introduction
This blog discusses the journey of building Email Notification Logs functionality in Jira Service Management with the goal of extending it to all Jira family products (like JSW and JWM). This functionality provides a comprehensive view (or logs) of email delivery to the customer’s mailbox and detailed reasoning for any failures. This capability plays a crucial role in bringing transparency to the system, ultimately leading to faster issue identification and resolution.
As the tech lead on this project, my roles and responsibilities include ensuring the success of every stage of the software development lifecycle. This starts with product ideation, shaping feature aesthetics and functionality, and moves on to tech solution designing and finally to execution while prioritizing top-notch quality.
Brief about Email Notification logs
Jira Service Management by Atlassian guarantees that end users stay updated on account activity, such as customer invites, and Jira requests activity, such as new request creation, comments addition, status change, etc through email notifications. These notifications are essential for keeping end users and Jira agents aligned which is required for keeping the business continuity. Therefore, it is imperative to have the ability to track failed email notifications. Earlier, email tracking functionality was missing, and customers had to reach out to Atlassian support to get the problems resolved, which takes significant time and incurs a huge support load. Thus, as part of Email Notification Logs, we have bridged this key gap.
Now that we understand the feature, we are ready to dive deep into the design of the email log system
Architecture Design
The architecture is built on an event-based design following the CQRS pattern. It stitches together three important parts:
Let's walk through the above architecture and learn how we were able to build and deliver it quickly by leveraging platform components and adhering to best design practices.
1. Data Collection
Steps 1-3 in the Architecture diagram
Data from multiple sources is collected to build notification tracking functionality, which is later published to the event bus platform, designed to allow service-to-service, decoupled communication through events. These sources are
Recommended by LinkedIn
2. Data Processing amp; Storage
Steps 4-5 in the Architecture diagram
Events from the event bus are consumed by the new Kotlin-based Logs service, which triages events, enriches them, and stores them in DB.
The data is stored in a platform datastore built on top of Amazon DynamoDB, which offers excellent features such as data residency, reliability, and scalability. This ensures that developers only need to focus on building incredible features and can safely leave all the heavy lifting to the platform.
3. Serving Read Queries
Steps 6-7 in the above diagram
All read queries come via Graphql gateway (another platform component ) to the Logs service where after request validations and appropriate permission checks, data is fetched from DB and served back to the users. As enrichment is performed asynchronously during write operation, it results in low latency reads, leading to a better user experience. FE is built in React and includes support for infinite scrolling functionality. This feature automatically loads the next page as the user reaches the end of the current page, eliminating the need to manually click on a "next" button.
Architecture Scalability amp; Extensibility
Why worry about Scalability & Extensibility?
As Notification logs is a requirement not just for JSM but even for other Jira family products like Jira Software. Thus it is crucial to design a system with scalability and extensibility in mind.
To provide a quick snapshot of our scale, we manage and send over 8 million notifications daily. Out of total volume around 4% fail at ESPs due to reasons like mailbox not being available, email on the suppression list, etc.
For scalability, Architecture follows the CQRS design pattern, which makes it easy to scale read and write separately. We use the web-worker model here. The consumer node processes incoming write events leveraging the async worker model and stores them in the platform datastore. On the other hand, Web nodes are responsible for handling incoming read requests with an API layer written in Grapqhl.
For extensibility, Low level design keeps a generic schema across both Read & Write layers i.e.
Both Graphql and DB schema contain fields like Project ID, Issue ID, Notification Type, Notification Subtype, etc. These fields are common to all Jira family products, making it product-agnostic and, therefore, easily extensible.
Summary
In summary, this article has explored how, in Atlassian, we developed a highly scalable email delivery tracking system in JSM that can be leveraged by other products in the Jira family. This was achieved by following industry-standard processes and techniques, which, in a nutshell, include the CQRS pattern and adopting a generic schema for easy extensibility. Finally, the advantage of using NoSQL DB is that it is a great choice for storing and reading data irrespective of its size.
Last, I Hope you enjoyed reading this article. Stay tuned for more insightful tech articles.
Senior Engineering Manager at Atlassian
7moVery proud. Excellent team work. Thanks and Kudos to everyone involved in making this happen. 👏
Vice President - Cloud Engineering @Sprinklr (NYSE: CXM)
7moGreat article Deep!