How I caused my first Production Incident
I was once a budding SRE (~5 months into the job) 👶
I was working at a public transport startup where my team owned the entire cloud infrastructure - the backbone of all software bringing in the revenue.
We were in the process of adopting a new monitoring solution across infra, and I was tasked with integrating it everywhere - application servers, databases, cloud account - to get end-to-end visibility.
Our main production Database cluster was a self-managed MySQL with a writer and couple of readers using binlog replication.
It handled hundreds of GBs of data for our most crucial, user-facing microservices, doing 50K+ read/write operations everyday to support customer rides.
If you’ve ever setup monitoring, you know that you have to install agents on your servers that export metrics, events & logs to the central monitoring.
In order to get deep observability into MySQL, I had to create a new user in the database to ensure appropriate permissions for the monitoring agent.
So I ran the user creation query...on all the MySQL nodes, including the readers.
Yup! I was only supposed to create the new user in the writer node (primary), which would then get propagated to all readers.
Recommended by LinkedIn
But I didn't think that creating a user is a write operation that gets “replicated” to readers. So I created it everywhere! 😰
Obviously, this froze all data replication due to state mismatch. Due to our strong consistency policy, the DB was no longer accepting Writes.
Our main Database was now inconsistent, no more customer bookings were being accepted during peak traffic and no money was being made!
The business came to a standstill.
If this wasn't bad enough, I also didn't realise what I had done. I just went about my coffee break until my manager & the database lead got alerted and started to unpack my adventure.
I was called in. We discussed what happened. Then they took full responsibility of the issue, locked down the Database, rolled back my changes, made the state consistent and brought the whole system back online.
They kept everyone informed about the timelines, conducted a full post-mortem, setup some processes in place to avoid this mistake in future, and put this all to rest. Those people made sure I now had the right knowledge to operate the systems.
Not a single person blamed me. Nobody even asked who screwed up. I got promoted in the next cycle.
That one incident taught me more about leadership than my 23-year long life before it.
Engineering at Databricks
1yWere you a super user? That could explain the slaves accepting write commands. There is a super_read_only option to prevent such mishaps. Otherwise setting read_only is top priority.
CTO
1yHaha, guess who still blames you
Airmeet.com | Full "House" Developer | Manager | SRE | AWS | Machine Learning
1yMan those coffee breaks!! Always comes at crucial times. 😅 Been there, done that!! 😄
SRE
1yBeen there, done that..😅 btw, glad to see the blameless postmortem being conducted.