How to Achieve SQL Server Reliability – Part 1
My earlier article Why SQL Server Stability is a BAD Thing discussed why that isn’t the route to making your SQL Server reliable. The obvious question is – “if it’s not about stability then what areas need to be looked at to achieve reliability”?
I believe there’s four technical areas that affect reliability. You must be on top of the first three as they all have a direct influence. The fourth one is important to help achieve long-term reliability, but it won’t give you it on a day-to-day basis. In this age of Cybersecurity, some might consider that a controversial statement to make and one I’ll expand on when I cover it.
· Availability – is the SQL Server service available when it’s needed?
· Consistency – are the results returned consistent and correct?
· Performance – do all the operations take place at an acceptable speed?
· Security – is my data and my installation secure against a cyber-attack?
But before we go into these, there’s something you need to clarify…
What does the business need and what can it afford?
In my experience, these two aspects are often at odds with each other. i.e. “We want 100% uptime and we’ve got no extra budget to achieve it”.
Before you dive into achieving technical reliability, I’d highly recommend you sit with the Business and ensure everyone’s agreed on what’s needed (with a hard dose of financial reality-check thrown in).
A significant part of that discussion needs to be around RPO/RTO (Recovery Point Objective and Recovery Time Objective) yet it’s a concept I find both IT and the Business are often unaware of. Before we dive into the technical aspects of achieving SQL Server reliability, we need to understand how to get agreement between the Business and IT as to what that the goal is. That’s not a conversation you want to be having, for the first time, in the middle of a disaster.
Firstly, 100% uptime, 365 days a year, requires heavy investment in both equipment and people. Very few businesses need it. Even those that operate 24x7x365 accept there’s going to be some downtime due to technical problems.
Downtime around patching/maintenance can be avoided but needs a more complex system (using clustering). You’ll likely need a DBA to manage that rather than the “accidental DBA” who ends up inheriting SQL Server due to their other engineering roles such as Systems or Development. SQL Server clusters are good, but they need some TLC.
Deciding What the Business Needs?
The place to start this discussion is with RPO/RTO, so we better look at each of these in turn and then how they relate to each other. They come into play when a disaster strikes. RPO looks back in time and RTO looks forward in time, from the moment of the incident.
RPO
RPO is the Recovery Point Objective. This is how far back in time you need to go to recover your data. With a good backup system this can be a few minutes. If you’re still doing, old-school, overnight backups, that can be 24 hours. You also then need to think about those units of time – if your most recent backup doesn’t restore and you revert to the next oldest backup, how much more data did you just lose – a whole day?
Equally important, how do you reinstate any lost data and from what source?
Recommended by LinkedIn
If you’d just sent out a pile of customer invoices, they might be retrievable from the email system and rekeyed. But to keep the invoice numbers correct, you likely can’t issue anymore invoices until you finish getting up to date. Plus, you need to key them 100% accurately as they’re legal documents. Frustrating, but doable.
Take the other extreme, where you’ve just lost 24 hours of factory production-line data where you’re tracking product measurements, finished goods movement into your warehouse and stocked goods movements out onto trucks for delivery (perhaps along with a pile of paperwork). The questions that then get asked are:
· Where do you get that information to recreate it?
· How do you get it back into the system?
· How long do you have to delay starting production because you’ve got historical data to reinstate?
· What if we can’t recover – how do you know what was sent to customers so they can be billed?
· Can you go into the warehouse and find the goods that the ERP system no longer has any knowledge of?
RTO
RTO is the Recovery Time Objective. This is how far forward in time you’re going to have to wait until you’re back and operational. Again, with a good backup system this can be a few minutes (albeit you might be running with reduced performance). If you’re doing overnight backups that take all night, they’re going to take a similar amount of time (or sometimes longer) to restore. Yes, if your backup takes 10 hours to run, it’s going to take 10+ hours to recover, from the point you’re ready to kick that off, which could be a few hours of discussions/preparation.
If these are completely new terms to you, we’ve written about them in our blog RPO & RTO Are The Backbone of Backup & Disaster Recovery. We’ve also got a free RPO/RTO worksheet to help IT and the Business engage in this discussion. If you’re struggling with this, it’s an exercise we love doing with customers. We can be the most miserable, pessimistic IT people imaginable as we prod and poke around different disaster scenarios which can lead to some fascinating discussions and insights that hadn’t previously been considered.
RPO & RTO Intertwined
These two timescales are tightly bound when disaster strikes. There’s also two related factors RPA/RTA with “A” being “Actual”. Most often, those timeframes are only discovered once the disaster is over. It’s far better to discover these either during Disaster Recovery (DR) testing, or a DR Simulation (which is better than nothing). That tells you if your plan is achievable and it’s far better to discover that during a relaxed exercise than with a whole load of people standing behind your back, looking at their watches and constantly asking “are we back yet”?
In the next article, I’ll start exploring the technical aspects of SQL Server Reliability. In the meantime, whether you’re from IT or from the Business, you need to sit down with your counterparts and figure out what’s needed before you figure out how to do it.
About me/Exmos
I’ve worked with Microsoft SQL Server since 1993 when version 4.2 was installed from three floppy disks and the manuals came in a large box. As a developer, a DBA, an infrastructure engineer and having written SQL monitoring tools, I’ve worked across most SQL Server disciplines. When my writing suggests that “only an idiot would do that”, it usually means I’ve been the idiot in the past.
At Exmos, we have a team of experts dealing with all aspects of SQL Server across 150+ servers. If you want chat (even informal) about anything relating to SQL Server, feel free to drop me a DM on Linked-in or email me: gordonc@exmos.com and I’m happy to talk… unless it’s about licensing where I’ll pass you over to our team who handle that, less exciting, aspect of SQL Server. J
If you’d like to be notified when I post the next blog article, you can subscribe here.