Choosing the right database for your microservices is a very important decision and if you realize down the line that you’ve made the wrong choice, migrating to another database is a very costly and risky procedure as each database technology (and type) has advantages and disadvantages.
However, there is a convergence happening in some general-purpose databases (that support various models of relational / document / key-value at the same time) like MySQL and PostgreSQL, which claim to make your life easier.
Before considering a specific database take some time and ask a few important questions related to your project…
How much data do you expect to store(GB/TB/PB, etc.) when the application is mature?
What is the type of data you are going to store (simple or complex data structures, etc.)
How many users do you expect to handle simultaneously at peak load?
What availability(eg. 99.999%), scalability, latency, and data consistency does your application need?
How often will your database schemas change?
What is the average response time needed for the queries?
Will you require a real-time consistent database or is some lag permissible?
What is the geographic distribution of your user population?
What is the natural “shape” of your data?
Does your application need online transaction processing (OLTP), analytic queries (OLAP), or both?
What ratio of reads to writes do you expect in production?
What are your preferred programming languages?
Do you have a budget? If so, will it cover licenses and support contracts?
How strict are you with invalid data being sent to your database? (Ideally, you are very strict and do server-side data validation before persisting it to your database)
For read and write operations, these are the typical numbers for operations per second:
Very high — Greater than one million
High — Between 500,000 and one million
Moderate — Between 10,000 and 500,000
Low — Less than 10,000
For latency, the typical numbers are:
Low — Less than one millisecond
Moderate — one to 10 milliseconds
High — Greater than 10 milliseconds
Nature of Data:
Not all microservices process or manage data at the same stage in their lifecycle. For some microservices, the database could be the source of truth, but for others, it may just be a temporary store. To understand the data needs of your microservices better, you can broadly classify the data in the following categories based on how it is processed:
Transient data: Data ingest microservices typically process information like events, logs, messages, and signals before passing it to the appropriate destination. Such microservices require data stores that can hold the data temporarily while supporting high-speed writes. Since transient data is not stored anywhere else, high availability of the data store used by your microservice is critical — this data cannot be lost.
Ephemeral data: Microservices that deliver instant user experiences often rely on a high-speed cache to store the most accessed data. One example is a cache server — a temporary data store whose sole purpose is to improve the user experience by serving information in real time.
While a data store for ephemeral data does not store the master copy of the data, it must be architected to be highly available, as failures could cause user experience issues and lost revenue.
Operational data: Information gathered from user sessions — such as user activity, shopping cart contents, clicks, likes, etc. — are considered operational data. These types of data power instant, real-time analytics and are typically used by microservices that interface directly with users. For this type of data, durability, consistency and availability requirements are high.
Transactional data: Data such as payment processing and order processing must be stored as a permanent record in a database. The data stores used must employ a cost-effective means of storage, even as volumes of transactions grow.
Classification of Database
The X-axis represents application scenarios: OLTP vs. OLAP.
The Y-axis represents database interfaces: SQL vs. NoSQL.
This matrix has these characteristics:
In the upper left corner, databases support OLTP workloads and the SQL language. For example, MySQL supports different transaction isolation levels, high QPS(Query per second), and low latency. We mainly use it to store transaction information and critical data, such as orders and VIP information.
In the lower-left corner, we use NoSQL databases to optimize special scenarios. Generally, these databases have simple schemas or they are schemaless with high throughput and low latency. We mainly use them as caches or key-value (KV) databases.
On the right are OLAP big data analytical systems, such as ClickHouse and Impala. Generally, they support the SQL language and don’t support transactions. They have good scalability and long response latency. We can add machines to enlarge data storage capacity, and the response delay is longer.
How complex are your query patterns? Do you just need retrieval by key, or also by various other parameters? Do you also need a fuzzy search on the data?
How you are going to fetch your data is one of the main ways to find the best database for your use case. If you are going to fetch data by key, then all you need is a key-value store (e.g. DynamoDB, Redis, etcd, S3).
In case you mostly fetch via key, but sometimes you also need to fetch by one or two other fields, then Wide-Column DBs (e.g.DynamoDB, Cassandra) may be right for you.
On the other hand, if you will require to query many different fields you can choose either Relational DB (e.g.MySQL, PostgreSQL) or Document DB (e.g.MongoDB, CouchDB, MySQL, PostgreSQ). Note that Document DBs don’t support well queries that require joining data from multiple documents.
Finally, in case you are looking for fuzzy search query capabilities (free text search), then search engines like Elasticsearch and Solr are the best fit.
Consistency
Is strong consistency required (read after write, especially when you switch writes to a different data center) or eventual consistency is OK?
In case you need to read your data right after your write, it (i.e. strong consistency) than a Relational database (e.g. MySQL, PostgreSQL) is usually more suited than a Document Database (e.g.MongoDB, CouchDB), especially in the case of multi-data-center scenario.
Storage Capacity
How much storage capacity is needed?
Most database systems are limited by the amount of space on the disk (e.g. MySQL) or struggle with performance as the number of Nodes and Shards grows into the hundreds (e.g. Elasticsearch).
When infinite storage is needed this is where cloud solutions shine. Object Storage Services like S3 and GCS will allow you to store as much data as you like with the handy option is multiple tiers, so you pay less for data that is rarely retrieved.
Performance
What is the needed throughput and latency?
All databases' performance degrades as the amount of read/write throughput traffic increases. This is the time when optimizations such as re-indexing and re-sharding of your data come in handy.
In case you have very high traffic and require very low latency, Cloud providers solutions like Amazon’s DynamoDB and Google’s Bigtable could be just what you need. As long as your service is deployed on the same data center as the database, you can enjoy latencies that are under 10ms. The downside is of-course the $ cost.
Maturity and Stability
If you choose self-hosted deployment, How much experience does your DBA team have with this technology, how mature is it?
Choosing the most trendy, powerful, and fully featured database to self-host may be tempting, but as long as your organization doesn’t have experience with this database, you may end up regretting it.
Setup, configuration, and fine-tuning of databases is a lengthy and risky ordeal. Sometimes choosing the “old” organization's self-hosted work-horse will pay bigger dividends in the long term when it comes to production stability.
Cost
If you choose a managed cloud solution, What are the costs? What are its limitations?
The payment model for managed cloud solutions is usually proportional to the read/write traffic. Make sure to read the finer print for each managed solution and make sure that it is cost-effective for your specific read/write usage patterns.
SQL databases are easy to use and provide SQL language that is easily used in applications. SQL is also a common language to learn and you can easily find SQL resources. SQL databases require you to have a clear schema and make sure data types do not change frequently.
NoSQL is good for object documents, and other non-SQL data types. NoSQL is also useful when data requires high throughput.
Relational(SQL) Databases selection and some relational database comparison
Cache - By definition means it is stored in memory. Any data stored in memory (RAM) for faster access is called cache. Examples: Ehcache, Memcache Typically you put an object in cache with String as Key and access the cache using the Key. It is very straightforward. It depends on the application when to access the cache vs database and no complex processing happens in the Cache. If the cache spans multiple machines, then it is called distributed cache. For example, Netflix uses EVCAche which is built on top of Memcache to store the user's movie recommendations that you see on the home screen.
In Memory Database - It has all the features of a Cache plus come processing/querying capabilities. Redis falls under this category. Redis supports multiple data structures and you can query the data in the Redis ( examples like getting the last 10 accessed items, getting the most used item etc). It can span multiple machines and is usually very high performant and also support persistence to disk if needed. For example, Twitter uses the Redis database to store timeline information.
Comparison of MySQL/PostgreSQL/MongoDB and Cassandra databases
When to use MySQL or PostgreSQL
The key difference between PostgreSQL and MySQL
PostgreSQL is an Object Relational Database Management System (ORDBMS) whereas MySQL is a community driven DBMS system.
PostgreSQL support modern applications feature like JSON, XML etc. while MySQL only supports JSON.
Comparing PostgreSQL vs MySQL performance, PostgreSQL performs well when executing complex queries whereas MySQL performs well in OLAP & OLTP systems.
PostgreSQL is complete ACID compliant while MySQL is only ACID compliant when used with InnoDB and NDB.
PostgreSQL supports Materialized Views whereas MySQL doesn’t supports Materialized Views.
Why use MySQL?
Here, are some important reasons for using MYSQL:
Supports features like Master-Slave Replication, Scale-Out
It supports Offload Reporting, Geographic Data Distribution, etc.
Very Low overhead with MyISAM storage engine when used for read-mostly applications
Support for Memory storage engine for frequently used tables
Query Cache for repeatedly used statements
You can easily learn and troubleshoot MySQL from different sources like blogs, white papers, and books
Offers useful features like Table partitioning, Point in Time Recovery, Transactional DDL, etc.
Ability to utilize 3rd party Key Stores in a full PKI infrastructure
Developers can modify open source code as it is licensed under BSD without the need to contribute back enhancements
Independent Software Vendors can redistribute it without the fear of being “infected” by an open source license
Users and Roles can be assigned Object level privileges
Supports AES, 3DES and other data encryption algorithms.
Since MongoDB and PostgreSQL are two of the most important and common database solutions on the market today, it is essential that you know exactly what you need for your company and how to use your database to its full potential. That is one of the ways.