Reliability of a Data-Intensive Application

Reliability of a Data-Intensive Application

During the last decade, we have seen various technological developments that have enabled companies to build platforms, such as social networks and search engines, that generate and manage unprecedented volumes of data. These massive amounts of data have made it imperative for businesses to focus on agility and short development cycles, along with hypothesis testing, to allow a quick response to emerging market trends and insights.

Each piece of software is unique and must be treated as such, but it is also true that there are foundations shared among most software systems. The foundations can be reduced to Reliability, Scalability, and Maintainability.

In a data-intensive application, datasets are divided into smaller fragments and distributed over different geographical locations. For this kind of application to thrive, it must deal with massive amounts of users, in a continuous way, with the ability to improve on and fix the existing app in cases of emergency.

No alt text provided for this image


A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For example, many applications need to:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (stream processing)
  • Periodically crunch a large amount of accumulated data (batch processing)


Let's take a deeper look at what Reliability means :


Reliability simply means “continuing to work correctly, even when things go wrong.”

Undoubtedly there will be times when our webpage, application, or software system will fail. Even the most experienced programmer is prone to errors, as it is human nature to be imperfect. Other sources of fault can be hardware or even software. Regardless of the failure source, a system should continue to perform at the desired level even in adversity.

Fault and failures is two different things , fault can be defined as a component of system that is varying from actual spec or expectation where as a failure is when a system component is completely down and stopped working.

  1. Hardware fault — e.g. disk failure (hardware redundancy), cloud instance went down (software level fault-tolerance technique such as running multiple instances of an application and using heart-beat exchange).
  2. Software error — a systematic error, should be addressed with testing, monitoring and constant self-checking.
  3. Human error — human operation could be unreliable, system need to provide clear interface, sandbox, easy recovery of human errors.


There are situations in which we may choose to sacrifice reliability in order to reduce development cost (e.g., when developing a prototype product for an unproven market) or operational cost (e.g., for a service with a very narrow profit margin) — but we should be very conscious of when we are cutting corners.

No alt text provided for this image

Fault Tolerance :

Fault tolerance is the dynamic method that’s used to keep the interconnected systems together, sustain reliability, and availability in distributed systems. The hardware and software redundancy methods are the known techniques of fault tolerance in distributed system.

Fault Tolerance Mechanism in Distributed Systems:

The replication based fault tolerance technique is one of the most popular method. This technique actually replicate the data on different other system. In the replication techniques, a request can be sent to one replica system in the midst of the other replica system. In this way if a particular or more than one node fails to function, it will not cause the whole system to stop functioning. Replication adds redundancy in a system.

Major issues in this technique :

Consistency: This is a vital issue in replication technique. Several copies of the same entity create problem of consistency because of update that can be done by any of the user. The consistency of data is ensured by some criteria such as linearizability , sequential consistency and casual consistency etc. sequential and linearizability consistency ensures strong consistency unlike casual consistency which defines a weak consistency criterion. For example a primary backup replication technique guarantee consistency by linerarizability, likewise active replication technique.

Degree or Number of Replica: The replication techniques utilizes some protocols in replication of data or an object, such protocol are: Primary backup replication , voting and primary-per partition replication. In the degree of replication, to attain a high level of consistency, large number of replicas is needed. If the number of replica is low or less it would affect the scalability, performance and multiple fault tolerance capability. To solve the issue of less number of replica, adaptive replicas creation algorithm was proposed.

The main aim of ARC ( adaptive creation algorithm ) is to maintain a rational replica number, not only satisfying the user anticipant availability, improving access efficiency and balancing overload, but also reducing bandwidth requirement, maintaining the system's stability, providing users with the satisfaction of QOS.

No alt text provided for this image
A simple Fault-tolerant setup in GCP







To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics