Designing Data Intensive Application

A data-intensive application is typically built from standard building blocks that provide commonly needed functionality. For Example, many applications need to:

Store data so that they, or another application, can find it again later(databases).
Remember the result of an expensive operation, to speed up reads (caches)
Allow users to search data by keyword or filter it in various ways (search indexes)
Send a message to another process, to be handled asynchronously (stream processing)
Periodically crunch a large amount of accumulated data (batch processing)

Data Systems

We typically think of databases, queues, caches etc. as being very different categories of tools. Although a database and a message queue have some superficial similarity -- both store data for some time-they have very different access patterns, which means different performance characteristics, and thus very different implementations.

For example, there are data-stores that are also used as message queues (Redis), and there are message queues with data-base-like durability guarantees (Kafka, Nats). The boundaries between the categories are becoming blurred.

Secondly, increasingly many applications now have such demanding or wide-ranging requirements that single tool can no longer meet all of its data processing and storage needs. Instead, the work is broken down into tasks that can bee performed efficiently on a single tool, and those different tools are stitched together using application code.

There are many factors that may influence the design of a data system, including the skills and experience of the people involved, legacy system dependencies, the timescale for delivery, your organization's tolerance of different kinds of risk, regulatory constraints, etc. Those factors depend very much on the situation.

3 main concerns,

Reliability

The system should continue to work correctly(performing the correct function at the desired level of performance) even in the face of adversity (hardware or software faults, and even human error).

Typical expectations include:

The application performs the function that the user expected.
It can tolerate the user making mistakes or using the software in unexpected ways.
Its performance is good enough for the required use case, under the expected load and data volume.
The system prevents any unauthorized access and abuse.

If above all 4 things mean "working correctly", then we can understand reliability as meaning, roughly, "continuing to work correctly, even when things go wrong."

The thing that can go wrong are called faults, and systems that anticipates faults and can cope with them are called fault-tolerant or resilient.

The fault is not same as failure. A fault is usually defined as one component of the system deviating from its spec, where as failure is when the system as a whole stops providing the required service to the user.

Counterintuitively, in such fault-tolerant systems, it can make sense to increase the rate of faults by triggering them deliberately --for example, by randomly killing them deliberately. The Netflix Chaos Monkey is an example of this approach.

Hardware Faults

Hard disks are reported as having a mean time to failure(MTFF) of about 10 to 50 years. Thus, on a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

Until recently, redundancy of hardware components was sufficient for most applications, since it makes total failure of a single machine fairly rare. As long as you can restore a backup onto a new machine fairly quickly.

However, as data volumes and applications computing demands have increased, more applications have begun using larger number of machines, which proportionally increases the rate of hardware faults. Moreover, in some cloud platforms such as AWS it is fairly common for VM instances to become unavailable without warning "AWS the good the bad and the ugly blog.awe.sm".

Software Errors

Another class of fault is systematic error within the system. Such faults are harder to anticipate, and because they are correlated across nodes, they tend to cause more system failures than uncorrelated hardware faults.

A software bug that causes every instance of an application server to crash when given a particular bad input. For example, consider the leap second on June 30, 2010, that caused many applications to hang simultaneously due to a bug in the linux kernel.
A runaway process that uses up some shared resources - CPU time, memory, disk space, or network bandwidth.
A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.
Cascading failures, where a small fault in one component triggers a fault in another component, which in turn triggers further faults.

Human Errors

A internet study found that configuration errors by operators were the leading cause of outages, whereas hardware faults(servers or network) played a role in 10-25% of outages.

The best system combine several approaches,

Design systems in a way that minimizes opportunities for error. For example, well designed abstractions, APIs, and admin interfaces make it easy to do "the right thing" and discourage "the wrong thing".
Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users.
Test throughly at all levels, from unit tests to whole-system integration tests and manual tests Automated testing is widely used, well understood, and especially valuable for covering corner cases that rarely arise in normal operation.
Allow quick and easy recovery from human error, to minimize the impact in the of a failure.
Setup detailed and clear monitoring, such as performance metrics and errors rates. It is referred as telemetry.

Scalability

As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.

Twitter as an example, using data published in November 2012. Two twitters main operations are

Post Tweet

A user can publish a new message to their followers(4.6k requests/sec on average, over 12k requests/sec at peak).

Home Timeline

A user can view tweets posted by the people they follow (300k requests/sec).

Twitter's scaling challenge is not primarily due to tweet volume, but due to fan-out -- each user follows many people, and each user is followed by many people. There are broadly two ways of implementing these two operations:

Posting a tweet simply inserts the new tweets into global collection of tweets. When a user requests their home timeline, look up all the people they follow, find all the tweets for each of those users, and merge them(sorted by time).
Maintain a cache for each user's home timeline -- like a mailbox of tweets for each recipient user. When a user posts a tweet, look up all the people who follow that user, and insert the new tweet into each of their home timeline caches. The request to read home timeline is then cheap, because its result has been computed ahead of time.

The first version of twitter used approach 1, but the system struggled to keep up with the load of home timeline queries, so the company switched to approach 2. This works better because the average rate of published tweets is almost two orders of magnitude lower than the rate of home timeline reads.

Twitter is using to a hybrid of both approaches. Most users' tweets continue to be fanned out to home timelines at the time when they are posted, but a small number of users with a very large number of followers are excepted from this fan-out.

Maintainability

Over time, many different people will work on the system (engineering and operations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.

Search This Blog

Data Structures and Algorithms and others