Distributed Systems

Table of contents

Distributed Systems
Topics
Distributed System Approaches
1. Master Slave
2. Masterless
Big Data Problems
Issues to be Figured out
Data Stores
Tools
Buzz Words
Distributed System Approaches

Topics

A Thorough Introduction to Distributed Systems
Data Intensive Applications - We call an application data-intensive if data is its primary challenge—the quantity of data, the complexity of data, or the speed at which it is changing—as opposed to compute-intensive, where CPU cycles are the bottleneck.

Distributed System Approaches

Master Slave

Master Slave
All requests goes to Master first, takes lock
Transaction Management
Not Scalability

Masterless

All are Master
Prepare intermediate result
Intermediate results are combined to make final result
Here Network Latency problem is observed.

Big Data Problems

Network Latency
Distribute data in small blocks (Map Reduce)
Denormalization

Issues to be Figured out

Replication

Replication refers to frequently copying the data across multiple machines.
Post replication, multiple copies of the data exists across machines.
This might help in case one or more of the machines die due to some failure.

Scalability

Types of Scaling

horizontal scaling (scale out) - adding more servers.
vertical scaling (scale up) - to increase the resources of the server (RAM, CPU, storage, etc.).
Example: Lets say you own a restaurant which is now exceeding its seating capacity. One way of accommodating more people ( scaling ) would be to add more and more chairs (scaling vertically). However since the space is limited, you won’t be able to add more chairs once the space is full. Another way of scaling would be to open new branches of the restaurant ( horizontal scaling ).

Storage Scalability

no design is correct or wrong, there are just good designs and bad designs which heavily depend on the use case.
Hence, it is extremely important to clarify the requirements for the problem asked.

Pre Requisites

Have some experience working with a relational DB ( like MySQL ).

Have a basic idea about NoSQL DBs.
Understand the basics of the following :
- Concurrency : Do you understand threads, deadlock, and starvation? What happens when multiple processes / threads are trying to modify the same data? A basic understanding of read and write locks.
- Networking : Do you roughly understand basic networking protocols like TCP and UDP? Do you understand the role of switches and routers?
- File systems : You should understand the systems you’re building upon. Do you know roughly how an OS, file system, and database work? Do you know about the various levels of caching in a modern OS?

CAP theorem

CAP Theorem in plain english
CAP Theorem states that in a distributed system, either 2 of the 3 features can be guaranteed:
- Consistency
- Availability
- Partition Tolerance
This is for data centres that are installed across multiple locations

Consistency

Consistency
- Assuming you have a storage system which has more than one machine, consistency implies that the data is same across the cluster, so you can read or write to/from any node and get the same data.
- Strongly Consistent.
Eventual consistency
- In a cluster, if multiple machines store the same data, an eventual consistent model implies that all machines will have the same data eventually.
- Its possible that at a given instance, those machines have different versions of the same data ( temporarily inconsistent ) but they will eventually reach a state where they have the same data.

Availability

In the context of a database cluster, Availability refers to the ability to always respond to queries ( read or write ) irrespective of nodes going down.

Partition Tolerance - Reliability

In the context of a database cluster, cluster continues to function even if there is a “partition” (communications break) between two nodes (both nodes are up, but can’t communicate).

Efficiency

Maintainability

Data Stores

Relational Database - RDBMS

Why use RDMS at all?

Provides ACID Properties

ACID Properties

Atomicity - Transaction Management
Consistency
Isolation
Durability

NoSQL Data Stores

MongoDB -> Replica Set
- Transaction management is within single replica set not accross multiple replica sets.
Casandra -> Partition
- Transaction management is possible only if in the same partition only.
Redundancy is not a problem then use NoSQL otherwise RDBMS.

Sharding

With most huge systems, data does not fit on a single machine. In such cases, sharding refers to splitting the very large database into smaller, faster and more manageable parts called data shards.
Logical partitioning of server.
example - column countryName used for sharding will make unequal size of partition after sharding.
partition here can be of big size hence use pf kafka is popular.
RDB can solve big data problem if join can be removed but normalization will go away hence not feasible.

Why can’t we use Sharding for big data problems

Sharding is Logical partitioning to server to make small
- example - column country used for sharding, this makes unequal size of partitions after sharding.
- partition here can be of big size hence use of kafka is popular.
Relational DB can solve big data problem if join can be removed but normalization will be compromised hence not feasible option.
- Here, we shard data based on country then, high population density countries will put more load on server for the same operation being performed on other partition.
- SQL Query - Chain of SQL operations (bottom-up approach), uses DAG(Directed Acyclic Graph)
Kafka - provides on demand scale up/ scale down.
index scan (kafka) => table scan (RDBMS)

Tools

Stream Processor
Batch Processor
Message Brokers
Buzz Words
Big Data
Web-scale
Sharding
Cloud services
MapReduce
Real-time
Elastic Search - Unstructured Data Structure
single-node versus distributed systems
online/interactive versus offline/batch processing systems
Real life example of scaling using MySQL
Paxos Consensus

Distributed System Approaches

Distributed Systems

Topics

Distributed System Approaches

Master Slave

Masterless

Big Data Problems

Issues to be Figured out

Replication

Scalability

Types of Scaling

Storage Scalability

CAP theorem

Consistency

Availability

Partition Tolerance - Reliability

Efficiency

Maintainability

Data Stores

Relational Database - RDBMS

ACID Properties

NoSQL Data Stores

Sharding

Tools

Buzz Words

Distributed System Approaches