System Design

Requirement Gathering

Clarification questions:

Scope:
- What's the purpose of the system
- what functions should be included?
- Do we want further analytics? If so we will need to store more data like number of clicks, location data, timestamp, etc.
Scale of Server:
- How much traffic? How often are users coming? What transaction rate do you want to support?
- Percentage of Read / Write?
- May need to estimate by yourself.
- Usually Massive Scale.
- Horizontal Partitioning
Scale of Data:
- Scale of data to store?
- NoSQL database, scalable (SQL can be vertical scalable, but usually NoSQL is the better solution for horizontal scale)
Data Retention Policy
- how long do we need to keep the data?

Performance Requirements

Requirement	Explaination	Solution
Latency	how fast	1. Caching 2. CDN
Availability	system uptime	1. Deploying in different servers across geographically 2. Load Balancer rerouting to healthy servers
Durability	existence	1. regular backups 2. Storing resources in different geographical locations 3. Performing checksums on data and repairing the corrupted data from backups.
Reliability	system work properly as designed	thourough system test procedures
Resiliency	self-heal after damage	1. Geo Routing 2. identify faulty software， automate the repair/restart features or take it out of the working system. 3. Active failover sites / active replication
Fault tolerance	zero down time	redundant parallel operating environment

Latency
- How fast is required?
- SLA language, "3 nines latency is 100ms" means 99.9% of requests come back within 100ms
- Solution:
  - Caching
  - CDN (Content Delivery Network) is geographically distributed for content around the world

Availability
- The percentage of a time period when the service will be able to respond to the request i.e system uptime.
- Do we allow certain downtime of the system?
- 99% availability means 3.65 days of downtime in a year
- Solution:
  - Deploying applications in different servers across geographically distant locations to withstand latency and disasters
  - Use proper load balancing techniques to reroute requests to healthy servers
Durability
- The ongoing existence of the object or resource.
- Note that it does not mean you can access it, only that it continues to exist.
- 99% durability means 1% chance of losing data
- Solution
  - By taking regular backups
  - Storing resources in different geographical locations to sustain disasters
  - Performing checksums on data and repairing the corrupted data from backups.
Reliability
- closely related to availability, however, a system can be ‘available’ but not be working properly. Reliability is the probability that a system will work as designed.
- Solution:
  - Equipping system test procedures to cover scenarios under production load and edge cases help to test the correctness of the system.
  - Any new feature should be thoroughly tested to cover all scenarios before being introduced, as most of the correctness issues surface when applying patches/system upgrades.
Resiliency
- The ability to a system to self-heal after damage, failure, load, or attack.
- Note that this does not mean that it will be available continuously during the event, only that it will self recover.
- Solution:
  - Desing to identify faulty software or hardware and automate the repair/restart features or in case of beyond repair, just take it out of the working system.
  - Active failover sites for applications and active replication to restore corrupted data.
  - Geo Routing

Fault tolerance
- The ability that helps a system to continue operating properly in the event of failure within some of its components.
- It is similar to high availability, except that in the high available systems we may expect some downtime, but for fault-tolerant system, we need zero down time.
- Like an airplane is said to be fault-tolerant as we have two operational engines. If one engine is down, the airplane can still fly with another engine
- Fault tolerance is costly to achieve as it involves maintaining a redundant parallel operating environment(software and hardware) to address the active takeover of processing, in the event of failure.

Budget / cost constraints
- Do I have infinite money and infinite servers?

Tips:

Working backwards from customer experience to define your requirements.
Who are the customers, what are their use cases, which use case do we need to consider.
Could seperate the services for different functions, so scaling them up independently is easier. If there's any change we can deal with them easily.

Architecture

Process: User -> Internet -> (CDN) -> (Geo Routing) -> (Load Balancer) -> Server (Http Server/Web Server) -> Caching layer -> Database

Load Balancer: nginx

Load Balancer vs Queue

Load-balancing is definitely something that can be achieved using queue technologies like Kafka
Load balancing with Kafka is a straightforward process and is handled by the Kafka producers by default. While it isn't traditional load balancing, it does spread out the message load between partitions while preserving message ordering.

message queuing services focus on asynchronous communication between disparate application parts,
while load balancing services focus on synchronous communication between clients and one or more of a pool of back-end servers.

The difference is in the direction of communication:
load balancer sends tasks (mostly HTTP requests) to a pool of workers ("upstream" HTTP servers).
But when you have a message queue like RabbitMQ, then that message queue is just a queue for tasks/messages - a "specialized database" - and workers actively ask the queue to give them work.

Load balancer:

client sends request to load balancer
load balancer sends request to backend
backend replies with response to load balancer
load balancer replies with response to the client

Message queue:

client sends task to message queue
message queue replies to the client "ok, got it"
backend worker asks the message queue whether it has any tasks
message queue replies with tasks to the backend
optional: backend worker tells the message queue that it successfully finished processing the task so message queue can discard it; otherwise message queue could think that backend worker failed and could send the task to another worker
if it is needed to reply with a response, then backend worker places the response to "callback queue" where client can read it

API Calls

HTTP Methods and Response

HTTP Method	CRUD	HTTP response code
GET	Read	200: OK; 400: BAD REQUEST; 404: NOT FOUND
POST	Create	200: OK; 201: Created; 204: No Content
PUT	Create/Update/Replace	200: OK; 201: Created; 204: No Content
PATCH	Partial Update/Modify	200: OK; 204: No Content; 404: NOT FOUND
DELETE	Delete	200: OK; 202: Accepted; 204: No Content; 404: NOT FOUND

https://restfulapi.net/http-s...
https://stackoverflow.com/que...

HTTP Requests and Response

HTTP request

must have:
- An HTTP method (like GET)
- A host URL (like https://api.spotify.com/)
- An endpoint path(like v1/artists/{id}/related-artists)
optionally have:
- Body
- Headers
- Query strings
- HTTP version
HTTP response
must have:
- Protocol version (like HTTP/1.1)
- Status code (like 200)
- Status text (OK)
- Headers
optionally have:
- Body

Example:

      PATH: /v1/messages_api

      POST
      REQ {
          message: str
      }
      RESP 200 OK {}

      GET
      REQ {}
      RESP 200 OK {
          messages: List[str]
      }

System Design

Requirement Gathering

Clarification questions:

Tips:

Architecture

Load Balancer vs Queue

API Calls

HTTP Methods and Response

HTTP Requests and Response

金金

引用和评论

API

ByteByteGo学习笔记：从零扩展到数百万用户

ByteByteGo学习笔记：键值（Key-Value）存储

ByteByteGo学习笔记：系统设计中的估算技巧

ByteByteGo学习笔记：系统设计面试框架

ByteByteGo学习笔记：深入理解与设计唯一ID生成器