Requirement Gathering
Clarification questions:
Scope:
- What's the purpose of the system
- what functions should be included?
- Do we want further analytics? If so we will need to store more data like number of clicks, location data, timestamp, etc.
Scale of Server:
- How much traffic? How often are users coming? What transaction rate do you want to support?
- Percentage of Read / Write?
- May need to estimate by yourself.
- Usually Massive Scale.
- Horizontal Partitioning
Scale of Data:
- Scale of data to store?
- NoSQL database, scalable (SQL can be vertical scalable, but usually NoSQL is the better solution for horizontal scale)
Data Retention Policy
- how long do we need to keep the data?
Performance Requirements
Requirement Explaination Solution Latency how fast 1. Caching 2. CDN Availability system uptime 1. Deploying in different servers across geographically 2. Load Balancer rerouting to healthy servers Durability existence 1. regular backups 2. Storing resources in different geographical locations 3. Performing checksums on data and repairing the corrupted data from backups. Reliability system work properly as designed thourough system test procedures Resiliency self-heal after damage 1. Geo Routing 2. identify faulty software, automate the repair/restart features or take it out of the working system. 3. Active failover sites / active replication Fault tolerance zero down time redundant parallel operating environment Latency
- How fast is required?
- SLA language, "3 nines latency is 100ms" means 99.9% of requests come back within 100ms
Solution:
- Caching
- CDN (Content Delivery Network) is geographically distributed for content around the world
Availability
- The percentage of a time period when the service will be able to respond to the request i.e system uptime.
- Do we allow certain downtime of the system?
- 99% availability means 3.65 days of downtime in a year
Solution:
- Deploying applications in different servers across geographically distant locations to withstand latency and disasters
- Use proper load balancing techniques to reroute requests to healthy servers
Durability
- The ongoing existence of the object or resource.
- Note that it does not mean you can access it, only that it continues to exist.
- 99% durability means 1% chance of losing data
Solution
- By taking regular backups
- Storing resources in different geographical locations to sustain disasters
- Performing checksums on data and repairing the corrupted data from backups.
Reliability
- closely related to availability, however, a system can be ‘available’ but not be working properly. Reliability is the probability that a system will work as designed.
Solution:
- Equipping system test procedures to cover scenarios under production load and edge cases help to test the correctness of the system.
- Any new feature should be thoroughly tested to cover all scenarios before being introduced, as most of the correctness issues surface when applying patches/system upgrades.
Resiliency
- The ability to a system to self-heal after damage, failure, load, or attack.
- Note that this does not mean that it will be available continuously during the event, only that it will self recover.
Solution:
- Desing to identify faulty software or hardware and automate the repair/restart features or in case of beyond repair, just take it out of the working system.
- Active failover sites for applications and active replication to restore corrupted data.
- Geo Routing
Fault tolerance
- The ability that helps a system to continue operating properly in the event of failure within some of its components.
- It is similar to high availability, except that in the high available systems we may expect some downtime, but for fault-tolerant system, we need zero down time.
- Like an airplane is said to be fault-tolerant as we have two operational engines. If one engine is down, the airplane can still fly with another engine
- Fault tolerance is costly to achieve as it involves maintaining a redundant parallel operating environment(software and hardware) to address the active takeover of processing, in the event of failure.
Budget / cost constraints
- Do I have infinite money and infinite servers?
Tips:
- Working backwards from customer experience to define your requirements.
- Who are the customers, what are their use cases, which use case do we need to consider.
- Could seperate the services for different functions, so scaling them up independently is easier. If there's any change we can deal with them easily.
Architecture
Process: User -> Internet -> (CDN) -> (Geo Routing) -> (Load Balancer) -> Server (Http Server/Web Server) -> Caching layer -> Database
Load Balancer: nginx
Load Balancer vs Queue
Load-balancing is definitely something that can be achieved using queue technologies like Kafka
Load balancing with Kafka is a straightforward process and is handled by the Kafka producers by default. While it isn't traditional load balancing, it does spread out the message load between partitions while preserving message ordering.
message queuing services focus on asynchronous communication between disparate application parts,
while load balancing services focus on synchronous communication between clients and one or more of a pool of back-end servers.
The difference is in the direction of communication:
load balancer sends tasks (mostly HTTP requests) to a pool of workers ("upstream" HTTP servers).
But when you have a message queue like RabbitMQ, then that message queue is just a queue for tasks/messages - a "specialized database" - and workers actively ask the queue to give them work.
Load balancer:
- client sends request to load balancer
- load balancer sends request to backend
- backend replies with response to load balancer
- load balancer replies with response to the client
Message queue:
- client sends task to message queue
- message queue replies to the client "ok, got it"
- backend worker asks the message queue whether it has any tasks
- message queue replies with tasks to the backend
- optional: backend worker tells the message queue that it successfully finished processing the task so message queue can discard it; otherwise message queue could think that backend worker failed and could send the task to another worker
- if it is needed to reply with a response, then backend worker places the response to "callback queue" where client can read it
API Calls
HTTP Methods and Response
HTTP Method | CRUD | HTTP response code |
---|---|---|
GET | Read | 200: OK; 400: BAD REQUEST; 404: NOT FOUND |
POST | Create | 200: OK; 201: Created; 204: No Content |
PUT | Create/Update/Replace | 200: OK; 201: Created; 204: No Content |
PATCH | Partial Update/Modify | 200: OK; 204: No Content; 404: NOT FOUND |
DELETE | Delete | 200: OK; 202: Accepted; 204: No Content; 404: NOT FOUND |
https://restfulapi.net/http-s...
https://stackoverflow.com/que...
HTTP Requests and Response
HTTP request
must have:
- An HTTP method (like GET)
- A host URL (like https://api.spotify.com/)
- An endpoint path(like v1/artists/{id}/related-artists)
optionally have:
- Body
- Headers
- Query strings
- HTTP version
HTTP response
must have:
- Protocol version (like HTTP/1.1)
- Status code (like 200)
- Status text (OK)
- Headers
optionally have:
- Body
Example:
PATH: /v1/messages_api
POST
REQ {
message: str
}
RESP 200 OK {}
GET
REQ {}
RESP 200 OK {
messages: List[str]
}
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。