Facing the future, let&#39;s talk about what is a modern data architecture

In the not-so-distant old IT days, there was a joke - if the databases "get together" for a "meeting".

Oracle: We need an enterprise database.

MySQL: Oracle is not open source.

PostgreSQL: MySQL doesn't have enough features.

SQLite: You can embed me anywhere. In this way, 4 kinds of databases are enough for everyone.

MongoDB: Why do we use joins and schemas?

Bigtable: MongoDB doesn't scale well for the web.

Hbase: Bigtable is not open source. ....

(From: "Foreign Journal IT Review")

This "dialogue" obviously has a witty element, but it also reflects an reality - the era of passed. As the saying goes, "If you want to do a good job, you must first sharpen your tools." So, what kind of data architecture do we need? How to choose a database? In the first issue of Amazon Cloud Technology's Build On "Thinking and Practice of Modern Data Architecture - Interpretation and Architecture Construction of NoSQL Past and Present", database product experts Lv Lin and Li Jun shared the topic of modern data architecture and led everyone to complete it on the spot. Two hands-on labs related to non-relational databases.

01 A single database cannot meet the demand

1970 was a huge turning point in the history of database technology, when Edgar Codd published "Relational Models for Large Shared Database Data". As a result, relational databases have always occupied the top position in the database ecosystem. Codd himself received a Turing Award for this achievement. It is worth noting that the relational database category alone has produced four Turing Award winners.

With the development of modern applications, developers have higher requirements for performance, scale and availability. The number of users is more than one million, the amount of data increases from TB to PB, and the performance requirements reach the delay of milliseconds or even microseconds... At the same time, developers hope to avoid the heavy and repetitive operation and maintenance and deployment work, and will be more Put more energy into developing the business. The model of a single database can no longer meet the needs of enterprises.

In 2004, Amazon's e-commerce business suffered a serious outage that left users unable to complete transactions for hours on end. At that time, Amazon's e-commerce used Oracle relational database, but because relational databases naturally have poor read and write performance in the face of high-efficiency reading and writing of massive data, despite having tens of thousands of Oracle databases, and The data was processed by sub-database and sub-table, and the system still crashed when the business volume increased sharply. At this time, Amazon has already encountered the expansion bottleneck of relational databases.

After that major accident, Amazon began to rethink, build its own app, and re-select its database. In fact, as one of Oracle's largest customers in the world at that time, Amazon's license discount was extremely low. However, the explosive development demand for the future made them aware of the imperfection of the current data architecture. After careful research and design, Amazon decided not to use a single database model, but to split it and use multiple types of databases such as Amazon Redshift, Amazon DynamoDB, Amazon Aurora, and PostgreSQL. This approach avoids the performance degradation problem caused by the increase of the data set caused by only using a relational database. Under massive data sets, high concurrent requests and continuous low response latency can still be maintained, and there is almost no scaling upper limit. Today, Amazon's e-commerce system may handle more than 80 million calls per second on Prime Day, which is similar to the Double 11 event, which is almost impossible if only relational databases are used.

Not only in Amazon, many giant companies in the Internet industry and financial industry have adopted multiple databases at the same time. For example, Airbnb, a service website that provides rental/rental services for global travelers and landlords, chooses MySQL and RDS for relational databases, DynamoDB for non-relational databases, and uses Amazon ElastiCache and Redis for front-end data caching. Capital One, a financial industry company, makes heavy use of the non-relational database DynamoDB, while Amazon Redshift is used for data analysis.

Each database has its own historical background. It is the product of the specific scene requirements under specific time and technical conditions. Each database has its own strengths and limitations. Therefore, different service types and even different scene characteristics under the same service link can be divided into different database requirements as needed. Moreover, as the microservices are split more and more finely, the database is naturally guaranteed to be split, and more and more enterprises are more suitable and more willing to choose a dedicated database according to their demand scenarios.

So, what we bring to you today is the first and most important concept of modern data architecture - "Specially built, dedicated to the library"

02 How to choose a suitable database?

To say the most dazzling, there is nothing more than the "big family" of database services. How to choose between different databases according to your own application scenarios, so that each scenario can get the ultimate performance, availability and scalability? Lu Lin introduced the application scenarios of different types of special databases in her sharing.

He starts with the relational database that developers are most familiar with. The more commonly used relational databases are PostgreSQL, MySQL, MariaDB, Oracle Database, SQL Server, etc. Amazon Cloud Technology's RDS also provides five commonly used database engines. Why did she create Amazon Aurora, Lu Lin said: "This actually comes from the needs of customers."

According to customer feedback, Oracle and SQL Server are powerful and provide enterprise-level support, but they have strict license licensing mechanisms, serious binding tendencies, and are expensive. On the other hand, open source databases such as MySQL and PostgreSQL, although free, lack functionality, performance, high availability, and enterprise-level support.

How can we have both fish and bear's paw? In 2014, Amazon Cloud Technology launched Amazon Aurora, the first relational database built for the cloud. Amazon Aurora is fully compatible with MySQL and PostgreSQL, delivers up to five times the performance of standard MySQL, three times the performance of standard PostgreSQL, and pays for what you use. Amazon Aurora is a good example of high availability in terms of performance.

However, as a relational database, Amazon Aurora still cannot escape the design problem of relational databases, that is, as the amount of data increases, the indexing performance will definitely decrease. When faced with massive amounts of data, but also to ensure indexing performance, companies usually use two methods.

The first is to sub-database and sub-table for relational databases. Sub-database sub-table can improve performance and increase availability, however, this method will also bring a lot of trouble to developers. For example, how to solve business problems? What about cross-resolved queries? How to distribute the hot and cold data evenly in each sub-database and sub-table? These all require developers to take time to consider.

For the second method, we have to talk about non-relational databases. Non-relational database storage format is flexible, fast, highly scalable, and relatively low cost. In many specific scenarios, the performance is strong, such as massive writes, accurate reads, high concurrent updates, and low requirements for consistency. The most typical non-relational database of Amazon Cloud Technology is DynamoDB. Its expansion has almost no upper limit, and it can avoid the performance degradation caused by the increase of the data set. It can still maintain the response time of milliseconds or even microseconds under massive data sets. Not only that, but DynamoDB uses a serverless architecture to automatically scale up or down without hardware configuration, software patches, or upgrades, backing up data continuously and uninterruptedly.

In addition to common relational databases and non-relational databases, there are also other types of databases, such as in-memory databases, document databases, graph databases, time series databases, etc., all of which have their own suitable application scenarios. Lu Lin introduced them one by one.

memory database: such as Amazon ElastiCache or Amazon MemoryDB. This type of database can ensure that data is not lost. Generally speaking, the replication technology of Redis is asynchronous replication, and some data may be lost, but there is no data loss when using the in-memory database Amazon MemoryDB.

database: as MongoDB, Amazon DocumentDB, etc. MongoDB is highly accepted in China and is very suitable for directly storing JSON data. Therefore, industries such as games and live broadcasts will naturally tend to adopt it. However, the free version of MongoDB is difficult to achieve high availability, and the fee for the paid version is very high. In comparison, Amazon DocumentDB provides more powerful high availability and scalability.

database: example, Amazon Neptune, graph database is a relatively new database, mainly used to record the relationship between different things. It is commonly used in social networks, knowledge graphs, life sciences and other scenarios. In addition, graph databases also play an important role behind fraud detection and epidemic prevention and control.

time series database: such as Amazon Timestream, time series database is mainly used to process data with time tags, mainly used in insurance, electric power, chemical industry and other industries, for various real-time detection, monitoring and analysis. Amazon Timestream provides a fast, scalable, fully managed service that is 1,000 times faster than relational databases at 1/10 the cost.

The so-called ruler is short, and the inch is long. Therefore, all kinds of databases can only exert the greatest value in their own suitable scenarios.

03 for modern applications highly available, scalable NoSQL database

Although there are many types of databases, they can generally be divided into two categories, one is traditional SQL, and the other is relatively emerging NoSQL. Lu Lin emphasized that the two databases are not mutually replaceable. If you need a lot of joins or flexible ad hoc queries, then SQL must be the obvious choice. However, if massive scaling, low predictable latency, and a flexible schema are required, NoSQL is the better choice.

In the non-relational database, Lu Lin focused on the basics and best practices of DynamoDB, and the follow-up hands-on experiments were also carried out around this database.

In 2007, the publication of Amazon's Dynamo paper provided inspiration for the subsequent development of a series of NoSQL theories and products, and paved the way for the theory. Many NoSQL products refer to the Dynamo system. In 2012, DynamoDB was officially born. This is a fully managed serverless type NoSQL database. To solve the core problems of database management, performance, scalability and reliability. With high scalability, availability and robustness, it is suitable for application services that store large amounts of data and require low latency at the same time. DynamoDB provides fully managed services and is easy to operate, so much so that there is a saying among developers, "Using DynamoDB, you don't have to worry about anything, just remember to pay the bill." Many top companies are DynamoDB users. Netflix, such as Huami and Suirui in China.

The core components of DynamoDB are tables, items, and properties. A table is a collection of items, and an item is a collection of attributes. DynamoDB uses primary keys to represent items in a table. The partition key is used to build a non-sorted hash index so that the table can be partitioned to meet scalability requirements. In a hash index determined by the partition key, the data is arranged according to the sort key, and there is no upper limit on the number of data rows corresponding to each sort key, unless you have a local secondary index.

A local secondary index (LSI) can choose a different sort key than the table, one index partition per table partition. Each partition key can store up to 10 GB of data, including the amount of data for table partitions and index partitions.

In addition to the local secondary index, another indexing method is the global secondary index (GSI). The global secondary index can choose a different partition key and sort key than the table, and each index partition corresponds to all table partitions.

How to choose between GSI and LSI? For GSI, there is no upper limit on the index size, the read and write capacity is independent of the table, and only eventual consistency is supported. For LSI, the index is stored in the partition of the table, the storage limit of each partition key value is 10GB, and the RCU and WCU on the table are used.

When using DynamoDB, in addition to specifying the primary key, partition key, and sort key, users only need to determine the number of accesses, and the system will preset the capacity according to the number of accesses. Not only that, DynamoDB also has a unique Token Bucket algorithm that can store the remaining RCU to deal with sudden traffic spikes.

For NoSQL, a relatively common problem is the problem of unbalanced access, and DynamoDB has an adaptive capacity (Adaptive Capacity) function, which increases the throughput of overheated partitions and isolates overheated items. In addition, DynamoDB also provides functions such as provisioned capacity automatic scaling and on-demand capacity expansion to minimize enterprise costs on the basis of guaranteed capacity.

At the end of the sharing, Lu Lin introduced four DynamoDB design best practices, namely:

● Carefully choose Hash Key to achieve unlimited expansion

● How to store large items

● How to handle hot items

● Use Time-Series table to store time series data

04 Hands-on lab session

"It's only a matter of fact on paper, and I never know that this matter needs to be implemented." Teacher Lu Lin's sharing gave the developers on the scene a preliminary understanding of the modern data architecture. Then, under the leadership of Mr. Li Jun, the developers started the hands-on experiment. There are two hands-on experiments in this Build On.

Action Experiment 1 Designing a Database for Mobile Applications Using Amazon DynamoDB

Hands-on Lab 1 Suppose a developer is building a mobile application for uploading photos. Users will upload photos through an app developed by the developer, and their friends can view their photos. This app is a social app, so users may find and follow friends. After following a friend, the user will be notified of new photos posted by the friend and will be able to send messages to the friend. Developers need to design apps to meet the needs of users to express their opinions on photos using four emojis: heart, smiley, thumbs up, and sunglasses.

Through this lab, developers learned how to model DynamoDB tables to handle all access patterns of an application, and learned how to use the new transaction processing capabilities to use DynamoDB quickly and efficiently.

Dynamic Experiment 2 Modeling Gamer Data Using Amazon DynamoDB

In addition to being used in social scenarios, DynamoDB is also a popular database service for game scenarios. Hands-on Lab 2 assumes the developer is building a battle royale game with 50 players online at the same time. Game time is typically around 30 minutes, and in games, developers must update a particular player's record to indicate how long that player has played, a record number of kills, or a win. Meet the needs of users who want to see the games they've played, who won the game, or who want to watch replays of every game action.

Through this experiment, the developers learned more about some core data modeling strategies and how to use DynamoDB to build a modern data architecture in games and similar scenarios.

Do you want to try it too? Developers who are not able to participate in this event in person and are interested in the DynamoDB database can scan the QR code below to register an account and receive a gift package to practice the above two hands-on experiments

Click read the original text to download this issue of Build On Handbook , come and learn and review the course content yourself~

Scan the QR code above to register now

Facing the future, let's talk about what is a modern data architecture

01 A single database cannot meet the demand

02 How to choose a suitable database?

03 for modern applications highly available, scalable NoSQL database

04 Hands-on lab session

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

6行代码节省超千万成本——记一次字段治理的“巧渡金沙江”

Facing the future, let&#39;s talk about what is a modern data architecture

01 A single database cannot meet the demand

02 How to choose a suitable database?

03 for modern applications highly available, scalable NoSQL database

04 Hands-on lab session

亚马逊云开发者

引用和评论

Amazon Bedrock 助力 SolveX.AI 构建智能解题 Agent，打造头部教育科技应用

分析型数据库入门指南：如何选择适合你的实时分析工具？

湖仓一体架构解析：如何平衡数据灵活性与分析性能？

6行代码节省超千万成本——记一次字段治理的“巧渡金沙江”

Facing the future, let's talk about what is a modern data architecture