图数据库 - Technology Sharing | Using a Graph Database to Reduce the Delay of MySQL Processing Multi-layer Relationships (1) - 个人文章

Author: Yang Taotao
Senior database expert, specializing in MySQL research for more than ten years. Good at MySQL, PostgreSQL, MongoDB and other open source databases related to backup and recovery, SQL tuning, monitoring operation and maintenance, high-availability architecture design, etc. Currently working in Aikesheng, providing MySQL-related technical support and MySQL-related course training for major operators and banking and financial enterprises.
Source of this article: original contribution
*The original content is produced by the open source community of Aikesheng, and the original content shall not be used without authorization. For reprinting, please contact the editor and indicate the source.

The MySQL database can be used to handle most online business scenarios, and it can handle it very well. It has an advantage in terms of single-node performance or overall throughput after multi-machine expansion. However, nothing is absolute, and MySQL's performance does not meet expectations in some scenarios. For example, in the processing of various complicated relationships, MySQL is a bit difficult to handle. In such scenarios, NoSQL is more suitable than relational databases. In this article, we use the graph database Neo4J (a type of NoSQL) to illustrate how to make up for the shortcomings of MySQL in this scenario.

Here is an example of a simple character relationship.

For simple character relationships, for example, the relationship chain between Xiao Song, Xiao Li, Xiao Yang, Xiao Ai, Xiao Xu, Xiao Na, and Xiao Qiao may be as follows:

Xiao Song "knows" Xiao Li.
Xiao Li "knows" Xiao Yang.
Xiao Yang "knows" Xiao Ai.
Xiao Ai "knows" Xiao Xu.
Xiao Xu "knows" Xiaona.
Cortana "knows" Joe.

"Knowledge" refers to the relationship between several people. There are many kinds of such relationships, such as "knowing", "meeting", "friends", "colleagues", "secret love", "lover" and so on. In this article, we first look at the basic relationship: "knowledge". That is to say, the relationship is only "knowing" and nothing else.

assumes the following three requirements:

Find the total number of users.
Find people who "know" Xiao Yang and who Xiao Yang "knows".
Find out who Xiao Yang "knows" "knows" "knows" "knows".

For these requirements, we first design two tables based on MySQL: (If only the last two requirements are implemented, only table 2 is needed.)

Table 1: User table, which stores user records; Table 2: User relationship table, which stores the relationship between users. For simplicity, the two tables have no primary key and are allocated internally by MySQL.

mysql:ytt>create table user1 (name varchar(64));
Query OK, 0 rows affected (0.09 sec)

mysql:ytt>insert user1 values ("小宋"),("小李"),("小杨"),("小爱"),("小徐"),("小娜"),("小乔");
Query OK, 7 rows affected (0.01 sec)
Records: 7  Duplicates: 0  Warnings: 0


mysql:ytt>create table relation1 (user_name varchar(64),friend_name varchar(64));
Query OK, 0 rows affected (0.07 sec)

mysql:ytt>insert relation1 values ("小宋","小李"),("小李","小杨"),("小杨","小爱"),("小爱","小徐"),("小徐","小娜"),("小娜","小乔");
Query OK, 6 rows affected (0.01 sec)
Records: 6  Duplicates: 0  Warnings: 0

Then let's implement the above three requirements:

Find the total number of users: Very simple, just ask count(*) directly.

   mysql:ytt>select count(*) as total_number from user1;
   +--------------+
   | total_number |
   +--------------+
   |            7 |
   +--------------+
   1 row in set (0.00 sec)

Find people who "know" Xiao Yang and who Xiao Yang "knows": Due to the small number of records, directly filter the entire table.

   mysql:ytt>select * from (select (case when friend_name = "小杨" then user_name when user_name = "小杨" then friend_name end) as user_name from relation1) as sub_table where user_name is not null;
   +-----------+
   | user_name |
   +-----------+
   | 小李      |
   | 小爱      |
   +-----------+
   2 rows in set (0.00 sec)

Find out who Xiao Yang "knows" "knows" "knows" "knows": that is, finds the end user name of the four-layer relationship network starting from Xiao Yang.

   mysql:ytt>select d.friend_name from relation1 b 
       -> inner join relation1 a on  b.user_name = "小杨" and b.friend_name = a.user_name
       -> inner join relation1 c on a.friend_name = c.user_name
       -> inner join relation1 d on c.friend_name = d.user_name;
   +-------------+
   | friend_name |
   +-------------+
   | 小乔        |
   +-------------+
   1 row in set (0.00 sec)

The above three requirements, especially the third. Find out the relationship network starting from Xiao Yang. The more layers there are, the more tables need to be joined. In MySQL, the higher the number of table associations, the worse the performance. I will continue to explore this issue in the "SQL Optimization" topic later.

We use the graph database Neo4J to solve the same need.

Create user nodes and relationships between users,

neo4j@ytt> create (x1:user {name:"小宋"}),
(x2:user {name:"小李"}),
(x3:user {name:"小杨"}),
(x4:user {name:"小爱"}),
(x5:user {name:"小徐"}),
(x6:user {name:"小娜"}),
(x7:user {name:"小乔"})
with x1,x2,x3,x4,x5,x6,x7
create (x1)-[:认识]->(x2),
(x2)-[:认识]->(x3),
(x3)-[:认识]->(x4),
(x4)-[:认识]->(x5),
(x5)-[:认识]->(x6),
(x6)-[:认识]->(x7);
0 rows
ready to start consuming query after 269 ms, results consumed after another 0 ms
Added 7 nodes, Created 6 relationships, Set 7 properties, Added 7 labels

The corresponding relationship diagram is shown as follows:

above 161f0c03fd7f20 are implemented in Neo4J as follows:

Requirement one:

   neo4j@ytt> match(x:user) return count(*) as total_number;
   +--------------+
   | total_number |
   +--------------+
   | 7            |
   +--------------+
   
   1 row
   ready to start consuming query after 21 ms, results consumed after another 1 ms

Requirement two:

   neo4j@ytt> match (y1:user)-[:认识]->(x:user {name:"小杨"})-[:认识]->(y2:user) return y1.name,y2.name;
   +-------------------+
   | y1.name | y2.name |
   +-------------------+
   | "小李"    | "小爱"    |
   +-------------------+
   
   1 row
   ready to start consuming query after 95 ms, results consumed after another 2 ms

Requirement three:

   neo4j@ytt> match(x:user {name:"小杨"})-[:认识*4]->(y) return y.name;
   +--------+
   | y.name |
   +--------+
   | "小乔"   |
   +--------+
   
   1 row
   ready to start consuming query after 398 ms, results consumed after another 174 ms

From the query results of the three requirements alone, the performance of MySQL and Neo4J are similar, and the difference is only in the way of writing. However, if the amount of data is enlarged, especially the processing of demand three, MySQL will be a bit difficult.

Let's enlarge the data several times, increase the number of records in the user table to 1,000, and increase the number of records in the relationship table to 100,000.

Create point data for user table and relation table respectively: (user1.csv contains 1100 users, relation1.csv contains 1W records, each user "knows" about 100 people, and adds an index to the relation table.)

mysql:ytt>truncate table user1;
Query OK, 0 rows affected (0.19 sec)

mysql:ytt>load data infile '/var/lib/mysql-files/user1.csv' into table user1 fields terminated by ',' enclosed by '"';
Query OK, 1100 rows affected (0.10 sec)
Records: 1100  Deleted: 0  Skipped: 0  Warnings: 0


mysql:ytt>truncate table relation1;
Query OK, 0 rows affected (0.11 sec)

mysql:ytt>load data infile '/var/lib/mysql-files/relation1.csv' into table relation1 fields terminated by ',' enclosed by '"';
Query OK, 100000 rows affected (1.60 sec)
Records: 100000  Deleted: 0  Skipped: 0  Warnings: 0

mysql:ytt>alter table relation1 add key idx_user_name (user_name), add key idx_friend_name(friend_name);
Query OK, 0 rows affected (4.09 sec)
Records: 0  Duplicates: 0  Warnings: 0

The data in the table is as follows:

mysql:ytt>table user1 limit 2;
+-------+
| name  |
+-------+
| user1 |
| user2 |
+-------+
2 rows in set (0.00 sec)

mysql:ytt>table relation1 limit 2;
+-----------+-------------+
| user_name | friend_name |
+-----------+-------------+
| user1     | user101     |
| user2     | user101     |
+-----------+-------------+
2 rows in set (0.00 sec)

Next, implement the third requirement:

Here, the user "Xiao Yang" is replaced with "user1". Since there are many results, we only need to ask for the total number. It takes more than 4 minutes, which is unacceptable for the user side.

mysql:ytt>select count(distinct d.friend_name) as cnt from relation1 b 
    -> inner join relation1 a on  b.user_name = "user1" and b.friend_name = a.user_name
    -> inner join relation1 c on a.friend_name = c.user_name
    -> inner join relation1 d on c.friend_name = d.user_name;


+-----+
| cnt |
+-----+
| 100 |
+-----+
1 row in set (4 min 15.47 sec)

Next, import MySQL data into Neo4J to continue to achieve the same requirements:

Import node:

neo4j@ytt> load csv from  "file:///user1.csv" as x create (a:user {name:x[0]});
0 rows
ready to start consuming query after 523 ms, results consumed after another 0 ms
Added 1100 nodes, Set 1100 properties, Added 1100 labels

Import relationship:

neo4j@ytt> load csv from "file:///relation1.csv" as x with x match (a:user {name:x[0]}), (b:user {name:x[1]}) merge (a)-[:认识]-(b);

Created 100000 relationships, completed after 94636 ms.

Add an index to a node:

neo4j@ytt> create index on :user(name);
0 rows
ready to start consuming query after 31 ms, results consumed after another 0 ms
Added 1 indexes

Neo4J implements the third requirement: the execution time is less than 1 second, which is many times faster than MySQL.

neo4j@ytt> match(x:user {name:"user1"})-[*4]->(y) 
return count(distinct y.name) as total_friend;
+--------------+
| total_friend |
+--------------+
| 100          |
+--------------+

1 row
ready to start consuming query after 44 ms,
results consumed after another 692 ms

This article makes a brief introduction based on the fact that graph databases are superior to relational databases in handling character relationships. For more relationship processing, please continue to read the subsequent chapters.

Technology Sharing | Using a Graph Database to Reduce the Delay of MySQL Processing Multi-layer Relationships (1)

assumes the following three requirements:

Find the total number of users.

Find people who "know" Xiao Yang and who Xiao Yang "knows".

Find out who Xiao Yang "knows" "knows" "knows" "knows".

For these requirements, we first design two tables based on MySQL: (If only the last two requirements are implemented, only table 2 is needed.)

We use the graph database Neo4J to solve the same need.

爱可生开源社区

引用和评论

如何巧妙解决 Too many connections 报错？

图查询语言GQL(Graph Query Language)语法概览