Only the sky is your limit, we love the process of exploration and immerse in it丨Interview with TiMatch Team

Only the sky is your limit, we love the process of exploring and immersing yourself in it

Hackathon itself brings us a whole new exploration, not just a team of people with a clear goal, spend a few days writing code, it is actually very boring. The exploration itself allows us to discover that the more we explore, the more we can find newer, better and more elegant solutions. We love the process of exploration and are immersed in it... ——TiMatch Team

One of the team members, Bai Jiachen, is a programmer who loves to draw, love motorcycles, and likes to read "Utopia".

In the TiDB Hackathon 2020, the [TiGraph ]() project implemented a new set of Key-Value coding in TiDB to introduce graph mode, deal with graph data analysis scenarios that are difficult to cover with traditional relational databases, and realize TiDB's four-degree The computing performance of the network has been greatly improved and won the second prize.

In the just-concluded TiDB Hackathon 2021, TiMatch made an evolution and upgrade on the basis of last year's project, built a distributed graph database with complete syntax based on TiDB and TiKV, and explored a method to build a mature and Path to an easy-to-use graph database. The project was unanimously affirmed by the on-site judges and won the second prize of this competition.

TiMatch - A distributed graph database with complete syntax. TiGraph has amazed everyone last year, and this year TiMatch is even more exciting. This time, the ease of use is better, and it can be directly upgraded and used for old clusters. Because TiMatch only builds a set of graph indexes internally, and then updates it uniformly with the data of the original relational table through the TiDB distributed transaction mechanism. The syntax is based on the syntax of Oracle graph, so the relationship is complete, but I think the challenge lies in the performance. I hope that the follow-up piece can show you relevant data.

——Judge Tang Liu

The comments are very amazing. With the expansion ability of TiKV + the analysis ability of TiFlash, there is a lot of room for imagination, and I hope to GA as soon as possible!

——Judge Feng Guangpu

Why did you choose the direction of graph database?

Graph technology has become the foundation of modern data and analytics capabilities, enabling the discovery of relationships between people, places, things, events, and locations across disparate data assets. Rely on graph technology to quickly answer complex business questions that previously required knowing the situation and understanding the connections and strengths of multiple entities.
Graph databases have always been a hot topic in the industry. It has a graph structure for semantic query, uses vertices, edges and attributes to represent and store data, supports simple and fast retrieval of complex hierarchical structures that are difficult to model in relational database systems, and elegantly solves the problem of traditional relational databases in Frequent performance or failure issues when running results against complex relationships or multi-table JOIN situations. There are many cases of graph applications around us, such as the Google PageRank algorithm; LinkedIn uses graphs to manage social relations and implement friend recommendation; Amazon uses graphs to implement real-time product recommendation; banks use graphs for risk control, anti-fraud and anti-money laundering, etc. From the statistics of DB Engine, we can see that on a global scale, since 2014, the popularity of graph databases has been rising rapidly, surpassing other types of databases. Gartner predicts that by 2025, the share of graph technology in data and analytics innovation will rise from 10% in 2021 to 80%, and the technology will facilitate rapid decision-making across the enterprise.

from last year's TiGraph to TiMatch

What improvements has the

Last year's TiGraph project left some regrets, such as the incomplete graph query language and no access to the TiKV storage engine. The task of TiMatch in this Hackathon competition is to design a prototype of a distributed graph database with complete syntax based on TiDB and TiKV, and continue to explore the path to build a mature and easy-to-use graph database on top of TiDB. TiMatch has achieved three major improvements.

Improve the completeness of the grammar, the language has introduced the Oracle PGQL grammar

Graph traversal is performed by introducing the TRAVERSE clause after the WHERE clause. The main purpose is to explore whether it can be effectively integrated, whether it can be reused seamlessly, whether the syntax is simple enough, and whether subqueries can be nested friendly to each other. Although the purpose of verification is finally achieved, since the data source needs to rely on the underlying SELECTION operator, multiple edge matching cannot be performed during graph calculation, and it is difficult to construct corresponding graph algorithms in other graph algorithm scenarios such as subgraph matching and shortest path. The query, in essence, is a problem with the completeness of the grammar.
Fortunately, in 2021, Oracle's PGQL is also exploring the compatible combination of graph query language and SQL language, and released the 1.0 specification. We directly refer to the PGQL syntax and implement a complete PGQL parser in TiDB to improve the completeness of the query. The syntax supports graph algorithms such as graph traversal, subgraph matching, TopK, and shortest path. The specific implementation of the algorithm is not only graph traversal, but also Shortest path queries were also implemented this year.

Reduce system complexity and learning

The improvement of simplification and completeness is intuitively contradictory. This time the solution treats the read path and the write path differently. The read path tries to improve the completeness of the query through the PGQL graph query language, which requires the introduction of new grammar rules. . On the writing path, we explored a new solution. We introduced many new syntaxes in last year's solution, such as:

CREATE EDGE/TAG
SHOW EDGES/TAGS
SHOW CREATE TAG/EDGE
DROP TAG/EDGE
ALTER TAG/EDGE
...

These syntaxes will have a relatively large compatibility impact on existing applications and ORM ecosystems, requiring a lot of application modifications and ecological compatibility.
This year, we have found a brand-new way through constant discussions and attempts, intruding into the user layer interface as little as possible, completely completing the compatibility within the database, and converting the external compatibility problem into the compatibility problem of the internal graph mode of the database. The workload has become larger, but it is more in line with the global optimal solution.

The DDL syntax for vertices has not changed at all.
The changed syntax adds SOURCE/DESTINATION KEY Column Option. The writing method is the same as FOREIGN KEY, so there will be no new learning burden for developers and DBAs. MySQL has nearly 20 Column Option, so only Column is added. 2/20 of Option, and the scope of Column Option is very small, related to the dozens of statement-level syntax added last year, this year can be said to be quite lightweight.

Is it possible to completely eliminate the new syntax? It is possible, for example, using the following methods:

CREATE TABLE (
  a bigint /*T! SOURCE KEY REFERENCES students */,
  b bigint /*T! DESTINATION KEY REFERENCES students */
)

The use of comments can completely eliminate the impact on upstream and downstream, but comments are not easy to find errors through the parser very early in the parsing stage, and many people ignore comments, so 100% compatibility is not fully pursued here.
automatically builds graph topology After introducing SOURCE/DESTINATION KEY, there is enough information in the database to automatically build graph topology. In fact, from the perspective of graph database, it mainly completes three parts of work:

graph data storage
Maintenance of graph topology
figure fan

The storage of graph data (vertices) and traditional data storage do not need to introduce differences in the past. Most of them are differences in internal storage details and Key-Value design. How to maintain graph topology and existing data, and how to add graph topology is a new In this Hackathon, we proposed a new solution, which can automatically build a graph topology of existing data through ALTER statement and SOURCE/DESTINATION KEY information, eliminating any data migration and business transformation process.

from stand-alone storage unistore to TiKV+TiFlash, increasing the data set support scale

Hackathon 2020 used unistore for running unit tests due to time constraints, and the introduction of the TiKV storage engine this year has greatly improved the computing performance. As can be seen in the Demo demo, the single-source 6-degree personal network query of 1 million points is only It takes 200 milliseconds, compared with 6 seconds for 100,000 data on unistore last year, the performance improvement is extremely obvious. In the future, TiFlash can also be introduced, and the MPP capability of TiFlash can be used to perform large-scale graph computing.

through two Hackathons

implements graph database on

What kind of path did TiMatch explore?

The first is to be compatible with TiDB's existing functions, existing data and existing ecology (DM/CDC/Lightning/ORM/ etc.)
Consider the automatic compatibility of adaptive graph DDL with existing data, so as to automatically build graph topology
Support TiKV storage engine for complete Parser implementation
Improve graph query: traversal, path filtering, shortest path, predicate pushdown, Coprocessor pushdown, etc.
Transactions, indexes, nested subqueries, etc.

good grades for 2nd prize at Hackathon

main reason for 161ea1ab53fd68?

The main reason is that this project stands on the shoulders of the "three-tier" giant, so that it can be presented in front of everyone\
Level 1: The thinking continues with TiGraph, the implementation and syntax are an advance, and the syntax is complete. Level 2: Implemented on the existing frameworks of TiDB and TiKV, expanding the graph database, and expanding the scenario of TiDB. Level 3: Introducing a new The syntax of the original version is used to complete the query of the graph database. It is extended from Oracle PGQL. The original language is rather verbose in some places and is not very elegant. It has been modified to become more elegant.
Secondly, the clear division of labor among the teams, mutual trust and seamless collaboration among the team members

Long Heng, Xia Yujie, Liu Dongpo: Focus on the implementation of syntax analysis, AST, Parser, writing path, etc. on the TiDB side
Bai Jiachen: Responsible for the push-down of TiKV side operators, and assume the role of team promotion (production of promotional videos + PRC PPT, etc.)

as a North American

participate in the TiDB Hackathon for the first time?

Bai

GitHub ID：JeepYiheihou

Graduated from Harvard with a master's degree, now working in AWS Vancouver, engaged in the core research and development of the ElasticCache cache database, and participated in this Hackathon during the vacation. The whole Hackathon process was a happy experience for me. The team is fully committed, and the competition is very fierce. The competition system gives us enough time to think and be creative. Many projects are highly completed, and many practical projects have emerged. The judges asked a lot of sharp questions. It can be seen that they not only see the bright spots of the current projects, but also explore the future direction and value of these projects from a long-term perspective. I think the judges also enjoy this process very much.

full-stack programmer?

Because of love. The team's video and RFC's PPT are from Bai Jiachen's hands. I have studied architecture for eight years before, making beautiful PPT is a compulsory course, and making videos is also self-taught. This time the team's promotional video draws on the theme of The Matrix. Some frames use processing for programming visualization, and use pathon language to present graphs, grids and networks. Before switching careers as a programmer, I was very interested in programming, taught myself programming, and was exposed to the use of visual tools to make videos, which happened to be used this time.

No matter what you did before or after the career change, when making a career decision, make sure you really love it. Love it, and you'll be engrossed in it and put your energy into it. Everyone knows how to prepare for changing careers as a programmer. The important thing is not how you enter the industry in the first step, but how to do each step well.

Outside of work, Bai Jiachen has a wide range of hobbies: he likes to draw and has drawn a series of oil pastel paintings; he likes motorcycles; he likes to go to the gym to exercise; The biggest hobbies are still writing code and reading papers, which is why I switched careers as a programmer, I want to try some challenging things, solve some problems, and enjoy the process of solving problems and the feedback I receive.

Only the sky is your limit, we love the process of exploration and immerse in it丨Interview with TiMatch Team

PingCAP

引用和评论

从企业数智化四阶段解读 TiDB 场景价值

MySQL慢查询日志：性能优化的终极指南

做到真正0丢失、0重复：Apache SeaTunnel 实现万亿级数据一致性全解密

Devin 发布 DeepWiki，2 星的项目直接装出万星的气场

好用的开源埋点方案-ClkLog埋点用户分析系统

DNS服务器地址大全

实战分享：DolphinScheduler 中 Shell 任务环境变量最佳配置方式