This article was first published on the Nebula Graph Community public account
index not found
? Can't find index? Why do I need to create a Nebula Graph index? When to use Nebula Graph native indexes? In response to common problems in the community, this article aims to explain the use of indexes in one article.
The index of Nebula Graph is actually very similar to the index in the traditional relational database, but there are some confusing differences. Students who are just beginning to understand Nebula will wonder:
- It is not clear what concept is indexed in the Nebula Graph database;
- When should Nebula Graph indexes be used;
- Do Nebula Graph indexes affect write performance? How big is the impact?
In this article, we will solve these problems one by one, so that everyone can use Nebula Graph better.
What exactly is a Nebula Graph index?
In short, Nebula Graph index is used and only used to query scenarios for pure attribute conditions . It has the following characteristics:
- It is not required for attribute condition filtering in graph walk queries
- Query based on pure attribute conditions (note: non-sampling case) must create an index
Pure attribute condition start query
We know that in traditional relational databases, an index is one or more copies of table data that are reordered for a specific column , which is used to speed up read queries for specific column filter conditions and bring additional data writes. In short, indexes can speed up, but it is not necessary for queries to use indexes.
In the Nebula Graph database, the index is a reordered copy of the point and edge specific attribute data , which is used to provide pure attribute conditions to start the query .
Take the following query as an example, this statement implements the way to obtain graph data from the specified point and edge attribute conditions instead of the point ID:
#### 必须 Nebula Graph 索引存在的查询
# query 0 纯属性条件出发查询
LOOKUP ON tag1 WHERE col1 > 1 AND col2 == "foo" \
YIELD tag1.col1 as col1, tag1.col3 as col3;
# query 1 纯属性条件出发查询
MATCH (v:player { name: 'Tim Duncan' })-->(v2:player) \
RETURN v2.player.name AS Name;
The above two pure attribute conditions start the query literally "acquiring the point or edge itself according to the specified attribute condition", and the negative example is the ID of the given point. Refer to the following examples:
#### 不基于索引的查询
# query 2, 从给定的点做的游走查询 vertex VID: "player100"
GO FROM "player100" OVER follow REVERSELY \
YIELD src(edge) AS id | \
GO FROM $-.id OVER serve \
WHERE properties($^).age > 20 \
YIELD properties($^).name AS FriendOf, properties($$).name AS Team;
# query 3, 从给定的点做的游走查询 vertex VID: "player101" 或者 "player102"
MATCH (v:player { name: 'Tim Duncan' })--(v2) \
WHERE id(v2) IN ["player101", "player102"] \
RETURN v2.player.name AS Name;
Let's take a closer look at the query 1
and query 3
, although the conditions in the statement all have filter conditions for tag as player { name: 'Tim Duncan' }
, but one needs to rely on the index to achieve No index is required. The specific reasons are here:
query 3
does not need an index because:- It will bypass the starting point of
(v:player { name: 'Tim Duncan' })
this kind of ungiven VID, and expand outward from the starting point of the given VID["player101", "player102"]
from v2, and then go throughGetNeighbors()
Get the point at the other end of the edge, thenGetVertices()
get the next hopv
, filter out unnecessary data according tov.player.name
;
- It will bypass the starting point of
query 1
is different because it doesn't have any given origin VID:- You can only start with the attribute condition
{ name: 'Tim Duncan' }
, and first find the matching points in the index data sorted by name: IndexScan() getsv
; - Then do GetNeighbors() from
v
to get the data of the other end of the edgev2
, and use GetVertices() to get the data in the next hopv2
;
- You can only start with the attribute condition
In fact, the key here is to query whether there is a given vertex ID (Vertex ID). The following two query execution plans compare their differences more clearly:
Legend: The execution plan of query 1 (requires an index);
Legend: The execution plan of query 3 (no index is required);
Why must an index be used in a query based on pure attribute conditions?
Because when Nebula Graph stores data, its structure is designed for distributed and relational relationships, and full scan conditional search without indexes in a table-structured database is actually more expensive, so it is intentionally prohibited by design.
However, if you don't pursue all the data, just sample a part of it. After 3.0, it supports the situation that the index is not mandatory LIMIT <n>
LIMIT
MATCH (v:player { name: 'Tim Duncan' })-->(v2:player) \
RETURN v2.player.name AS Name LIMIT 3;
Why do queries only need pure attribute conditions for indexing?
Here, we compare the implementation of normal graph query graph-queries and pure-prop-condition queries :
- graph-queries, such as
query 2
andquery 3
are extended walks along the edge to find specific path conditions; - pure-prop-condition queries, such as
query 0
andquery 1
are to find points and edges that satisfy only certain property conditions (or unrestricted conditions);
In Nebula Graph, when graph-queries is expanding, the original data of the graph has been sorted by VID (both points and edges), or it has been indexed in the data. This sorting brings continuous storage (physical upper adjacency) so that the extended walk itself is optimized to return results quickly.
Summary: What is an index and what is an index not?
What is an index?
- The Nebula Graph index is for the ordering of a piece of attribute data from a given attribute condition to query and edge, and it makes this read query mode possible at the cost of writing.
What is an index not?
- Nebula Graph indexes are not used to speed up general graph queries : queries that expand outward from a single point (even filtering attribute conditions) do not rely on native indexes, because the storage of Nebula data itself is optimized and sorted for this kind of query. .
Some design details of Nebula Graph indexes
In order to better understand the limitations, costs, and capabilities of the index, let's explain more details about it.
- Nebula Graph indexes are locally (not separate, centralized) stored and sharded together with point data.
It only supports left matching
- Because the bottom layer is RocksDB Prefix Scan;
Performance cost:
- The path when writing: not only one more data, but also expensive read operations to ensure consistency;
- Read path: rule-based optimization to select indexes, fan-out to all StorageDs;
This information can be seen in #handdrawingsandvideos on my personal website (link: https://www.siwei.io/sketch-notes/ ), refer to the following picture:
Due to the design of left matching, in complex query scenarios, such as: wildcards and REGEXP are involved in queries based on pure attribute conditions, Nebula Graph provides the function of full-text indexing, which uses Raft Listener to asynchronously write data to an external Elasticsearch cluster Among them, and check ES when querying, see the document for specific full-text index usage: https://docs.nebula-graph.com.cn/3.0.0/4.deployment-and-installation/6 .deploy-text-based-index/2.deploy-es/ .
In this hand drawing, we can also see that
Write path
- Writing index data is a synchronous operation;
Read path
- This part draws an example of RBO. The rules in the query assume that when col2 is equal and matched on the left, the performance is better than that of col1, so the second index is selected;
- After the index is selected, the request to scan the index is fan-out to the storage node, and some of the filter conditions such as TopN can be pushed down;
in conclusion:
- Because of the cost of writing, it is only used when the index must be used. If the sampling query can meet the reading requirements, LIMIT <n> can be used without creating an index.
Index has left matching restriction
- The order that matches the query needs to be carefully designed
- Sometimes it is necessary to use a full-text index .
Use of indexes
For details, please refer to the official Nebula index document: https://docs.nebula-graph.io/3.0.0/3.ngql-guide/14.native-index-statements/ Some key points are:
The first point is to create an index on the Tag or EdgeType for the attributes of the edge that you want to be checked conditionally, using the CREATE INDEX
statement;
The second point is that after the index is created, the index data will be written synchronously, but if the index corresponding to the point-edge data that already exists before the index is created, it needs to be explicitly specified to create. This is an asynchronous job, and the statement needs REBUILD INDEX
be executed-- REBUILD INDEX
;
The third point, after triggering the asynchronous REBUILD INDEX
, the available statement SHOW INDEX STATUS
query status:
Fourth, the query that uses the index can be LOOKUP
, and can often use the pipe character to expand the query on top of it, refer to the following example:
LOOKUP ON player \
WHERE player.name == "Kobe Bryant"\
YIELD id(vertex) AS VertexID, properties(vertex).name AS name |\
GO FROM $-.VertexID OVER serve \
YIELD $-.name, properties(edge).start_year, properties(edge).end_year, properties($$).name;
It can also be MATCH
, where v
is obtained through the index, and v2
is obtained by expanding the query in the data (non-index) part.
MATCH (v:player{name:"Tim Duncan"})--(v2:player) \
RETURN v2.player.name AS Name;
The fifth point is the capabilities and limitations of compound indexes. Understanding that the matching of the native index is the left matching allows us to know that the index for more than one attribute: the composite index, and can help us understand its ability to be limited, here are a few conclusions:
We create compound indexes on multiple properties that are order-dependent
- For example, we create a two-attribute composite index index_a:
(isRisky: bool, age: int)
, and index_b:(age: int, isRisky: bool)
when querying according to theWHERE n.user.isRisky == true AND n.user.age > 18
filter condition, index_a The short field is obviously more efficient.
- For example, we create a two-attribute composite index index_a:
Only the filter condition of the proper subset of the compound indexed property that is matched by the compound left can be supported only
- For example, index_a:
(isRisky: bool, age: int)
, and index_b:(age: int, isRisky: bool)
queryWHERE n.user.age > 18
when this statement onlyindex_b
composite leftmost match, to meet this query.
- For example, index_a:
- For some dependent attributes as the starting point of the query to find points and edges, the native index cannot meet the matching scenario of full-text search. At this time, we should consider using the Nebula full-text index, which is an out-of-the-box external Elasticsearch supported by the Nebula community. Through configuration, the data created with the full-text index will be asynchronously updated to the Elastic cluster through the Raft listener. The query of the full-text index The entry is also
LOOKUP
, please refer to the document for details: https://docs.nebula-graph.com.cn/3.0.1/4.deployment-and-installation/6.deploy-text-based -index/2.deploy-es/ .
review
- Nebula Graph index scans enumerations and edges by ordering copies of attributes when only attribute conditions are provided;
- Nebula Graph indexes are not used for graph expansion queries;
- Nebula Graph index is left matching, not used for fuzzy full-text search;
- Nebula Graph indexes have performance costs when writing;
- Remember to rebuild the index if there is already data on the corresponding edge before the Nebula Graph index is created;
Happy Graphing!
Exchange graph database technology? To join the Nebula exchange group, please fill in your Nebula business card first, and the Nebula assistant will pull you into the group~~
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。