Abstract: This article briefly introduces the knowledge related to the storage and retrieval of the knowledge graph.

This article is shared from Huawei Cloud Community "Knowledge Graph Storage and Retrieval" , the original author: JuTzungKuei.

1 Overview

Background: With the development and popularization of the Internet, an interconnected world is taking shape. At the same time, data is showing explosive exponential growth, and we are in a new era of surging digital torrents.

How much data do we generate every day? According to statistics every day:

  • Send 500 million tweets to the blog;
  • 294 billion emails sent;
  • There are 5 billion online searches every day in the world;
  • A connected car will generate 4TB of data;
  • Facebook generates 4PB of data every day, including 350 million photos and 100 million hours of videos.

There is more and more knowledge, and the current common knowledge graphs are all composed of triple data.

  • DBpedia has nearly 80 million triples;
  • YAGO has more than 120 million triples;
  • Wikidata has nearly 410 million triples;
  • Freebase has more than 3 billion triples;
  • Chinese Encyclopedia has about 140 million triples.

image.png

So, how do we efficiently store and retrieve large-scale graph data? ? ?

The knowledge graph is a directed graph structure that describes the entities, events, or concepts that exist in the real world and the relationships between them. Among them, the nodes in the directed graph represent entities, events, or concepts, and the edges in the graph represent the relationships between adjacent nodes.

image.png

The figure shows a partial schematic diagram of Andy Lau's knowledge map. In the figure, the red font indicates the concept, the rectangular box indicates the entity, the blue font indicates the attribute, the ellipse indicates the attribute value, and the orange font indicates the relationship.

  • Concepts: characters, countries, movies, etc.
  • Entity: Andy Lau, Zhu Liqian, China, A World Without Thieves, etc.
  • Attributes: height, weight, gender, capital, abbreviation, release time, Douban score, etc.
  • Relationship: wife, daughter, nationality, starring, etc.

2. Storage of knowledge graph

The knowledge in the knowledge graph is represented by the RDF structure, and its basic building blocks are facts.

Each fact is a triple: <subject S, predicate P, object O>, where:

  • Subject S: can be entities, events, concepts
  • Predicate P: can be relationship, attribute
  • Object O: can be entities, events, concepts, ordinary values

The following shows the list of triples represented by the knowledge in the knowledge graph.

<S, P, O>

<Andy Lau, birthday, September 27, 1961>

<Andy Lau, blood type, AB type>

<Andy Lau, wife, Zhu Liqian>

<Andy Lau, daughter, Liu Xianghui>

<Andy Lau, Nationality, China>

<China, Capital, Beijing>

。。。 。。。

In order to efficiently query and manage the knowledge graph data, it is necessary to organize the data reasonably on the storage medium. According to the different storage methods, standard knowledge storage methods can be divided into storage based on table structure and storage based on graph structure.

2.1, storage based on table structure

The storage based on the table structure uses a two-dimensional data table to store the data in the knowledge graph. According to different design principles, knowledge graphs can have different table structures. Currently, they can be divided into five categories: triple table, attribute table, horizontal table, vertical table, and full index.

2.1.1 Table of triples

The facts in the knowledge graph are triples. A simple and direct storage method is to design a table to store all the facts in the knowledge graph, which is to create a table with three columns in a relational database. The mode of the table is: <subject, predicate, object>. Store each triple in the knowledge graph as a row in the triple table.

This storage method is simple and straightforward and easy to understand, but storing the entire knowledge graph in a table will cause the size of a single table to be too large, and there will be a very large overhead in complex queries, or additions, deletions, and changes.

Program representatives: RDF database system 3store, Virtuoso

image.png

2.1.2 Property table

The attribute table, also known as the type table, is to construct a table for each type, and the instances of the same type are placed in the same table. Each column of the table represents an attribute of this type of entity, and each row stores an instance of this type of entity.

Although this storage method overcomes the shortcomings of the triple table, it also causes new problems. A large number of data fields are repeated, and the attribute values of some data have null values, which will cause redundant storage.

Program representative: RDF triple library Jena

image.png

country

image.png

the film
image.png

2.1.3 Level table

Each row of the level table stores all the predicates and objects of a subject in a knowledge graph. In fact, the level table is equivalent to the adjacency table of the knowledge graph. The number of columns in the level table is the number of different predicates in the knowledge graph, and the number of rows is the number of different subjects in the knowledge graph.

In the real knowledge graph, the number of different predicates may be tens of thousands, which will exceed the upper limit of the database; there are a lot of null values.

Program representative: the early RDF database system DLDB
image.png

2.1.4 Vertical table

The vertical table is a method of dividing the dimension by the predicate of the triad. According to the predicate, the RDF knowledge graph is divided into a number of tables containing only two columns of subject and object. The total number of tables is the number of different predicates in the knowledge graph. In other words, a table is created for each predicate, and the subject and object values connected by the predicate in the knowledge graph are stored in the table.

This method replaces the self-join with the connection between different tables, avoiding the self-join operation. However, it cannot well support query operations where the predicate is a variable.

Program Representative: SW-Store

gender
image.png

Starring
image.png

capital
image.png

2.1.5 Full Index

Full index, also known as six-fold index, is an optimization technology proposed for the characteristics of knowledge graph data and operations. It uses the characteristics of knowledge graph triples to construct an index. Enumerate the various permutations of the subject, predicate, and object in the triple, and then build an index for them one by one. There are six kinds of arrangement of subject, predicate and object. The content of these indexes corresponds to the various possibilities of the triple pattern with variables in the operation of the knowledge graph, which is a typical "space-for-time" strategy.

This method not only alleviates the single-table self-join problem of the triple table, but also speeds up the query efficiency of the graph. But it also increases the cost of updating and maintenance.

Program representatives: RDF-3X, Hexastore

Six tables: SPO, SOP, PSO, POS, OSP, OPS

2.2. Storage based on graph structure

The storage based on graph structure is to store the data in the knowledge graph by way of graph. Regarding entities as nodes and relationships as edges with labels, the data of the knowledge graph naturally meets the structure of the graph model. The storage method based on the graph structure can directly and accurately reflect the internal structure of the knowledge graph. At present, there are two main graph storage modes: adjacency list and adjacency matrix. The corresponding database is a graph database, and the data model is an attribute graph.

2.2.1, adjacency list

The so-called adjacency list is a list corresponding to each node (entity) in the knowledge graph, and the information related to the entity is stored in the list. When using graph structure to manage knowledge graph data, a key issue is how to effectively prune query operations in the index candidate space based on graph structure.
image.png

2.2.2, adjacency matrix

The so-called adjacency matrix is to maintain multiple nxn-dimensional matrices in the computer, where n is the number of nodes in the knowledge graph. Each matrix corresponds to a predicate, and each row or column corresponds to a node in the knowledge graph. If the i-th row and j-th column of the matrix corresponding to the predicate p is 1, it means that there is an edge from the i-th node to the j-th node in the knowledge graph with a predicate p.

Three-dimensional matrix M: |S| x |P| x |O|, respectively represent the number of subject, predicate, and object. If <s, p, o> exists in the knowledge graph, then Mi[k]=1, otherwise set to 0 .

2.2.3, graph database

The theoretical basis of graph database is graph theory, which represents and stores data through nodes, edges and attributes. Specifically, the graph database is based on a directed graph, in which nodes, edges, and attributes are the core concepts of the graph database.

Node: Represents objects such as entities and events.

Edge: Refers to the directed lines connecting nodes in the graph, used to represent the relationship between different nodes.

Attributes: describe the characteristics of nodes or edges.

Common graph databases: Neo4J, JanusGraph, OrientDB, etc.;
image.png

3. Retrieval of knowledge graph

The knowledge of the knowledge graph is actually stored through a database system, and most database systems provide users with an interface for accessing data through a formal query language.

3.1 SQL

Structured Query Language is used to manage relational databases.

Four operations

  • Increase: insert into table name (column 1, column 2, ...) values (value 1, value 2, ...)
  • Delete: delete from table name where conditions
  • Change: update table name set column 1=value 1 where condition
  • Check: select column 1, column 2, ... from table name where conditions

3.2 SPARQL

SPARQL is a query language and data acquisition protocol developed by W3C for RDF data. It is a query language widely supported by graph databases.

Three operations:

  • Increase: insert data triple data
  • Delete: delete data triple data
  • Modification: None, combined with addition and deletion
  • Check: select variable 1, variable 2, ... where graph mode
select ?x, ?y
where {
    天下无贼 主演 ?x .
    无间道 主演 ?x .
    ?x 生日 ?y .
}

3.3 Gremlim

Gremlin is a graph traversal language used in the Apache Tinkerpop framework. Gremlin can be used to conveniently query graph data, modify graphs, local traversal, and filter attributes.

Three operations

  • Increase: g.addV('person').property(id,'007').property('birthday','June 22, 1962'), g.addE('husband').property('xxx' ,'yyy').from(gV('001')).to(gV('002'))
  • Delete: gV('007').drop()
  • Check: gV().hasLabel('person'), gE().label(), gV().valueMap()

3.4 Cypher

Cypher is a descriptive graph query language that allows expressive and efficient queries on graph storage without writing graph structure traversal code. Is a widely used declarative graph database query language.

Four operations

  • Add: create(n: people {name:'Zhou Xingchi', birthday:'June 22, 1962'}) return n;
  • Delete: match(s:Student{id: 1}) detach delete s;
  • Change: match(n) where id(n)=7 set n.name ='neo' return n;
  • Check: match(n{name:"Andy Lau"}) return n, match(a: people{name:"Andy Lau"})-[b:Relation {{name:"Nationality"}]->(c) return c ;

reference

Click to follow to get to know the fresh technology of Huawei Cloud for the first time~


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量