头图

In scenarios related to maps or geographic information, the retrieval of address keywords is particularly important. For example, when opening Baidu Maps and wanting to query the information of a certain location "China Agricultural University, No. 17 Qinghua East Road, Haidian District, Beijing", we often enter the keyword "China Agricultural University" instead of the detailed address information accurate to the street. Behind the address keyword search, what is needed is a database that can support full-text search and fuzzy query to match it, so as to quickly improve the efficiency of address search.

图片

PostgreSQL is known as "the most advanced open source database available in the world". It has strong text search capabilities. It not only supports full-text search, but also supports fuzzy queries and regular queries. In addition, PostgreSQL also has built-in expression indexing and Gin indexing functions, and with the rich plug-in ecology, it has a relatively large advantage in the direction of address keyword retrieval.

This paper introduces a method for retrieving logistics address keywords based on PostgreSQL, to illustrate how to use PostgreSQL to improve the retrieval efficiency of logistics address keywords.

1. Application background

In the scenario where address retrieval is required, after the user enters the address text, the address needs to be segmented, and then matched with the address corpus database through full-text indexing technology to obtain normalized address information, and address positioning is performed on this basis.

Usually, the address query sentence will be divided into several paragraphs of keywords after the address segmentation process, and the keywords are matched to the historical address corpus database, and then the query sentence is returned to obtain the query result. Usually, it takes a few seconds to dozens of seconds from the time when the user enters a keyword query to get the returned result due to different keyword segmentation and matching methods.

Retrieving entries in a database is a very basic and common function, and there are many ways to implement them, including:

1. Implemented based on professional independent search engines such as Elasticsearch or Lucene
2. Based on the retrieval function of the database

Although systems such as Elasticsearch can achieve relatively flexible retrieval functions, the development and operation and maintenance costs will also greatly increase. How to use the built-in functions of PostgresSQL to quickly and efficiently implement most Chinese retrieval scenarios is the technical solution we want to discuss.

2. Technical solutions

GIN (Generalized Inverted Index, Generalized Inverted Index) is an index structure that stores a set of pairs (key, posting list), where key is a key value, and posting list is a set of positions where the key has appeared. For example, in ('hello', '14:2 23:4'), it means that hello has appeared in the two positions of 14:2 and 23:4. In PostgreSQL, these positions are actually the tid of the tuple. Each attribute in the table may be parsed into multiple key values when indexing, so the tid of the same tuple may appear in the posting list of multiple keys. Through this index structure, the tuple containing the specified key can be quickly found.

pg_trgm is an extension plug-in of PostgreSQL based on N-gram model word segmentation. Its basic idea is to perform a sliding window operation of size N on the content of the text according to bytes, forming a sequence of byte fragments of length N, pg_trgm is a ternary For the 3-Gram, each consecutive 3 characters is a TOKEN, and then the GIN inverted index is established on the TOKEN, and efficient and accurate fuzzy query can be performed.

pgbigm is similar to pg_trgm, and it is also an extension plug-in for PostgreSQL based on N-gram model segmentation. The difference is that pgbigm is a binary 2-gram.

Combined with the characteristics of PostgreSQL index and word segmentation model, we constructed about 100 million rows of address data in the Beijing area for performance testing, and compared and analyzed that PostgreSQL has significantly improved efficiency in logistics keyword retrieval scenarios. The test results are as follows:

图片

It can be seen from the above results that the performance of both pg_trgm+gin and pgbigm+gin is much better than that of the commonly used Btree when performing fuzzy queries. At the same time, because the TOKEN generated by pg_trgm is three characters, the corresponding TOKEN can only be matched if there are more than three characters. When it is less than 3 characters, one or two characters need to be searched before and after fuzzy search, so the retrieval performance is obviously degraded. , in comparison, pgbigm (based on binary Tri-Gram) is more efficient in processing fuzzy query of single-word and double-word characters. Since the keywords of logistics are all more than three characters, the pg_trgm+gin scheme is used for keyword search and query, so as to ensure the response time at the millisecond level.

In addition, for text address data, it often has the characteristics of natural language. Jieba stuttering word segmentation is a powerful word segmentation database. Word segmentation is more suitable for business attributes. The main functions include: supporting different modes of word segmentation, custom dictionary, keyword extraction , part-of-speech tagging. pg_jieba uses the jieba word segmentation algorithm to build a PostgreSQL Chinese word segmentation plugin, and the word segmentation effect also has a good performance.

3. Summary

In summary, PostgreSQL supports rich indexes, powerful full-text retrieval capabilities and a variety of plug-in ecosystems, and supports text queries in different scenarios. Users do not need to synchronize data to search engines and then query. Using PostgreSQL can greatly simplify The user's architecture, development costs, while ensuring absolute real-time data query.

Based on the open source PostgreSQL, JD Cloud is a powerful relational database cloud database PostgreSQL, which supports rich data types and geographic information expansion, and has powerful parallel computing capabilities. It supports a complete set of solutions such as backup, monitoring, and migration.


京东云开发者
3.4k 声望5.4k 粉丝

京东云开发者(Developer of JD Technology)是京东云旗下为AI、云计算、IoT等相关领域开发者提供技术分享交流的平台。