1
vivo Internet Server Team - Shuai Guangying

This article sorts out Elasticsearch's thinking on the upgrade and optimization of the numerical index implementation scheme. From 2015 to the present, the numerical index scheme has undergone multiple versions of iterations. The implementation ideas have changed from the initial string simulation to KD-Tree, and the technology has become more and more complex. The capabilities are getting stronger and stronger, and the application scenarios are getting richer. From geographic location information modeling to multi-dimensional coordinates, data retrieval to data analysis insights, Elasticsearch can be seen.

1. Business Background

LBS service is an important part of the current Internet, involving catering, entertainment, taxi, retail and other scenarios. In these scenarios, there is a very important basic ability: searching for nearby POIs. For example, search for nearby food, search for nearby cinemas, search for nearby special cars, and search for nearby stores. For example: take a coordinate point as the center to query the POI coordinates within a radius of 1km, as shown in the following figure:

图片

Elasticsearch has the ability to respond in milliseconds in geographic location information retrieval, and millisecond response is critical to user experience. The above problem uses Elasticsearch, and only needs to use the geo_distance query to solve the business problem. The query syntax using Elasticsearch is as follows:

 GET /my_locations/_search
{
  "query": {
    "bool": {
      "must": {
        "match_all": {}
      },
      "filter": {
        "geo_distance": {
          "distance": "1km",
          "pin.location": {
            "lat": 40,
            "lon": 116
          }
        }
      }
    }
  }
}

The use of tools is a very simple matter, and the more interesting lies in the ideas behind the tools to solve problems. Once you understand the idea of dealing with problems, you can be detached from the tool itself and draw inferences from one case to another. Based on the problem of how to achieve millisecond-level search for nearby POIs in the context of massive data, this paper discusses the implementation scheme of Elasticsearch and the evolution process of implementing geographic indexing technology.

2. Background knowledge

Before introducing the Elasticsearch solution, we first need to introduce some background knowledge, mainly three questions.

  1. How to pinpoint an address?

A geographic coordinate system consisting of longitude, latitude, and relative altitude that can unambiguously mark any location on the earth. Longitude range [-180, 180], latitude range [-90, 90] on Earth. Usually the prime meridian (longitude is 0) and the equator (latitude is 0) are the dividing lines. For most business scenarios, two-dimensional coordinates composed of longitude and latitude are sufficient to deal with business problems, and there may be some exceptions in Chongqing Mountain City.

  1. How to calculate the distance between two addresses?

For the plane coordinate system, the distance between two points can be easily calculated by the Pythagorean theorem. But since the earth is an imperfect sphere, and different locations have different altitudes, it is a very complicated problem to accurately calculate two distance locations. In the case of disregarding the height, the two-dimensional coordinate distance usually uses the Haversine formula.

This formula is very simple, just use arcsin and cos two high school math formulas. where φ and λ represent the radian-degree measure of the latitude and longitude of the two points. Where d is the distance between the two points sought, and the corresponding mathematical formula is as follows (refer to Wikipedia):

图片

Programmers prefer to look at code, and it is easier to understand formulas against code. The corresponding code is as follows:

 // 代码摘自lucene-core-8.2.0, SloppyMath工具类
 
 /**
  * Returns the Haversine distance in meters between two points
  * given the previous result from {@link #haversinSortKey(double, double, double, double)}
  * @return distance in meters.
  */
 public static double haversinMeters(double sortKey) {
   return TO_METERS * 2 * asin(Math.min(1, Math.sqrt(sortKey * 0.5)));
 }
 
 /**
  * Returns a sort key for distance. This is less expensive to compute than
  * {@link #haversinMeters(double, double, double, double)}, but it always compares the same.
  * This can be converted into an actual distance with {@link #haversinMeters(double)}, which
  * effectively does the second half of the computation.
  */
 public static double haversinSortKey(double lat1, double lon1, double lat2, double lon2) {
   double x1 = lat1 * TO_RADIANS;
   double x2 = lat2 * TO_RADIANS;
   double h1 = 1 - cos(x1 - x2);
   double h2 = 1 - cos((lon1 - lon2) * TO_RADIANS);
   double h = h1 + cos(x1) * cos(x2) * h2;
   // clobber crazy precision so subsequent rounding does not create ties.
   return Double.longBitsToDouble(Double.doubleToRawLongBits(h) & 0xFFFFFFFFFFFFFFF8L);
 }
 // haversin
 // TODO: remove these for java 9, they fixed Math.toDegrees()/toRadians() to work just like this.
 public static final double TO_RADIANS = Math.PI / 180D;
 public static final double TO_DEGREES = 180D / Math.PI;
 
 // Earth's mean radius, in meters and kilometers; see http://earth-info.nga.mil/GandG/publications/tr8350.2/wgs84fin.pdf
 private static final double TO_METERS = 6_371_008.7714D; // equatorial radius
 private static final double TO_KILOMETERS = 6_371.0087714D; // equatorial radius
 
/**
  * Returns the Haversine distance in meters between two points
  * specified in decimal degrees (latitude/longitude).  This works correctly
  * even if the dateline is between the two points.
  * <p>
  * Error is at most 4E-1 (40cm) from the actual haversine distance, but is typically
  * much smaller for reasonable distances: around 1E-5 (0.01mm) for distances less than
  * 1000km.
  *
  * @param lat1 Latitude of the first point.
  * @param lon1 Longitude of the first point.
  * @param lat2 Latitude of the second point.
  * @param lon2 Longitude of the second point.
  * @return distance in meters.
  */
 public static double haversinMeters(double lat1, double lon1, double lat2, double lon2) {
   return haversinMeters(haversinSortKey(lat1, lon1, lat2, lon2));
 }
  1. How to easily share latitude and longitude coordinates on the Internet?

Geohash is an algorithmic service published by Gustavo Niemeyer on his personal blog on 2008-02-26. Its original intention is to provide a short URL to identify the map location by encoding the latitude and longitude, which is convenient for use in emails, forums and websites.

In fact Geohash's value is not only to provide short URLs, its greater value lies in:

  1. Geohash provides a unique ID for each coordinate on the map, which is equivalent to providing an ID card for each geographic location. Unique ID is very rich in application scenarios in the database.
  2. Another storage method is provided for coordinate points in the database, which converts two-dimensional coordinate points into one-dimensional strings. For one-dimensional data, indexes such as B-trees can be used to speed up queries.
  3. Geohash is a prefix encoding, and coordinate points with similar locations have the same prefix. Provides high-performance proximity POI query through prefix, and proximity POI query is the core capability of LBS service.

Regarding the encoding rules of Geohash, it will not be expanded here. The most important point here is:

Geohash is a prefix encoding, and coordinate points with similar locations have the same prefix. Geohash codes have different lengths and cover different areas.

Under the foreshadowing of the previous knowledge, the simplest solution to find the coordinate set within the specified radius of a coordinate point is released.

  • Brute force algorithm
The center coordinate point calculates the distance from each coordinate point in the set in turn, and selects the coordinate points that meet the radius conditions.

This algorithm is too familiar to everyone, and it is the most common Brute Force algorithm. This algorithm cannot meet the millisecond-level response time requirements in the context of massive data, so it is mostly used for offline computing. For the business demands of millisecond-level response, this algorithm can be modified based on geohash.

  • Secondary screening
  1. Calculate the geohash based on the coordinate center point, and determine the geohash prefix based on the radius.
  2. Preliminarily screen out the coordinate points that roughly meet the requirements through the Geohash prefix (8 Geohash blocks around the Geohash block where the center point is located need to be included in the preliminary screening range).
  3. Secondary screening was performed using the Haversine formula for the primary screening results.

In addition to the above solutions, what are Elasticsearch's fantastic ideas for geographic information processing?

3. Program evolution

Elasticsearch supports geo_distance query since version 2.0, and has been updated to version 7.14.

From 2015 to now, it has experienced 6 years of development, and has built the following capabilities:

图片

Technical iteration can be roughly divided into 3 stages:

图片

The development has achieved remarkable results, and we can get a glimpse of it from the results of the performance test:

图片

图片

图片

In general, the efficiency of searching and writing data has been greatly improved under the premise of reducing resource consumption. The following is a detailed introduction to Elasticsearch's idea of geographic information indexing.

3.1 Prehistoric times

Elasticsearch is a search engine built on Lucene. Lucene's original idea was a full-text search toolbox that supported string searches without considering the handling of numeric types. The core idea is very simple. After the document is divided into words, a mapping of term => array[docIds] is constructed for each word.

In this way, users only need three steps to enter keywords to get the desired results:

Step 1: Find the corresponding inverted list by keyword. This step is simply to look up the dictionary. For example: TermQuery.TermWeight Get the inverted list of the term and read the docId+freq information.

Step 2: Score the document according to the docId and word frequency information obtained from the inverted table, and return the TopN result with the highest score to the user. For example: TopScoreDocCollector -- the collect() method, based on the small top heap, retains the TopN documents with the largest scores.

Step 3: Query the forward table based on docId to obtain the document field details.

These three steps seem simple, but they are the best battlefield for data structure applications. It needs to comprehensively consider the time complexity of disk, memory, IO, and data structure, which is very challenging.

For example, dictionary lookup can be implemented with many data structures, such as jump tables, balanced trees, HashMap, etc., and Lucene's core engineer Mike McCandless has implemented an FST that only he can understand, which is a combination of finite automata and prefix trees. This data structure is used to balance query complexity and storage space. It is slower than HashMap, but has lower space consumption. Document scoring usually uses a small top heap to maintain the N results with the highest scores. If a new document scores more than the top of the heap, just replace the top element of the heap.

Problem: For real business scenarios, only string matching queries are not enough. Strings and numeric values are the two most widely used data types. What if you need to perform an interval query? This is a very basic capability of a database product.

Lucene provides an adaptation solution, RangeQuery. It is to use enumeration to simulate numerical query. Simply put: RangeQuery=BooleanQuery+TermQuery, so the limit query is an integer and the maximum interval cannot exceed 1024. This implementation can be said to be very tasteless, but Lucene 2.9.0 version really supports numerical query.

LUCENE-1470,LUCENE-1582,LUCENE-1602,LUCENE-1673,LUCENE-1701,LUCENE-1712

Added NumericRangeQuery and NumericRangeFilter, a fast alternative to RangeQuery/RangeFilter for numeric searches. They depend on a specific structure of terms in the index that can be created by indexing using the new NumericField or NumericTokenStream classes. NumericField can only be used for indexing and optionally stores the values as string representation in the doc store. Documents returned from IndexReader/IndexSearcher will return only the String value using the standard Fieldable interface. NumericFields can be sorted on and loaded into the FieldCache. (Uwe Schindler, Yonik Seeley, Mike McCandless)

This implementation is very powerful, supports int/long/float/double/short/byte, and does not limit the query range. Its core idea is to array the numeric bytes, and then use the prefix to manage the interval hierarchically.

As shown below:

图片

In essence, it is still RangeQuery=BooleanQuery+TermQuery, but a layer of conversion is done in front: managing an interval through the prefix tree reduces the number of matching words, and this reduction is very effective. So here is an expert parameter: precisionStep. It is used to control the number of terms generated in word segmentation for each numerical field. The more terms generated, the finer the granularity of interval control, the larger the disk space occupied, and the higher the query efficiency.

For example: if precisionStep=8, it means that the upper layer of the leaf node of the prefix tree controls 255 leaves. Then, when the query range is from 1 to 511, 511 terms need to be traversed because it spans two adjacent non-leaf nodes. But if the query range is 0~512, you only need to traverse 2 terms. This implementation really feels like a roller coaster.

To sum up, the Lucene inverted index at the core of Elasticsearch is a classic invariant: the core of both string and numerical indexing is an inverted table. Understanding this core is critical to understanding geographic data storage and querying later. Next, we take the implementation idea of geo_distance as the main line of exploration, and explore the implementation ideas of each version of ES.

3.2 Elasticsearch version 2.0

The idea of implementing geo\_distance query in this version is very simple, and it is based on NumericRangeQuery. Its geo\_point type field is actually a composite field, or a structure. In the underlying implementation, two independent field indexes are used to avoid brute force scanning. That is, the geo_point field of Elasticsearch is lat, lon in implementation, and the encoded geohash comprehensively provides retrieval and aggregation functions.

The field definitions are as follows:

 public static final class GeoPointFieldType extends MappedFieldType {

    private MappedFieldType geohashFieldType;
    private int geohashPrecision;
    private boolean geohashPrefixEnabled;

    private MappedFieldType latFieldType;
    private MappedFieldType lonFieldType;

    public GeoPointFieldType() {}
}

The execution of the algorithm is divided into three phases:

Step 1: Calculate a rectangular area that roughly meets the requirements according to the center point and radius, and then use the minimum and maximum longitude of the rectangular area to obtain a numerical interval query , and use the minimum and maximum latitude of the rectangular area to obtain an interval query .

The core code is shown in the following figure:

 // 计算经纬度坐标+距离得到的矩形区域
// GeoDistance类
public static DistanceBoundingCheck distanceBoundingCheck(double sourceLatitude, double sourceLongitude, double distance, DistanceUnit unit) {
     // angular distance in radians on a great circle
     // assume worst-case: use the minor axis
     double radDist = unit.toMeters(distance) / GeoUtils.EARTH_SEMI_MINOR_AXIS;
 
     double radLat = Math.toRadians(sourceLatitude);
     double radLon = Math.toRadians(sourceLongitude);
 
     double minLat = radLat - radDist;
     double maxLat = radLat + radDist;
 
     double minLon, maxLon;
     if (minLat > MIN_LAT && maxLat < MAX_LAT) {
         double deltaLon = Math.asin(Math.sin(radDist) / Math.cos(radLat));
         minLon = radLon - deltaLon;
         if (minLon < MIN_LON) minLon += 2d * Math.PI;
         maxLon = radLon + deltaLon;
         if (maxLon > MAX_LON) maxLon -= 2d * Math.PI;
     } else {
         // a pole is within the distance
         minLat = Math.max(minLat, MIN_LAT);
         maxLat = Math.min(maxLat, MAX_LAT);
         minLon = MIN_LON;
         maxLon = MAX_LON;
     }
 
     GeoPoint topLeft = new GeoPoint(Math.toDegrees(maxLat), Math.toDegrees(minLon));
     GeoPoint bottomRight = new GeoPoint(Math.toDegrees(minLat), Math.toDegrees(maxLon));
     if (minLon > maxLon) {
         return new Meridian180DistanceBoundingCheck(topLeft, bottomRight);
     }
     return new SimpleDistanceBoundingCheck(topLeft, bottomRight);
 }

Step 2: The two queries are combined into a composite query that takes the intersection through BooleanQuery, so as to initially screen out the docId collection in the rectangular area indicated by the latitude and longitude.

The core code is shown in the following figure:

 public class IndexedGeoBoundingBoxQuery {

public static Query create(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
    if (!fieldType.isLatLonEnabled()) {
        throw new IllegalArgumentException("lat/lon is not enabled (indexed) for field [" + fieldType.names().fullName() + "], can't use indexed filter on it");
    }
    //checks to see if bounding box crosses 180 degrees
    if (topLeft.lon() > bottomRight.lon()) {
        return westGeoBoundingBoxFilter(topLeft, bottomRight, fieldType);
    } else {
        return eastGeoBoundingBoxFilter(topLeft, bottomRight, fieldType);
    }
}

private static Query westGeoBoundingBoxFilter(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
    BooleanQuery.Builder filter = new BooleanQuery.Builder();
    filter.setMinimumNumberShouldMatch(1);
    filter.add(fieldType.lonFieldType().rangeQuery(null, bottomRight.lon(), true, true), Occur.SHOULD);
    filter.add(fieldType.lonFieldType().rangeQuery(topLeft.lon(), null, true, true), Occur.SHOULD);
    filter.add(fieldType.latFieldType().rangeQuery(bottomRight.lat(), topLeft.lat(), true, true), Occur.MUST);
    return new ConstantScoreQuery(filter.build());
}

private static Query eastGeoBoundingBoxFilter(GeoPoint topLeft, GeoPoint bottomRight, GeoPointFieldMapper.GeoPointFieldType fieldType) {
    BooleanQuery.Builder filter = new BooleanQuery.Builder();
    filter.add(fieldType.lonFieldType().rangeQuery(topLeft.lon(), bottomRight.lon(), true, true), Occur.MUST);
    filter.add(fieldType.latFieldType().rangeQuery(bottomRight.lat(), topLeft.lat(), true, true), Occur.MUST);
    return new ConstantScoreQuery(filter.build());
}
}

Step 3: Use the FieldData cache (forward information) to obtain the latitude and longitude of each coordinate point in the rectangular area according to the docId, and then use the previous Haversine formula to calculate the distance from the center coordinate point, and perform precise screening to obtain a set of documents that meet the conditions.

The core code is as follows:

 // GeoDistanceRangeQuery类的实现
 @Override
 public Weight createWeight(IndexSearcher searcher, boolean needsScores) throws IOException {
     final Weight boundingBoxWeight;
     if (boundingBoxFilter != null) {
         boundingBoxWeight = searcher.createNormalizedWeight(boundingBoxFilter, false);
     } else {
         boundingBoxWeight = null;
     }
     return new ConstantScoreWeight(this) {
         @Override
         public Scorer scorer(LeafReaderContext context) throws IOException {
             final DocIdSetIterator approximation;
             if (boundingBoxWeight != null) {
                 approximation = boundingBoxWeight.scorer(context);
             } else {
                 approximation = DocIdSetIterator.all(context.reader().maxDoc());
             }
             if (approximation == null) {
                 // if the approximation does not match anything, we're done
                 return null;
             }
             final MultiGeoPointValues values = indexFieldData.load(context).getGeoPointValues();
             final TwoPhaseIterator twoPhaseIterator = new TwoPhaseIterator(approximation) {
                 @Override
                 public boolean matches() throws IOException {
                     final int doc = approximation.docID();
                     values.setDocument(doc);
                     final int length = values.count();
                     for (int i = 0; i < length; i++) {
                         GeoPoint point = values.valueAt(i);
                         if (distanceBoundingCheck.isWithin(point.lat(), point.lon())) {
                             double d = fixedSourceDistance.calculate(point.lat(), point.lon());
                             if (d >= inclusiveLowerPoint && d <= inclusiveUpperPoint) {
                                 return true;
                             }
                         }
                     }
                     return false;
                 }
             };
             return new ConstantScoreScorer(this, score(), twoPhaseIterator);
         }
     };
 }

This is a very simple and intuitive idea that realizes the search ability of POI within the specified radius range of the center point.

To briefly summarize the main points:

  1. Use the coordinates of the center point and the radius to determine the boundaries of the rectangular area.
  2. Use the Bool query to synthesize two NumericRangeQuery queries to realize the initial screening of the rectangular area.
  3. Use the Haversine formula to calculate the distance between the center point and each coordinate point in the rectangular area, and perform the second-stage filtering operation to filter out the final set of docIds that meet the conditions.

Although the scheme is simple, it realizes the ability of geo_distance after all. It's not impossible to use, right? So what's wrong with this scheme?

3.3 Elasticsearch version 2.2

There is a problem with the implementation of the ES2.0 version, that is, the data filtering of the two-dimensional combined conditional query is not well solved. It obtains the document sets that meet the latitude range conditions and the document sets that meet the longitude range conditions, and then conducts the intersection, and preliminarily screens too many invalid document sets.

Its processing idea is represented by a diagram as follows:

图片

That is, so many records are selected, and in the end, only the red area where the latitude and longitude ranges intersect is the primary screening range.

In response to the above problems, ES 2.2 version introduces a feature: Quadtree-based geographic location query (implemented in Lucene 5.3 version).

Quadtree is not a complex and advanced data structure. Compared with binary tree, it has two more child nodes.

As a basic data structure, Quadtree has a wide range of application scenarios, and it can be seen in image processing, spatial indexing, collision detection, life game simulation, fractal image analysis and other fields.

In the elasticsearch geographic spatial index problem, Quadtree is used to represent the interval, which can be regarded as a kind of prefix tree.

  • Region quadtree
The region quadtree represents a partition of space in two dimensions by decomposing the region into four equal quadrants, subquadrants, and so on with each leaf node containing data corresponding to a specific subregion. Each node in the tree either has exactly four children, or has no children (a leaf node). The height of quadtrees that follow this decomposition strategy (ie subdividing subquadrants as long as there is interesting data in the subquadrant for which more refinement is desired) is sensitive to and dependent on the spatial distribution of interesting areas in the space being decomposed. The region quadtree is a type of trie.

In terms of interval division, Quadtree is somewhat similar to the processing idea of geohash. In a one-dimensional world, bisection can iterate infinitely. Similarly, in a two-dimensional world, quads can also be iterated infinitely. The following figure can very vividly show the interval division process of Quadtree.

图片

How does ES 2.2 use Quadtree to implement geo_distance query?

Usually we use a data structure, store data based on the data structure, and then query the data structure. ES's use of Quadtree here is very clever: Quadtree is not used when storing, but its query method is used when querying.

Morton coding: Before understanding the processing idea of ES, a knowledge point needs to be popularized, that is morton coding. As for morton coding, similar to geohash, it is a grid coding that cross-codes two-dimensional data into one-dimensional data according to binary bits, and its usage and characteristics are similar to geohash. For 64-bit morton codes, the latitude and longitude positioning accuracy range is controlled to the centimeter level, which is very, very high accuracy for geographic location scenarios.

Data storage: Before ES2.2, a latitude and longitude coordinate needs to be stored in three fields: lat, lon, geohash. With Quadtree, only one field is needed to store it. The specific implementation idea is as follows: map the lat, lon coordinates, so that the value range of latitude and longitude is mapped from [-180,180]/[-90,90] to [0,2147483520] (integer is easy to handle), and then processed into one-dimensional mortonHash value. For the processing idea of numerical fields, it returns to the idea of the prefix (trie), and returns to the familiar expert parameter precisionStep. How to understand the prefix here? For one-dimensional data, each prefix manages an interval, and for two-dimensional data, each prefix manages a two-dimensional grid area. For example, a coordinate point uses precisionStep=9 to divide the prefix, and its visualized rectangular area is as follows:

(take shift=27,36)

图片

(take shift=36,45)

图片

Data query: When querying, first convert the coordinates of the query center point into a rectangle. We have continued the approach of ES 2.0 for this processing idea, which is not unfamiliar.

For example: for a point with coordinates (116.433322,39.900255) and a radius of 1km, the resulting rectangle looks like this:

 double centerLon = 116.433322;
double centerLat = 39.900255;
double radiusMeters = 1000.0;
GeoRect geoRect = GeoUtils.circleToBBox(centerLon, centerLat, radiusMeters);
System.out.println( geoRect );

The corresponding visualization graphs generated by the AutoNavi API are as follows:

图片

With this rectangle, the latter approach is somewhat different from ES 2.0. The idea of the ES 2.2 version is to use Quadtree to grid the entire world map. The specific process is as follows:

  • Quadtree processing flow

Step 1: Take the latitude and longitude (0,0) as the starting center point, and divide the whole world into 4 blocks. And determine which block the rectangle generated by the parameters is in.

Step 2: Skip the area where the rectangular area is not. For the block where the rectangular area is located, continue to quarter and cut into 4 blocks.

Step 3: When any of the following conditions are met, collect the relevant document sets as the result of the first batch of coarse screening.

  • Condition 1: When the segmentation is exactly the same as the precisionStep of the prefix, and the quad-cell is inside the rectangle.
  • Condition 2: When it is segmented to the minimum level (level=13) and the quad-cell intersects with the rectangular area.

Step 4: Use Lucene's doc_values caching mechanism to obtain the latitude and longitude corresponding to each docId, and use the distance formula to calculate whether it is within the radius to obtain the final result. (This operation is also the normal way of thinking)

图片

In addition, ES is version compatible during processing.

For example: ES 2.2 version for the implementation key point of geo\_distance, judge whether the index version is created after V\_2\_2\_0 version, if so, directly use Lucene's GeoPointDistanceQuery query class, otherwise use ES 2.0 version GeoDistanceRangeQuery.
 IndexGeoPointFieldData indexFieldData = parseContext.getForField(fieldType);
final Query query;
if (parseContext.indexVersionCreated().before(Version.V_2_2_0)) {
    query = new GeoDistanceRangeQuery(point, null, distance, true, false, geoDistance, geoFieldType, indexFieldData, optimizeBbox);
} else {
    distance = GeoUtils.maxRadialDistance(point, distance);
    query = new GeoPointDistanceQuery(indexFieldData.getFieldNames().indexName(), point.lon(), point.lat(), distance);
}
 
if (queryName != null) {
    parseContext.addNamedQuery(queryName, query);
}
Core code reference: GeoPointDistanceQuery, GeoPointRadiusTermsEnum

3.4 Elasticsearch version 5.0

There is no end to the exploration of solution optimization. Inspired by the paper "Bkd-Tree: A Dynamic Scalable kd-Tree", Lucene's core engineer Michael McCandless has upgraded the geolocation data index modeling and query functions based on the BKD tree.

This data structure is not only used to solve the problem of geographic location query, but also a general solution for numerical data index modeling. It can handle one-dimensional values, from byte to BigDecimal, IPv6 addresses, etc.; it can also handle two-dimensional and even N-dimensional data retrieval problems.

  • LUCENE-6825

This can be used for very fast 1D range filtering for numerics, removing the 8 byte (long/double) limit we have today, so eg we could efficiently support BigInteger, BigDecimal, IPv6 addresses, etc.

It can also be used for > 1D use cases, like 2D (lat/lon) and 3D (x/y/z with geo3d) geo shape intersection searches.

...

It should give sizable performance gains (smaller index, faster searching) over what we have today, and even over what auto-prefix with efficient numeric terms would do.

In the previous version, the processing idea for numerical interval queries was essentially term matching, and a term was used to manage an interval through the prefix, thereby reducing the number of terms that the interval query needs to traverse. Since ES 5.0 version, numerical query (from one dimension to N dimension) has been completely optimized, and the bottom layer is the independent index of BKD tree implemented by Lucene 6.0 version. Its implementation not only reduces memory overhead, but also improves retrieval and indexing speed.

Regarding the principle of bkd-tree, the general idea is as follows. In the face of numerical query interval query, it is roughly divided into two levels:

[Optimize memory query]: BST(binary-search-tree) > Self-balanced BST > kd-tree.

[Optimize external memory (hard disk) query]: B-tree > KDB-tree > BKD tree.

kd-tree is actually a multi-dimensional BST. E.g:

图片

[Data storage]: The core idea of BKD tree is very simple. The rectangular space (southWest, northEast) formed by the N-dimensional point set is recursively divided into smaller rectangular spaces. Unlike the common kd-tree, it stops when the number of coordinate points in the grid area is less than a certain number (such as 1024).

E.g:

图片

Through the division of regions, ensure that the number of POIs in each region is roughly equal.

[Data query]: When searching, it no longer starts to locate from the whole world like Quadtree, but searches based on the space formed by the current set of points. For example, take the geo_distance query as an example.

The process is as follows:

The first step: center point coordinates + radius to generate a rectangle (shape boundary). This step is a routine operation, and the previous versions also do this.

Step 2: Perform an intersect operation between the rectangle and the rectangle (cell) formed by the leaf nodes of the BKD tree. The so-called intersect operation is to calculate the positional relationship between the two rectangles: intersection, embedded or irrelevant. The area formed by query and bkd-tree has three relationships.

图片

For CELL\_CROSSES\_QUERY, if it is a leaf node, you need to judge whether each POI in the cell meets the query conditions of the query; otherwise, query the sub-interval; for CELL\_OUTSIDE\_QUERY, skip it directly; for CELL\_INSIDE\_QUERY , the POIs in the entire cell meet the query conditions.

Core code: LatLonPoint/LatLonPointDistanceQuery

3.5 Follow-up development

The iteration and changes of Geo query capabilities are actually the upgrade and optimization of Elasticsearch's numerical query capabilities as a database, expanding the applicable scenarios of the product, allowing users to break the prejudice that Elasticsearch can only do full-text retrieval. Elasticsearch still has a long way to go from a full-text search database to an analytical database.

According to Michael McCandless's vision, the current multidimensional data can only be a single point, but some scenarios require the shape to be indexed as a dimension. Under this requirement, it needs to be realized by a more generalized kd tree, namely R-Tree.

There is a long way to go. ES has gone through 6 years of development since version 2.0 supported geo-spatial. It has come a long way, but there are still many fields and scenarios worth exploring.

refer to:


vivo互联网技术
3.3k 声望10.2k 粉丝