In-depth analysis of the implementation principle of Lucene lightweight full-text index

1. Introduction to Lucene

1.1 What is Lucene?

Lucene is a sub-project of the jakarta project group of the Apache Foundation;
Lucene is an open source full-text search engine toolkit, provides a complete query engine and indexing engine, part of the language text analysis engine ;
Lucene is not a complete full-text search engine that provides full-text search engine only architecture, but still can be used as a combination of various types of plug-in toolkit for the project provide some high-performance full-text search function ;
The commonly used full-text search engines such as ElasticSearch and Solr are all based on Lucene.

1.2 Usage scenarios of Lucene

It is suitable for scenarios where a small amount of data index is required. When the index amount is too large, full-text search servers such as ES and Solr need to be used to realize the search function.

1.3 What can you learn from this article?

How is Lucene's complicated index generated and written, and what role are the files in the index playing?
How does Lucene full-text index perform efficient search?
How does Lucene optimize the search results so that users can search for the content they want based on keywords?

This article aims to share the source code reading and function development experience of Lucene search engine. Lucene uses 7.3.1 version.

Two, Lucene basic workflow

The index generation is divided into two parts:

1. Creation phase:

In the phase of adding documents, call the addDocument method through IndexWriter to generate a forward index file;
After the document is added, the inverted index file is generated by flush or merge operation.

2. Search phase:

The user sends a query request to Lucene through a query statement;
Read the content of the index library through IndexReader under IndexSearch to obtain the document index;
After the search results are obtained, the results are sorted based on the search algorithm and returned.

The index creation and search process is shown in the following figure:

Three, Lucene index composition

3.1 Forward index

The basic hierarchical structure of Lucene five parts: index, segment, document, domain, and word 160fe253b5a727. The generation of forward index is the process of processing documents and decomposing domain storage words based on Lucene's basic hierarchical structure.

The hierarchical relationship of index files is shown in Figure 1:

index : Lucene index library contains all the content of the search text, which can be stored in different databases or file directories in the form of files or file streams.
segment : An index contains multiple segments, and the segments are independent of each other. Because Lucene needs to load the index segment for the next search when searching for keywords, if there are more index segments, it will increase the large I/O overhead and slow down the retrieval speed. Therefore, when writing, it will use the segment merging strategy to perform the different segments. merge.
document : Lucene will write the document into a segment, and a segment contains multiple documents.
domain : A document will contain a variety of different fields, and different fields are stored in different fields.
word : Lucene will split the string in the domain into words through lexical analysis and language processing through the tokenizer, Lucene will use these keywords for full-text search.

3.2 Inverted Index

The core of Lucene full-text indexing is a fast indexing mechanism based on inverted indexing.

The principle of the inverted index is shown in Figure 2. The inverted index is simply based on the analyzer to segment the text content, and then records which article each word appears in, so that the search term entered by the user can be found to contain the word Article.

problem: above inverted index needs to be loaded into the memory every time. When the number of articles is large and the length of the article is long, the index word may occupy a lot of storage space, and the memory loss after loading into the memory is relatively large. Big.

solution : Starting from Lucene4, Lucene uses FST to reduce the space consumption caused by index words.

FST (Finite StateTransducers), Chinese name finite state machine converter. Its main features are the following four points:

The time complexity of finding a word is O(len(str));
By storing the prefix and suffix separately, the space required for storing words is reduced;
When loading, only the prefix is put into the memory index, and the suffix words are stored in the disk, which reduces the loss of memory index usage space;
The FST structure has high query efficiency when querying query conditions such as PrefixQuery, FuzzyQuery, and RegexpQuery.

The specific storage method is shown in Figure 3:

The files related to the inverted index include .tip, .tim and .doc, among which:

tip: Used to save the prefix of the inverted index Term to quickly locate the position of the Term belonging to this Field in the .tim file, that is, aab, abd, and bdc in the figure above.
tim: Save the corresponding Term corresponding to different prefixes and the corresponding inverted table information. The inverted table can be quickly searched through the jump table, and the intersection, union, and difference of multiple conditions can be queried by skipping some elements through the jump table. Set operations like these also improve performance.
doc: Contains the document number and word frequency information, and returns the text information saved in the file according to the content in the inverted table.

3.3 Index query and document search process

Lucene uses the inverted index to locate the document number that needs to be queried. After searching for the file by the document number, it uses the word weight and other information to sort the documents and returns.

Load the tip file in the memory, and match the position of the suffix word block in the tim file according to the FST;
According to the searched suffix word block position, the related information of the suffix and the inverted table can be queried;
Locate the document number and word frequency information from the doc file according to the inverted list information queried in tim, and complete the search;
After the file is located, Lucene will go to the .fdx file directory index and .fdt to find the target file according to the forward index.

The file format is shown in Figure 4:

The above mainly explains the working principle of Lucene, and the following will explain the relevant code of Lucene in Java to perform indexing, query and other operations.

Fourth, the addition, deletion and modification of Lucene

The operations such as the analysis and storage of the text in the Lucene project are all implemented by the IndexWriter class. The IndexWriter file is mainly composed of two classes: Directory and IndexWriterConfig. Among them:

Directory : Used to specify the directory type for storing index files. Since the text content is to be searched, it is natural to write the text content and index information into the directory first. Directory is an abstract class that allows many different implementations for index storage. Common storage methods generally include local storage (FSDirectory), memory (RAMDirectory), etc.
IndexWriterConfig : Used to specify the relevant configuration of IndexWriter when writing file content, including OpenMode index construction mode, Similarity correlation algorithm, etc.

How does IndexWriter operate the index specifically? Let us briefly analyze the relevant source code of IndexWriter index operation.

4.1. New documentation

a. Lucene will create a ThreadState object for each document, the object holds DocumentWriterPerThread to perform file addition, deletion and modification operations;

ThreadState getAndLock(Thread requestingThread, DocumentsWriter documentsWriter) {
  ThreadState threadState = null;
  synchronized (this) {
    if (freeList.isEmpty()) {
      // 如果不存在已创建的空闲ThreadState，则新创建一个
      return newThreadState();
    } else {
      // freeList后进先出，仅使用有限的ThreadState操作索引
      threadState = freeList.remove(freeList.size()-1);

      // 优先使用已经初始化过DocumentWriterPerThread的ThreadState，并将其与当前
      // ThreadState换位，将其移到队尾优先使用
      if (threadState.dwpt == null) {
        for(int i=0;i<freeList.size();i++) {
          ThreadState ts = freeList.get(i);
          if (ts.dwpt != null) {
            freeList.set(i, threadState);
            threadState = ts;
            break;
          }
        }
      }
    }
  }
  threadState.lock();
  
  return threadState;
}

b. Index file insertion: DocumentWriterPerThread calls the processField under DefaultIndexChain to process each field in the document. The processField method is the core execution logic of the index chain . The user performs corresponding indexing, word segmentation, storage and other operations on the different FieldTypes set by the user for each field. The more important thing in FieldType is indexOptions:

NONE: The domain information will not be written into the inverted table, and the domain name cannot be searched during the index phase;
DOCS: The document is written into the inverted table, but because the word frequency information is not recorded, it appears multiple times and only needs to be processed once;
DOCS\_AND\_FREQS: The document and word frequency are written into the inverted table;
DOCS\_AND\_FREQS\_AND\_POSITIONS: document, word frequency and position are written into the inverted table;
DOCS\_AND\_FREQS\_AND\_POSITIONS\_AND\_OFFSETS: The document, word frequency, position and offset are written into the inverted table.

// 构建倒排表

if (fieldType.indexOptions() != IndexOptions.NONE) {
    fp = getOrAddField(fieldName, fieldType, true);
    boolean first = fp.fieldGen != fieldGen;
    // field具体的索引、分词操作
    fp.invert(field, first);

    if (first) {
      fields[fieldCount++] = fp;
      fp.fieldGen = fieldGen;
    }
} else {
  verifyUnIndexedFieldType(fieldName, fieldType);
}

// 存储该field的storeField
if (fieldType.stored()) {
  if (fp == null) {
    fp = getOrAddField(fieldName, fieldType, false);
  }
  if (fieldType.stored()) {
    String value = field.stringValue();
    if (value != null && value.length() > IndexWriter.MAX_STORED_STRING_LENGTH) {
      throw new IllegalArgumentException("stored field \"" + field.name() + "\" is too large (" + value.length() + " characters) to store");
    }
    try {
      storedFieldsConsumer.writeField(fp.fieldInfo, field);
    } catch (Throwable th) {
      throw AbortingException.wrap(th);
    }
  }
}

// 建立DocValue（通过文档查询文档下包含了哪些词）
DocValuesType dvType = fieldType.docValuesType();
if (dvType == null) {
  throw new NullPointerException("docValuesType must not be null (field: \"" + fieldName + "\")");
}
if (dvType != DocValuesType.NONE) {
  if (fp == null) {
    fp = getOrAddField(fieldName, fieldType, false);
  }
  indexDocValue(fp, dvType, field);
}
if (fieldType.pointDimensionCount() != 0) {
  if (fp == null) {
    fp = getOrAddField(fieldName, fieldType, false);
  }
  indexPoint(fp, field);
}

c. To parse the Field, you first need to construct the TokenStream class to generate and convert the token stream. TokenStream has two important derived classes Tokenizer and TokenFilter. Tokenizer is used to read characters through the java.io.Reader class to generate a Token stream, and then Through any number of TokenFilter to process these input Token streams, the specific source code is as follows:

// invert：对Field进行分词处理首先需要将Field转化为TokenStream
try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream))
// TokenStream在不同分词器下实现不同，根据不同分词器返回相应的TokenStream
if (tokenStream != null) {
  return tokenStream;
} else if (readerValue() != null) {
  return analyzer.tokenStream(name(), readerValue());
} else if (stringValue() != null) {
  return analyzer.tokenStream(name(), stringValue());
}

public final TokenStream tokenStream(final String fieldName, final Reader reader) {
  // 通过复用策略，如果TokenStreamComponents中已经存在Component则复用。
  TokenStreamComponents components = reuseStrategy.getReusableComponents(this, fieldName);
  final Reader r = initReader(fieldName, reader);
  // 如果Component不存在，则根据分词器创建对应的Components。
  if (components == null) {
    components = createComponents(fieldName);
    reuseStrategy.setReusableComponents(this, fieldName, components);
  }
  // 将java.io.Reader输入流传入Component中。
  components.setReader(r);
  return components.getTokenStream();
}

d. According to the tokenizer configured in IndexWriterConfig, the tokenizer component corresponding to the tokenizer is returned through the strategy mode. There are many different implementations of the tokenizer for different languages and different word segmentation requirements.

StopAnalyzer: Stop word tokenizer, used to filter specific strings or words in the vocabulary.
StandardAnalyzer: Standard word segmentation, which can segment words based on numbers, letters, etc., supports vocabulary filtering instead of StopAnalyzer function, and supports simple Chinese word segmentation.
CJKAnalyzer: It can provide better support for Chinese word segmentation according to Chinese language habits.

Take StandardAnalyzer (standard tokenizer) as an example:

// 标准分词器创建Component过程，涵盖了标准分词处理器、Term转化小写、常用词过滤三个功能
protected TokenStreamComponents createComponents(final String fieldName) {
  final StandardTokenizer src = new StandardTokenizer();
  src.setMaxTokenLength(maxTokenLength);
  TokenStream tok = new StandardFilter(src);
  tok = new LowerCaseFilter(tok);
  tok = new StopFilter(tok, stopwords);
  return new TokenStreamComponents(src, tok) {
    @Override
    protected void setReader(final Reader reader) {
      src.setMaxTokenLength(StandardAnalyzer.this.maxTokenLength);
      super.setReader(reader);
    }
  };
}

e. After obtaining the TokenStream, analyze and obtain the attributes through the incrementToken method in the TokenStream, and then construct the inverted table through the add method under the TermsHashPerField, and finally store the relevant data of the Field in the freqProxPostingsArray of the type FreqProxPostingsArray, and the termVectorsPostingsArray of the TermVectorsPostingsArray, Constitute an inverted list;

// 以LowerCaseFilter为例，通过其下的increamentToken将Token中的字符转化为小写
public final boolean incrementToken() throws IOException {
  if (input.incrementToken()) {
    CharacterUtils.toLowerCase(termAtt.buffer(), 0, termAtt.length());
    return true;
  } else
    return false;
}

  try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
    // reset TokenStream
    stream.reset();
    invertState.setAttributeSource(stream);
    termsHashPerField.start(field, first);
    // 分析并获取Token属性
    while (stream.incrementToken()) {
      ……
      try {
        // 构建倒排表
        termsHashPerField.add();
      } catch (MaxBytesLengthExceededException e) {
        ……
      } catch (Throwable th) {
        throw AbortingException.wrap(th);
      }
    }
    ……
}

4.2 Deletion of documents

a. To delete a document under Lucene, first add the Term or Query to be deleted to the delete queue;

synchronized long deleteTerms(final Term... terms) throws IOException {
  // TODO why is this synchronized?
  final DocumentsWriterDeleteQueue deleteQueue = this.deleteQueue;
  // 文档删除操作是将删除的词信息添加到删除队列中，根据flush策略进行删除
  long seqNo = deleteQueue.addDelete(terms);
  flushControl.doOnDelete();
  lastSeqNo = Math.max(lastSeqNo, seqNo);
  if (applyAllDeletes(deleteQueue)) {
    seqNo = -seqNo;
  }
  return seqNo;
}

b. Trigger the delete operation according to the Flush strategy;

private boolean applyAllDeletes(DocumentsWriterDeleteQueue deleteQueue) throws IOException {
  // 判断是否满足删除条件 --> onDelete
  if (flushControl.getAndResetApplyAllDeletes()) {
    if (deleteQueue != null) {
      ticketQueue.addDeletes(deleteQueue);
    }
    // 指定执行删除操作的event
    putEvent(ApplyDeletesEvent.INSTANCE); // apply deletes event forces a purge
    return true;
  }
  return false;
}

public void onDelete(DocumentsWriterFlushControl control, ThreadState state) {
  // 判断并设置是否满足删除条件
  if ((flushOnRAM() && control.getDeleteBytesUsed() > 1024*1024*indexWriterConfig.getRAMBufferSizeMB())) {
    control.setApplyAllDeletes();
    if (infoStream.isEnabled("FP")) {
      infoStream.message("FP", "force apply deletes bytesUsed=" + control.getDeleteBytesUsed() + " vs ramBufferMB=" + indexWriterConfig.getRAMBufferSizeMB());
    }
  }
}

4.3 Update of documentation

The update of the document is a process of deleting first and then inserting, so I won't go into more details in this article.

4.4 Index Flush

After a certain number of documents are written, a certain thread triggers the Flush operation of IndexWriter to generate segments and write the Document information in the memory to the hard disk. Flush operation currently has only one strategy: FlushByRamOrCountsPolicy. FlushByRamOrCountsPolicy is mainly based on two strategies to automatically perform Flush operations:

maxBufferedDocs: Flush operation is triggered when a certain number of documents are collected.
ramBufferSizeMB: Flush operation is triggered when the document content reaches the limit value.

Among them, activeBytes() is the amount of memory occupied by indexes collected by dwpt, and deleteByteUsed is the amount of deleted indexes.

@Override
public void onInsert(DocumentsWriterFlushControl control, ThreadState state) {
  // 根据文档数进行Flush
  if (flushOnDocCount()
      && state.dwpt.getNumDocsInRAM() >= indexWriterConfig
          .getMaxBufferedDocs()) {
    // Flush this state by num docs
    control.setFlushPending(state);
  // 根据内存使用量进行Flush
  } else if (flushOnRAM()) {// flush by RAM
    final long limit = (long) (indexWriterConfig.getRAMBufferSizeMB() * 1024.d * 1024.d);
    final long totalRam = control.activeBytes() + control.getDeleteBytesUsed();
    if (totalRam >= limit) {
      if (infoStream.isEnabled("FP")) {
        infoStream.message("FP", "trigger flush: activeBytes=" + control.activeBytes() + " deleteBytes=" + control.getDeleteBytesUsed() + " vs limit=" + limit);
      }
      markLargestWriterPending(control, state, totalRam);
    }
  }
}

Write the memory information to the index library.

Index Flush is divided into active Flush and automatic Flush. The Flush operation triggered by the strategy is automatic Flush. The execution of active Flush is quite different from automatic Flush. This article will not go into details about active Flush. If you need to understand, you can jump to link .

4.5 Index segment Merge

When indexing Flush, each dwpt will generate a segment separately. When there are too many segments, full-text search may span multiple segments, resulting in multiple loads. Therefore, it is necessary to merge too many segments.

The execution of segment merging is managed by MergeScheduler. mergeScheduler also contains a variety of management strategies, including NoMergeScheduler, SerialMergeScheduler and ConcurrentMergeScheduler.

1) The merge operation first needs to query the segments to be merged according to the segment merging strategy through the updatePendingMerges method. There are many types of segment merging strategies. This article only introduces two segment merging strategies used by Lucene by default: TieredMergePolicy and LogMergePolicy.

TieredMergePolicy: First sort the segment set provided by IndexWriter through the OneMerge scoring mechanism, and then select some (possibly discontinuous) segments in the sorted segment set to generate a segment set to be merged, that is, non-adjacent segment files (Non-adjacent Segment).
LogMergePolicy: a fixed-length merging method, through the maxLevel, LEVEL\_LOG\_SPAN, levelBottom parameters, the continuous segments are divided into different levels, and then mergeFactor is used to select segments from each level to merge.

private synchronized boolean updatePendingMerges(MergePolicy mergePolicy, MergeTrigger trigger, int maxNumSegments)
  throws IOException {

  final MergePolicy.MergeSpecification spec;
  // 查询需要合并的段
  if (maxNumSegments != UNBOUNDED_MAX_MERGE_SEGMENTS) {
    assert trigger == MergeTrigger.EXPLICIT || trigger == MergeTrigger.MERGE_FINISHED :
    "Expected EXPLICT or MERGE_FINISHED as trigger even with maxNumSegments set but was: " + trigger.name();

    spec = mergePolicy.findForcedMerges(segmentInfos, maxNumSegments, Collections.unmodifiableMap(segmentsToMerge), this);
    newMergesFound = spec != null;
    if (newMergesFound) {
      final int numMerges = spec.merges.size();
      for(int i=0;i<numMerges;i++) {
        final MergePolicy.OneMerge merge = spec.merges.get(i);
        merge.maxNumSegments = maxNumSegments;
      }
    }
  } else {
    spec = mergePolicy.findMerges(trigger, segmentInfos, this);
  }
  // 注册所有需要合并的段
  newMergesFound = spec != null;
  if (newMergesFound) {
    final int numMerges = spec.merges.size();
    for(int i=0;i<numMerges;i++) {
      registerMerge(spec.merges.get(i));
    }
  }
  return newMergesFound;
}

2) Create and start the user merged thread MergeThread through the merge method in the ConcurrentMergeScheduler class.

@Override
public synchronized void merge(IndexWriter writer, MergeTrigger trigger, boolean newMergesFound) throws IOException {
  ……
  while (true) {
    ……
    // 取出注册的后选段
    OneMerge merge = writer.getNextMerge();
    boolean success = false;
    try {
      // 构建用于合并的线程MergeThread 
      final MergeThread newMergeThread = getMergeThread(writer, merge);
      mergeThreads.add(newMergeThread);

      updateIOThrottle(newMergeThread.merge, newMergeThread.rateLimiter);

      if (verbose()) {
        message("    launch new thread [" + newMergeThread.getName() + "]");
      }
      // 启用线程
      newMergeThread.start();
      updateMergeThreads();

      success = true;
    } finally {
      if (!success) {
        writer.mergeFinish(merge);
      }
    }
  }
}

3) Perform the merge operation through the doMerge method;

public void merge(MergePolicy.OneMerge merge) throws IOException {
  ……
      try {
        // 用于处理merge前缓存任务及新段相关信息生成
        mergeInit(merge);
        // 执行段之间的merge操作
        mergeMiddle(merge, mergePolicy);
        mergeSuccess(merge);
        success = true;
      } catch (Throwable t) {
        handleMergeException(t, merge);
      } finally {
        // merge完成后的收尾工作
        mergeFinish(merge)
      }
……
}

Five, Lucene search function realization

5.1 Load index library

Lucene first needs to load the index segment into memory if it wants to perform a search. Because the operation of loading the index library is very time-consuming, it is only necessary to reload the index library when the index library changes.

The loading index library is divided into two parts: loading segment information and loading document information:

1) Load section information:

Obtain the largest generation in the segment through the segments.gen file, and obtain the overall information of the segment;
Read the .si file, construct the SegmentInfo object, and finally summarize the SegmentInfos object.

2) Load document information:

Read the segment information and obtain the corresponding FieldInfo from the .fnm file to construct FieldInfos;
Open the related files and dictionary files of the inverted list;
Read index statistics and related norms information;
Read the document file.

5.2 Packaging

After the index library is loaded, the IndexReader needs to be encapsulated into IndexSearch. IndexSearch returns the results the user needs through the Query statement constructed by the user and the specified Similarity text similarity algorithm (default BM25). The search function is realized through the IndexSearch.search method.

search : Query contains multiple implementations, including BooleanQuery, PhraseQuery, TermQuery, PrefixQuery and other query methods. Users can construct query statements according to project requirements

sorting : In addition to calculating document relevance score sorting through Similarity, IndexSearch also provides a BoostQuery method for users to specify keyword scores and customize sorting. Similarity relevance algorithm also includes many different relevance score calculation implementations, which will not be repeated here for the time being, readers can check it online if necessary.

Six, summary

As a full-text indexing toolkit, Lucene provides powerful full-text search function support for small and medium-sized projects, but there are many problems in the process of using Lucene:

Since Lucene needs to retrieve the index library through IndexReader to read the index information and load it into the memory to achieve its retrieval capabilities, when the index volume is too large, it will consume too much memory of the service deployment machine.
The search implementation is more complicated, and the index, word segmentation, storage and other information of each field need to be set one by one, which is complicated to use.
Lucene does not support clustering.

There are many restrictions when using Lucene, and it is not so convenient to use. When the amount of data increases, it is better to choose a distributed search server such as ElasticSearch as the implementation solution of the search function.

Author: vivo Internet server team-Qian Yulun

In-depth analysis of the implementation principle of Lucene lightweight full-text index

1. Introduction to Lucene

1.1 What is Lucene?

1.2 Usage scenarios of Lucene

1.3 What can you learn from this article?

Two, Lucene basic workflow

Three, Lucene index composition

3.1 Forward index

3.2 Inverted Index

3.3 Index query and document search process

Fourth, the addition, deletion and modification of Lucene

4.1. New documentation

4.2 Deletion of documents

4.3 Update of documentation

4.4 Index Flush

4.5 Index segment Merge

Five, Lucene search function realization

5.1 Load index library

5.2 Packaging

Six, summary

vivo互联网技术

引用和评论

vivo Pulsar万亿级消息处理实践（1）-数据发送原理解析和性能调优

Java8的新特性

Java11的新特性

Java5的新特性

Java9的新特性

Java13的新特性

Java7的新特性