边缘注释搜索中的短语匹配

Marginalia Search now supports phrase matching. It took about 4 months to implement and was funded by nlnet after the grant ended.

Old design: Positions for each term were stored approximately with 56 bits per word and a sentence-level granularity. Quoted search queries were handled by guessing relevant word n-grams. Some cases didn't work well, like the query "the vietnam of computer science" which only found Vietnamese computer scientists.
Representing positions lists: The literal textbook solution is to store compressed lists of positive integers as positions. Elias gamma coding was initially used but was found to be not fast enough. Varints were then used instead as they are more CPU-friendly and about twice as fast as the gamma code implementation. There are also faster vectorized implementations like Stream VByte but they didn't perform well in Java.
Size and shape of coded data: Switching to varint didn't significantly affect the size of the position data but increased the disk space usage. About 85 GB per index partition and 700 GB in total. The priority index was shrunk from 175 GB to 10 GB by removing metadata and using a custom compression scheme.
Position spans: A new approach for coding information about individual word occurrences was introduced. It stores coded lists of start and end positions in separate files and is more CPU-friendly. This allows the search engine to index more documents and assign lower scores to matches in less useful areas.
Phrase matching and ranking factors: Stop words need to be removed for phrase matching. The ranking algorithm needs extensive modification and tuning with new ranking factors like the presence of the search query as a phrase, the minimum distance between keywords, and the distance before seeing all keywords. All these factors can be explored using the qdebug utility.
Conclusion: This was a large change with over 200 commits and a delta of nearly 20,000 lines of code. It's been a success with some queries performing better. The feedback cycle in web search engine development is long with a long rebuild time. The approach of looking at queries and making changes to improve search result quality is solid.