ElasticSearch Study Notes (1) Basic Concepts and Basic Use

Generally, I introduce a certain framework, MQ, and middleware. Generally, I talk about what it is, what it can help us do, and then use it, advanced features. This time, I plan to change the style and intersperse some small stories. When I wrote this, I remembered the first project I just entered. There was a page query, with 2.7 million pieces of data in the main table, seven tables were joined, the second table had more than one million data, and the remaining five Zhang is about 400,000 to 500,000 pieces of data, and the query takes about ten seconds. This speed is definitely difficult to meet the requirements. Please optimize the DBA to seven seconds. Users can barely accept it. I asked my big brother at the time. , What if the amount of data rises further and becomes slower? My eldest brother at the time said that we can consider ElasticSearch. ElasticSearch is known as a billion-level data, and the speed of light query.

origin

Many years ago, a newly married unemployed developer named Shay Banon followed his wife to London, where he was studying cooking. While looking for a lucrative job, he started using an early version of Lucene to make a recipe search engine for his wife.
Using Lucene directly was difficult, so Shay set out to make an abstraction layer that Java developers could use to easily add search functionality to their programs. He released his first open source project, Compass.
Shay later got a job working on high performance, in-memory data grids in distributed environments. The need for a high-performance, real-time, distributed search engine was particularly prominent, and he decided to rewrite Compass into an independent service called Elasticsearch.
The first public version was released in February 2010, and since then, Elasticsearch has become one of the most active projects on Github with over 300 contributors (currently 736 contributors). A company has started to provide commercial services around Elasticsearch and develop new features, however, Elasticsearch will always be open source and available to everyone.
Shay's wife is said to be still waiting for her recipe search engine... Elasticsearch: The Definitive Guide is based on version 2.x

So what is Lucene? Lucene is a full-text search engine library belonging to Apache. The full name is Apache Lucene. Luece can be said to be the most advanced, high-performance and full-featured search engine library, but Lucene is just a library, which is more complicated to use. Shay Banon built an abstraction layer to try to shield developers from complex details. Java developers can use it to easily add search capabilities to their programs. This is Compass, which eventually evolved into ElasticSearch.

research all? Can't a traditional database? The reasons given in the ElasticSearch Definitive Guide are:

Unfortunately, most databases are surprisingly inefficient at extracting usable knowledge from your data. Sure, you can filter by timestamp or exact value, but are they capable of full-text search, handling synonyms, scoring documents by relevance? Can they generate analytics and aggregate data from the same data? Most importantly, can they do this in real-time, without going through a large batch of tasks?

The valid information we can extract from the above sentence is that most relational data is not capable of full-text retrieval, processing synonyms, and scoring documents by relevance.

What is full text retrieval (search)?

Full Text Searching (or just text search ) provides the capability to identify natural-language documents that satisfy a query , and optionally to sort them by relevance to the query. The most common type of search is to find all documents containing given query terms and return them in order of their similarity to the query. Notions of query and similarity are very flexible and depend on the specific application. The simplest search considers query as a set of words and similarity as the frequency of query words in the document. [1]
Full-text search (or text search) provides the ability to identify natural language that satisfies a query, and to rank query results for relevance. The most common full-text search is to find all documents that meet the conditions from all documents and sort them according to the similarity. Note that queries and similarity are very flexible and application specific. The simplest full-text search is to treat "query" as a set of words and similarity as the frequency of query times.

So what is Document?

A document is the unit of searching in a full text search system; for example, a magazine article or email message. The text search engine must be able to parse documents and store associations of lexemes (key words) with their parent document. Later, These associations are used to search for documents that contain query words.
Document is the basic unit of full-text search system, such as an article in a magazine or an email can be regarded as a document. A full-text search engine must be able to parse documents and store keywords and document associations. These associations will be used to search for documents containing this query.

Is the above description a bit similar to the search engine we usually use? We enter keywords in the search engine box, the search engine searches the entire Web, and the search engine ranks webpages based on relevance. For example, I searched for Zookeeper in Baidu, and the following are the search results of Baidu:

百度的搜索结果

I search on bing:

bing的搜索结果

These pages are ranked by search engines based on relevance. Baidu also dealt with synonyms, Zookeeper has the meaning of zookeeper, so the zookeeper was also brought out. Another example is apple's official website and Apple's official website are synonyms, you enter these two keywords in Baidu, the first one is Apple's official website. All of the above mentioned are difficult to achieve in traditional relational databases, but Elasticsearch can provide it for us. In fact, there is also a word segmentation problem here. For example, I search for ES word segmentation on the search engine. For the search engine, this is an input. When it matches the document, it actually treats the ES word segmentation as two. Word: ES participle. The two terms are included in the document to calculate the relevance. Since you can do so much for us, just, omg, learn it.

The above functions are very similar to the functions of search engines, so ElasticSearch positions itself as:

Elasticsearch is a distributed, RESTful-style search and data analysis engine that addresses a wide variety of emerging use cases for all types of data including text, numeric, geospatial, structured and unstructured data, and more .

The latest version of ES is 8.3. We do not choose the latest version. We will introduce it based on version 7.17. Linux server based on Centos 7

basic concept

ES概念

Since ElstaticSearch can act as a search engine, where is their data source? ElstaticSearch has a built-in data source, which can be understood as another form of database in one form.

Index

When I saw this word, I subconsciously thought of the MySQL index, but in ES, the index has two meanings:

Noun: An index is equivalent to a table in a relational database (before 6.x, an index could be considered a database)
Verb: To save a document in an index, this process can also be called indexing.

Since an index is equivalent to a table in a relational database, does ElstaticSearch have a corresponding table creation statement? Some are Mapping.

Mapping

Mapping defines which fields and types of these fields are in the documents in the index, similar to the definition of the table structure in the database. Mapping has two functions:

Defines the name and corresponding type of each field in the index.
Defines the relevant settings of each field and inverted index, such as what tokenizer to use, etc.

It should be noted that once the Mapping is defined, the defined field type cannot be changed

type

Before 6.x, index can be understood as a database in a relational database, and type is considered as a table in the database. Using type allows us to store multiple types of data in the index. When filtering data, you can specify the type. The existence of type can reduce the number of indexes to some extent, but the type has the following limitations:

Fields in different types need to be consistent . For example, a different --- index under type has two fields with the same name, and their type (string, date, etc.) and configuration must also be the same.
A field that exists only in a certain type will also consume resources in other types that do not have this field.
The score is determined by statistics within index . That is, documents in one type affect the scores of documents in another type.

The above restrictions require us to only apply type when the types in the same index have similar mappings. Otherwise, using multiple type may consume more resources than using multiple index .

After 6.00, the type was gradually abandoned, and the existence of multiple types in an index was not supported.

After version 7.00, the support of multiple types under a single index was officially abolished.

Document

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents
The way ElasticSearch stores documents is distributed storage. Unlike storing columnar data, ES stores documents in JSON sequence.

Some basic concepts have been discussed above. Let's install ElasticSearch and introduce other core concepts in practice.

pack up

Install ES

 wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.0-linux-x86_64.tar.gz
mkdir es
mv elasticsearch-7.17.0-linux-x86_64.tar.gz  es
# ES 7 以后自带JDK,你在解压过程中会发现ES自带的JDK有module这个概念,已经不是JDK8喽
# 已经是JDK 17了
tar xvf elasticsearch-7.17.0-linux-x86_64.tar.gz 
# 为后面搭建集群做准备
mv elasticsearch-7.17.0  es_node1

Introduction to the ES directory

Table of contents	describe
bin	Contains some execution scripts, where the ES startup files and script installation files are here
config	Contains the cluster configuration file (elasticsearch.yml), jvm configuration (jvm.options), user and other related configurations
JDK	After 7.0, it comes with jdk and java operating environment
lib	class library for java
plugins	The directory where the plugin is installed
modules	Contains all ES modules

Change setting

Although ES claims to be out-of-the-box, we still have to change some configurations. The configuration files of ES are mainly in the config directory. We mainly focus on two directories:

elasticsearch.yml

elasticsearch.yml is used to configure various parameters of ES services

jvm.options

jvm.options mainly saves JVM related configuration.

 echo -e '\n' >> config/elasticsearch.yml
# 向yml追加配置
echo 'cluster.name: my_app' >> config/elasticsearch.yml

echo 'node.name: my_node_1' >> config/elasticsearch.yml

echo 'path.data: ./data' >> config/elasticsearch.yml

echo 'path.logs: ./logs' >> config/elasticsearch.yml

echo 'http.port: 9211' >> config/elasticsearch.yml
# 允许任意ip 直连,真实环境不要设置为0.0.0.0
echo 'network.host: 0.0.0.0' >> config/elasticsearch.yml

echo 'discovery.seed_hosts: ["localhost"]' >> config/elasticsearch.yml

echo 'cluster.initial_master_nodes: ["my_node_1"]' >> config/elasticsearch.yml

echo -e '\n' >> config/jvm.options

# 设置堆内存最小值

echo '-Xms1g' >> config/jvm.options

# 设置堆内存最大值

echo '-Xmx1g' >> config/jvm.options

sudo su

echo -e '\nvm.max_map_count=262144' >> /etc/sysctl.conf

sysctl -p

exit;

run es

 # 注意，ES 不能使用 root 来运行！！！！ 这里我们新建个用户
adduser es_study
# 这里我将密码设置为my_studyes001
passwd  es_study 
# 将esnode1 这个文件夹授予给es_study
chown -R es_study es_node1/
# 该limits.conf  
# 追加 * soft nofie 65536  
# 追加 * hard nofile 65536
vim /etc/security/limis.conf 
# 然后重新登录，如果下面命令输出65536代表生效
ulimit -H -n
# 切换用户
su es_study
# 到bin目录下 
 ./elasticsearch

Then access the host ip: 9211 where the ES is located, and the following appears:

 {
  "name" : "my_node_1",
  "cluster_name" : "my_app",
  "cluster_uuid" : "G6WNl_15TFmJ6VhDJIXfCw",
  "version" : {
    "number" : "7.17.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "bee86328705acaa9a6daede7140defd4d9ec56bd",
    "build_date" : "2022-01-28T08:36:04.875279988Z",
    "build_snapshot" : false,
    "lucene_version" : "8.11.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

It means that the installation was successful, because I configured the environment variables, so I still use JDK 8, but this time I want ES to use JDK 17.

 # 在/etc/profile 中加入ES_JAVA_HOME即可
vim /etc/profile
source /etc/profile
export ES_JAVA_HOME=/usr/es/es_node1/jdk

Install Kibana

Kibana is the official data analysis and visualization platform, but for now we just need to use it as a debugging tool for ES queries. There is a corresponding relationship between Kibana and ES versions, so you need to download the same version of Kibana as ES.

 # 注意kibana 也不能在root用户下面运行,我们还是要用上面的用户来启动 授予权限
wget https://artifacts.elastic.co/downloads/kibana/kibana-7.17.0-linux-x86_64.tar.gz
tar xvf kibana-7.17.0-linux-x86_64.tar.gz
mv kibana-7.17.0-linux-x86_64 kibana
cd kibana
# 需要注意的是，线上一定不能配置ip为 0.0.0.0，这是非常危险的行为！！！
echo -e '\nserver.host: "0.0.0.0"' >> config/kibana.yml
echo -e '\nelasticsearch.hosts: ["http://localhost:9211"]' >> config/kibana.yml
chown -R es_study kibana
./bin/kibana >> run.log 2>&1 &

Then access the host ip: 5601 where the ES is located, and see the following figure:

选择DevTools

use it

It took so much effort to finally get it installed, and now we're using it in Kibana. We mentioned above that ES is a distributed, RESTful style search and data analysis engine that can solve a variety of use cases that are emerging. Considering that some students do not know the RESTFul style, here is a brief introduction. HTTP has several request methods, so that the request method and CRUD can be matched:

POST new
PUT update
DELETE delete
GET to get the resource

Manage indexes - Mapping

create index

 PUT books
{
  "mappings": {
    "properties": {
      "book_id": {
        "type": "keyword" 
      },
      "name": {
        "type": "text"
      },
         "price":{
        "type":"scaled_float",
         "scaling_factor": 100
      },
      "author": { 
       "properties": {
          "first": { "type": "text" },
          "last":  { "type": "text" }
       }  
    }
  }
}
}

The above defines an index that simply defines an index through ElasticSearch's Mapping. The type represents the data type. The common data types of ES are:

keyword

The keyword type is more suitable for storing short, structured strings. For example, product ID, product name, etc.

text

Fields of type text are suitable for storing full-text data, such as SMS content and email content.

Numeric types (long, integer, short, byte, double, float, half_float, scaled_float...)
- In terms of integer types (byte short integer long), the smallest type that satisfies the requirements should be selected
- For floating point types, it is often more efficient to store floating point data into integers using a scaling factor, which is the scaled_float we used above. If the input is 23.45, ES will multiply the input price by 23.456 times 100, Taking another number close to the original value gives 2346. The advantage of using a scale factor is that integers are easier to compress than floats, saving space and saving disk space.
object type

Simply put, the value of the field is still a json object.

We boarded Kibana's DevTools and entered the above Mapping command:

创建成功

drop index

 DELETE books

Addition, deletion and modification of documents

create
- INDEX API Creation
  This way specifies the API of the document
```
 PUT books/_doc/1 
{
  "book_id": "4ee82462",
  "name": "母猪的产后护理(一)",
  "price":"23.56",  
  "author": {"first":"wang","last":"小明"}
}
```
If you run it continuously in Kibana, there will be no error, no duplicate creation, and the version number will change:
When indexing a document in ES, if the document ID already exists, the old document will be deleted first, then the content of the new document will be written, and the version number of the document will be incremented.
- create doc
```
 POST books/_doc
{
  "book_id": "4ee824621",
  "name": "母猪的产后护理(二)",
  "price":"23.56",  
  "author": {"first":"wang","last":"小明"}
}
```
  Use the post method to create a document and generate an ID from ES. Run the above command continuously, and you will find that the version number has not changed, indicating that it is continuously added.

QUERY

Query the document with ID 1

 GET books/_doc/1

The GET API provides multiple parameters. For more information, please refer to the official documentation.

Batch query

 # 参数中指定index
GET /_mget
{
  "docs": [
    { "_index": "books", "_id": "1" },
    { "_index": "books", "_id": "00IDB4IBeGTeMZoVN5XN" }
  ]
}
# 在URL中指定index
GET /_mget
{
  "docs": [
    { "_index": "books", "_id": "1" },
    { "_index": "books", "_id": "00IDB4IBeGTeMZoVN5XN" }
  ]
}
# 简写
GET /books/_mget
{
  "ids" : ["1", "00IDB4IBeGTeMZoVN5XN"]
}

update

 POST books/_update/1
{
  "doc": {
    "name":"伟大的傅里叶变换"
  }
}

The above index document can also be updated, so what is the difference between this method and the update effect of the index document is to delete the data first, and then write the old data. If we only want to update one field, and only give the updated field, the other fields will be erased.

delete
Just specify the id on the parameter.

 DELETE books/_doc/1

So what if we want to batch manipulate documents in the index? ElstaticSearch provides us with Bulk API to manipulate bulk documents. Four types of manipulations can be supported simultaneously in the Bulk API:

Index
Create
Update
Delete

Syntax example:

 POST _bulk
{ "index" : { "_index" : "books", "_id" : "3" } }  #指定操纵类型 哪个索引和文档
{ "book_id": "4ee82462","name":"天生吾战"}
{ "delete":{"_index": "books", "_id":"2"}}
{"update":{"_index":"books","_id":3}}
{"doc":{"name":"母猪的产后护理(四)"}}

course

Generally, I still go to the official website of ElasticSearch to see the documents. ES has a Chinese official website, but it has been partially Chineseized. The ES official website has "Elasticsearch: The Authoritative Guide", but unfortunately it is based on the 2.x version:

https://www.elastic.co/guide/cn/elasticsearch/guide/current/index.html

Higher versions have user guides:

选DOCS

用户指导

The installation tutorial in this article basically applies the Nuggets booklet "Elasticsearch from Getting Started to Practice"

Reference documentation

Full Text Search https://docs.microsoft.com/en-us/sql/relational-databases/search/full-text-search?view=sql-server-ver16
Elasticsearch (015): Numeric type (numeric) of common field mapping types in es https://blog.csdn.net/weixin_39723544/article/details/104331885
ElasticSearch GET documentation https://www.elastic.co/guide/en/elasticsearch/reference/7.13/docs-get.html
"Elasticsearch from entry to practice" Nuggets booklet https://juejin.cn/book/7054754754529853475/section/7064921135250407437