Basic Concepts of Elastic Stack

Cluster

An Elasticsearch cluster consists of one or more nodes, which can be identified by their cluster name. By default, if our Elasticsearch is already running, it will automatically generate a cluster called "Elasticsearch". Of course we can customize the name of our cluster in config/elasticsearch.yml:

[root@cb71f81b72b7 config]# cat elasticsearch.yml 
cluster.name: "docker-cluster"
network.host: 0.0.0.0
[root@cb71f81b72b7 config]#

An Elasticsearch cluster looks like the following layout:

The architecture diagram with NginX proxy and Balancer is as follows:

We can pass:

GET _cluster/state

to get the status of the entire cluster. This state can only be changed by the master node. The result returned by the above interface is:

{
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "Lmv7APZ4QemXB88AGZZfcA",
  "version" : 163,
  "state_uuid" : "7nh2_aWaT9aJHiVaOo5UjQ",
  "master_node" : "bhsEXqpaT7K2PEmXSYlbsg",
  "blocks" : { },
  "nodes" : {...},
  "metadata" : {...},
  "routing_table" : {...},
  "routing_nodes" : {...}
}

Node

A node is a single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine. A cluster consists of one or more nodes. In the test environment, I can run multiple nodes on a server. In actual deployment, in most cases it is still necessary to run a node on a server.

node classification :

master-eligible: can be used as the master node. Once it becomes the master node, it can manage the settings and changes of the entire cluster: create, update, delete indexes; add or delete nodes; assign shards to nodes
data: data node
ingest: data access (such as pipeline)
machine learning (Gold/Platinum License)

Generally speaking, a node can have one or more of the above functions. We can define it on the command line or in the Elasticsearch configuration file (Elasticsearch.yml):

Node type	Configuration parameters	Defaults
master-eligible	Node.master	True
data	Node.data	True
ingest	Node.ingest	True
machine learning	Node.ml	true (except OSS releases)

You can also make a node do exclusive functions and roles. If the above node configuration parameters do not have any configuration, then we can consider this node as a coordination node. under these circumstances.

It can accept external requests and forward them to the corresponding nodes for processing. For the master node, sometimes we need to set cluster.remote.connect: false.

In actual use, we can send the request to the data node, but not to the master node. We can define the role of a node in the cluster by configuring in the config/elasticsearch.yml file:

In some cases, we can set node.voting_only to true so that a node can only be used as a voting function when node.master is true, rather than being elected as a master node. This is to avoid split-brain conditions. It can usually be served by a node with lower CPU performance.

We can use one of the following commands to get all master-eligible nodes that can currently vote:

GET /_cluster/state?filter_path=metadata.cluster_coordination.last_committed_config

A result similar to the following list is obtained:

{
  "metadata" : {
    "cluster_coordination" : {
      "last_committed_config" : [
        "bhsEXqpaT7K2PEmXSYlbsg",
        "chsEXqpaT7K2PEmXSYlbsg",
        "dhsEXqpaT7K2PEmXSYlbsg"
      ]
    }
  }
}

In the entire Elastic architecture, the relationship between Data Node and Cluster is expressed as follows:

Document

Elasticsearch is document-oriented, which means that the smallest unit of data you index or search is a document.

Documents have some important properties in Elasticsearch:

independent:
A document contains fields (names) and their values.
Can be layered:
Think of it as a document within a document, it can also contain other fields and values.
Flexible structure
Documentation does not depend on a predefined schema.

Compared with relational databases, Document corresponds to each record in it.

Type

A type is a logical container for a document, similar to how a table is a container for rows.

You put documents with different structures (schemas) in different types. For example, you can use one type to define aggregation groups and another type for events when people aggregate.

In Elasticsearch, we can start by not defining a mapping, but directly write to the index we specify. The mapping of this index is dynamically generated (of course we can also prohibit this behavior). Each data type of the data item therein is dynamically identified. For example, time, string, etc., although some data types still need to be adjusted manually, such as geo_point and other geographic location data.

Index

In Elasticsearch, an index is a collection of documents.

Each Index consists of one or more documents, and these documents can be distributed among different shards.

Many people think that index is similar to database in relational database. There's some truth to this statement, but it's not quite the same. One of the important reasons is that documents in Elasticsearch can have object and nested structures. An index is a logical namespace that maps to one or more primary shards and can have zero or more replica shards.

Whenever a document comes in, the hash calculation will be automatically performed according to the id of the document and stored in the calculated shard instance. This result can make all shards have more balanced storage, and some shards will not be too large. busy.

shard_num = hash(_routing) % num_primary_shards

We can also see from the above formula that our shard number cannot be dynamically modified, otherwise the corresponding shard number will not be found in the future. It must be pointed out that the number of replicas can be modified dynamically.

Shard

Elasticsearch is a distributed search engine, so indexes are often split into elements called shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances shards as needed, so users don't need to worry about the details.

Due to sharding, an index can store a large amount of data that is limited by the hardware of a single node. For example, one has 2 billion disk space with a size of 2TB, but no one node has such a large disk space; or a single node processes search requests and the response is too slow. To solve this problem, Elasticsearch provides the ability to divide the index into multiple parts, which are called shards. When you create an index, you can specify the number of shards you want. Each shard is itself a fully functional and independent "index" that can be placed on any node in the cluster.

Fragmentation:

Allows you to horizontally split/expand your content capacity.
Allows you to perform distributed, parallel operations on top of shards (potentially, on multiple nodes), thereby increasing performance/throughput.
type:
Primary shard
Each document is stored in a Primary shard. When indexing a document, it first indexes on the Primary shard, then on all replicas of this shard (replica). An index can contain one or more primary shards. This number determines the scalability of the index relative to the size of the index data. After the index is created, the number of primary shards in the index cannot be changed.
Replica shard
Each primary shard can have zero or more replicas. A replica is a replica of the primary shard and serves two purposes:
Added failover : If the primary shard fails, the replica shard can be promoted to the primary.
Improves performance : Get and search requests can be handled by the primary or replica shards.

By default, each primary shard has one replica, but the number of replicas can be changed dynamically on an existing index. A replica shard is never started on the same node as its primary shard.

For example, an index: index has 5 shards and 1 replica as follows:

These shards are distributed on different physical machines:

The corresponding Shard value can be set for each Index:

curl -XPUT http://localhost:9200/wechat?pretty -H 'Content-Type: application/json' - d ' { "settings" : { "index.number_of_shards" : 2, "index.number_of_replicas" : 1
}
}

In the above REST interface, we set 2 shards for the index wechat, and there is a replica. Once the number of primary shards is set, we cannot modify it. This is because Elasticsearch will assign the corresponding document to the corresponding shard according to the id of each document and the number of primary shards. If this number is modified later, the corresponding shard may not be found every time you search.

We can view the settings in our index through the following interface:

curl -XGET http://localhost:9200/wechat/_settings?pretty

Replica

By default, Elasticsearch creates one primary shard and one replica for each index. This means that each index will contain one primary shard and each shard will have one replica.

Allocating multiple shards and replicas is the essence of the design of the distributed search function, providing high availability and fast access to documents in the index. The main difference between primary and replica shards is that only primary shards can accept indexing requests. Both replica and primary shards can serve query requests.

Note : number_of_shards value and index-related, regardless of the entire cluster. This value specifies the number of shards per index (not the total number of primary shards in the cluster).

Get the health of an index through the following interface:

 http://localhost:9200/_cat/indices/twitter

Further query, we can see:

If an index is displayed in red, it means that at least one primary shard of this index has not been allocated correctly, and some shards and their corresponding replicas cannot be accessed normally. If it is green, it means that each shard of the index has a copy (replica), and its copy is successfully replicated in the corresponding replica shard. If one of the nodes is broken, the corresponding replica of the other node will work, so that the data will not be lost.

shard health

Red: At least one primary shard is not allocated in the cluster.
Yellow: All primary replicas are assigned, but at least one replica is not assigned.
Green: All shards are allocated.