⭐《ElasticSearch核心技术与实战》笔记 - 5. 应用实战 - 尤嘉兴ing

[TOC]

电影搜索服务

需求分析及架构设计

https://www.elastic.co/blog/e...

将电影数据导入 Elasticsearch

搭建你的电影搜索服务

Stackoverflow 用户调查问卷分析

需求分析与架构设计

https://insights.stackoverflo...

数据 Extract & Enrichment

分析数据 StackOverflow 2020 年度开发者调查
使用 Logstash 导入数据到 ES
ES 中进行相关配置
1. Ingest 分割字段及转换字段类型
2. 使用 dynamic template 设置导入数据的字段类型
在下面相关操作执行完毕后, 创建 Index Pattern
Kibana 中的操作: 在 Management 的 Index Patterns 中创建名为 "final-stackoverflow-survey" 的 Index Pattern, 只包含 final-stackoverflow-survey 这一个索引.

logstash 相关配置: stackoverflow-surver.conf

input {
    file {
        # 这里的路径必须是绝对路径, 不能是相对路径
        path => ["D:/Programming/logstash-7.9.3/survey_results_public.csv"]
        start_position => "beginning"
        #        sincedb_path => "/dev/null"
        # windows 上没有 /dev/null, 使用 nul 作为替代.
        sincedb_path => "nul"
    }
}

filter {
    csv {
        autogenerate_column_names => false
        skip_empty_columns => true
        separator => ","
        columns => [
            "Respondent",
            "MainBranch",
            "Hobbyist",
            "Age",
            "Age1stCode",
            "CompFreq",
            "CompTotal",
            "ConvertedComp",
            "Country",
            "CurrencyDesc",
            "CurrencySymbol",
            "DatabaseDesireNextYear",
            "DatabaseWorkedWith",
            "DevType",
            "EdLevel",
            "Employment",
            "Ethnicity",
            "Gender",
            "JobFactors",
            "JobSat",
            "JobSeek",
            "LanguageDesireNextYear",
            "LanguageWorkedWith",
            "MiscTechDesireNextYear",
            "MiscTechWorkedWith",
            "NEWCollabToolsDesireNextYear",
            "NEWCollabToolsWorkedWith",
            "NEWDevOps",
            "NEWDevOpsImpt",
            "NEWEdImpt",
            "NEWJobHunt",
            "NEWJobHuntResearch",
            "NEWLearn",
            "NEWOffTopic",
            "NEWOnboardGood",
            "NEWOtherComms",
            "NEWOvertime",
            "NEWPurchaseResearch",
            "NEWPurpleLink",
            "NEWSOSites",
            "NEWStuck",
            "OpSys",
            "OrgSize",
            "PlatformDesireNextYear",
            "PlatformWorkedWith",
            "PurchaseWhat",
            "Sexuality",
            "SOAccount",
            "SOComm",
            "SOPartFreq",
            "SOVisitFreq",
            "SurveyEase",
            "SurveyLength",
            "Trans",
            "UndergradMajor",
            "WebframeDesireNextYear",
            "WebframeWorkedWith",
            "WelcomeChange",
            "WorkWeekHrs",
            "YearsCode",
            "YearsCodePro"
        ]
    }

    # ??? 没看懂
    if ([collector] == "collector") {
        drop {

        }
    }

    # 移除部分字段
    mutate {
        remove_field => ["message", "@version", "@timestamp", "host"]
    }
}

output {
    # 方便显示处理进度
    stdout {
        codec => "dots"
    }
    
    # 写入 es 的 stackoverflow-survey-raw 索引中
    elasticsearch {
        hosts => ["http://localhost:9200"]
        index => "stackoverflow-survey-raw"
        document_type => "_doc"
    }
}

Input Plugin
- File Input
Filter Plugin
- CSV Filter
- Mutate Filter
Output Plugin
- ES Output

windows 下运行 logstash 导入数据的示例

.\bin\logstash.bat -f .\stackoverflow-surver.conf

在 Kibana 中执行如下

DELETE stackoverflow-survey-raw

// 查看写入数据的字段类型, 发现都是 strng
// 由于我们不需要对这些数据进行全文搜索, 同时有聚合的需求, 因此需要将其指定为 keyword 类型
GET stackoverflow-survey-raw


// 设置 dynamic mapping
PUT final-stackoverflow-survey
{
  "mappings": {
    // 将所有的 string 类型转换为 keyword 类型
    "dynamic_templates": [
      {
        "string_as_keyword": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  },   
  "settings": {
    // 副本分片数设为 0
    "number_of_replicas": 0
  }
}

GET stackoverflow-survey-raw/_search

使用 Dynamic Template 处理文本类型 Mapping

在 Kibana 中执行如下

// 创建一个 Ingest Pipeline, 对部分字段进行分割, 及格式转换操作
PUT _ingest/pipeline/stackoverflow_pipeline
{
  "description": "Pipeline for stackoverflow survey",
  "processors": [
    {
      "split": {
        "field": "NEWPurchaseResearch",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWSOSites",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWStuck",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DevType",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWJobHunt",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWJobHuntResearch",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DatabaseDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DatabaseWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "LanguageWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "LanguageDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "MiscTechDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "MiscTechWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "PlatformDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "PlatformWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "WebframeWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "WebframeDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWCollabToolsDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWCollabToolsWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "JobFactors",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "Ethnicity",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "Sexuality",
        "separator": ";"
      }
    },
    {
      "convert": {
        "field": "YearsCode",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "YearsCode",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "WorkWeekHrs",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "WorkWeekHrs",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "Age",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "Age",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "Age1stCode",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "Age1stCode",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "YearsCodePro",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "YearsCodePro",
              "value": 0
            }
          }
        ]
      }
    }
  ]
}

// 通过 reindex 将 logstash 导入的数据重新导入(应用上述创建的 ingest pipeline)到 final-stackoverflow-survey 索引中
POST _reindex
{
  "source": {
    "index": "stackoverflow-survey-raw"
  },
  "dest": {
    "index": "final-stackoverflow-survey",
    "pipeline": "stackoverflow_pipeline"
  }
}

GET final-stackoverflow-survey

GET final-stackoverflow-survey/_search

创建 Ingest Pipeline
- Split 一些字符串
- 转换整形数

构建 Insights Dashboard

Elastic认证

认证

...略

考纲整理

安装配置
- 根据需求, 配置部署集群
- 配置集群的节点
- 为集群设置安全保护
- 基于 X-Pack, 为集群配置 RBAC
索引数据
- 根据需求, 定义一个索引
- 执行索引的 Index, CRUD
- 定义与使用 Index Alias
- 定义与使用 Index Template
- 定义与使用 Dynamic Template
- 使用 Reindex API & Update By Query 重新索引文档
- 定义 Ingest Pipeline (包括使用 Painless 脚本)
查询
- 使用 terms 或 phrase 查询一个或多个字段
- 使用 Bool query
- 高亮查询结果
- 对查询结果排序
- 对查询结果分页
- 使用 Scroll API
- 使用模糊查询
- 使用 Search Template
  日常工作可能不会使用, 但使用它可以更好的分离 Search 的定义与使用
- 跨集群搜索
聚合
- metric & Bucket Aggregation
- sub-aggregation
- pipeline aggregation
映射与分词
- 按需定义索引 mapping
- 按需自定义 analyzer
- 为字段定义多字段类型 (不同的字段使用不同的 type 和 analyzer)
- 定义和查询 nested 文档
- 定义及查询 parent/child 文档
集群管理
- 按需将索引的分片分配到特定的节点
- 为索引配置 Shard allocation awareness & Force awareness
- 诊断分片的问题, 恢复集群的 health 状态
- Backup & Restore 集群或者特定的索引
- 配置一个 hot & warm 架构的集群
- 配置跨集群搜索