[TOC]

电影搜索服务

需求分析及架构设计

img

img

img

img

img

https://www.elastic.co/blog/e...

img

img

将电影数据导入 Elasticsearch

img

img

img

img

img

搭建你的电影搜索服务

img

img

img

img

img

img

img

Stackoverflow 用户调查问卷分析

需求分析与架构设计

img

img

img

img

img

https://insights.stackoverflo...

数据 Extract & Enrichment

  1. 分析数据 StackOverflow 2020 年度开发者调查
  2. 使用 Logstash 导入数据到 ES
  3. ES 中进行相关配置

    1. Ingest 分割字段及转换字段类型
    2. 使用 dynamic template 设置导入数据的字段类型
  4. 在下面相关操作执行完毕后, 创建 Index Pattern

    Kibana 中的操作: 在 Management 的 Index Patterns 中创建名为 "final-stackoverflow-survey" 的 Index Pattern, 只包含 final-stackoverflow-survey 这一个索引.

logstash 相关配置: stackoverflow-surver.conf

input {
    file {
        # 这里的路径必须是绝对路径, 不能是相对路径
        path => ["D:/Programming/logstash-7.9.3/survey_results_public.csv"]
        start_position => "beginning"
        #        sincedb_path => "/dev/null"
        # windows 上没有 /dev/null, 使用 nul 作为替代.
        sincedb_path => "nul"
    }
}

filter {
    csv {
        autogenerate_column_names => false
        skip_empty_columns => true
        separator => ","
        columns => [
            "Respondent",
            "MainBranch",
            "Hobbyist",
            "Age",
            "Age1stCode",
            "CompFreq",
            "CompTotal",
            "ConvertedComp",
            "Country",
            "CurrencyDesc",
            "CurrencySymbol",
            "DatabaseDesireNextYear",
            "DatabaseWorkedWith",
            "DevType",
            "EdLevel",
            "Employment",
            "Ethnicity",
            "Gender",
            "JobFactors",
            "JobSat",
            "JobSeek",
            "LanguageDesireNextYear",
            "LanguageWorkedWith",
            "MiscTechDesireNextYear",
            "MiscTechWorkedWith",
            "NEWCollabToolsDesireNextYear",
            "NEWCollabToolsWorkedWith",
            "NEWDevOps",
            "NEWDevOpsImpt",
            "NEWEdImpt",
            "NEWJobHunt",
            "NEWJobHuntResearch",
            "NEWLearn",
            "NEWOffTopic",
            "NEWOnboardGood",
            "NEWOtherComms",
            "NEWOvertime",
            "NEWPurchaseResearch",
            "NEWPurpleLink",
            "NEWSOSites",
            "NEWStuck",
            "OpSys",
            "OrgSize",
            "PlatformDesireNextYear",
            "PlatformWorkedWith",
            "PurchaseWhat",
            "Sexuality",
            "SOAccount",
            "SOComm",
            "SOPartFreq",
            "SOVisitFreq",
            "SurveyEase",
            "SurveyLength",
            "Trans",
            "UndergradMajor",
            "WebframeDesireNextYear",
            "WebframeWorkedWith",
            "WelcomeChange",
            "WorkWeekHrs",
            "YearsCode",
            "YearsCodePro"
        ]
    }

    # ??? 没看懂
    if ([collector] == "collector") {
        drop {

        }
    }

    # 移除部分字段
    mutate {
        remove_field => ["message", "@version", "@timestamp", "host"]
    }
}

output {
    # 方便显示处理进度
    stdout {
        codec => "dots"
    }
    
    # 写入 es 的 stackoverflow-survey-raw 索引中
    elasticsearch {
        hosts => ["http://localhost:9200"]
        index => "stackoverflow-survey-raw"
        document_type => "_doc"
    }
}
  • Input Plugin

    • File Input
  • Filter Plugin

    • CSV Filter
    • Mutate Filter
  • Output Plugin

    • ES Output

windows 下运行 logstash 导入数据的示例

.\bin\logstash.bat -f .\stackoverflow-surver.conf

在 Kibana 中执行如下

DELETE stackoverflow-survey-raw

// 查看写入数据的字段类型, 发现都是 strng
// 由于我们不需要对这些数据进行全文搜索, 同时有聚合的需求, 因此需要将其指定为 keyword 类型
GET stackoverflow-survey-raw


// 设置 dynamic mapping
PUT final-stackoverflow-survey
{
  "mappings": {
    // 将所有的 string 类型转换为 keyword 类型
    "dynamic_templates": [
      {
        "string_as_keyword": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  },   
  "settings": {
    // 副本分片数设为 0
    "number_of_replicas": 0
  }
}

GET stackoverflow-survey-raw/_search
  • 使用 Dynamic Template 处理文本类型 Mapping

在 Kibana 中执行如下

// 创建一个 Ingest Pipeline, 对部分字段进行分割, 及格式转换操作
PUT _ingest/pipeline/stackoverflow_pipeline
{
  "description": "Pipeline for stackoverflow survey",
  "processors": [
    {
      "split": {
        "field": "NEWPurchaseResearch",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWSOSites",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWStuck",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DevType",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWJobHunt",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWJobHuntResearch",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DatabaseDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "DatabaseWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "LanguageWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "LanguageDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "MiscTechDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "MiscTechWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "PlatformDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "PlatformWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "WebframeWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "WebframeDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWCollabToolsDesireNextYear",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "NEWCollabToolsWorkedWith",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "JobFactors",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "Ethnicity",
        "separator": ";"
      }
    },
    {
      "split": {
        "field": "Sexuality",
        "separator": ";"
      }
    },
    {
      "convert": {
        "field": "YearsCode",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "YearsCode",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "WorkWeekHrs",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "WorkWeekHrs",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "Age",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "Age",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "Age1stCode",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "Age1stCode",
              "value": 0
            }
          }
        ]
      }
    },
    {
      "convert": {
        "field": "YearsCodePro",
        "type": "integer",
        "on_failure": [
          {
            "set": {
              "field": "YearsCodePro",
              "value": 0
            }
          }
        ]
      }
    }
  ]
}

// 通过 reindex 将 logstash 导入的数据重新导入(应用上述创建的 ingest pipeline)到 final-stackoverflow-survey 索引中
POST _reindex
{
  "source": {
    "index": "stackoverflow-survey-raw"
  },
  "dest": {
    "index": "final-stackoverflow-survey",
    "pipeline": "stackoverflow_pipeline"
  }
}

GET final-stackoverflow-survey

GET final-stackoverflow-survey/_search
  • 创建 Ingest Pipeline

    • Split 一些字符串
    • 转换整形数

构建 Insights Dashboard

image-20201116003055417

image-20201116003120607

image-20201116003149869

image-20201116003224563

image-20201116003246850

Elastic认证

认证

...略

考纲整理

  • 安装配置

    • 根据需求, 配置部署集群
    • 配置集群的节点
    • 为集群设置安全保护
    • 基于 X-Pack, 为集群配置 RBAC
  • 索引数据

    • 根据需求, 定义一个索引
    • 执行索引的 Index, CRUD
    • 定义与使用 Index Alias
    • 定义与使用 Index Template
    • 定义与使用 Dynamic Template
    • 使用 Reindex API & Update By Query 重新索引文档
    • 定义 Ingest Pipeline (包括使用 Painless 脚本)
  • 查询

    • 使用 terms 或 phrase 查询一个或多个字段
    • 使用 Bool query
    • 高亮查询结果
    • 对查询结果排序
    • 对查询结果分页
    • 使用 Scroll API
    • 使用模糊查询
    • 使用 Search Template

      日常工作可能不会使用, 但使用它可以更好的分离 Search 的定义与使用
    • 跨集群搜索
  • 聚合

    • metric & Bucket Aggregation
    • sub-aggregation
    • pipeline aggregation
  • 映射与分词

    • 按需定义索引 mapping
    • 按需自定义 analyzer
    • 为字段定义多字段类型 (不同的字段使用不同的 type 和 analyzer)
    • 定义和查询 nested 文档
    • 定义及查询 parent/child 文档
  • 集群管理

    • 按需将索引的分片分配到特定的节点
    • 为索引配置 Shard allocation awareness & Force awareness
    • 诊断分片的问题, 恢复集群的 health 状态
    • Backup & Restore 集群或者特定的索引
    • 配置一个 hot & warm 架构的集群
    • 配置跨集群搜索

模拟测试

img

img

img

img

img

img

img

img

集群的备份与恢复

image-20201116005419399

image-20201116005428580

  • 这里的 my_fs_backup 是创建的 Repository

image-20201116005731101

image-20201116005759131

  • 可以只为指定索引创建快照

image-20201116005913554

  • restore 之前需要先将索引删除, 否则会报错.

image-20201116005438060


嘉兴ing
284 声望24 粉丝

PHPer@厦门