Cloud 学习笔记9. Project 1 : Inverted Index

Project 1 : Inverted Index

实验准备

环境：CDH 5.13.0 (详见5. Word Count)

创建倒排索引

倒排索引（Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。

反向索引数据结构是典型的搜索引擎检索算法重要的部分，也是文档检索系统中最常用的数据结构

建立反向索引的过程分两步：

对原始文档数据进行编号（DocID），形成列表。图一左侧文档列表
对文档中数据进行分词，得到词条(term)。对词条进行编号，以词条为索引，保存相关信息(词频，文档编号，位置信息)。图一右侧posting list

基于倒排索引的检索

首先基于Document1、Document2、Document3建立Inverted Index。假设我们要检索关键词blue sky. 根据Term分别获得对应的posting list:

blue - 1:3,3:2

sky - 2:8, 3:3

经过对比，两个关键词同时出现在Document3，分别位于位置2、3。于是返回Document3作为检索到的文档

代码

import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class InvertedIndex {
    public static class Map extends Mapper<Object, Text, Text, Text> {

        public void map(Object key, Text value, Context context)
                throws IOException, InterruptedException {
            FileSplit split = (FileSplit) context.getInputSplit();
            String filePath = split.getPath().getName().toString();
            String[] words = value.toString().split("\\s+");
            for (String word : words) {
                context.write(new Text(word.toLowerCase()), new Text(filePath));
            }
        }
    }

    public static class Reduce extends Reducer<Text, Text, Text, Text> {
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
            // <Word, list(file1, file1, file2, file2),...>
            HashMap<String, Integer> map = new HashMap<>();
            for (Text record : values) {
                String file = record.toString();
                int times = map.containsKey(file) ? map.get(file) : 0;
                map.put(file, times + 1);
            }

            Iterator<String> iter = map.keySet().iterator();
            String res = "";
            while (iter.hasNext()) {
                String filePath = iter.next();
                int times = map.get(filePath);
                res += filePath + ":" + times + "\t";
            }
            context.write(key, new Text(res));
        }
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = new Job(conf, "inverted index");
        job.setJarByClass(InvertedIndex.class);

        job.setMapperClass(Map.class);

        job.setReducerClass(Reduce.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

执行

在项目文件夹下，创建input/，进入input/随意生成样本文档。运行上面的代码，将输入路径和输出路径，分别作为参数第一位、第二位传入。(详见5. Word Count)

Cloud 学习笔记9. Project 1 : Inverted Index

Project 1 : Inverted Index

实验准备

创建倒排索引

基于倒排索引的检索

代码

执行

s09g

引用和评论

Cloud 学习笔记 11.Multicast 组播

阿里云 ESA 游戏行业解决方案｜安全防护、加速、低延时的技术融合

云电竞巅峰对决：ToDesk/网易云/START实战测评，谁是真王者？

Linux系统安装更新Python3.x版本详细步骤

K3s + KubeSphere + DeepSeek 全流程部署指南：轻量 K8s 与 AI 大模型私有化实践

OpenAI 最后一代非推理模型：OpenAI 发布 GPT-4.5预览版

国产化环境下的 K8s 全离线部署：鲲鹏 + 麒麟 V10 + KubeSphere + Harbor