使用opennlp进行文档分类

序

本文主要研究下如何使用opennlp进行文档分类

DoccatModel

要对文档进行分类，需要一个最大熵模型(Maximum Entropy Model)，在opennlp中对应DoccatModel

    @Test
    public void testSimpleTraining() throws IOException {

        ObjectStream<DocumentSample> samples = ObjectStreamUtils.createObjectStream(
                new DocumentSample("1", new String[]{"a", "b", "c"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "1", "2"}),
                new DocumentSample("1", new String[]{"a", "b", "c", "3", "4"}),
                new DocumentSample("0", new String[]{"x", "y", "z"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "5", "6"}),
                new DocumentSample("0", new String[]{"x", "y", "z", "7", "8"}));

        TrainingParameters params = new TrainingParameters();
        params.put(TrainingParameters.ITERATIONS_PARAM, 100);
        params.put(TrainingParameters.CUTOFF_PARAM, 0);

        DoccatModel model = DocumentCategorizerME.train("x-unspecified", samples,
                params, new DoccatFactory());

        DocumentCategorizer doccat = new DocumentCategorizerME(model);

        double[] aProbs = doccat.categorize(new String[]{"a"});
        Assert.assertEquals("1", doccat.getBestCategory(aProbs));

        double[] bProbs = doccat.categorize(new String[]{"x"});
        Assert.assertEquals("0", doccat.getBestCategory(bProbs));

        //test to make sure sorted map's last key is cat 1 because it has the highest score.
        SortedMap<Double, Set<String>> sortedScoreMap = doccat.sortedScoreMap(new String[]{"a"});
        Set<String> cat = sortedScoreMap.get(sortedScoreMap.lastKey());
        Assert.assertEquals(1, cat.size());
    }

这里为了方便测试，先手工编写DocumentSample来做训练文本
categorize方法返回的是一个概率，getBestCategory可以根据概率来返回最为匹配的分类

输出如下：

Indexing events with TwoPass using cutoff of 0

    Computing event counts...  done. 6 events
    Indexing...  done.
Sorting and merging events... done. Reduced 6 events to 6.
Done indexing in 0.13 s.
Incorporating indexed data for training...  
done.
    Number of Event Tokens: 6
        Number of Outcomes: 2
      Number of Predicates: 14
...done.
Computing model parameters ...
Performing 100 iterations.
  1:  ... loglikelihood=-4.1588830833596715    0.5
  2:  ... loglikelihood=-2.6351991759048894    1.0
  3:  ... loglikelihood=-1.9518912133474995    1.0
  4:  ... loglikelihood=-1.5599038834410852    1.0
  5:  ... loglikelihood=-1.3039748361952568    1.0
  6:  ... loglikelihood=-1.1229511041438864    1.0
  7:  ... loglikelihood=-0.9877356230661396    1.0
  8:  ... loglikelihood=-0.8826624290652341    1.0
  9:  ... loglikelihood=-0.7985244514476817    1.0
 10:  ... loglikelihood=-0.729543972551105    1.0
//...
 95:  ... loglikelihood=-0.0933856684859806    1.0
 96:  ... loglikelihood=-0.09245907503183291    1.0
 97:  ... loglikelihood=-0.09155090064000486    1.0
 98:  ... loglikelihood=-0.09066059844628399    1.0
 99:  ... loglikelihood=-0.08978764309881068    1.0
100:  ... loglikelihood=-0.08893152970793908    1.0

小结

opennlp的categorize方法需要自己先切词好，单独调用不是很方便，不过如果是基于pipeline设计的，也可以理解，在pipeline前面先经过切词等操作。本文仅仅是使用官方的测试源码来做介绍，读者可以下载个中文分类文本训练集来训练，然后对中文文本进行分类。

doc

Document Categorizer API

使用opennlp进行文档分类

序

DoccatModel

小结

doc

codecraft

引用和评论

聊聊Tomato Architecture