英文词干提取(stemming)算法 - Lovins, Porter - 冷且静

英文词干提取有多种方式，在实践中，可能涉及到机器学习数据挖掘等多方面的内容。
这里主要介绍的是易于实现的几种原始算法：

Lovins (1968)
Porter (1980)
Porter2 (2000)

1. Lovins

Lovins是最早的实现

1.1. 简介

算法涉及如下部件：

ending, 词后缀，共有294个，详细列表见最后
condition, 词后缀去除条件，每个ending对应一个condition，共有29个，详细列表见最后
transformation, 转换ending的方式，共有35个，详细列表见最后

算法分为两部：

对英文词，根据ending列表，按照ending从长到短扫描，找到第一个符合condition的ending
根据剩下的stem应用transformation，将ending转为恰当的形式

1.2. 例子

第一步

英文词为nationally，按照endling列表，从长到短扫描，首先找到 .09. ationally B，
对应的规则是B Minimum stem length = 3，要求去除ending后，剩余的部分长度大于等于3
nationally 去除 ationally 后只剩下 n, 不符合condition

继续扫描ending，找到 .07. ionally A，对应的规则是 A No restrictions on stem,没有任何限制。
于是最终选定 ionally作为ending

第二步

英文词nationally的stem是nat, 查找transformation，发现没有符合的transformation，不进行变换直接输出。
比如又一个词sitting，第一步得到stem是sitt, 第二步这里会应用第一条transformation，最终输出sit

1.Appendix.A endings 列表

.11.
alistically B   arizability A   izationally B

.10.
antialness A    arisations A    arizations A    entialness A

.09.
allically C     antaneous A     antiality A     arisation A
arization A     ationally B     ativeness A     eableness E
entations A     entiality A     entialize A     entiation A
ionalness A     istically A     itousness A     izability A
izational A

.08.
ableness A      arizable A      entation A      entially A
eousness A      ibleness A      icalness A      ionalism A
ionality A      ionalize A      iousness A      izations A
lessness A

.07.
ability A       aically A       alistic B       alities A
ariness E       aristic A       arizing A       ateness A
atingly A       ational B       atively A       ativism A
elihood E       encible A       entally A       entials A
entiate A       entness A       fulness A       ibility A
icalism A       icalist A       icality A       icalize A
ication G       icianry A       ination A       ingness A
ionally A       isation A       ishness A       istical A
iteness A       iveness A       ivistic A       ivities A
ization F       izement A       oidally A       ousness A

.06.
aceous A        acious B        action G        alness A
ancial A        ancies A        ancing B        ariser A
arized A        arizer A        atable A        ations B
atives A        eature Z        efully A        encies A
encing A        ential A        enting C        entist A
eously A        ialist A        iality A        ialize A
ically A        icance A        icians A        icists A
ifully A        ionals A        ionate D        ioning A
ionist A        iously A        istics A        izable E
lessly A        nesses A        oidism A

.05.
acies A         acity A         aging B         aical A
alist A         alism B         ality A         alize A
allic BB        anced B         ances B         antic C
arial A         aries A         arily A         arity B
arize A         aroid A         ately A         ating I
ation B         ative A         ators A         atory A
ature E         early Y         ehood A         eless A
elity A         ement A         enced A         ences A
eness E         ening E         ental A         ented C
ently A         fully A         ially A         icant A
ician A         icide A         icism A         icist A
icity A         idine I         iedly A         ihood A
inate A         iness A         ingly B         inism J
inity CC        ional A         ioned A         ished A
istic A         ities A         itous A         ively A
ivity A         izers F         izing F         oidal A
oides A         otide A         ously A

.04.
able A          ably A          ages B          ally B
ance B          ancy B          ants B          aric A
arly K          ated I          ates A          atic B
ator A          ealy Y          edly E          eful A
eity A          ence A          ency A          ened E
enly E          eous A          hood A          ials A
ians A          ible A          ibly A          ical A
ides L          iers A          iful A          ines M
ings N          ions B          ious A          isms B
ists A          itic H          ized F          izer F
less A          lily A          ness A          ogen A
ward A          wise A          ying B          yish A

.03.
acy A           age B           aic A           als BB
ant B           ars O           ary F           ata A
ate A           eal Y           ear Y           ely E
ene E           ent C           ery E           ese A
ful A           ial A           ian A           ics A
ide L           ied A           ier A           ies P
ily A           ine M           ing N           ion Q
ish C           ism B           ist A           ite AA
ity A           ium A           ive A           ize F
oid A           one R           ous A

.02.
ae A            al BB           ar X            as B
ed E            en F            es E            ia A
ic A            is A            ly B            on S
or T            um U            us V            yl R
s' A            's A

.01.
a A             e A             i A             o A
s W             y B

1.Appendix.B conditions 列表

A   No restrictions on stem
B   Minimum stem length = 3
C   Minimum stem length = 4
D   Minimum stem length = 5
E   Do not remove ending after e
F   Minimum stem length = 3 and do not remove ending after e
G   Minimum stem length = 3 and remove ending only after f
H   Remove ending only after t or ll
I   Do not remove ending after o or e
J   Do not remove ending after a or e
K   Minimum stem length = 3 and remove ending only after l, i or u*e
L   Do not remove ending after u, x or s, unless s follows o
M   Do not remove ending after a, c, e or m
N   Minimum stem length = 4 after s**, elsewhere = 3
O   Remove ending only after l or i
P   Do not remove ending after c
Q   Minimum stem length = 3 and do not remove ending after l or n
R   Remove ending only after n or r
S   Remove ending only after dr or t, unless t follows t
T   Remove ending only after s or t, unless t follows o
U   Remove ending only after l, m, n or r
V   Remove ending only after c
W   Do not remove ending after s or u
X   Remove ending only after l, i or u*e
Y   Remove ending only after in
Z   Do not remove ending after f
AA  Remove ending only after d, f, ph, th, l, er, or, es or t
BB  Minimum stem length = 3 and do not remove ending after met or ryst
CC  Remove ending only after l

1.Appendix.C transformations 列表

1   remove one of double b, d, g, l, m, n, p, r, s, t
2   iev   ->   ief
3   uct   ->   uc
4   umpt  ->   um
5   rpt   ->   rb
6   urs   ->   ur
7   istr  ->   ister
7a  metr  ->   meter
8   olv   ->   olut
9   ul    ->   l except following a, o, i
10  bex   ->   bic
11  dex   ->   dic
12  pex   ->   pic
13  tex   ->   tic
14  ax    ->   ac
15  ex    ->   ec
16  ix    ->   ic
17  lux   ->   luc
18  uad   ->   uas
19  vad   ->   vas
20  cid   ->   cis
21  lid   ->   lis
22  erid  ->   eris
23  pand  ->   pans
24  end   ->   ens except following s
25  ond   ->   ons
26  lud   ->   lus
27  rud   ->   rus
28  her   ->   hes except following p, t
29  mit   ->   mis
30  ent   ->   ens except following m
31  ert   ->   ers
32  et    ->   es except following n
33  yt    ->   ys
34  yz    ->   ys

2. Porter

2.1. 简介

元音与辅音

元音辅音与常见的定义略有不同：

元音(Vowel) - A E I O U, 以及辅音后边的Y
辅音(Consonant) - 除了 A E I O U，以及元音后边的Y

单词的分组

连续的元音看作元音组V，连续的辅音看作辅音组C，于是任意一个单词都可以表示成VC交错的形式，例如：

segmentfault -> s/e/gm/e/ntf/au/lt -> CVCVCVC
porter -> p/o/rt/e/r -> CVCVC
application -> a/ppl/i/c/a/t/io/n -> VCVCVCVC
apple -> a/ppl/e -> V/C/V

综合起来，可以表示为 VC 组的形式：$$ C^m[V] $$
其中参数m类似于Lovin中condition的stem长度，用于后续的判断

规则

Porter算法以rule为主，rule的形式为：

(condition) S1 -> S2

condition作用于去除了S1的stem，除了m还有其他特征：

m - 表示VC组的数目
* - 表示任意字符, 和子串，v,d,o配合使用
大写字母 - 表示子串
v - 表示一个元音字符
d - 表示两个一样的辅音
o - 表示cvc, 其中第二个c不能是W,X,Y

S1是词的后缀，S2的变化后的后缀

和Lovin不同，一个词语经过多个规则的串联处理，输出目标词(Lovin是一次性输出)
例如 hopping, 首先应用规则(*v*) ING ->, 变为hopp
然后应用规则(*d and not (*L or *S or *Z)) -> single letter，从hopp变为hop

流程

整个算法是从上往下应用规则，有些规则比较特殊，如果触发了要处理额外的规则
规则很多，于是对规则进行分组(step)，这里的分组是为了逻辑上做区分(实际上算法也可以根据分组优化)，整个算法就是从头到位执行的，流程如下：

do Step_1a
do Step_1b (如果命中step 2b.2 or step 2b.3, 则做一些额外工作)
do Step_1c
do Step_2
do Step_3
do Step_4
do Step_5a
do Step_5b

每个Step的详细内容见附录

2.2. 例子

2.Appendix Step 1a

      SSES  ->   SS
      IES   ->   I
      SS    ->   SS
      S     ->

2.Appendix Step 1b

(m>0) EED     ->   EE
(*v*) ED      ->
(*v*) ING     ->

If the second or third of the rules in Step 1b is successful, the following is done:

      AT      ->   ATE
      BL      ->   BLE
      IZ      ->   IZE
      (*d and not (*L or *S or *Z)) -> single letter
      (m=1 and *o)  ->   E

2.Appendix Step 1c

(*v*) Y       ->   I

2.Appendix Step 2

(m>0) ATIONAL ->   ATE
(m>0) TIONAL  ->   TION
(m>0) ENCI    ->   ENCE
(m>0) ANCI    ->   ANCE
(m>0) IZER    ->   IZE
(m>0) ABLI    ->   ABLE
(m>0) ALLI    ->   AL
(m>0) ENTLI   ->   ENT
(m>0) ELI     ->   E
(m>0) OUSLI   ->   OUS
(m>0) IZATION ->   IZE
(m>0) ATION   ->   ATE
(m>0) ATOR    ->   ATE
(m>0) ALISM   ->   AL
(m>0) IVENESS ->   IVE
(m>0) FULNESS ->   FUL
(m>0) OUSNESS ->   OUS
(m>0) ALITI   ->   AL
(m>0) IVITI   ->   IVE
(m>0) BILITI  ->   BLE

2.Appendix Step 3

(m>0) ICATE   ->   IC
(m>0) ATIVE   ->
(m>0) ALIZE   ->   AL
(m>0) ICITI   ->   IC
(m>0) ICAL    ->   IC
(m>0) FUL     ->
(m>0) NESS    ->

2.Appendix Step 4

(m>1) AL      ->
(m>1) ANCE    ->
(m>1) ENCE    ->
(m>1) ER      ->
(m>1) IC      ->
(m>1) ABLE    ->
(m>1) IBLE    ->
(m>1) ANT     ->
(m>1) EMENT   ->
(m>1) MENT    ->
(m>1) ENT     ->
(m>1 and (*S or *T)) ION   ->
(m>1) OU      ->
(m>1) ISM     ->
(m>1) ATE     ->
(m>1) ITI     ->
(m>1) OUS     ->
(m>1) IVE     ->
(m>1) IZE     ->

2.Appendix Step 5a

(m>1) E   ->
(m=1 and not *o) E   ->

2.Appendix Step 5b

(m > 1 and *d and *L)   ->   single letter

英文词干提取(stemming)算法 - Lovins, Porter

1. Lovins

1.1. 简介

1.2. 例子

第一步

第二步

1.Appendix.A endings 列表

1.Appendix.B conditions 列表

1.Appendix.C transformations 列表

2. Porter

2.1. 简介

元音与辅音

单词的分组

规则

流程

2.2. 例子

2.Appendix Step 1a

2.Appendix Step 1b

2.Appendix Step 1c

2.Appendix Step 2

2.Appendix Step 3

2.Appendix Step 4

2.Appendix Step 5a

2.Appendix Step 5b

winterdawn

引用和评论

WMT15 单句评价任务的分析

大模型中的Token究竟是什么？从原理到作用深度解析

功率器件热设计基础（九）——功率半导体模块的热扩散

英飞凌 | 驱动电路设计（二）——驱动器的输入侧探究

DeepSeek的开源之路:一文读懂从V1-R1的技术发展,见证从开源新秀到推理革命的领跑者

入选ICLR 2025，MIT/UC伯克利/哈佛/斯坦福等提出DRAKES算法，突破生物序列设计瓶颈

怎么判断自己下载的 trae 是国际版还是国内版？