前言

  • 本文根据《精通正则表达式》和 Unicode Regular Expressions 整理。
  • 本文的示例默认以 Python3 为实现语言,用到 Python3 的 re 模块或 regex 库。

基本的 Unicode 属性分类

\p{L}|\p{Letter} 字母
\p{M}|\p{Mark} 不能单独出现,必须与其他基本字符一起出现(重音符号、包围框,等等)的字符
\p{Z}|\p{Separator} 用于表示分割,但本身不可见的字符(各种空白字符)
\p{S}|\p{Symbol} 各种图形符号(Dingbats)和字母符号
\p{N}|\p{Number} 任何数字字符
\p{P}|\p{Punctuation} 标点字符
\p{C}|\p{Other} 匹配其他任何字符(很少用于正常字符)

基本的 Unicode 子属性

Letter

\p{Ll}|\p{Lowercase_Letter} 小写字母
\p{Lu}|\p{Uppercase_Letter} 大写字母 
\p{Lt}|\p{Titlecase_Lettter} 出现在单词开头的字母 
\p{L&}|\p{Ll}、\p{Lu} 、\p{Lt} 并集的简写法 
\p{Lm}|\p{Modifier_Letter} 少数形似字母的,有特殊用途的字符 
\p{Lo}|\p{Other_Letter} 没有大小写形式,也不属于修饰符的字母,
         包括希伯来语、阿拉伯语、孟加拉语、泰国语、日语中的字母。

Mark

\p{Mn}|\p{Non_Spacing_Mark} 用于修饰其他字符的“字符(Characters)”,
         例如重音符号、变音符号、某些“元音记号”和语调标记。 
\p{Mc}|\p{Spacing_Combining_Mark} 会占据一定宽度的修饰字符
        (各种语言中的大多数“元音记号”,这些语言包括孟加拉语、印度古哈拉地语、
        泰米尔语、泰卢固语、埃纳德语、马来语、僧伽罗语、缅甸语和高棉语)。
\p{Me}|\p{Enclosing_Mark} 可以围住其他字符的标记,例如圆圈、方框、钻石型等

Separator

\p{Zs}|\p{Space_Separator} 各种空白字符,例如空格符、不间断空格(non-breakspace),
         以及各种固定宽度的空白字符。
\p{Zl}|\p{Line_Separator} LINE SEPARATOR 字符(U+2028)
\p{Zp}|\p{Paragraph_Separator} PARAGRAPH SEPARATOR 字符(U+2029),段落分割符

Symbol

\p{Sc}|\p{Currency_Symbol} 货币符号、$、¥、...。
\p{Sk}|\p{Modifier_Symbol} 大多数版本中它表示组合字符,
         但是作为功能完整的字符,它们有自己的意义。
\p{So}|\p{Other_Symbol} 各种印刷符号、框图符号、盲文符号,
         以及非字母形式的中文字符,等等。

Number

\p{Nd}|\p{Decimal_Digit_Number} 各种字母表中从 0 到 9 的数字(不包括中文、日文和韩文)
\p{Nl}|\p{Letter_Number} 几乎所有罗马数字。
\p{No}|\p{Other_Number} 作为加密符号(superscripts)和记号的数字,
         非阿拉伯数字的数字表示字符(不包括中文、日文、韩文中的字符)。

Punctuation

\p{Pd}|\p{Dash_Punctuation} 各种格式的连字符(hyphen)和短划线(dash)
\p{Ps}|\p{Open_Punctuation} (、《 等字符
\p{Pe}|\p{Close_Punctuation} )、》 等字符
\p{Pi}|\p{Initial_Punctuation} “、< 等字符
\p{Pf}|\p{Final_Punctuation} ”、> 等字符
\p{Pc}|\p{Connector_Punctuation} 少数有特殊语法含义的标点,如下划线
\p{Po}|\p{Other_Punctuation} 用于表示其他所有标点符号: !、&、.、: 等

Other

\p{Cc}|\p{Control} ASCII 和 Latin-1 编码中的控制字符(TAB、LF、CR)等
\p{Cf}|\p{Format} 用于表示格式的不可见字符
\p{Co}|\p{Private_Use} 分配与私人的代码点(例如公司的 logo)
\p{Cs}|\p{Surrogate} one half of a surrogate pair in UTF-16 encoding
\p{Cn}|\p{Unassigned} 目前尚未分配字符的代码点

Unicode Scripts

  • 主要用于匹配特定语言
  • 示例:匹配汉字

    >>> regex.findall(r'\p{Han}', '孔子/现代价值/Theory of "Knowing"')
    ['孔', '子', '现', '代', '价', '值']
  • 列表

    \p{Common}
    \p{Arabic}
    \p{Armenian}
    \p{Bengali}
    \p{Bopomofo}
    \p{Braille}
    \p{Buhid}
    \p{Canadian_Aboriginal}
    \p{Cherokee}
    \p{Cyrillic}
    \p{Devanagari}
    \p{Ethiopic}
    \p{Georgian}
    \p{Greek}
    \p{Gujarati}
    \p{Gurmukhi}
    \p{Han}
    \p{Hangul}
    \p{Hanunoo}
    \p{Hebrew}
    \p{Hiragana}
    \p{Inherited}
    \p{Kannada}
    \p{Katakana}
    \p{Khmer}
    \p{Lao}
    \p{Latin}
    \p{Limbu}
    \p{Malayalam}
    \p{Mongolian}
    \p{Myanmar}
    \p{Ogham}
    \p{Oriya}
    \p{Runic}
    \p{Sinhala}
    \p{Syriac}
    \p{Tagalog}
    \p{Tagbanwa}
    \p{TaiLe}
    \p{Tamil}
    \p{Telugu}
    \p{Thaana}
    \p{Thai}
    \p{Tibetan}
    \p{Yi}

Unicode Blocks

  • 正则与 Unicode 编码段的映射
  • 列表

    \p{InBasic_Latin}: U+0000–U+007F
    \p{InLatin-1_Supplement}: U+0080–U+00FF
    \p{InLatin_Extended-A}: U+0100–U+017F
    \p{InLatin_Extended-B}: U+0180–U+024F
    \p{InIPA_Extensions}: U+0250–U+02AF
    \p{InSpacing_Modifier_Letters}: U+02B0–U+02FF
    \p{InCombining_Diacritical_Marks}: U+0300–U+036F
    \p{InGreek_and_Coptic}: U+0370–U+03FF
    \p{InCyrillic}: U+0400–U+04FF
    \p{InCyrillic_Supplementary}: U+0500–U+052F
    \p{InArmenian}: U+0530–U+058F
    \p{InHebrew}: U+0590–U+05FF
    \p{InArabic}: U+0600–U+06FF
    \p{InSyriac}: U+0700–U+074F
    \p{InThaana}: U+0780–U+07BF
    \p{InDevanagari}: U+0900–U+097F
    \p{InBengali}: U+0980–U+09FF
    \p{InGurmukhi}: U+0A00–U+0A7F
    \p{InGujarati}: U+0A80–U+0AFF
    \p{InOriya}: U+0B00–U+0B7F
    \p{InTamil}: U+0B80–U+0BFF
    \p{InTelugu}: U+0C00–U+0C7F
    \p{InKannada}: U+0C80–U+0CFF
    \p{InMalayalam}: U+0D00–U+0D7F
    \p{InSinhala}: U+0D80–U+0DFF
    \p{InThai}: U+0E00–U+0E7F
    \p{InLao}: U+0E80–U+0EFF
    \p{InTibetan}: U+0F00–U+0FFF
    \p{InMyanmar}: U+1000–U+109F
    \p{InGeorgian}: U+10A0–U+10FF
    \p{InHangul_Jamo}: U+1100–U+11FF
    \p{InEthiopic}: U+1200–U+137F
    \p{InCherokee}: U+13A0–U+13FF
    \p{InUnified_Canadian_Aboriginal_Syllabics}: U+1400–U+167F
    \p{InOgham}: U+1680–U+169F
    \p{InRunic}: U+16A0–U+16FF
    \p{InTagalog}: U+1700–U+171F
    \p{InHanunoo}: U+1720–U+173F
    \p{InBuhid}: U+1740–U+175F
    \p{InTagbanwa}: U+1760–U+177F
    \p{InKhmer}: U+1780–U+17FF
    \p{InMongolian}: U+1800–U+18AF
    \p{InLimbu}: U+1900–U+194F
    \p{InTai_Le}: U+1950–U+197F
    \p{InKhmer_Symbols}: U+19E0–U+19FF
    \p{InPhonetic_Extensions}: U+1D00–U+1D7F
    \p{InLatin_Extended_Additional}: U+1E00–U+1EFF
    \p{InGreek_Extended}: U+1F00–U+1FFF
    \p{InGeneral_Punctuation}: U+2000–U+206F
    \p{InSuperscripts_and_Subscripts}: U+2070–U+209F
    \p{InCurrency_Symbols}: U+20A0–U+20CF
    \p{InCombining_Diacritical_Marks_for_Symbols}: U+20D0–U+20FF
    \p{InLetterlike_Symbols}: U+2100–U+214F
    \p{InNumber_Forms}: U+2150–U+218F
    \p{InArrows}: U+2190–U+21FF
    \p{InMathematical_Operators}: U+2200–U+22FF
    \p{InMiscellaneous_Technical}: U+2300–U+23FF
    \p{InControl_Pictures}: U+2400–U+243F
    \p{InOptical_Character_Recognition}: U+2440–U+245F
    \p{InEnclosed_Alphanumerics}: U+2460–U+24FF
    \p{InBox_Drawing}: U+2500–U+257F
    \p{InBlock_Elements}: U+2580–U+259F
    \p{InGeometric_Shapes}: U+25A0–U+25FF
    \p{InMiscellaneous_Symbols}: U+2600–U+26FF
    \p{InDingbats}: U+2700–U+27BF
    \p{InMiscellaneous_Mathematical_Symbols-A}: U+27C0–U+27EF
    \p{InSupplemental_Arrows-A}: U+27F0–U+27FF
    \p{InBraille_Patterns}: U+2800–U+28FF
    \p{InSupplemental_Arrows-B}: U+2900–U+297F
    \p{InMiscellaneous_Mathematical_Symbols-B}: U+2980–U+29FF
    \p{InSupplemental_Mathematical_Operators}: U+2A00–U+2AFF
    \p{InMiscellaneous_Symbols_and_Arrows}: U+2B00–U+2BFF
    \p{InCJK_Radicals_Supplement}: U+2E80–U+2EFF
    \p{InKangxi_Radicals}: U+2F00–U+2FDF
    \p{InIdeographic_Description_Characters}: U+2FF0–U+2FFF
    \p{InCJK_Symbols_and_Punctuation}: U+3000–U+303F
    \p{InHiragana}: U+3040–U+309F
    \p{InKatakana}: U+30A0–U+30FF
    \p{InBopomofo}: U+3100–U+312F
    \p{InHangul_Compatibility_Jamo}: U+3130–U+318F
    \p{InKanbun}: U+3190–U+319F
    \p{InBopomofo_Extended}: U+31A0–U+31BF
    \p{InKatakana_Phonetic_Extensions}: U+31F0–U+31FF
    \p{InEnclosed_CJK_Letters_and_Months}: U+3200–U+32FF
    \p{InCJK_Compatibility}: U+3300–U+33FF
    \p{InCJK_Unified_Ideographs_Extension_A}: U+3400–U+4DBF
    \p{InYijing_Hexagram_Symbols}: U+4DC0–U+4DFF
    \p{InCJK_Unified_Ideographs}: U+4E00–U+9FFF
    \p{InYi_Syllables}: U+A000–U+A48F
    \p{InYi_Radicals}: U+A490–U+A4CF
    \p{InHangul_Syllables}: U+AC00–U+D7AF
    \p{InHigh_Surrogates}: U+D800–U+DB7F
    \p{InHigh_Private_Use_Surrogates}: U+DB80–U+DBFF
    \p{InLow_Surrogates}: U+DC00–U+DFFF
    \p{InPrivate_Use_Area}: U+E000–U+F8FF
    \p{InCJK_Compatibility_Ideographs}: U+F900–U+FAFF
    \p{InAlphabetic_Presentation_Forms}: U+FB00–U+FB4F
    \p{InArabic_Presentation_Forms-A}: U+FB50–U+FDFF
    \p{InVariation_Selectors}: U+FE00–U+FE0F
    \p{InCombining_Half_Marks}: U+FE20–U+FE2F
    \p{InCJK_Compatibility_Forms}: U+FE30–U+FE4F
    \p{InSmall_Form_Variants}: U+FE50–U+FE6F
    \p{InArabic_Presentation_Forms-B}: U+FE70–U+FEFF
    \p{InHalfwidth_and_Fullwidth_Forms}: U+FF00–U+FFEF
    \p{InSpecials}: U+FFF0–U+FFFF

Unicode 编码表

Example

  • 文字过滤,去除标点符号等特殊字符

    >>> regex.sub(r'[^\p{L}]', '', '1孔子/现代价值/Theory of "Knowing')
    '孔子现代价值TheoryofKnowing'
本文出自 qbit snap

qbit
268 声望279 粉丝