使用正则 Unicode 属性匹配单词

试图使用 unicode 属性匹配来实现跨语言的单词匹配

unicode property escapes 的提案里给了这样的解决方法

const regex = /([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+)/gu;
const text = `
Amharic: የኔ ማንዣበቢያ መኪና በዓሣዎች ተሞልቷል
Bengali: আমার হভারক্রাফ্ট কুঁচে মাছ-এ ভরা হয়ে গেছে
Georgian: ჩემი ხომალდი საჰაერო ბალიშზე სავსეა გველთევზებით
Macedonian: Моето летачко возило е полно со јагули
Vietnamese: Tàu cánh ngầm của tôi đầy lươn
`;

let match;
while (match = regex.exec(text)) {
  const word = match[1];
  console.log(`Matched word with length ${ word.length }: ${ word }`);
}


Mark Unicode Category 能够匹配组合符,例如\u200D\uFE0F。按照上面的正则,组合符单独存在也会被识别为一个词,应该如何修改正则来避免?

提案链接:https://github.com/tc39/propo...

阅读 1.2k
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题