试图使用 unicode 属性匹配来实现跨语言的单词匹配
unicode property escapes 的提案里给了这样的解决方法
const regex = /([\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Connector_Punctuation}\p{Join_Control}]+)/gu;
const text = `
Amharic: የኔ ማንዣበቢያ መኪና በዓሣዎች ተሞልቷል
Bengali: আমার হভারক্রাফ্ট কুঁচে মাছ-এ ভরা হয়ে গেছে
Georgian: ჩემი ხომალდი საჰაერო ბალიშზე სავსეა გველთევზებით
Macedonian: Моето летачко возило е полно со јагули
Vietnamese: Tàu cánh ngầm của tôi đầy lươn
`;
let match;
while (match = regex.exec(text)) {
const word = match[1];
console.log(`Matched word with length ${ word.length }: ${ word }`);
}
Mark Unicode Category 能够匹配组合符,例如\u200D
、\uFE0F
。按照上面的正则,组合符单独存在也会被识别为一个词,应该如何修改正则来避免?