如何用正则表达式匹配英文文章中的英文单词?谢谢^_^

新手上路,请多包涵

题目描述

需求:Java写一个程序,汇总文章中每个英文单词的个数。判断一个单词时,需要考虑前后的空格,换行字符以及连接”-”符号,连接符会将一个词组成一个整体,用正则表达式实现,具体规则如下:

  1. 以下当作一个词:
    don't, doesn't, didn't, can't, couldn't, wouldn't, isn't, aren't, wasn't, weren't
  2. 以下当作一个词处理:
    he's, she's, I'm, you're, we're, they're
  3. 以下不计入统计,删除
    Shawn's, apple's, Jonas’, what's, 'twas
  4. ice-cream 如果不在行尾换行时,当作一个词,但是不能删掉中间连接符

题目来源及自己的思路

看了一些资料,写了一个初稿,
(?:she's|he's|they're|we're|you're|I'm|It's)|(?:isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't )

测试字符串为:
She's"1.tom:'what's your name.' Jame's Janes', didn't, character,wasn't,
ice-cream,

相关代码

(?:she's|he's|they're|we're|you're|I'm|It's)|(?:isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't )

你期待的结果是什么?实际看到的错误信息又是什么?

但是不能正确判断单词、连接符和换行符。

谢谢老司机领路!帮我设计这个正则表达式 ^_^

阅读 3.7k
2 个回答

基本上满足你的要求

    public static int countWordsUsingRegex(String arg) {
        if (arg == null) {
            return 0;
        }

        // - 换行 自己调整 -\n
        final String[] words = arg.split("[\p{Punct}|\s&&[^']&&[^-]]+|\s+Shawn's\s+|\s+apple's\s+|\s+Jonas'\s+|\s+what's\s+|\s+'twas\s+|'s\s+|-\n");

        for (String word : words) {
            System.out.println(word);
        }

        return words.length;
    }

这样子写可以吗?稍微麻烦了点但是正确率应该算高

while (tokenizer1.hasMoreTokens()) {

        word = tokenizer1.nextToken(" ,?.!:;\"\"`()[]\n'").toLowerCase();
        if (word.matches("[a-z]+-") ) {//如果遇到结尾时连接符的单词
            word2 = tokenizer2.nextToken(" ,?.!:;'-");//截取第二行的第一个单词的后半段;
            word += word2;
        }else if(word.matches("\'")) {
            if (word.matches("isn't|aren't|doesn't|don't|didn't|haven't|hadn't|hasn't|can't|couldn't|wasn't|weren't|wouldn't")) {
                
            }
            }
        else if (word.length() <= 1) {
            word = "";
        }
        }
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题
宣传栏