python正则表达式如何匹配任意文本之后以"结尾

2059

发布于
2017-10-03

更新于
2017-10-08

可能问题描述的不是很清楚...
事情源于我们英语课留了个作业要背单词什么的五百多个词懒得一个个去查了想写个爬虫自动搜索必应词典上的翻译然后我就可以直接看了
感觉这个正则表达式的匹配有点难写（对于我来说...）

本来想用BeautifulSoup...但是能力不够完成不了...
所以用了个笨办法，先用requests.get()把差不多包含了翻译的源码存起来，然后用正则表达式。
我的代码是这样的：

#usr/bin/env python3
# -*- coding: utf-8 -*-

# vocabulary list 自动补全中文
# 做的不太好...待改进

import requests, re, os

# 读取文件获得单词列表
os.chdir("D:\\")
file = open("D:\\English.txt", 'r', encoding = 'UTF-8')
new_file = open("D:\\vocabulary_list.txt", "w", encoding = 'UTF-8')
vocabulary = file.read()
words = vocabulary.split()

# 对列表中元素进行分类处理

words_list = []
for word in words:
    numRegex = re.compile(r'\d')
    numMo = numRegex.search(word)
    try:
        words_list.append(numMo.group())
    except:
        # 爬虫
        kv = {'q' : word}
        r = requests.get("http://cn.bing.com/dict/search", params = kv)
        try:
            r.raise_for_status()
        except:
            print("啊这里出了一点问题")
            
        text = r.text[400:600]
        regex = re.compile(r'(n|v|pron|adj|adv|num|art|prep|conj|int)(\.)(.*)')
        mo = regex.search(text)
        try:
            expression = mo.group()
            words_list.append(word + ' ' + expression)
        except:
            print('未查找到')

# 写入文件

for word in words_list:
    new_file.write(word + '\n')
        
# 关闭文件

file.close()
new_file.close()
print('已完成')

其中的文件大概长这样：
English.txt:

运行结束后的vocabulary_list.txt是这样的：（只截图中间一部分）

就觉得做的不够好吧...
希望大佬们给本小白一点建议...
先谢过好心的大佬们了~

原问题如上...
根据建议修改了代码，但是老是提醒说li是NoneType...挂上来看看有什么问题...

# 改过的vocabulary.py
import requests, os
import bs4
os.chdir("D:\\")
with open(".\\English.txt", 'r', encoding = 'UTF-8') as file_in:
    words = file_in.read().split()
with open(".\\new_vocabulary_list.txt",'w', encoding = 'UTF-8') as file_out:
    for word in words:
        file_out.write(word)
        file_out.write('\n')
        r = requests.get("http://cn.bing.com/dict/search?q="+word)
        r.encoding = r.apparent_encoding
        markup = r.text

        soup = bs4.BeautifulSoup(markup, "html.parser")
        root_element = soup.find(class_="qdef").find("ul") # 寻找class为qdef的节点下的ul节点（至于为何是“class为qdef的节点”，请查看网页源代码）
        for li in root_element.find_all("li"):
            file_out.write('\t')
            if 'web' not in li.find(class_="pos")['class']:
                file_out.write(li.find(class_="pos").string)
                file_out.write(' ')
                file_out.write(li.find(class_="def").string)
            file_out.write('\n')

正则表达式

python

阅读 7.8k

2 个回答

龙方淞

✓ 已被采纳

先上一个根据你的代码改的版本

#usr/bin/env python3
# -*- coding: utf-8 -*-
# 直接使用内置urllib库抓网页
from urllib.request import urlopen
import re, os

# 读取文件获得单词列表
# ios.chdir("D:\\")
# 直接使用当前目录下的文件岂不更好

file = open("./English.txt", 'r', encoding = 'UTF-8')
new_file = open("./vocabulary_list.txt", "w", encoding = 'UTF-8')
vocabulary = file.read()
words = vocabulary.split()

# 对列表中元素进行分类处理

words_list = []
for word in words:
    numRegex = re.compile(r'\d')
    numMo = numRegex.search(word)
    try:
        words_list.append(numMo.group())
    except:
        r = urlopen("http://cn.bing.com/dict/search?q="+word)
        text = r.read().decode('UTF-8')
        # 你需要了解一下正则表达式的贪婪和非贪婪匹配
        regex = re.compile(r'(n|v|pron|adj|adv|num|art|prep|conj|int)(\.)(.*?)"')
        mo = regex.search(text)
        try:
            expression = mo.group()
            words_list.append(word + ' ' + expression[:-1])# 删除结尾的"
        except:
            print('未查找到')

# 写入文件

for word in words_list:
    new_file.write(word + '\n')
        
# 关闭文件

file.close()
new_file.close()
print('已完成')

再上一个自己用beautiful soup弄的版本（我没有做错误处理，但是可以容易地加进去）：

from urllib.request import urlopen
import re, os
import bs4

with open("./English.txt", 'r') as file_in:
    words = file_in.read().split()
with open("./vocabulary_list.txt",'w') as file_out:
    for word in words:
        file_out.write(word)
        file_out.write('\n')
        markup = urlopen("http://cn.bing.com/dict/search?q="+word).read().decode("UTF-8")
        soup = bs4.BeautifulSoup(markup, "html.parser")
        root_element = soup.find(class_="qdef").find("ul") # 寻找class为qdef的节点下的ul节点（至于为何是“class为qdef的节点”，请查看网页源代码）
        for li in root_element.find_all("li"):
            file_out.write('\t')
            if 'web' not in li.find(class_="pos")['class']:
                file_out.write(li.find(class_="pos").string)
                file_out.write(' ')
                file_out.write(li.find(class_="def").string)
            file_out.write('\n')

结果：

nyrd33

66229

发布于
2017-10-05

更新于
2017-10-05

网页匹配使用正则表达式是最后不得已的方法，能不用就不用，有太多方法比正则效率高。
基础演示：

In [27]: import re
    ...: import requests
    ...: from bs4 import BeautifulSoup
    ...:
    ...: key_word = "variables"
    ...:
    ...: response = requests.get("http://cn.bing.com/dict/search", params={"q": key_word}, verify=False)
    ...: soup = BeautifulSoup(response.text,"lxml")
    ...: content = soup.find("meta", attrs={"name": "description"})["content"]
        # 单词的所有释义都在一个 meta 节点属性 name 为 "description" 的 content 属性里
    ...: print(content)
        # 得到全部单词释义，有些多余的内容需要进一步处理
    ...:
必应词典为您提供variables的释义，美['veriəb(ə)l]，英['veəriəb(ə)l]，adj. 变量；可以调节的；可变的；形式多变的； n. 因素；变数；变元；变动；
网络释义： 变项；变量声明；变量窗口；

In [28]: content = re.sub(r"必应.*英.*?，|； (?=[a-z]{1}|网)", "\n    ", content[:-2])
        # 去除字符串前面多余的部分、按浏览器网页中的格式分割开，格式化字符串、[:-1] 切片去除最后的 "；"
    ...: print(key_word, content)
        # 输出
    ...:
variables
    adj. 变量；可以调节的；可变的；形式多变的
    n. 因素；变数；变元；变动
    网络释义： 变项；变量声明；变量窗口；

把上面包装成函数，外面调用时加个循环就可以很方便的遍历所有单词：

import re
import requests
from bs4 import BeautifulSoup

def bing_dict_crawl(key_word):
    response = requests.get("http://cn.bing.com/dict/search", params={"q": key_word}, verify=False)
    soup = BeautifulSoup(response.text, "lxml")
    content = soup.find("meta", attrs={"name": "description"})["content"]
    content = re.sub(r"必应.*英.*?，|； (?=[a-z]{1}|网)", "\n    ", content[:-2])
    return key_word + content

openpath = "XXXXXXXXXXXXXXX"
savepath = "XXXXXXXXXXXXXXX"
meaning_list = []
with open(openpath, "r", encoding="utf-8") as openfile:
    wordlist = openfile.readlines()
for index, key_word.strip() in enumerate(wordlist):
    print("正在抓取 {0}，还剩 {1} 个单词".format(key_word, len(wordlist) - index - 1))
    content = bing_dict_crawl(key_word)
    meaning_list.append(content)
with open(savepath, "w", encoding="utf-8") as savefile:
    savefile.write("\n".join(sorted(meaning_list)))

抓取过程及结果预览：

正在抓取 apply，还剩 2 个单词
正在抓取 audio，还剩 1 个单词
正在抓取 append，还剩 0 个单词
append
    v. 增补
    网络释义： 附加；追加；添加
apply
    v. 应用；使用；涂；敷
    网络释义： 申请；适用；套用
audio
    adj. 声音的；录音的
    网络释义： 音频；声卡；音效

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

python正则表达式如何匹配任意文本之后以"结尾

你尚未登录，登录后可以

有一种算法存在返回真，不存在返回假的高性能算法，我忘记是什么了?

在计算机中如何翻译`pattern`？

关于正则的一个小问题？

duckdb 的 python sdk 读取 csv 的时候，如何指定列的字段类型？

为什么 pypi 的页面上的新版本在通过 pip 获取不到？

请问在一个项目中一般是创建多个ioc容器，还是一个ioc容器？

python这句代码是什么意思？

python正则表达式如何匹配任意文本之后以"结尾

你尚未登录，登录后可以

有一种算法 存在返回真，不存在返回假的高性能算法，我忘记是什么了?

在计算机中如何翻译`pattern`？

关于正则的一个小问题？

duckdb 的 python sdk 读取 csv 的时候，如何指定列的字段类型？

为什么 pypi 的页面上的新版本在通过 pip 获取不到？

请问在一个项目中一般是创建多个ioc容器，还是一个ioc容器？

python这句代码是什么意思？

有一种算法存在返回真，不存在返回假的高性能算法，我忘记是什么了?