如何从维基百科中获取纯文本

Question

新手上路，请多包涵

我想编写一个仅获取维基百科描述部分的脚本。也就是说，当我说

/wiki bla bla bla

它将转到 bla bla bla 的维基百科页面，获取以下内容，并将其返回到聊天室：

“Bla Bla Bla”是 Gigi D’Agostino 创作的一首歌的名字。他将这首歌描述为“我写的一首曲子，想着所有说话不说话的人”。突出但荒谬的人声样本取自英国乐队 Stretch 的歌曲“Why Did You Do It”

我怎样才能做到这一点？

原文由 Wifi 发布，翻译遵循 CC BY-SA 4.0 许可协议

python mediawiki wikipedia wikipedia-api mediawiki-api

阅读 579

1 个回答

得票最新

社区维基

1

发布于
2023-01-04

这里有几种不同的可能方法；使用适合您的任何一个。我下面的所有代码示例都使用 requests 用于对 API 的 HTTP 请求；你可以安装 requests 和 pip install requests 如果你有Pip。他们也都使用 Mediawiki API ，两个使用查询端点；如果您需要文档，请点击这些链接。

1. 使用 `extracts` 直接从 API 获取整个页面或页面“提取”的纯文本表示

请注意，此方法仅适用于具有 TextExtracts 扩展名的 MediaWiki 站点。这尤其包括维基百科，但不包括一些较小的 Mediawiki 网站，例如 http://www.wikia.com/

你想点击一个 URL

https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Bla_Bla_Bla&prop=extracts&exintro&explaintext

分解一下，我们有以下参数（记录在 https://www.mediawiki.org/wiki/Extension:TextExtracts#query+extracts ）：

action=query ， format=json 和 title=Bla_Bla_Bla 都是标准的MediaWiki API参数
prop=extracts 让我们使用 TextExtracts 扩展
exintro 限制对第一节标题之前内容的响应
explaintext 使响应中的摘录成为纯文本而不是 HTML

然后解析 JSON 响应并提取提取物：

 >>> import requests
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'extracts',
...         'exintro': True,
...         'explaintext': True,
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> print(page['extract'])
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

2.使用 `parse` 端点获取页面的完整HTML，解析它，并提取第一段

MediaWiki 有一个 parse 端点，您可以使用 https://en.wikipedia.org/w/api.php?action=parse&page=Bla_Bla_Bla 等 URL 访问该端点以获取页面的 HTML。然后，您可以使用像 lxml 这样的 HTML 解析器解析它（首先使用 pip install lxml 安装它）以提取第一段。

例如：

 >>> import requests
>>> from lxml import html
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'parse',
...         'page': 'Bla Bla Bla',
...         'format': 'json',
...     }
... ).json()
>>> raw_html = response['parse']['text']['*']
>>> document = html.document_fromstring(raw_html)
>>> first_p = document.xpath('//p')[0]
>>> intro_text = first_p.text_content()
>>> print(intro_text)
"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

3.自己解析维基文本

您可以使用 query API 获取页面的 wikitext，使用 mwparserfromhell 解析它（首先使用 pip install mwparserfromhell 对其进行解析），然后将其缩小为 humanb-文本使用 strip_code 。 strip_code 在撰写本文时不能完美运行（如下例中清楚显示），但有望改进。

 >>> import requests
>>> import mwparserfromhell
>>> response = requests.get(
...     'https://en.wikipedia.org/w/api.php',
...     params={
...         'action': 'query',
...         'format': 'json',
...         'titles': 'Bla Bla Bla',
...         'prop': 'revisions',
...         'rvprop': 'content',
...     }
... ).json()
>>> page = next(iter(response['query']['pages'].values()))
>>> wikicode = page['revisions'][0]['*']
>>> parsed_wikicode = mwparserfromhell.parse(wikicode)
>>> print(parsed_wikicode.strip_code())
{{dablink|For Ke$ha's song, see Blah Blah Blah (song). For other uses, see Blah (disambiguation)}}

"Bla Bla Bla" is the title of a song written and recorded by Italian DJ Gigi D'Agostino. It was released in May 1999 as the third single from the album, L'Amour Toujours. It reached number 3 in Austria and number 15 in France. This song can also be heard in an added remixed mashup with L'Amour Toujours (I'll Fly With You) in its US radio version.

Background and writing
He described this song as "a piece I wrote thinking of all the people who talk and talk without saying anything". The prominent but nonsensical vocal samples are taken from UK band Stretch's song "Why Did You Do It"''.

Music video
The song also featured a popular music video in the style of La Linea. The music video shows a man with a floating head and no arms walking toward what appears to be a shark that multiplies itself and can change direction. This style was also used in "The Riddle", another song by Gigi D'Agostino, originally from British singer Nik Kershaw.

Chart performance
Chart (1999-00)PeakpositionIreland (IRMA)Search for Irish peaks23

References

External links

Category:1999 singles
Category:Gigi D'Agostino songs
Category:1999 songs
Category:ZYX Music singles
Category:Songs written by Gigi D'Agostino

原文由 Mark Amery 发布，翻译遵循 CC BY-SA 3.0 许可协议

撰写回答

你尚未登录，登录后可以

和开发者交流问题的细节
关注并接收问题和回答的更新提醒
参与内容的编辑和改进，让解决方法与时俱进

推荐问题

Stack Overflow 翻译

子站问答

访问

本篇内容翻译自 Stack Overflow，如果你觉得翻译结果值得改进，欢迎直接编辑修改，感谢你为社区贡献。

相似问题

找不到问题？创建新问题

如何从维基百科中获取纯文本

1. 使用 `extracts` 直接从 API 获取整个页面或页面“提取”的纯文本表示

2.使用 `parse` 端点获取页面的完整HTML，解析它，并提取第一段

3.自己解析维基文本

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译

如何从维基百科中获取纯文本

1. 使用 extracts 直接从 API 获取整个页面或页面“提取”的纯文本表示

2.使用 parse 端点获取页面的完整HTML，解析它，并提取第一段

3.自己解析维基文本

你尚未登录，登录后可以

字节的 trae AI IDE 不支持类似 vscode 的 ssh remote 远程开发怎么办？

DataCap 中验证码无法显示，后台出现 NullPointerException 错误?

发现深拷贝和浅拷贝效果一致：请问一下有什么区别呢？

如何实现一个深拷贝函数？

Python 成员变量在多个子类实例间共享，如何避免？

为什么 Qwen2.5-Omni-7B 官方教程都报错 Cannot import available module of Qwen2_5OmniModel in modelscope ？

Spark-TTS-0.5B 的 requirements.txt 在哪里？

Stack Overflow 翻译

1. 使用 `extracts` 直接从 API 获取整个页面或页面“提取”的纯文本表示

2.使用 `parse` 端点获取页面的完整HTML，解析它，并提取第一段