头图

It is not difficult to use Python to grab public data from Wikipedia, Baidu Baike and other websites and store it in a table. But in many application scenarios, we are no longer limited to storing captured data in tables, but also need to be more intuitive to visualize. For example, in this case, Python was used to grab the cities where the Winter Olympics were held in the past from Wikipedia, and then create maps, image galleries, and even flexible sharing and collaboration. To achieve this, if you use Python to visualize and share web pages after capturing the data, it will be more complicated and not efficient. For many non-professionals, it will limit the performance. And if combined with SeaTable form to realize it will be very convenient, anyone can get started. As a new online collaborative table and information management tool, it not only can conveniently manage various types of data, but also provides rich data visualization functions, as well as complete Python API functions.

This article will share how to use Python to grab city data from Wikipedia, and then automatically fill it in the SeaTable table, and use the SeaTable table visualization plug-in to automatically generate maps, image galleries, etc. The figure below is a basic table of the cities hosting the Winter Olympics.

Task objective: Find the geographic location (latitude and longitude) corresponding to the city through the Wikipedia link of each city and fill in the "Latitude and Longitude" field. At the same time, download a promotional picture of the city from Wikipedia and upload it to "City Picture" field.

Automatically get the latitude and longitude of the city to the "Latitude and Longitude" field of the table

To get information from web pages, some simple python crawling techniques are needed. This task is implemented using the Python modules of requests and beatifulsoup, where the requests module can simulate online requests and return a DOM tree of html, beatifulsoup obtains the desired information in the label by parsing the DOM tree. Taking the latitude and longitude of a city in Wikipedia as an example, the structure of the DOM tree is as follows:

As long as the information can be seen in the webpage, its location can be queried through the source code of the DOM tree, and the desired content can be extracted through simple analysis. The specific analysis method can refer to the beautifulsoup document.

The following gives a code to parse its latitude and longitude information through url:

import requests
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Chamonix" # 维基百科的城市链接
# 请求该链接, 获取其内容, 网页内容是一段 DOM 树
resp = requests.get(url)
# 把获取的内容装进 beatifulsoup解析器,以待解析
soup = BeautifulSoup(resp.content)

# 纬度, 找到 DOM 属性 class 为 longitude 的结构, 获取其标签值
lon = soup.find_all(attrs={"class": "longitude"})[0].string
# 经度,  找到 DOM 属性 class 为 latitude的结构, 获取其标签值
lat = soup.find_all(attrs={"class": "latitude"})[0].string

The format of longitude and latitude found out above is a standard geographic format, written as 45° 55′ 23.16″ N, 6° 52′ 10.92″ E, and it needs to be converted into decimal format for writing in the SeaTable table. Here you need to write a conversion logic for conversion.

Automatically get city pictures to the "City Picture" field of the form

In this task, apart from knowing the latitude and longitude information, you also need to download a picture and transfer it to the table. Similarly, the picture is the same as the latitude and longitude. The original information can also be found in the DOM tree:

The src value of the img tag is the download link we need. Combined with the file operation of the SeaTable API, the image can be easily downloaded and then uploaded to the table. The following is the complete code for the task:

import requests
from bs4 import BeautifulSoup
import re
from seatable_api import Base, context
import os
import time
'''
该脚本演示了通过从维基百科举行冬奥会的城市数据中摘取相关内容,解析,并把其填入 seatable 表格中的案例
数据包括地理位置的经纬度, 以及代表图片
'''
SERVER_URL = context.server_url or 'https://cloud.seatable.cn/'
API_TOKEN  = context.api_token  or 'cacc42497886e4d0aa8ac0531bdcccb1c93bd0f5'
TABLE_NAME = "历届举办地"
URL_COL_NAME = "维基百科城市链接"
CITY_COL_NAME = "举办城市"
POSITION_COL_NAME = "经纬度"
IMAGE_COL_NAME = "城市图片"

def get_time_stamp():
    return str(int(time.time()*10000000))

class Wiki(object):

    def __init__(self, authed_base):
        self.base = authed_base
        self.soup = None

    def _convert(self, tude):
        # 把经纬度格式转换成十进制的格式,方便填入表格。
        multiplier = 1 if tude[-1] in ['N', 'E'] else -1
        return multiplier * sum(float(x) / 60 ** n for n, x in enumerate(tude[:-1]))

    def _format_position(self, corninate):
        format_str_list = re.split("°|′|″", corninate)
        if len(format_str_list) == 3:
            format_str_list.insert(2, "00")
        return format_str_list

    def _get_soup(self, url):
        # 初始化DOM解析器
        resp = requests.get(url)
        soup = BeautifulSoup(resp.content)
        self.soup = soup
        return soup

    def get_tu_position(self, url):
        soup = self.soup or self._get_soup(url)

        # 解析网页的DOM,取出经纬度的数值, 返回十进制
        lon = soup.find_all(attrs={"class": "longitude"})[0].string
        lat = soup.find_all(attrs={"class": "latitude"})[0].string

        converted_lon = self._convert(self._format_position(lon))
        converted_lat = self._convert(self._format_position(lat))

        return {
            "lng": converted_lon,
            "lat": converted_lat
        }

    def get_file_download_url(self, url):
        # 解析一个DOM,取出其中一个图片的下载链接

        soup = self.soup or self._get_soup(url)
        src_image_tag = soup.find_all(attrs={"class": "infobox ib-settlement vcard"})[0].find_all('img')
        src = src_image_tag[0].attrs.get('src')
        return "https:%s" % src

    def handle(self, table_name):
        base = self.base
        for row in base.list_rows(table_name):
            try:
                url = row.get(URL_COL_NAME)
                if not url:
                    continue
                row_id = row.get("_id")
                position = self.get_tu_position(url)
                image_file_downlaod_url = self.get_file_download_url(url)
                extension = image_file_downlaod_url.split(".")[-1]

                image_name = "/tmp/wik-image-%s-%s.%s" % (row_id, get_time_stamp(), extension)
                resp_img = requests.get(image_file_downlaod_url)
                with open(image_name, 'wb') as f:
                    f.write(resp_img.content)
                info_dict = base.upload_local_file(
                    image_name,
                    name=None,
                    relative_path=None,
                    file_type='image',
                    replace=True
                )

                row_data = {
                    POSITION_COL_NAME: position,
                    IMAGE_COL_NAME: [info_dict.get('url'), ]
                }
                base.update_row(table_name, row_id, row_data)
                os.remove(image_name)
                self.soup = None
            except Exception as e:
                print("error", row.get(CITY_COL_NAME), e)

def run():
    base = Base(API_TOKEN, SERVER_URL)
    base.auth()

    wo = Wiki(base)
    wo.handle(TABLE_NAME)

if __name__ == '__main__':
    run()

The following is the result of a table that automatically writes data by running a script. It can be seen that, compared to searching online and manually filling in each row of data, the automated operation of the script can save a lot of time, and is accurate and efficient.

Use SeaTable's map plug-in to automatically generate city maps

With the city latitude and longitude information obtained previously, we can add a map plug-in from the "plug-in" column of the SeaTable table, and then simply click to automatically mark the city on the map based on the "longitude and latitude" field. And you can also mark different label colors, set direct and floating display fields, etc. Compared to monotonously viewing each city in the table, visualization through the map is obviously more vivid and intuitive.

Visualize city pictures with SeaTable’s gallery plugin

The gallery plug-in can also be placed on the table toolbar for easy opening and viewing at any time. In the settings of the gallery plug-in, you can also display the picture in the form of a gallery according to the "city picture" field in the table with a simple click, and you can also set the title name and other display fields. This is more beautiful and more convenient than browsing the thumbnails in the table, which greatly improves the browsing experience. And click on the picture to zoom in. Clicking on the title can also directly enter to view and edit its row content in the table.

In addition, the form also supports flexible sharing and collaboration permission management and control, which can meet detailed and diverse sharing scenarios. For example, if you want to directly share the map and gallery for others to view, you can also directly add "map" and "gallery" in the external application of the form plug-in. For more use, you can experience it yourself, so I won’t introduce it here.

Summarize

As a new collaborative table and information management tool, SeaTable is not only rich in functions, but also easy to use. When we usually use Python to implement some programs, we can flexibly combine the functions of the SeaTable table, thereby saving time and labor costs such as programming, development, and maintenance, and quickly and easily achieve more interesting things and more complete applications. It also allows the use of tools to exert greater value.


SeaTable开发者版
139 声望2.8k 粉丝