Hello everyone, I am a side dish.
A man who hopes to be about architecture ! If you also want to be the person I want to be, otherwise click on your attention and be a companion, so that Xiaocai is no longer alone!
This article mainly introduces
Selenium
If necessary, you can refer to
If it helps, don’t forget to 1619a1fbdd392b ❥
The WeChat public account has been opened, , students who have not followed please remember to pay attention!
Hello, everyone. This is , the predecessor of
. Don’t let everyone get lost because of the name or profile picture.
Recently, in order to expand the language, I probably learned about the Python gameplay this week. After learning , it’s really fragrant . I don’t know if you found it interesting when you first learned a language, you want to try everything.
Speaking Python everyone's reaction might be reptiles , automated testing , will talk less to do with python web development, relatively speaking, in the country are more web development language or the Java ~ but It’s not that python is not suitable for web development. As far as I know, commonly used web frameworks are Django
and flask
etc.~
Django is a very heavy framework, it provides a lot of very convenient tools, also encapsulates many things, do not need to make too many wheels by yourself
Flask the advantage of being small, but the disadvantage is also small. Being flexible means that you need to build more wheels or spend more time configuring
But the focus of this article is not to introduce python's web development , nor to introduce python's basic introduction , but to talk about python's automated testing and crawler introduction~
In my opinion, if you have development experience in other languages, Xiaocai is more recommended to start directly from a case and learn while watching. The grammar and the like are actually the same ( will be followed by combining java to learn python content ), the code can basically be read, but if you don’t have any language development experience, then Xiaocai still recommends learning from scratch. Videos and books are good choices. I recommend Liao Xuefeng’s blog. The content is quite similar. Not bad Python tutorial
1. Automated testing
Python can do a lot of things, and can do a lot of interesting things
To learn a language, of course you have to find points of interest to learn faster, for example, you want to crawl pictures or videos of a certain website, right?
What is automated testing? That is automation +
test, as long as you write a script (.py file), after running it will automatically help you run the test process in the background, then use automated testing, there is a good tool to help you complete , That is
Selenium
Selenium
is a web automated testing tool that can easily simulate the operation of real users on the browser. It supports various mainstream browsers, such as IE、Chrome、Firefox、Safari、Opera
etc., here is a demonstration using python, not that Selenium It supports python, it has client-side drivers for multiple programming languages, and a brief introduction to the syntax~ Let's make a simple example to demonstrate!
1) Pre-preparation
In order to ensure the smooth presentation, we need to do some pre-preparation, otherwise it may cause the browser to not open normally~
step 1
Check the browser version, we use Edge edge://version
in the URL input box to check the browser version, and then go to the corresponding driver store to install the corresponding version of the driver Microsoft Edge-Webdriver (windows.net)
Step 2
Then we unzip the downloaded driver file to the Scripts
folder in your python installation directory
2) Browser operation
Get ready, let's look at the following simple code:
There are only 4 lines of code in addition to the guide package, and input python autoTest.py
in the terminal, and get the following demonstration effect:
You can see that using this script has achieved automatically open the browser,
automatically enlarge the window,
automatically open the Baidu webpage, three automated operations, which have brought our learning one step closer, do you think it’s a bit interesting~ Let’s You are gradually sinking!
Here are a few common methods for browser operations:
method | illustrate |
---|---|
webdriver.xxx() | Used to create browser objects |
maximize_window() | Window maximized |
get_window_size() | Get browser size |
set_window_size() | Set browser size |
get_window_position() | Get browser location |
set_window_position(x, y) | Set browser location |
close() | Close current tab/window |
quit() | Close all tabs/windows |
These are of course Selenium
, and more outstanding ones are yet to come~
When we open the browser, of course, what we want to do is not just open the simple operation of the webpage, after all, the programmer's ambition is unlimited! We also want to automatically manipulate page elements, so we need to talk Selenium
the positioning operation of 0619a1fbdd3f1e
3) Positioning elements
The element positioning of the page is not unfamiliar to the front end, and element positioning can be easily realized with JS, such as the following:
- Positioning by id
document.getElementById("id")
- Locate by name
document.getElementByName("name")
- Locate by tag name
document.getElementByTagName("tagName")
- Positioning by class
document.getElementByClassName("className")
- Positioning via css selector
document.querySeletorAll("css selector")
The above methods can achieve element selection and positioning. Of course, the protagonist in this section is Selenium
. As the main automated testing tool, how can it show weakness? There are 8 ways to achieve page element positioning, as follows:
- id positioning
driver.find_element_by_id("id")
When we open the Baidu page, we can find that the id of the input box is kw ,
After knowing the element ID, we can use the id to locate the element, as follows
from selenium import webdriver
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开百度网页
driver.get("http://baidu.com")
# 通过 id 定位元素
i = driver.find_element_by_id("kw")
# 往输入框输入值
i.send_keys("菜农曰")
- name attribute value positioning
driver.find_element_by_name("name")
The method of name locating is similar to that of id, both of which need to find the value of name and then call the corresponding api. The usage is as follows:
from selenium import webdriver
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开百度网页
driver.get("http://baidu.com")
# 通过 id 定位元素
i = driver.find_element_by_name("wd")
# 往输入框输入值
i.send_keys("菜农曰")
- Class name positioning
driver.find_element_by_class_name("className")
Consistent with id and name positioning method, you need to find the corresponding className and then locate~
- Tag name positioning
driver.find_element_by_tag_name("tagName")
We use this method rarely in our daily life, because in HTML, functions are defined by tags. For example, input is input, table is table... Each element is actually a tag, and a tag is often used to define One type of function, there may be multiple divs, inputs, tables, etc. in a page, so it is difficult to accurately locate elements using tags~
- css selector
driver.find_element_by_css_selector("cssVale")
This method needs to connect the five selectors of css
five selectors
- Element selector
The most common CSS selector is the element selector. In HTML documents, the selector usually refers to a certain HTML element, for example:
html {background-color: black;} p {font-size: 30px; backgroud-color: gray;} h2 {background-color: red;}
- Class selector
.
plus the class name forms a class selector, for example:.deadline { color: red;} span.deadline { font-style: italic;}
- id selector
The ID selector is somewhat similar to the class selector, but the difference is significant. First, an element cannot have multiple classes like a class attribute, and an element can only have a unique ID attribute. Use the ID selector method to add the id value to the
#
#top { ...}
- Attribute selector
We can select elements based on their attributes and attribute values, for example:
a[href][title] { ...}
- Derived selector
It is also known as the context selector, and it uses the document DOM structure for css selection. E.g:
body li { ...} h1 span { ...}
Of course, the selector here is just a brief introduction, and more content can be consulted by yourself~
After understanding the selector, we can happily locate the css selector:
from selenium import webdriver
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开百度网页
driver.get("http://baidu.com")
# 通过 id选择器 定位元素
i = driver.find_elements_by_css_selector("#kw")
# 往输入框输入值
i.send_keys("菜农曰")
- Link text positioning
driver.find_element_by_link_text("linkText")
This method is specially used to locate text links. For example, we can see that there are news,
hao123
, map... and other link elements on the homepage of Baidu.
Then we can use the link text to locate
from selenium import webdriver
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开百度网页
driver.get("http://baidu.com")
# 通过 链接文本 定位元素并 点击
driver.find_element_by_link_text("hao123").click()
- Part of the link text
driver.find_element_by_partial_link_text("partialLinkText")
This method is link_text
. Sometimes a hyperlink text may be very long. If we enter all of it, it will be troublesome and unsightly.
In fact, we only need to intercept a part of the string to let selenium understand the content we want to select, then we use partial_link_text this way~
- xpath path expression
driver.find_element_by_xpath("xpathName")
The several positioning methods introduced above are all in an ideal state, each element has a unique id or name or class or hyperlink text attribute, then we can locate them through this unique attribute value. But sometimes the element we want to locate does not have id, name, class attributes, or these attribute values of multiple elements are the same, or if the page is refreshed, these attribute values will change. Then at this time we can only locate through xpath or CSS. Of course the value of xpath does not need you to calculate. We just need to open the page and find the corresponding element in F12, right click and copy xpath.
Then locate in the code:
from selenium import webdriver
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开百度网页
driver.get("http://www.baidu.com")
driver.find_element_by_xpath("//*[@id='kw']").send_keys("菜农曰")
4) Element operation
Of course, what we want to do is not only the selection of elements, but the operations after selecting the elements. In the above demonstration, we have actually performed two operations click()
and send_keys("value")
. Here we continue to introduce several other operations~
Method name | illustrate |
---|---|
click() | Click element |
send_keys("value") | Analog key input |
clear() | Clear the content of elements, such as input boxes |
submit() | submit Form |
text | Get the text content of the element |
is_displayed | Determine whether the element is visible |
After reading, is there a similar feeling? Isn't this the basic operation of js~!
5) Practical exercises
After learning the above operations, we can simulate a shopping operation in Xiaomi Mall, the code is as follows:
from selenium import webdriver
item_url = "https://www.mi.com/buy/detail?product_id=10000330"
# 加载 Edge 驱动
driver = webdriver.ChromiumEdge()
# 设置最大窗口化
driver.maximize_window()
# 打开商品购物页
driver.get(item_url)
# 隐式等待 设置 防止网络阻塞页面未及时加载
driver.implicitly_wait(30)
# 选择地址
driver.find_element_by_xpath("//*[@id='app']/div[3]/div/div/div/div[2]/div[2]/div[3]/div/div/div[1]/a").click()
driver.implicitly_wait(10)
# 点击手动选择地址
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div["
"1]/div/div/div[2]/span[1]").click()
# 选择福建
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[13]").click()
driver.implicitly_wait(10)
# 选择市
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# 选择区
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(10)
# 选择街道
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div/div/div/div/div/div/div["
"1]/div[2]/span[1]").click()
driver.implicitly_wait(20)
# 点击加入购物车
driver.find_element_by_class_name("sale-btn").click()
driver.implicitly_wait(20)
# 点击去购物车结算
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div[1]/div[2]/a[2]").click()
driver.implicitly_wait(20)
# 点击去结算
driver.find_element_by_xpath("//*[@id='app']/div[2]/div/div/div/div[1]/div[4]/span/a").click()
driver.implicitly_wait(20)
# 点击同意协议
driver.find_element_by_xpath("//*[@id='stat_e3c9df7196008778']/div[2]/div[2]/div/div/div/div[3]/button[1]").click()
The effect is as follows:
This is the practice of our learning results. Of course, if you encounter a spike situation, you can also write a script to practice your hands~:boom: If there is no stock, we can add a while
loop to poll access!
Two, crawler test
Above we have realized how to use Selenium
to achieve automated testing, the use must be legal~ Next we will show another powerful function of python, that is for the crawler
Before learning crawlers, we need to understand several necessary tools
1) Page downloader
The python standard library has provided: urllib
, urllib2
, httplib
and other modules for http requests, but the API is not easy to use and elegant~, it requires a lot of work, and the coverage of various methods to complete the simplest tasks, of course This is unbearable by programmers. The heroes of all parties have developed a variety of useful third-party libraries for use~
- request
request is a python-based http library that uses the apaches2 license. It is highly encapsulated on the basis of python's built-in modules, so that users can more conveniently complete all the operations available in the browser when making network requests~
- scrapy
The difference between request and scrapy may be that scrapy is a relatively heavyweight framework. It belongs to a website-level crawler, while request is a page-level crawler. The number of concurrency and performance is not as good as that of scrapy.
2) Page parser
- BeautifulSoup
BeautifulSoup is a module that is used to receive an HTML or XML string, and then format it, and then use the method he provides to quickly find the specified element, so that it becomes possible to find the specified element in HTML or XML simple.
- scrapy.Selector
Selector is based on parsel, a more advanced package that selects a certain part of an HTML file through a specific XPath or CSS expression. It is built on the lxml library, which means that they are very similar in speed and parsing accuracy.
For specific use, please Scrapy document , the introduction is quite detailed
3) Data storage
When we crawl down the content, this time we need to have a corresponding storage source for storage
Specific database operations will be introduced in subsequent web development blog posts~
- txt text
Common operations using file
- sqlite3
SQLite, a lightweight database, is an ACID-compliant relational database management system, which is contained in a relatively small C library
- mysql
Don’t introduce too much, you know everything you know, old lover of web development
4) Practical exercises
Web crawler, actually called network data collection is easier to understand. It is programmatically request data (HTML form) from the web server, and then parse the HTML to extract the data you want.
We can simply divide it into 4 steps:
- Get html data according to the given url
- Parse html to obtain target data
- Storing data
Of course, all of this needs to be based on your understanding of the simple syntax of python and the basic operation of html~
We will use request + BeautifulSoup + text
for operation exercises, assuming we want to crawl the content of teacher Liao Xuefeng’s python tutorial~
# 导入requests库
import requests
# 导入文件操作库
import codecs
import os
from bs4 import BeautifulSoup
import sys
import json
import numpy as np
import importlib
importlib.reload(sys)
# 给请求指定一个请求头来模拟chrome浏览器
global headers
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
server = 'https://www.liaoxuefeng.com/'
# 廖雪峰python教程地址
book = 'https://www.liaoxuefeng.com/wiki/1016959663602400'
# 定义存储位置
global save_path
save_path = 'D:/books/python'
if os.path.exists(save_path) is False:
os.makedirs(save_path)
# 获取章节内容
def get_contents(chapter):
req = requests.get(url=chapter, headers=headers)
html = req.content
html_doc = str(html, 'utf8')
bf = BeautifulSoup(html_doc, 'html.parser')
texts = bf.find_all(class_="x-wiki-content")
# 获取div标签id属性content的内容 \xa0 是不间断空白符
content = texts[0].text.replace('\xa0' * 4, '\n')
return content
# 写入文件
def write_txt(chapter, content, code):
with codecs.open(chapter, 'a', encoding=code)as f:
f.write(content)
# 主方法
def main():
res = requests.get(book, headers=headers)
html = res.content
html_doc = str(html, 'utf8')
# HTML解析
soup = BeautifulSoup(html_doc, 'html.parser')
# 获取所有的章节
a = soup.find('div', id='1016959663602400').find_all('a')
print('总篇数: %d ' % len(a))
for each in a:
try:
chapter = server + each.get('href')
content = get_contents(chapter)
chapter = save_path + "/" + each.string.replace("?", "") + ".txt"
write_txt(chapter, content, 'utf8')
except Exception as e:
print(e)
if __name__ == '__main__':
main()
When we run the program, we can see the content of the tutorial we crawled D:/books/python
In this way, we have simply implemented the crawler, but the crawler needs to be cautious~!
In this article, we have automated testing and
crawler, hoping to stimulate your interest~
Don't talk about it, don't be lazy, and be a X as an architecture with Xiaocai~ Follow me to be a companion, so that Xiaocai is no longer alone. See you below!
If you work harder today, you will be able to say less begging words tomorrow!
I am Xiaocai, a man who becomes stronger with you.💋
WeChat public account has been opened, , students who have not followed please remember to pay attention!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。