Preface
After working overtime for a period of time, the project was finally brought online. I thought it could be a little easier, but things often backfired, and various problems appeared. Since I am doing a POS pre-transaction system, it involves business related to merchants’ purchases and transactions. It is necessary to send the "interbank number" to the upstream payment institution. However, due to incomplete data in the system, it often appears that the bank cannot be found or The inter-line number is incorrect, etc., resulting in shipment failure.
In order to solve this problem, I asked an upstream organization for a copy of branch information. Good guy, there are 14w records. When importing the system, some abnormal data was found. Some of them are banks in Jiangxi, and the area code turns out to be Beijing. After a period of investigation, I found that such data is quite a lot. This is really troublesome for me. I was lazy, and when the customer service gave feedback, there was a repair.
After 2 minutes of thinking, I think that I will need to fix the data every day in the future. So long-term pain is not as good as short-term pain, and it is better to fix it all at once. Then I opened Baidu with my backhand, and after a period of traveling. It is found that the branch information of the following 3 websites is relatively complete, and it is ready to be compared with the data in the system and then amended.
- http://www.jsons.cn/banknum/
- http://www.5cm.cn/bank/ branch number/
- https://www.appgate.cn/branch/bankBranchDetail/ branch number
Analyze the website
Enter the connection number, select the query method, and click to start query. However, the result page flashed by, and then was covered by the advertisement page, this time it is very fast for you. For this, I can't be troubled by nature. From the front-end perspective, it is obvious that the table
tag that displays the results is hidden and used to display advertisements. So the backhand is to open the console and view the source code.
After a search, I finally found the address of the details page.
Through the above operations, if we want to climb to the data, we need to do two steps. Enter the bank number for query first, and then go to the details page to get the data you want. So the first step is to get the query interface, so I opened the familiar console again.
From the above figure, we can see that these requests are all for getting advertisements, and we didn't find the interface we wanted. What is the situation, is it changed out of thin air? No, mainly because this website is not separated from the front and back ends, so at this time we need to start with its source code.
<html>
<body>
<form id="form1" class="form-horizontal" action="/banknum/" method="post">
<div class="form-group">
<label class="col-sm-2 control-label"> 关键词:</label>
<div class="col-sm-10">
<input class="form-control" type="text" id="keyword" name="keyword" value="102453000160" placeholder="请输入查询关键词,例如:中关村支行" maxlength="50" />
</div>
</div>
<div class="form-group">
<label class="col-sm-2 control-label"> 搜索类型:</label>
<div class="col-sm-10">
<select class="form-control" id="txtflag" name="txtflag">
<option value="0">支行关键词</option>
<option value="1" selected="">银行联行号</option>
<option value="2">支行网点地址</option>
</select>
</div>
</div>
<div class="form-group">
<label class="col-sm-2 control-label"> </label>
<div class="col-sm-10">
<button type="submit" class="btn btn-success"> 开始查询</button>
<a href="/banknum/" class="btn btn-danger">清空输入框</a>
</div>
</div>
</form>
</body>
</html>
By analyzing the code, we can get:
- Request address: http://www.jsons.cn/banknum/
- Request method: POST
Request parameters:
- keyword: link number
- txtflag :1
We can use PostMan
to verify whether the interface is valid. The verification result is shown in the figure below:
The remaining two websites are relatively simple. You only need to change the corresponding link number and make a request to obtain the corresponding data, so I won't go into details here.
Crawler preparation
After the above analysis, we have got the interface we want, it can be said that everything is ready, only the code is owed. The principle of crawling is very simple, that is, to parse HTML elements, and then obtain the corresponding attribute values and save them. Since Java is used for development, Jsoup to complete this work.
<!-- HTML解析器 -->
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
Since the data of a single website may be incomplete, we need to crawl them one by one. Fetch the first one first, if not, then fetch the next website, and so on. For such business scenarios, we can use the variant responsibility chain design pattern to write code.
BankBranchVO branch information
@Data
@Builder
public class BankBranchVO {
/**
* 支行名称
*/
private String bankName;
/**
* 联行号
*/
private String bankCode;
/**
* 省份
*/
private String provName;
/**
* 市
*/
private String cityName;
}
BankBranchSpider abstract class
public abstract class BankBranchSpider {
/**
* 下一个爬虫
*/
private BankBranchSpider nextSpider;
/**
* 解析支行信息
*
* @param bankBranchCode 支行联行号
* @return 支行信息
*/
protected abstract BankBranchVO parse(String bankBranchCode);
/**
* 设置下一个爬虫
*
* @param nextSpider 下一个爬虫
*/
public void setNextSpider(BankBranchSpider nextSpider) {
this.nextSpider = nextSpider;
}
/**
* 使用下一个爬虫
* 根据爬取的结果进行判定是否使用下一个网站进行爬取
*
* @param vo 支行信息
* @return true 或者 false
*/
protected abstract boolean useNextSpider(BankBranchVO vo);
/**
* 查询支行信息
*
* @param bankBranchCode 支行联行号
* @return 支行信息
*/
public BankBranchVO search(String bankBranchCode) {
BankBranchVO vo = parse(bankBranchCode);
while (useNextSpider(vo) && this.nextSpider != null) {
vo = nextSpider.search(bankBranchCode);
}
if (vo == null) {
throw new SpiderException("无法获取支行信息:" + bankBranchCode);
}
return vo;
}
}
The parsing method is different for different websites. In short, it is to obtain the attribute value of the HTML tag. There are many ways to achieve this step. My implementation method is posted below for reference only.
JsonCnSpider
@Slf4j
public class JsonCnSpider extends BankBranchSpider {
/**
* 爬取URL
*/
private static final String URL = "http://www.jsons.cn/banknum/";
@Override
protected BankBranchVO parse(String bankBranchCode) {
try {
log.info("json.cn-支行信息查询:{}", bankBranchCode);
// 设置请求参数
Map<String, String> map = new HashMap<>(2);
map.put("keyword", bankBranchCode);
map.put("txtflag", "1");
// 查询支行信息
Document doc = Jsoup.connect(URL).data(map).post();
Elements td = doc.selectFirst("tbody")
.selectFirst("tr")
.select("td");
if (td.size() < 3) {
return null;
}
// 获取详情url
String detailUrl = td.get(3)
.selectFirst("a")
.attr("href");
if (StringUtil.isBlank(detailUrl)) {
return null;
}
log.info("json.cn-支行详情-联行号:{}, 详情页:{}", bankBranchCode, detailUrl);
// 获取详细信息
Elements footers = Jsoup.connect(detailUrl).get().select("blockquote").select("footer");
String bankName = footers.get(1).childNode(2).toString();
String bankCode = footers.get(2).childNode(2).toString();
String provName = footers.get(3).childNode(2).toString();
String cityName = footers.get(4).childNode(2).toString();
return BankBranchVO.builder()
.bankName(bankName)
.bankCode(bankCode)
.provName(provName)
.cityName(cityName)
.build();
} catch (IOException e) {
log.error("json.cn-支行信息查询失败:{}, 失败原因:{}", bankBranchCode, e.getLocalizedMessage());
return null;
}
}
@Override
protected boolean useNextSpider(BankBranchVO vo) {
return vo == null;
}
}
FiveCmSpider
@Slf4j
public class FiveCmSpider extends BankBranchSpider {
/**
* 爬取URL
*/
private static final String URL = "http://www.5cm.cn/bank/%s/";
@Override
protected BankBranchVO parse(String bankBranchCode) {
log.info("5cm.cn-查询支行信息:{}", bankBranchCode);
try {
Document doc = Jsoup.connect(String.format(URL, bankBranchCode)).get();
Elements tr = doc.select("tr");
Elements td = tr.get(0).select("td");
if ("".equals(td.get(1).text())) {
return null;
}
String bankName = doc.select("h1").get(0).text();
String provName = td.get(1).text();
String cityName = td.get(3).text();
return BankBranchVO.builder()
.bankName(bankName)
.bankCode(bankBranchCode)
.provName(provName)
.cityName(cityName)
.build();
} catch (IOException e) {
log.error("5cm.cn-支行信息查询失败:{}, 失败原因:{}", bankBranchCode, e.getLocalizedMessage());
return null;
}
}
@Override
protected boolean useNextSpider(BankBranchVO vo) {
return vo == null;
}
}
AppGateSpider
@Slf4j
public class AppGateSpider extends BankBranchSpider {
/**
* 爬取URL
*/
private static final String URL = "https://www.appgate.cn/branch/bankBranchDetail/";
@Override
protected BankBranchVO parse(String bankBranchCode) {
try {
log.info("appgate.cn-查询支行信息:{}", bankBranchCode);
Document doc = Jsoup.connect(URL + bankBranchCode).get();
Elements tr = doc.select("tr");
String bankName = tr.get(1).select("td").get(1).text();
if(Boolean.FALSE.equals(StringUtils.hasText(bankName))){
return null;
}
String provName = tr.get(2).select("td").get(1).text();
String cityName = tr.get(3).select("td").get(1).text();
return BankBranchVO.builder()
.bankName(bankName)
.bankCode(bankBranchCode)
.provName(provName)
.cityName(cityName)
.build();
} catch (IOException e) {
log.error("appgate.cn-支行信息查询失败:{}, 失败原因:{}", bankBranchCode, e.getLocalizedMessage());
return null;
}
}
@Override
protected boolean useNextSpider(BankBranchVO vo) {
return vo == null;
}
}
Initialize the crawler
@Component
public class BankBranchSpiderBean {
@Bean
public BankBranchSpider bankBranchSpider() {
JsonCnSpider jsonCnSpider = new JsonCnSpider();
FiveCmSpider fiveCmSpider = new FiveCmSpider();
AppGateSpider appGateSpider = new AppGateSpider();
jsonCnSpider.setNextSpider(fiveCmSpider);
fiveCmSpider.setNextSpider(appGateSpider);
return jsonCnSpider;
}
}
Crawling interface
@RestController
@AllArgsConstructor
@RequestMapping("/bank/branch")
public class BankBranchController {
private final BankBranchSpider bankBranchSpider;
/**
* 查询支行信息
*
* @param bankBranchCode 支行联行号
* @return 支行信息
*/
@GetMapping("/search/{bankBranchCode}")
public BankBranchVO search(@PathVariable("bankBranchCode") String bankBranchCode) {
return bankBranchSpider.search(bankBranchCode);
}
}
Demo
Crawl success
Failed to crawl
Code address
to sum up
The main difficulty of this crawler lies in Jsons.cn. Because the data interface is hidden in the code, it will take some time to get it. And the request address is the same as the page address, but the request method is different, which is easy to be misled. It's relatively simple to compare the other two, just replace the link number directly, and there is no anti-pickup mechanism for these three websites, so the data is easily obtained.
Past review
end
If you think it is helpful to you, you can comment more and like it a lot, or you can go to my homepage to see, maybe there are articles you like, or you can just follow them, thank you.
I am a different technology house. I make a little progress every day and experience a different life. See you next time!
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。