我可以用webmagic顺利地抓取网页上的所有链接,并保存在一个集合里面,然而当我运行了一段普通的代码后,这个集合里面的链接数量就变了,好奇怪,谁能解释一下吗?
代码如下:
List<String> homeworkUrlList = page.getHtml().xpath("//tr[@class='altrow']//td[1]").links().all();
if(homeworkUrlList!=null && homeworkUrlList.size()>0){
// JSONArray jsonArray = new JSONArray();
// List<String> homeworkUrlList2 = homeworkUrlList;
// int courseid = Integer.parseInt(pageUrl.substring(pageUrl.indexOf("=")+1));
// for(String homeworkUrl: homeworkUrlList){
// int homeworkId = Integer.parseInt(homeworkUrl.substring(homeworkUrl.lastIndexOf("=")+1));
// tdid2 = tdid+String.valueOf(homeworkId);
// finish = page.getHtml().xpath("//td[@id='"+tdid2+"']/span/text()").toString();
// //System.out.println(finish);
//
// if(finish.equals("未提交")){
// JSONObject jsonobject = new JSONObject();
// jsonobject.put("userid", login.id);
// jsonobject.put("courseid",courseid );
// jsonobject.put("homeworkid", homeworkId);
// jsonArray.add(jsonobject);
// }
// }
//
// if(jsonArray != null && jsonArray.size()>0){
// page.putField("unfinish", jsonArray);
// }
// System.out.println("homework list:"+homeworkUrlList);
System.out.println(homeworkUrlList.size());
page.addTargetRequests(homeworkUrlList);
}
云服务器那里截图不方便,我把结果复制下来。
下面是运行了注释部分的代码前后的结果的一部分:
运行注释部分的代码前:
10
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=559
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=568
11
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=282
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=708
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=201
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=132
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=164
11
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=444
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=166
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=247
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=206
7
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=1168
14
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=297
8
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=300
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=219
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=445
6
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=494
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=166
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=247
unfinish: [{"homeworkid":4404,"userid":1,"courseid":247}]
插入未提交作业表
3
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=512
unfinish: [{"homeworkid":9269,"userid":1,"courseid":512},{"homeworkid":8344,"userid":1,"courseid":512},{"homeworkid":8343,"userid":1,"courseid":512}]
插入未提交作业表
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=247&homeworkId=4404
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=9269
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=8344
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=8343
no match!
登陆信息正确!!!
也就是说,本应该homeworkUrlList的数量很多的,运行了只是部分的代码以后却变少了,百思不得解,望有人能指点迷津!!
下面是pipeline的代码:
if(Pattern.matches(pattern3, resultItems.getRequest().getUrl())){
JSONArray jsonArray = resultItems.get("unfinish");
if(jsonArray!=null && jsonArray.size()>0){
for(int i=0;i<jsonArray.size();i++){
Unfinish unfinish = new Unfinish();
unfinish.setUserId(jsonArray.getJSONObject(i).getInt("userid"));
unfinish.setCourseId(jsonArray.getJSONObject(i).getInt("courseid"));
unfinish.setHomeworkId(jsonArray.getJSONObject(i).getInt("homeworkid"));
System.out.println("插入未提交作业表");
unfinishDao.add(unfinish);
}
}
}
难道是pipeline的代码影响了吗?可是表面看上去pipeline的代码只是取数据和存数据而已。