关于WebMagic抓取的链接的集合异常问题。

我可以用webmagic顺利地抓取网页上的所有链接,并保存在一个集合里面,然而当我运行了一段普通的代码后,这个集合里面的链接数量就变了,好奇怪,谁能解释一下吗?
代码如下:

List<String> homeworkUrlList = page.getHtml().xpath("//tr[@class='altrow']//td[1]").links().all();
            if(homeworkUrlList!=null && homeworkUrlList.size()>0){
//                JSONArray jsonArray = new JSONArray();
//                List<String> homeworkUrlList2 = homeworkUrlList;
//                int courseid = Integer.parseInt(pageUrl.substring(pageUrl.indexOf("=")+1));
//                for(String homeworkUrl: homeworkUrlList){
//                    int homeworkId = Integer.parseInt(homeworkUrl.substring(homeworkUrl.lastIndexOf("=")+1));
//                    tdid2 = tdid+String.valueOf(homeworkId);
//                    finish = page.getHtml().xpath("//td[@id='"+tdid2+"']/span/text()").toString();
//                    //System.out.println(finish);
//                    
//                    if(finish.equals("未提交")){
//                        JSONObject jsonobject = new JSONObject();
//                        jsonobject.put("userid", login.id);
//                        jsonobject.put("courseid",courseid );
//                        jsonobject.put("homeworkid", homeworkId);
//                        jsonArray.add(jsonobject);        
//                    }
//                }
//                
//                if(jsonArray != null && jsonArray.size()>0){
//                    page.putField("unfinish", jsonArray);
//                }
//                System.out.println("homework list:"+homeworkUrlList);
                System.out.println(homeworkUrlList.size());
                page.addTargetRequests(homeworkUrlList);
            }

云服务器那里截图不方便,我把结果复制下来。
下面是运行了注释部分的代码前后的结果的一部分:

运行注释部分的代码前:

10
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=559
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=568
11
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=282
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=708
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=201
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=132
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=164
11
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=444
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=166
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=247
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=206
7
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=1168
14
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=297
8
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=300
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=219
15
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=445
6
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=494
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=166
1
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=247
unfinish:    [{"homeworkid":4404,"userid":1,"courseid":247}]
插入未提交作业表
3
get page: http://www.scholat.com/course/S_homeworkList.html?courseId=512
unfinish:    [{"homeworkid":9269,"userid":1,"courseid":512},{"homeworkid":8344,"userid":1,"courseid":512},{"homeworkid":8343,"userid":1,"courseid":512}]
插入未提交作业表
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=247&homeworkId=4404
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=9269
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=8344
no match!
get page: http://www.scholat.com/course/S_oneHomework.html?courseId=512&homeworkId=8343
no match!
登陆信息正确!!!

也就是说,本应该homeworkUrlList的数量很多的,运行了只是部分的代码以后却变少了,百思不得解,望有人能指点迷津!!
下面是pipeline的代码:

if(Pattern.matches(pattern3, resultItems.getRequest().getUrl())){
            JSONArray jsonArray = resultItems.get("unfinish");
            if(jsonArray!=null && jsonArray.size()>0){
                for(int i=0;i<jsonArray.size();i++){
                    Unfinish unfinish = new Unfinish();
                    unfinish.setUserId(jsonArray.getJSONObject(i).getInt("userid"));
                    unfinish.setCourseId(jsonArray.getJSONObject(i).getInt("courseid"));
                    unfinish.setHomeworkId(jsonArray.getJSONObject(i).getInt("homeworkid"));
                    System.out.println("插入未提交作业表");
                    unfinishDao.add(unfinish);
                }
            }
        }

难道是pipeline的代码影响了吗?可是表面看上去pipeline的代码只是取数据和存数据而已。

阅读 2.1k
撰写回答
你尚未登录,登录后可以
  • 和开发者交流问题的细节
  • 关注并接收问题和回答的更新提醒
  • 参与内容的编辑和改进,让解决方法与时俱进
推荐问题