使用 Racket 抓取网站

  • Personal History with Early Blogging Tools: From 2002 to 2009, kept a personal blog with Radio Userland built on Userland Frontier. Frontier was a capable system with its own scripting language and object database. Radio Userland was for early bloggers with local editing and free hosting. Blogged 592 days in 7 years in a linkblog style. Forgot the blog until recently when found it at http://radio-weblogs.com.
  • Webscraping With Racket: Decided to scrape the site with Racket as Rash gives quick ability to call command line programs.
  • Walking backwards in time: Built a custom functional iterator with Gregor library to walk back through days and call a provided function.
  • Parsing Content: Racket's http-client returns HTML response as S-expressions by default. Can turn off this behavior with (current-http-client/response-auto #f). Used sxpath library to find HTML element with blog posts and xexp->html to translate sexp back to HTML.
  • Putting it together: Used Racket's various libraries like gregor, http-client, sxml/sxpath, and html-writing to iterate through dates, scrape blog content, and process and write the HTML. It was a fun challenge and Racket ecosystem made complex things easier.
阅读 8
0 条评论