Monday 22 March 2010

Review of Python Crawling Tools

http://www.ohloh.net/p/WenChuan

Web crawlers are funny things. I have had a little experience of working with web crawlers in the past with a mixed set of results. Many were too difficult to configure or simply not robust enough. At this stage of the project, I would ideally like a crawler that is easy to adapt and yet trust-worthy enough to be able to set it running and leave it.

Imagining that I don't want to crawl very large portions of the internet, instead looking just to work with small corners I will be happy with something simple. Worth trialling are alternatives to "doing it yourself", such a Yahoo Pipes and the 80legs.com crawling service as well as Yahoo Boss search engine builder. Whenever possible I will consider using these tools, simply because of the time needed to crawl for data.

My preferred language is python but I wonder if there is a more web-centric language to think about? I vaguely remember Rebol as having URLs as base types. Maybe Rebol is worth exploring.

I started by looking at a list of python crawlers at Ohloh http://www.ohloh.net/p?sort=users&q=python+crawler and from these tested these....

Testing Notes...


Ruya (Not bad, has before_crawl and after_crawl functions that you overshadow)

Mechanize (Kind of browser simulation - including form filling)


import re
from mechanize import Browser

br = Browser()
br.open("http://www.example.com/")
# follow second link with element text matching regular expression
response1 = br.follow_link(text_regex=r"cheese\s*shop", nr=1)
assert br.viewing_html()
print br.title()
print response1.geturl()


Crawl-e ( Not bad )
The CRAWL-E developers are very familiar with how TCP and HTTP works and using that knowledge have written a web crawler intended to maximize TCP throughput. This benefit is realized when crawling web servers that utilize persistent HTTP connections as numerous requests will be made over a single TCP connection thus increasing the throughput.


#Squzer - missing in action

SuperCrawler - not bad but the code looks a bit terse.

Fseek
Fseek is a python-based web crawler. The user-interface is implemented using Django, the back-end uses pyCurl to fetch pages, and Pyro is used for IPC.
Says it's Django but it's not. I liked the idea of it being "presentation ready"... could be handy.
     cannot create writable FSEEK_DATA_DIR /var/fseek.
     ImportError: No module named Pyro.naming
     failed to load external entity "/var/fseek/solr/etc/jetty.xml"
     Needed Pyro

Webchuan ( XPathy)

WebChuan is a set of open source libraries and tools for getting and parsing web pages of website. It is written in Python, based on Twisted and lxml.
It is inspired by GStreamer. WebChuan is designed to be back-end of web-bot, it is easy to use, powerful, flexible, reusable and efficient.
     Error in the setup.py script removed the description and it worked.


Jazz crawler
     Broke on unicode data (added any2ascii wrappers and it solved it -- hack)
    Then borked on a draw_graph (probably too big)
    Interesting because it builds a graph (and calculates PageRank) of the data gathered enabling an out-of-the-box simple visualisation...



Harvestman
    I have used this before and found did "too powerful", I never did work out how to get it not to save crawled files to disk. I was unhappy with the config.xml approach to running the crawler too. If Scrapy proves unsuccessful I will return to this because it's a great product.

ExternalSiteCatalog - an integration of Harvestman with Plone - may be useful later.

AWCrawler - a python spider that saves data to Amazon S3 etc

Scrapy (XPath and pipelines)
  
Scrapy is new (to me) and looks easy to modify. I will experiment with this, though will need to learn some XPath to begin with.

No comments:

Post a Comment