My particular problem is this. I want it to be easy to use, something like this would be nice...
from crawler import Crawler
class myCrawer(Crawler):
def handle(self, url, html):
#do_something_here
c = myCrawler(url="http://wherever.com")
c.crawl_type = "nice"
c.run( )
... and also, I'd like it to well designed enough not to run out of memory, well documented, know about the strange unicode formats that live out there on the web, to open too many sockets, to do the right thing when a URL times out... and hey to be quick it probably should be threaded to boot.
After days (and days) of downloading and trying crawlers out there have been some worth noting.
...but I have also tried crawl-e, creepycrawler, curlspider, flyonthewall, fseek, hmcrawler, jazz-crawler, ruya, supercrawler, webchuan, webcrawling, yaspider and more! All of these are rubbish...
The problem is this. It IS possible to write a crawler in 20 or so lines in python that will work... except it won't. It won't handle pages that redirect to themselves, it won't handle links that are ../../ relative, it won't be controllable in any way. 
My problem with the two "best contenders" was this. Firstly, HarvestMan, no matter how hard I configure it, always saves the crawled pages to disk. I had this problem in 2006ish, I still have it today. Arnand (lovely guy) the developer has rolled in many of my change requests since then so that the pseudo code above is almost a reality, but only almost. HarvestMan still is a front-runner if just because of the thought that has gone into the configuration options.
Scrapy looked very cool too, but it seems more geared towards getting specific data from known pages, rather than wandering around the web willy nilly. I would have to create my own crawler within Scrapy. I will maybe come back to this but in general, armed with python and a few regular expressions you can't half get a lot done.
So, in my attempt to get "round the loop" once, that is, to a. gather some data from a few sites (namely, the University of York's sites), b. manipulate it in some way (in this case, pump it at Open Calais and see what we get back... poor man's artificial intelligence) and then c.present it ( maybe as a tag cloud, or something more fancy if I have time) I needed a crawler that was very simple to use. 
So ...
a. Despite wanting to use an "off the shelf" crawler. I found a crawler that almost worked and hacked it until it worked. It's not threaded, it's not clever but it does the job.
I had to do some hand pruning to look at the mime-types of the pages returned, and remove Betsie pages (of which there were thousands)... but I will try and roll that back into the crawler.
b. I found a Django application called django_calais, which after adding the last line to my Page model, like this...
class Page(models.Model):
    url = models.URLField(unique=True, null=False, blank=False, db_index=True)
    title = models.CharField(max_length=300, null=True, blank=True)
    crawl_date = models.DateTimeField(default=datetime.now)
    html = models.TextField()
    type = models.CharField(max_length=100, null=True, blank=True)
    size = models.IntegerField(null=True, blank=True)
    calais_content_fields = [('title', 'text/txt'), ('url', 'text/html'), ('html', 'text/html')]
... I could then run....
def analyze():
    pages = Page.objects.filter(type="text/html")
    for page in pages:
        print page.__unicode__(), page.url
        try:
            CalaisDocument.objects.analyze(page, fields= [('title', 'text/txt'), ('url', 'text/html'), ('html', 'text/html'),])
        except Exception, err:
            print err
.... and have my Calais application be populated with People, and Organisations and Companies and Facilities etc. all of which are related back to my Page model.
I haven't got to presentation stage yet, the Calais analyzer is still running, BUT after get quite anxious about spending too long looking for an adequate crawler, the semantic bit has already proven itself. So maybe tomorrow I will be able to present some data... 
And then I'd better check it in to a repository or something. 

 
 

Tom:
ReplyDeleteTom Tague from OpenCalais here. Please take a moment to let us know how the OpenCalais extraction goes. I'm certain it won't be perfect with this content set - but very curious to see how it does.
Regards,
Tague, thanks for getting in touch. The content set is weirder than weird (trust me).
ReplyDeleteSo far, my initial reaction is that the results are WONDERFUL. I will need to look deeper into the results to see how I present them, but the fact that it accurately identifies people and concepts like "external examiner" or "Chemistry Clean Technology" is amazing.
The fact that it also happy returns "quantum-chemical calculations" is astonishing.