Monday, 7 February 2011

Progress February 2011

Over the last few months I've tried a number of novel approaches to try and make sense of the data that the crawler and semantifier return.

Many of these approaches (based on the Neo4J work) still have some potential ( I think ), but I wasn't able to prove it.

Recently, if only in an effort to see what's there, I went back to basics and do the simplest thing possible (almost). The Home Page now looks like this (see below)... a simple tag cloud showing the most frequent concepts and images of the people who's pages were found to contain squarish, largeish images. There is also a type-ahead scrolling search box. Showing people what is in the database, before they search is a really useful means of helping people not to search for things that aren't there.

I decided to attempt to return only related people and then overlapping concepts (for any given concept). This would mean that a search for plankton might return ...

Plankton -> (Peter Daines, Andy Bird, Rob Raiswell, Michael Krom etc) which would then subsequently return the connected concepts of (Fisheries, Aquatic ecology, Climate change, Environment etc).

The result of this really simple approach looks like this...

...and bizarrely, because the concepts are interlinked, the visualisation sorts itself out. That is to say, previously I was attempting to fingerprint the connections between concepts and then sort them, but found that the visualisation ( trying to find the best layout for interconnected concepts) actually does the same job, but in 2D space rather than in a matrix.

A rough beta of this work is online here . Be gentle with it, it's not finished. I now want to see if I can create some embed code that might be usefully added to another site - showing a mini collection of related concepts and URLs.

Friday, 6 August 2010

Neo4J + Python Crawler + Open Calais + Gephi

So, today after having spent the morning getting our LDAP server to talk to our instance (it worked!) I thought I'd have a bash at this. The plan was/is to create a crawler that gets a web page, finds the pages that it links to (including Word or PDF files),  pumps any data found at the Open Calais API and saves the semantic Entities returned into a Neo4J database then go and look at it with the Gephi visualising tool ( see previous posts ).

This is all kinda new to me, I have no idea if what I'm doing makes sense, but I hope that once I can "look" at data, I'll be able to figure out a way of pruning it into something usable.

The (ropey) code is here and the visualisation is shown above. I have no idea if this will be "traversable" yet, but it kind of proves to me that it's doable. Ideally I want to crawl a given pile of pages into one big soup and then get from page B to page X and "discover" the shortest route between them...

How hard can it be :-)

5    import urllib2, urllib,  traceback, re, urlparse, socket, sys, random, time 
6    from pprint import pprint 
7    import codecs, sys 
8    from calais import Calais      # 
9    from scraper import fetch     # 
11   streamWriter = codecs.lookup('utf-8')[-1] 
12   sys.stdout = streamWriter(sys.stdout) 
15   calaisapi = Calais(API_KEY, submitter="python-calais demo") 
17   def analyze(url=None): 
18       result = calaisapi.analyze_url( url ) 
19       return result 
21   def neo_result(result, url): 
22       import neo4j 
24       db = neo4j.GraphDatabase( "simple_neo_calais_test" ) 
26       result.print_summary( ) 
28       with db.transaction: 
29           # Create the page index 
30           pages = db.index("Pages", create=True) 
31           page_node = pages[url] # does this page exist yet? 
32           if not page_node: 
33               page_node = db.node(url=url) # create a page 
34               pages[ url ] = page_node # Add to index 
35               print "Created:" , url 
36           else: 
37               print "Exists already:" , url 
39           print len(result.entities), "Calais Entities" 
40           for e in result.entities: 
41               print result.doc['info']['externalID']    #URL 
42               entity_type =  e['_type'] 
43               entity_value = e['name'] 
44               relevance = e['relevance'] 
45               instances = e['instances'] 
47               print entity_type, entity_value, relevance, instances #instances is a list of contexts 
49               #Create an entity 
50               entity = db.node(value=entity_value, relevance=relevance ) 
51               entity_type = db.node( name= entity_type ) 
52               entity_type.is_a( entity )    # e.g Amazon is_a Company 
53               page_node.has( entity_type ) 
55       db.shutdown() 
60   def print_result(result): 
61       'Custom code to just show certain bits of the result obj' 
62       result.print_summary( ) 
64       print "Entities" 
65       for e in result.entities: 
66           print e['_type'], ":", (e['name']) 
67           print e.keys() 
68           print 
70       print "Topics" 
71       for t in result.topics: 
72           print t['category'] 
73           #print t 
75       print 
76       print "Relations" 
77       print result.print_relations( ) 
81   suffixes_to_avoid=[ 'css','js','zip', ] 
82   def get_links( data, url='',suffixes_to_avoid=suffixes_to_avoid ): 
83       'I know I should use BeautifulSoup or lxml' 
84       links = [ ] 
85       icos = [ ] 
86       feeds = [ ] 
87       images = [ ] 
88       try: 
89           found_links = re.findall(' href="?([^\s^"\'#]+)', data)# think this strips off anchors 
91           for link in found_links: 
92               # fix up relative links 
93               link = urlparse.urljoin(url, link) 
94               link = link.replace("/..", "") # fix relative links 
96               #check to see if path is just "/"... for example, 
97               path = urlparse.urlsplit(link)[2] 
98               if path  == "/": 
99                   link = link[:-1]#take off the trailing slash (or should we put it on?) 
101              #just in case fixups 
102              link = link.replace("'", "") 
104              if link not in links and link[:7] == 'http://' : #avoid mailto:, https:// 
105                  if "." in path: 
106                      suffix_found = path.split(".")[-1] 
107                      print suffix_found, link 
108                      if suffix_found  in suffixes_to_avoid: 
109                          pass 
110                      else: 
111                          if suffix_found == 'ico': 
112                              icos.append( link ) 
113                          elif suffix_found == 'rss' or suffix_found=='xml': 
114                              feeds.append ( link ) 
115                          elif suffix_found in ['gif', 'png', 'jpg', 'jpeg']: 
116                              images.append( link ) 
117                          else: 
118                              links.append( link ) 
119                  else: 
120                      links.append( link ) 
123      except Exception, err: 
124          print err 
126      return links, icos, images, feeds 
128  def get_images(content, url): 
129      #ico 
130      #feeds 
131      #images 
132      #text 
133      found_images = re.findall(' src="?([^\s^"\'#]+)', content) 
134      images = [ ] 
137      for image in found_images: 
138          image = urlparse.urljoin(url, image) 
139          image = image.replace("/..", "") 
140          path = urlparse.urlsplit(image)[2] 
142          if path[-3:] == ".js": 
143              pass 
144          else: 
145               images.append( image ) 
146      return images 
148  def get_icos(content, url): 
149      #href="/static/img/favicon.ico" 
150      #image/x-icon  
151      found_images = re.findall(' href="?([^\s^"\'#]+\.ico)', content) 
152      images = [ ] 
155      for image in found_images: 
156          image = urlparse.urljoin(url, image) 
157          image = image.replace("/..", "") 
158          path = urlparse.urlsplit(image)[2] 
160          if path[-3:] == ".js": 
161              pass 
162          else: 
163               images.append( image ) 
165      print images 
166      if len(images)> 0: 
167          print len(images), "icos" 
168          image = images[ 0 ] #just get the first 
169      else: 
170          #try a default... you never know... might be aliased anyway... 
171          scheme, domain, path, query, x = urlparse.urlsplit( url ) 
172          image = '%s://%s/favicon.ico' % (scheme, domain) 
174      return image 
177  mimes_u_like = [ 'text/html', 'application/pdf', ] 
178  def get( url, mimes_u_like=mimes_u_like, fetch_links=False ): 
179      url, status, message, headers, content  = fetch(url) 
180      mime = headers['content-type'].split(";")[0].strip() 
182      page = {'url':url, 
183              'mime':mime, 
184              'status':status, 
185              'message':message, 
186              'headers':headers, 
187              'content':content, 
188              'links':[], 
189              'images':[], 
190              'feeds': [] } 
193      ico =  get_icos( content, url ) 
194      page['ico'] = ico 
196      images_found =  get_images( content, url ) 
197      page['images'] = images_found 
199      if fetch_links == False: 
200          return page     # this is wrong... I want to discover the the links without necessarily getting their data 
201      else: 
202          links_found, icos, images, feeds = get_links( content, url ) 
203          print url 
204          print len( links_found ), "links found" 
205          links = [ ] 
206          print len( icos ), "icos" 
207          print len( images ), "images" 
208          print images 
209          print len( feeds ), "feeds" 
212          for link in links_found: 
213              url, status, message, headers, content  = fetch(link) 
214              mime = headers['content-type'].split(";")[0].strip() 
215              if mime in mimes_u_like: 
216                  print "\t", status, mime, link 
217                  if status == 200: 
218                      links.append( {'url':url, 
219                                     'status':status, 
220                                     'message':message, 
221                                     'headers':headers, 
222                                     'mime':mime, 
223                                     'content':content} 
224                                    ) 
226          print len(links) , "Used" 
227          page['links'] = links 
228          return page 
230  if __name__ == '__main__': 
231      url = '' 
232      print "Getting:", url 
233      page = get( url, fetch_links=True ) 
235      result = analyze( url ) 
236      neo_result ( result, url ) 
238      print "Getting %s links.." % str(len(page['links'])) 
239      for link_dict in page['links']: 
240          print "Getting link:", link_dict['url'] 
241          try: 
242              result = analyze( link_dict['url'] ) 
243              neo_result ( result, link_dict['url'] ) 
244          except Exception, err: 
245              print    link_dict['url'], err 
247      print "Done!" 

Wednesday, 4 August 2010

Django and Neo4J

So, today I was tempted by the approach taken in these slides about Neo4J and Django. My hope was/is that whilst the Neo4J implementation isn't complete, because I imagine many of the concepts simply won't easily map from a graph database to a SQL one, at least I might be able to work in a more familiar way in order to be able to learn about neo4j.

I started by grabbing the files from the model folder from here.... adding...

NEO4J_RESOURCE_URI = '/Users/tomsmith/neo4django_db' my file and then adding this code to a file...

from neo4j.model import django_model as models

class Actor(models.NodeModel):
    name= models.Property(indexed=True)
    href = property(lambda self:('/actor/%s/' %

    def __unicode__(self):

class Movie(models.NodeModel):
    title = models.Property(indexed=True)
    year = models.Property()
    href = property(lambda self:('/movies/%s/' %

    actors = models.Relationship(Actor,type=models.Outgoing.ACTS_IN,related_name="acts_in",)

    def title_length(self):
return len(self.title)

    def __unicode__(self):
return self.title

... and then from within python shell I could...

>> import stuff.models as m

>> movie = m.Movie(title="Fear and Loathing in Las Vegas",year=1998)

>>>>> movie = m.Movie(title="The Brothers Grimm",year=2005) 
>>> movie = m.Movie(title="Twelve Monkeys",year=1995) 
>>> movie = m.Movie(title="Memento",year=2000) 

... now I can...

>> movies = m.Movie.objects.all()
>>> for movie in movies:print, movie, movie.href
7 Memento /movies/7/
6 Twelve Monkeys /movies/6/
5 The Brothers Grimm /movies/5/
4 Fear and Loathing in Las Vegas /movies/4/
3 The Book of Eli /movies/3/
1 10,000 years B.C. /movies/1/

... which is fantastic! Don't you think?! The django_model is handling the db.transaction stuff. I can even do a ...

>> reload( m ) 

... and it doesn't break (because the database is already connected) as my earlier code did. (I think I can even run my django instance AND be connected to the neo4j database at the same time.. phew!).

 What I think is fantastic about it is that using this approach I get to work with objects in a familiar way. And by that, I mean, having a schema-less database is all well and good, but almost the first thing I seem to find myself doing is creating similar types of objects. So being able to create classes (maybe even with subclasses) seems perfect for my needs ... in theory. In theory, I'd like to be able to define classes with properties but then on-the-fly maybe add extra properties to objects (without changing my class definition).

AND I can add methods to my classes... or can I ?  I added a simple function to my Movie class...

     def title_length(self):
return len(self.title)

...and ran... 

>>> movies = m.Movie.objects.all()
>>> for movie in movies:print, movie, movie.href, movie.title_len()

...and got...

AttributeError: 'Movie' object has no attribute 'title_len'

... Boo! At this point after doing looking at the Movie objects, I found ...

>>> dir( m.Movie.objects )
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slotnames__', '__str__', '__subclasshook__', '__weakref__', '_copy_to_model', '_inherited', '_insert', '_set_creation_counter', '_update', 'aggregate', 'all', 'annotate', 'complex_filter', 'contribute_to_class', 'count', 'create', 'creation_counter', 'dates', 'defer', 'distinct', 'exclude', 'extra', 'filter', 'get', 'get_empty_query_set', 'get_or_create', 'get_query_set', 'in_bulk', 'iterator', 'latest', 'model', 'none', 'only', 'order_by', 'reverse', 'select_related', 'update', 'values', 'values_list'] I did a ...

>>> movies = m.Movie.objects.all().order_by('year')

...and got ...

Traceback (most recent call last):
  File "", line 1, in
AttributeError: 'NodeQuerySet' object has no attribute 'order_by'

...and the same is true for .get_or_create() or .values() or many other methods.

... so (and this isn't a criticism) it seems that the Django model is a "work in progress" (the cat ate his source code) with some of the methods still to be completed. At this point I should probably start looking at the django_model source code and begin adding to the functionality it offers... except ...

  1. I probably don't have the ability to create a database adapter, especially for one I don't understand
  2. I don't know if shoe-horning neo4j into Django is a good idea. I'm not saying it isn't ( so far, it's tantalising ) I'm just saying I don't know.
  3. When I find errors I'm not sure if it's me or not...
...for example, I tried...

>>> actor = m.Actor(name="Joseph Melito")
>>> actor = m.Actor(name="Bruce Willis")
>>> actor = m.Actor(name="Jon Seda")

...and all was fine and dandy and then I tried ...

>>> movie = m.Movie.objects.get(title="Twelve Monkeys")
>>> movie.actors.add ( actor )

... and got...

Traceback (most recent call last):
  File "", line 1, in
  File "/opt/local/lib/python2.5/site-packages/", line 623, in __get__
    return self._get_relationship(obj, self.__state_for(obj))
  File "/opt/local/lib/python2.5/site-packages/", line 708, in _get_relationship
    states[] = state = RelationshipInstance(self, obj)
  File "/opt/local/lib/python2.5/site-packages/", line 750, in __init__
    self.__removed = pyneo.python.set() # contains relationships
AttributeError: 'module' object has no attribute 'set'

... I wonder if Tobias has done more work on the Relationships class at all?

Tuesday, 3 August 2010

Visualisation of a Neo4J database with Gephi

1.  Get the version of Gephi app that can read neo4j databases (not the main one)
bzr branch

2. Get Netbeans, open the project you've just downloaded and build it as a Mac application (you'll find it in a folder called dist). As it happens, Netbean is quite a lovely python IDE, once you've added a few plugins. I added a python plugin, a regular expression one and found I could pretty much use the python debugger out of the box (more than I can say for Eclipse) and watch variables etc. Very nice.

3. Run the code from yesterday and open the DB that gets created with Gephi... alter the layouts and you get something like this...

... which was a small crawl of a small site (showing IDs).

I have no absolutely no idea what this all means but isn't it lovely to look at? I think I'm onto something.

Working Neo4J / Python code

So, this sort of works. The team at neo4j have made the iteration work for me... Usage:


... and what it does it go to that web site and grabs a few links, following them, adding the pages and their data to a neo4j graph database.

import neo4j #See
import urllib2, traceback, re, urlparse, socket, sys, random, time

socket.setdefaulttimeout(4) #seconds

'''This is my attempt to begin to make proper classes and functions to talk to the neo4j database, it's not meant to be fancy or clever, I just want to be able to CRUD ok-ish'''

 db = neo4j.GraphDatabase("crawler_example3_db")
 with db.transaction:
  #How do we delete an index... which might have old nodes in? This doesn't work...any ideas... how do I empty an index, should I? Does deleting a node delete its refererence from the index
   for ref in db.index('pages'):
  except Exception, err:
   print err
  pages = db.index("pages", create=True) # create an index called 'pages'
  #print "Index created"
 #return db, pages
except Exception, err:
 print err
 #print "db:", db

#############     UTILITY FUNCTIONS         #############

def get_links( data, url=''):
 'I know I should use BeautifulSoup or lxml, but for simplicity it's a regex..ha'
 links = [ ]
  found_links = re.findall(' href="?([^\s^"\'#]+)', data)
  for link in found_links:
   #print link
   link = urlparse.urljoin(url, link)
   link = link.replace("/..", "") #fix relative links
   link = link.replace("'", "")
   if link not in links and link[:7] == 'http://' :#avoid mailtos
    links.append( link )
 except Exception, err:
  print err
 return links
def wget(url):
  handle = urllib2.urlopen(url)
  if 'text/html' in handle.headers['content-type']:
   data = unicode(,  errors='ignore')
   return data
 except Exception, e:
  print "Error wgetting", e, url
  #print traceback.print_exc()
 return None

def delete_all_pages():
 with db.transaction:
   for node in db.node:
    id  =
    node.delete( )
    print id, "deleted"
  except Exception, err:
   print err
   pass # fails on last iteration?
 print "All pages deleted!"

def create_page( url, code=200 , follow=True):
 '''this creates a page node then adds it to the "pages" index, if it's there already it'll get it'''
 # Get the actual page data from the web...
 data = wget( url ) #get the actual html
 if not data:
  return None
  print "Boo!", url
  data_len = len(data)
  page_node = pages[url] # does this page exist yet?
  if not page_node:
   page_node = db.node(url=url, code=code, data=str(data), data_len=data_len) # create a page
   pages[ url ] = page_node # Add to index
   print "Created:" , url
   print "Exists already:" , url
  #Now create a page for every link in the page
  if follow == True:
   print "\tfollowing links"
   i = 0
   links = get_links(data, url)
   print len( links ) , "found"
   if len(links) > 20:
    links = links[:19]
   for link in links:
    print "\tcreating:",  link
    create_page( link, follow=False)
  return page_node

def get_one(url):
 'given a url with return a node obj'
 with db.transaction:
   node = pages [ url ]
   if node == None:
    print "Node is none! Creating..."
    create_page( url, code=200 , follow=False)
   return node
  except Exception, err:
   print err
 return None

def list_all_pages( ):
  # Just iterate through the pages to make sure the data in in there...
  print "Listing all pages..."
 with db.transaction: 
  for node in db.node:
    print node
    #print node['data'][:150], "..."
   except Exception, err:
    print err
 print "...done listing!"

def delete_one( url ):
 with db.transaction:
   node = pages[ url ]
   if node:
    node.delete( )
    print "Node with id=",, "deleted"
    #delete from index
    del pages[url]
  except Exception, err:
   print "Node probably not found:", err
   print dir(node) # let's have a look

def find_links_between_pages( ):
 #'Just iterate through the pages to make sure the data in in there...'
  print "Linking all pages..."
 with db.transaction:
   for node in db.node:
     print str(node), node['url'], node['data_len']
     links = get_links( node['data'], node['url'])
     for link in links:
       print link
       #look to see if a node with that url exists, if it doesn't it's created...
       other_node = get_one( link )
       if not other_node:
        other_node.links_to( node )
      except Exception, err:
       print err
    except Exception, err:
     print err
  except Exception, err:
   pass # fails on last iteration?
   print err

class Backlink(neo4j.Traversal):
 types = [neo4j.Outgoing.links_to]
 order = neo4j.DEPTH_FIRST
 stop = neo4j.STOP_AT_END_OF_GRAPH
 def isReturnable(self, pos):
  if pos.is_start: return False
  else:return pos.last_relationship.type == 'links_to'

if __name__ == '__main__':

 #### LET'S GET STARTED ######
  url = sys.argv[1]
  print "No url passed, using ''"
  url = ''
 print "Starting:", url
 ######## DELETE ALL THE PAGES WE HAVE SO FAR ############
 # Avoiding this because of ...
 #jpype._jexception.RuntimeExceptionPyRaisable: org.neo4j.graphdb.NotFoundException: Node[0] 
 #... errors later...
 #print "Deleting existing pages..."
 #delete_all_pages( ) #we may want to add new data to each page... forget this for now
 ########        ADD SOME ACTUAL PAGES        #################
 #print "Creating some pages "
 with db.transaction:
  create_page(  url, 200, follow=True ) #Also fetches some pages linked from this page
 ########    NOW GET SOME NODES OUT     ############### 
 print "Has our data made it to the database? Let's see..."
 # Do some fishing for nodes...
 with db.transaction:
   node = get_one( url )
   print '\tget one:' + str( node ) , "... oh yes!"
   print "\tid:", , "url:", node['url'], "data-length:", node['data_len']
   #This should fail
   #print "This SHOULD fail.."
   #node = get_one( '' )
   #print 'get one:' + str( node ) 
   #print "id:", , "url:", node['url'], "data length:", node['data_len']
  except Exception, err:
   print "Probably not a node with that URL"
   print err 
 #########  TRY TO ITERATE ALL PAGES FETCHED ################
 list_all_pages(  )
 #########    TRY TO DELETE ONE ################
 #delete_one( url ) 
 # Now let's see if it has gone
 #list_all_pages( ) #or maybe later
 print "Doing linking pages..."
 find_links_between_pages( ) #goes to every page, looking for other pages in the database
 print "shutting down, saving all the changes, etc"