Thursday 18 March 2010

Milestones

Milestone 1: People and Site

a. People. The first milestone is about assembling a team of people engaged with the as yet, non-existent PPPeople PPPowered technology. This team will provide the data, particularly their social media usage and profiles etc with which the initial data-mining can be built upon.

Goals: To have at least 20 people in a number of teams of people willing to work with me on the JISC project, providing hard data, usage and feedback.

b. Presentation. The data that is ultimately mined needs to be "hosted" in a software tool that provides a level of service that is valuable enough to be used frequently. This will ensure that meaningful relationships between people are presented and then subsequently pruned by participant.  The choice for which tool we use at this point was initially between Wordpress, Drupal, Elgg and LifeRay.

The conceptual framing of what this tool should be, a people directory, a people discovery tool or a profile repository or a personal new aggregator is quite important with regards to setting expectations (for re-visiting the site)

Outcomes: 

  1. To have a site up and running, available to staff at University of York that at least displays their profile in some way.
  2. A blog post reporting on the project so far, the plan, the tool chosen (with reasons) and invitations to help with the plan (which semantic repositorities to us (and how), which crawling tools to use
  3. Review the approaches take with other social networking sites to gather any useful approaches. For example, are their benefits to be gained by leveraging an existing social network and evangelising the use of a certain social network in order to ease data gathering.

This phase is very dependent on peoples' time availability and willingness to engage and also on the richness of the data returned from the survey. If nobody at all uses social media or completes the survey we won't have a lot to begin with.


Milestone 2: Just Data and Display




The first stages of displaying mined data with avoid the complexities of semantic reasoning about the data, about using unusual sources of data and begin by simply asking people to complete a survey. This will hopefully result in a spreadsheet of sites, blogs, social media memberships (twitter, linkedin, CiteULike etc) and will form the basis for exploring slightly less explicit data (connections in Linked in, followers on Twitter, mentions in Twitter or on other people's blogs.

To display this data, initially we will need to adapt the profile in the presentation tool, perhaps including an RSS aggregation.

Outcomes: 

  1. A survey asking people to identify their social media accounts, interests in the form of keyword, URLs etc 
  2. A site with at least 20 members showing their "simple" social media membership and some of the data (recent tweets, friends etc). 
  3. The initial adaptation code (module/plugin) uploaded to the SVN site.
  4. A blog post showing developments and with comments from members.

This will require a good understanding of the plug-in architecture of the presentation tool. The whole point of this project is to be integrated into a tool that is usable and used.


Milestone 3: More Deeply Mined Data

This stage looks to find information beyond that that is given, perhaps crawling Google searches, finding more distant links between resources. Ideally any crawling or querying tools should be well integrated into the presentation tool OR componentized and talk to the presentation tool via XMLRPC or similar.

We will experiment with (probably python-based) web-crawlers such as Domo, Harvestman, Mechanize and Scrapy. In addition we will look at free or low lost tools and services available. These may include...


  1. Yahoo Pipes - online data manipulation and routing
  2. 80 legs - a crawling service
  3. Maltego - a desktop open source forensics application
  4. Picalo - desktop forensics application (or any other tool)


Outcomes:

  1. More interesting data, richer connections
  2. A generic crawling methodology 

 Whilst I am more than familiar with creating simple crawlers this will need a more standalone, robust, better architected approach.


Milestone 4: Semantified Data

This stage will look at how to apply semantic tools, or a more reasoned understanding of the data gathered. We will be looking at ...

  1. Yahoo Boss for general searching.
  2. OpenCalais, to attempt to discover entities in unstructured data
  3. DBPedia, Edina, Freebase for understanding entities like Towns, Universities, Concepts etc
  4. ePrints research repository to connect people via research outputs
 SPARQL is very new to me. I need more understanding of RDF, LinkedData etc. Although the JISC Dev8D conference gave me new insights into the possibilities presented by LinkedData and open data I still feel I have a way to go to fully understand this area.

Milestone 5: Slightly More Sophisticated Presentation

Depending on the data gathered, there will be opportunities to present the data in more interesting and usable ways. Initially we will attempt to use simple visualisations such as timelines, tree maps, word clouds etc. which can be easily integrated in the presentation tool.



More ambitious visualisations such as Mention Map example will be explored if appropriate.

 Need to understand more about the maths behind networks and visualisation. Luckily Gustav Delius is around to advise.








No comments:

Post a Comment