Jan 6, 2012

Geek Crawlers isn't dead yet!

As promised in last post, I have started some work towards structuring the crawling script to support pluggable components. The basic design is far from elegant or complete, but at least there has been a start.  If you are interested, find new source code on my git-hub repository here.

More on this, soon!

Jan 3, 2012

GC: Geek Crawlers

As part of my job at ThoughtWorks, there are lot of things that I get to do in an interesting manner. We have been brainstorming on how to increase the effectiveness of some of the geek community events that we do. As far as I understand, there are three primary things that we need to ensure.

  • Gather people passionate about technology.
  • Gather people passionate about technology.
  • Gather people passionate about technology.
And thats about it. 

However, it's easier said than done.  As much as we would like to have more and more passionate people joining the events; there is apparent lack of either visibility to events we are doing, to the right set of people or those community events are not compelling enough. (Which ties back to the lack of participation by people who bring in value in terms of passion and knowledge both)

We have tried leveraging the existing forums and meetup lists to ensure, it goes around by word of mouth, however I personally think; that this is something so vital to creating a right eco system that we need to work harder than just that.

So I decided that it's about time; I dug up some of those folks and invited them personally. The idea is to find people who are teeming with knowledge and are visible on different forums like StackOverflow, Twitter,  HackerNews, Quora etc.

In short, crawl for intelligence. Only if it was so easy.

In my effort to find the right set of people; I have started crawling some of these forums for their user base; starting with StackOverflow. 

I decided to take Mechanize (a nokogiri based html parser; which wraps nokogiri parsing capabilities in nice syntactically sugary objects) for a spin. Also along the way, I chose mongodb (mongoid) as the local data store for no apparent reason. 

My objectives were simple:
  • Crawl user-base that StackOverflow has.
  • Store all users in mongo with location, name, reputation and a url that points to resource.
  • Using mongodb's regex capability filter out people based on different criteria like location.
  • Manually google-stalk relevant users and make a list whom I should approach in a non intrusive way. (I consider invite emails as spam, and before I make any moves I want to give it a hard thought, as to what would be a good way to invite someone; or let them know of events happening)
Here is a small script that I wrote exactly to do the crawling bit. 


I am certain beyond doubt, that there are better ways to do it. I also think that there is SO API available to do this a little better; but I really wanted to explore mechanize a bit along the way.

Anyway, in couple of hours I had all the StackOverflow users residing in mongodb. Total number of users I could crawl is 663876. Which is a decent number to start off with. Although that does not reflect in anyway the people who are in same geographical location as I am. However that, was a job easily done by regex matches of mongodb.

Sorted by SO reputation; there is lot of data that I can explore on each user; but it's still a task that I have to do manually for now. Although I am inclined to hack together a quick component to call SocialNetwork's API to get more relevant information. Of course it isn't going to be definitive and would still require manual intervention; but if it could do 1/4th of what Rapportive does today (and must I say, it does it well); I would think of it as a successful experiment.

And yet, this is merely one site where I have had an opportunity to crawl with considerable success. I am looking forward to extend this script into a more structured components where in crawlers can be more generic however information extractors could be pluggable components tailored for each portal.

And then I am also tempted to throw a bit of map-reduce into the mix to speed it up. Considering that SO user base isn't as vast as it would be on other sites; I definitely need to structure that better; as I go along.

I'll come back soon and post more results about how this goes. If nothing, it is definitely a great learning experience as it involves playing with large datasets.