Thursday, May 31, 2012

Image crawler engine using "anemone + geospider + redirect_follower + memCached"

Recently our team at Kiprosh built an image crawler engine with following requirements (to be strictly developed in 1 week time or less).

1) crawl and spider all images from a given URL (http or https)
2) crawl it as a background process
3) scrape till 3 level deep per link. (configurable depth)
3) save img URL's in DB, caching
4) keep on displaying the crawled images in UI
5) ability to tag these images
6) ability to "multi select" (using shift + mouse clicks) tag and untag images
7) wonderful nice looking UI with ajax, pjax for pagination, tagging and ability to cache
8) search feature based on tags
9) multi size crawl feature

After thorough research and quick PoC, we used gems like geo-spider, anemone, redirect_follower and memcached to build this crawler engine. The overall app turned out to be very stable, scalable, fast and elegant due to usage of these awesome gems. There were other gems in comparison to geo-spider but for our requirement geo-spider served specific purpose to allow retrieve metadata we needed from source URL's. Anemone is another cool gem for depth crawling in URL that other gems and patterns didn't allow us earlier to dive deep into.

Links to these gems and their respective project page
GeoSpiderAnemone, Redirect_Follower, Memcached

On Heroku, we had to use following gems for caching.

#gem "memcached-northscale", "~> 0.19.5.4"
#gem 'memcached-northscale'
#gem 'dalli'

On our dedicated node, memcached worked just fine without customizing or supporting with other versions.

Queue_Classic and Mechanize Gems

Creating a list of useful (indeed very useful) gems for future reference. We used following gems recently in our rails apps (actually products & tools) that we are developing for our clients.

1) queue_classic - Though we have used Redis Resque, RabbitMQ in three to four apps in past but for this specific requirement we wanted to rely on fast, low maintenance message queue providing a simple and intuitive user experience. It is built upon PostgreSQL to avoid the necessity of adding redis or 0MQ on heroku. Yeah queue_classic doesn't increase any database load contrary its pretty efficient due to usage of inherently reliable PostgreSQL methods where PostgresSQL has many wonderful feature such as  Listen/Notify. (even Oracle too supports listen/notify.) Thus, to avoid resque worker running on heroku and due to its sheer simplicity, we opted for queue_classic. BTW, queue_classic is extensively used by Heroku Postgres team to monitor the health of their customer databases processing hundreds of jobs per second.

RailCasts - http://railscasts.com/episodes/344-queue-classic 
Source page and more information is at - https://github.com/ryandotsmith/queue_classic

2) Mechanize - What a wonderful gem we must say :) Well we have done lots of scrapping but mechanize is really handy to automate interaction with websites. We are building a tool (web based) for an enterprise to automate large number of routine and regular tasks for their helpdesk support staff. Mechanize scripts helps us execute these routine tasks that saves a ton of time for the support team.

RailCasts - http://railscasts.com/episodes/191-mechanize
Source page and more information is at https://github.com/tenderlove/mechanize