Thursday, May 31, 2012

Image crawler engine using "anemone + geospider + redirect_follower + memCached"

Recently our team at Kiprosh built an image crawler engine with following requirements (to be strictly developed in 1 week time or less).

1) crawl and spider all images from a given URL (http or https)
2) crawl it as a background process
3) scrape till 3 level deep per link. (configurable depth)
3) save img URL's in DB, caching
4) keep on displaying the crawled images in UI
5) ability to tag these images
6) ability to "multi select" (using shift + mouse clicks) tag and untag images
7) wonderful nice looking UI with ajax, pjax for pagination, tagging and ability to cache
8) search feature based on tags
9) multi size crawl feature

After thorough research and quick PoC, we used gems like geo-spider, anemone, redirect_follower and memcached to build this crawler engine. The overall app turned out to be very stable, scalable, fast and elegant due to usage of these awesome gems. There were other gems in comparison to geo-spider but for our requirement geo-spider served specific purpose to allow retrieve metadata we needed from source URL's. Anemone is another cool gem for depth crawling in URL that other gems and patterns didn't allow us earlier to dive deep into.

Links to these gems and their respective project page
GeoSpiderAnemone, Redirect_Follower, Memcached

On Heroku, we had to use following gems for caching.

#gem "memcached-northscale", "~> 0.19.5.4"
#gem 'memcached-northscale'
#gem 'dalli'

On our dedicated node, memcached worked just fine without customizing or supporting with other versions.

No comments: