· elixir

Building a distributed web-crawler in elixir

So here’s an n-part tutorial on getting a distributed web-crawler running with Elixir. So what’s the motivation for this yak-shaving project? Essentially, someone asked me. So here goes…

The crawler has two main tasks and a few requirements:

The architecture of the crawler could be done a few different ways. We’ll be having a queue with workers pulling items and then storing them back on a central storage node. We could also have a queue reader and then sending the urls to the workers to pull down. To simplify matters, we’ll only have one central node running redis that does both the storing of the state of the crawler and all the downloaded pages.

So here’s a summary of a few posts that go through building this crawler:

So let’s say you just want to use the application and not go through all the posts. The example below will setup a storage node with two worker nodes.

git clone https://github.com/tjheeta/elixir_web_crawler.git
cd elixir_web_crawler && vagrant up
vagrant ssh storage1 -c 'redis-cli sadd download_queue https://en.wikipedia.org/wiki/Main_Page'
vagrant ssh worker1 -c 'less /etc/service/runit_crawlr/log/main/current'* 
vagrant ssh storage1 -c 'find /home/erlang/dl/'

The redis sets of interest are parse_queue, download_queue, and download_finished. The workers will stop downloading wikipedia after the limit has been reached, which by default is 100. If you’re interested in how it was built, please go through the posts listed above.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket