10 Dec 2014

Building a distributed web-crawler in elixir

So here’s an n-part tutorial on getting a distributed web-crawler running with Elixir. So what’s the motivation for this yak-shaving project? Essentially, someone asked me. So here goes…

The crawler has two main tasks and a few requirements:

Download the pages and store them on some node.
Parse the pages for new links.
Ability to spawn or destroy worker nodes as required and have it pick back up.
Ability to limit the number of times a worker accesses a website to avoid getting banned.

The architecture of the crawler could be done a few different ways. We’ll be having a queue with workers pulling items and then storing them back on a central storage node. We could also have a queue reader and then sending the urls to the workers to pull down. To simplify matters, we’ll only have one central node running redis that does both the storing of the state of the crawler and all the downloaded pages.

So here’s a summary of a few posts that go through building this crawler:

So let’s say you just want to use the application and not go through all the posts. The example below will setup a storage node with two worker nodes.

Install vagrant and ansible on your local box. If you’re using vagrant with lxc, that’s great, otherwise, you can use virtualbox, though I haven’t tested it myself
Clone the repo.

git clone https://github.com/tjheeta/elixir_web_crawler.git

Alter elixir_web_crawler/ansible/playbook.yml to adjust limit to something higher/lower.

cd elixir_web_crawler && vagrant up

Add wikipedia to the download queue

vagrant ssh storage1 -c 'redis-cli sadd download_queue https://en.wikipedia.org/wiki/Main_Page'

To see the download logs on worker1

vagrant ssh worker1 -c 'less /etc/service/runit_crawlr/log/main/current'*

To see the files downloaded

vagrant ssh storage1 -c 'find /home/erlang/dl/'

The redis sets of interest are parse_queue, download_queue, and download_finished. The workers will stop downloading wikipedia after the limit has been reached, which by default is 100. If you’re interested in how it was built, please go through the posts listed above.

Nothing interesting...

About

Building a distributed web-crawler in elixir