# Building a distributed web-crawler in elixir

So here’s an n-part tutorial on getting a distributed web-crawler running with Elixir. So what’s the motivation for this yak-shaving project? Essentially, someone asked me. So here goes…

The crawler has two main tasks and a few requirements:

• Parse the pages for new links.
• Ability to spawn or destroy worker nodes as required and have it pick back up.
• Ability to limit the number of times a worker accesses a website to avoid getting banned.

The architecture of the crawler could be done a few different ways. We’ll be having a queue with workers pulling items and then storing them back on a central storage node. We could also have a queue reader and then sending the urls to the workers to pull down. To simplify matters, we’ll only have one central node running redis that does both the storing of the state of the crawler and all the downloaded pages.

So here’s a summary of a few posts that go through building this crawler:

So let’s say you just want to use the application and not go through all the posts. The example below will setup a storage node with two worker nodes.

• Install vagrant and ansible on your local box. If you’re using vagrant with lxc, that’s great, otherwise, you can use virtualbox, though I haven’t tested it myself
• Clone the repo.
git clone https://github.com/tjheeta/elixir_web_crawler.git

• Alter elixir_web_crawler/ansible/playbook.yml to adjust limit to something higher/lower.
cd elixir_web_crawler && vagrant up

vagrant ssh storage1 -c 'redis-cli sadd download_queue https://en.wikipedia.org/wiki/Main_Page'

vagrant ssh worker1 -c 'less /etc/service/runit_crawlr/log/main/current'*

vagrant ssh storage1 -c 'find /home/erlang/dl/'