Building a distributed web-crawler in elixir
So here’s an n-part tutorial on getting a distributed web-crawler running with Elixir. So what’s the motivation for this yak-shaving project? Essentially, someone asked me. So here goes…
The crawler has two main tasks and a few requirements:
- Download the pages and store them on some node.
- Parse the pages for new links.
- Ability to spawn or destroy worker nodes as required and have it pick back up.
- Ability to limit the number of times a worker accesses a website to avoid getting banned.
The architecture of the crawler could be done a few different ways. We’ll be having a queue with workers pulling items and then storing them back on a central storage node. We could also have a queue reader and then sending the urls to the workers to pull down. To simplify matters, we’ll only have one central node running redis that does both the storing of the state of the crawler and all the downloaded pages.
So here’s a summary of a few posts that go through building this crawler:
- Connecting erlang nodes together
- Setting up a redis pool with poolboy
- Saving files on a remote node
- Supervision trees
- Startup and runit
So let’s say you just want to use the application and not go through all the posts. The example below will setup a storage node with two worker nodes.
- Install vagrant and ansible on your local box. If you’re using vagrant with lxc, that’s great, otherwise, you can use virtualbox, though I haven’t tested it myself
- Clone the repo.
git clone https://github.com/tjheeta/elixir_web_crawler.git
- Alter elixir_web_crawler/ansible/playbook.yml to adjust limit to something higher/lower.
cd elixir_web_crawler && vagrant up
- Add wikipedia to the download queue
vagrant ssh storage1 -c 'redis-cli sadd download_queue https://en.wikipedia.org/wiki/Main_Page'
- To see the download logs on worker1
vagrant ssh worker1 -c 'less /etc/service/runit_crawlr/log/main/current'*
- To see the files downloaded
vagrant ssh storage1 -c 'find /home/erlang/dl/'
The redis sets of interest are parse_queue, download_queue, and download_finished. The workers will stop downloading wikipedia after the limit has been reached, which by default is 100. If you’re interested in how it was built, please go through the posts listed above.