24 Nov 2014

Ansible autoscale (tower) alternatives

In my last post, I wanted to deploy a cluster of workers, but stopped after setting up autoscaling groups as I wanted a “nice” way to build them up. I could have plowed on with a userdata script that was built with the original master server ip, followed by using ansible-playbook to download and execute some plays. In fact, that seems awfully reasonable to me right now instead of going on this yak-shaving adventure. Unfortunately, there are times for push and there are times for pull, and this is a time for pull.

But let’s press on. We still would like an alternative to setting up clusters of servers that isn’t dependent on either setting up ip’s into a userdata script or moving that level of abstraction up to s3 and putting the ip up there. Those solutions do work, but are sub-optimal for anything remotely complex with multiple masters or partitions. The ansible knockd comment pretty much says it all - “And at that point I might as well just use chef – this whole setup is to avoid running an open service and having to manage agents and certs.”

Ansible tower does support autoscaling. They do callbacks with a shared secret, with some userdata script that calls back to the server. The server then verifies that the node doing the callback is indeed in the correct ec2 autoscale group, etc. and kicks off the playbooks. But it’s not free and looks like it requires some GUI configuration.

Some must-have’s regarding this:

No messing around with GUI’s.
Ability to lose state of the master and can rebuild it with config files
The ability to setup a temporary cluster to later teardown and not have a headache about it.
The “master” can be either temporary or rebuilt without any hassle.

There’s semaphore, which seems to be only a GUI for ansible. After some fixes, it requires a lot of manual intervention and doesn’t support autoscaling yet.

This type of stuff makes me want to go back to Chef. I could easily automate the creation of a temporary chef server, run my stuff and be done. In fact, that’s exactly what I’ve been doing for the past year for all my vagrant lxc hosts since I didn’t find chef-zero good enough at the time. Of course, what I wanted to avoid was the 30 minutes and 300 lines of code to bootstrap a chef-server.

But to continue on this adventure, we’ll need to setup a server on EC2 to execute commands on all the other nodes. So taking inspiration from the knockd post, I’ve created a role that should take care of the autoscaling problem.

https://github.com/tjheeta/ansible_asg_master

It does the following during the bootstrap:

ansible.yml - sets up ansible with .boto.cfg, .ansible.cfg, and the private key specified
sync.yml - copies over roles, inventory, and playbooks from your local machine specified in the variables ansible_asg_local*.
app.yml - sets up the callback application. There are some variables that need to be set such as port and hashes.

The callback application:

responds over HTTP to /api/ec2/asg/$id.
If the variable ansible_asg_app[‘hashes’] has been setup, it uses that to translate $id -> playbook. This is a bit of security through obscurity, but may help if you want to setup a public facing server and don’t want to expose all the playbooks. If it’s not setup, it will look just use the $id as the playbook, which is the default. ~~~ ansible_asg_app: hashes: 596524d707e18a17ee521f7710d628e3: worker.yml ~~~
checks that the calling ip exists in the instances in the region on your EC2 account. This is where the security happens.
runs the ansible playbook specified on the calling ip

Setting up the ansible autoscale master

We’ll can test this out with the example playbook in the repo. It will setup the ansible_asg_master, the launch configuration, and then the autoscale group. The newly created worker will then connect back to the ansible_asg_master on the private ip address, which will launch an ansible run. The userdata script specifies the playbook to run, in this case:

- name: Create autoscale group
  hosts: localhost
  connection: local
  gather_facts: False
  vars:
    # This will get set to run during the callback to the master server
    userdata_playbook: worker.yml

And our worker.yml is just this in the examples/playbooks:

---
- hosts: all
  sudo: true
  tasks:
    - shell: touch /tmp/ran_the_playbook

Here are the steps to test it out, we’ll be using the playbook in the repo and uploading it to the ansible master on ec2:

Clone it

git clone https://github.com/tjheeta/ansible_asg_master /home/yourname/ansible_asg_master

Make sure you have boto installed and setup the config

sudo apt-get install python-boto
cat ~/.boto
[Credentials]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY

Modify the example/playbook/group_vars/all and setup the variables

ansible_asg_local_roles_path: /home/yourname/ansible_asg_master/example/roles
ansible_asg_local_inventory_path: /home/yourname/ansible_asg_master/example/inventory
ansible_asg_local_playbook_path: /home/yourname/ansible_asg_master/example/playbooks
aws_ssh_private_key: /home/self/ec2_private_key.pem
aws_remote_user: ubuntu
aws_access_key_id: YOUR_ACCESS_KEY
aws_secret_access_key: YOUR_SECRET_KEY
aws_region: us-east-1
aws_ami_id: ami-f2c74d9a
aws_instance_type: t1.micro
aws_key_name: ec2_key
ansible_asg_app:
port: 9001
#  hashes:
#    1234: crawler_worker.yml

Create the master and wait for it to complete. The second step will create the autoscale group. We’re doing this because of some inventory caching funkiness.

ansible-playbook -i example/inventory --tags main example/playbooks/create.yml
PLAY [Create AWS resources] *************************************************** 

TASK: [Create security group] ************************************************* 
changed: [localhost -> 127.0.0.1]

TASK: [create main instance] ************************************************** 
changed: [localhost -> 127.0.0.1]

... a lot more output ...

PLAY RECAP ******************************************************************** 
54.172.231.21              : ok=31   changed=28   unreachable=0    failed=0   
localhost                  : ok=8    changed=4    unreachable=0    failed=0

Create the autoscale group and launch configuration.

ansible-playbook -i example/inventory --tags autoscale_group example/playbooks/create.yml
PLAY [Create autoscale group] ************************************************* 

TASK: [setup the launch config for the autoscale group for workers with the user_data scripts] *** 
changed: [localhost]

TASK: [setup the autoscaling group] ******************************************* 
changed: [localhost]

PLAY RECAP ******************************************************************** 
localhost                  : ok=2    changed=2    unreachable=0    failed=0

Get the IP for the worker

./example/inventory/ec2.py --refresh-cache|grep -A2 tag_aws_autoscaling_groupName_worker_asg
"tag_aws_autoscaling_groupName_worker_asg": [
  "54.172.47.201"
    ],

Verify that the job ran correctly

ssh -i /home/self/ec2_private_key.pem ubuntu@54.172.47.201 "ls -l /tmp/ran_the_playbook"
-rw-r--r-- 1 root root 0 Nov 25 09:10 /tmp/ran_the_playbook

Log into the master and see if all the workers are available.

ubuntu@ip-172-31-50-19:~$ sudo su - ansible 
ansible@ip-172-31-50-19:~$ export ANSIBLE_HOST_KEY_CHECKING=False
ansible@ip-172-31-50-19:~$ ansible -i inventory/ -m ping tag_aws_autoscaling_groupName_worker_asg
54.165.76.124 | success >> {
"changed": false, 
"ping": "pong"
}

Success. Of course, if you don’t have success, there are log files on the ansible master in /var/log/ansible_asg/

Now that we have a master, we can do other things, such as run ansible directly on the master. We can setup cron jobs on our workers to periodically poll for any updates in the cluster. We can do our autoscaling. And another yak has been shaved.

tl;dr - https://github.com/tjheeta/ansible_asg_master - setups up an ansible master and allows autoscaling. Examples in the repo.

Nothing interesting...

About

Ansible autoscale (tower) alternatives

Setting up the ansible autoscale master