15 Apr 2015

Ansible vs Chef

I wrote an earlier post about evaluating Ansible as an alternative to Chef. So after spending many years with Chef, I’ve found that Ansible is a lot easier to manage with startups. It’s easier to train developers, it’s easier to manage inventory and orchestration, and it works reasonably well on the scale of thousands of hosts. And let’s face it, if you have more than that, you’ll have to start partitioning. If you’re using Puppet, you’ll have multiple puppet masters, I’m assuming the same with Salt. For Chef, there’s the enterprise offering which gets pretty expensive and then you’ll have to deal with a slow search, so invariably there’s some usage of chef-solo going on and some self-configuration. But for most use cases, Ansible is just fine and is quick to both setup and is ridiculously easy to maintain. This is a fairly advanced post comparing Chef to Ansible and the patterns that are available.

This is such a big post it deserves a table of contents.

The Basics
Chef vs Ansible How To / Cheatsheet / Guide
Conclusion

The Basics

Workflow management

With Chef, this is a headache since the source of truth is the Chef Server, not git. There’s a couple ways to manage this:

A branch for development, staging, and production which gets uploaded to different Chef organizations.
Changing the cookbook version on each commit and setting the versions in the environment file (gets hairy for hotfixes, etc)
Developing your own custom release logic that modifies the environments.

With Ansible, what you see on disk is what you’re deploying. Everything is in source control. You can hotfix one branch, merge it back to develop, without any of the other headaches.

Dependency management

With Chef, the best practice is to use something like Berkshelf which will download and manage all the community cookbooks. It works most of the time, except when it doesn’t, in which case, life becomes very painful. The Chef SDK is required. The “Berkshelf Way” involves having a git repo for every cookbook and managing the versioning through the Berksfile.lock. If you read through that link, there’s a lot of parts.

With Ansible, there’s ansible-galaxy for community roles, but I prefer using git submodules. For instance, let’s say we want to include the ever popular rvm role:

git submodule add git@github.com:rvm/rvm1-ansible.git rvm1-ansible
git submodule update --init

This locks in the module at the commit that you want and there are no extra dependencies. Everything is in git and we’re golden. Ansible doesn’t have the dependencies of dependencies of dependencies in community cookbooks that Chef has. Yet, at least.

Maintenance

Chef has both a server and client component. These will have to be periodically upgraded. There’s also the cost of maintaining a chef server or paying for hosted chef.

Ansible wins this one.

Speed

There’s a post here comparing Ansible vs Salt for speed. It’s an interesting post, because there a few types of speed. Speed of development and speed of deploy. I’ve never used Salt, and certainly ZeroMQ and having an agent is going to be a faster startup connection than SSH. And I’m going to take his word on the execution speed as well since I’m not going to deploy a bunch of hosts to test it. But since this is an Ansible vs. Chef post, I’d have to say it’s both faster to deploy and develop in. Hands down.

Fact caching

Before every run, Ansible connects to the hosts in question, and gathers information about them. Usually in server based architectures, the servers will store all this information and provide information about other hosts, groups, roles. Ansible does this before the execution of the playbook. To speed this up, you can start caching the facts in redis. With all the facts available, the execution becomes faster than server based searches. For instance, for chef-client needs to connect to the server, do the search, get and parse the results. For ansible, the results are already all there. I’ll explain this magic later when I talk about search.

Scalability

SSH vs Agent or Push vs Pull

SSH connections can be optimized using ControlPersist and Pipelining. It’s certainly as fast as firing up chef-client and having it pull down all the cookbooks.

The problems come when you can’t reach a host. In a pull based system, the host will retry, so essentially, you’ll have to do the same thing with ansible. Any host that fails in an ansible-playbook will be listed in failed_playbook.retry and ansible-playbook can be called on only those hosts on a secondary run. You can put the ansible-playbook in cron on a central server, thereby mimicing the behaviour of a local agent run. You can use Ansible Tower or some homegrown alternatives to deal with autoscaling from which the hosts connect back and call ansible runs. There are a few options to make it equivalent to a centralized server.

There is a major advantage to not having an agent, is that there is no maintenance cost of upgrading. Going from Chef 0.7 to 0.9 to 10 to 11 to 12, and dealing with the changes in syntax, etc. is really a lot of extra work.

And regardless, in a pull based system, deploys still have to be triggered, right?

Raw numbers

Just some rough numbers here. Each ssh connection takes about 2 mb resident. The python process itself that generates the command is around 20 mb resident. Let’s round it up to 30 mb per fork. If you want to have 1000 forks, that will cost about 30 GB of memory. It’s not Erlang, that’s for sure, but back to reality, do you have 10000 machines that you want to deploy to all at the same time? The bottom line is that ansible is heavy on memory usage, so you’re going to need to add more memory or you’re going to need to wait.

For more speed:

Partition the deployment over multiple masters by using redis fact caching and the limit parameter.
Create a smaller changeset by using tags.

Ansible also has the concept of rolling updates.

Regardless with Chef, Puppet, Salt, etc, you’re going to have to partition at some point, it’s a question of when, and then there’s the tradeoff of having to maintain the master. Pick your poison.

Chef vs Ansible How To / Cheatsheet / Guide

This section describes how to use Ansible for those coming from a Chef background. Since Ansible also does orchestration, some concepts aren’t in Chef.

Search

If there’s no central server, how does one find other hosts like a database, other members of a cluster, the slaves, etc? I didn’t quite understand how much easier it is not to have a search. Earlier I mentioned that ansible compiles and stores all the relevant information before the start of the playbook. During the run, it provides several magic variables:

groups - provides the list of all the groups that are setup and the hosts in the group. Functionally equivalent to chef roles.
group_names - on a particular host, this variable will tell which groups the host belongs to.
hostvars - information about every host is in this hash

Let’s say we’ve defined the group name in the variable mysql_master_group. If the app server needs to know where the database master is it would set the template to be the following:

staging:
  host: {{ hostvars[groups[mysql_master_group][0]]['ansible_eth0']['ipv4']['address'] }}

groups[mysql_master_group] - this returns an array of hosts belonging to mysql_master_group which could be ['db1','db2']
groups[mysql_master_group][0] - this returns the first item - which would return 'db1'
hostvars[groups[mysql_master_group][0]] - this contains all the information gathered about the host.
hostvars[groups[mysql_master_group][0]]['ansible_eth0']['ipv4']['address'] - this returns the ip address to place in the template.

Or if you want to list a bunch of hosts in a template:

{% for host in groups[mysql_master_group] %}
   {{ hostvars[host]['ansible_eth0']['ipv4']['address'] }}
{% endfor %}

All this takes place within the Jinja2 templating. The playbook doesn’t need any code or logic in it. Chef requires the search be part of the recipe, which then calls the template with the correct variables. Ansible skips straight to the template. No search required, and even better, there is no eventual consistency.

Inventory

That’s all well and good, but you still need to know what hosts are in what group.

In Chef, the nodes are registered with the server and as such, they will be part of the search. Well, not quite, they will be available eventually, and only if the run succeeds. This behaviour ends up giving ordering issues if one node is up before the other or perhaps other problems if a run failed transiently.

With ansible, instead of nodes being registered with the chef server, ansible uses a variety of inventory scripts against ec2, rackspace, etc. to come up with a list of hosts and groups. For EC2, the group information will come from the tags, with Openstack it could come from the metadata. There are no hard and fast rules as you control the behaviour with the scripts. The directory that ansible uses for inventory can contain multiple scripts. In fact, you can use Chef’s inventory in ansible if for some reason you ever wanted to convert to ansible.

Orchestration and add_host

So with Chef, you probably have written an orchestration layer using Fog or something like that. But ansible comes with an orchestration layer, which makes things ever convenient.

So let’s say for argument’s sake, that you have a program that configures an entire environment for you. Creates the hosts in EC2 in the correct order, bootstraps chef, and then runs it with the correct roles/recipes. With ansible, although we have dynamic inventory, the inventory list and the facts about the inventory only happen at the beginning of the run. Ansible deals with this issue with combination of register and add_host when the hosts are first created. I talked about ansible and ec2 earlier. This looks like the following snippet of code:

- name: create instances
  hosts: localhost
  tasks:
    - ec2:
        key_name: "{{ aws_key_name }}"
        region: "{{ aws_region }}"
        group: [ "default", "app_fw" ]
        instance_type: "{{ aws_instance_type }}"
        instance_tags:
          group: app
        count_tag: 
          group: app
        exact_count: 5
        image: "{{ aws_ami_id }}"
        wait: yes
      register: ec2_app_hosts

    - add_host: hostname={{ item.public_ip }} groupname=app
      with_items: ec2_app_hosts.tagged_instances

- name: configure db
  hosts: db
  roles: 
    - db
    - { role: app, deploy_only: true }
  post_tasks:
    - shell: /opt/app/migrate_db

- name: configure app
  hosts: app
  roles: app

The first task creates 5 instances of an app server with the group app on ec2. The results of creation, aka. the information about the hosts will be registered in the ec2_app_hosts variable. This is then used in the next task add_host which adds all those ip’s to the app group for later use in the task. The ec2 inventory script will show these on the next run of ansible, but on the first run, we can still get the same orchestration effects. Also, the count_tag and exact_count need to be specified, otherwise, we could end up with a lot of extra hosts. Essentially, we’ve gotten a free orchestration layer without doing any work.

Imagine wanting to do a database migration before a deployment. We’d want to run the migration on a single host and then deploy code everywhere else. In Chef, if chef-client is running periodically, we’d have to write a migration guard that makes sure that the database has run all the migrations before the deployment of code and leave the chef-client in a wait state or exited. In ansible, we save a lot of time and just run the migration on one host and then do the deployment.

Orchestration. Once you got it, you can’t live without it.

Conditionals

Ansible doesn’t have the full power of the ruby language to use inside the playbooks and complex conditionals are two steps. For instance, if you want to check if a file exists, you’ll have to register the output of a variable followed by using when.

- stat: path=/tmp/thefile get_md5=no get_checksum=no
  register: st
- debug: var=st
- shell: touch /tmp/thefile
  when: not st.stat.exists

Instead of assigning something to a variable, you’ll have to use set_fact to make a fact available in later tasks if you want to do something conditionally.

- set_fact: server_memory={{ (ansible_memtotal_mb * 0.65 / 100) | round }}
  when: ansible_virtualization_type != "lxc"
- set_fact: server_memory=256
  when: ansible_virtualization_type == "lxc"

But you aren’t limited to just the modules provided by ansible, you can run anything via the shell, register the output, and then use that output in a when clause. My favorite example of this is rebooting and continuing to deploy. Chef does this via eventual consistency, ansible has this module called wait_for combined with delegate_to which allows this to happen immediately.

- name: Check for reboot hint.
  shell: if [ $(readlink -f /vmlinuz) != /boot/vmlinuz-$(uname -r) ]; then echo 'reboot'; else echo 'no'; fi
  ignore_errors: true
  register: reboot_hint

- name: Rebooting ...
  command: shutdown -r now "Ansible kernel update applied"
  async: 0
  poll: 0
  ignore_errors: true
  when: reboot_hint.stdout.find("reboot") != -1

- name: Wait for thing to reboot...
  wait_for: host="{{ ansible_default_ipv4.address }}" port=22
  delegate_to: localhost

- name: Do something else
  shell: touch /tmp/newfile

The conditional syntax is a bit wordier, but it gets the job done.

Storing sensitive data

Chef has encrypted data bags. This means the encryption key exists on all the clients and the encrypted data bags need to be uploaded to the chef server. How do you encrypt it locally? As far as I know, you don’t without writing yourself a plugin that calls Chef::EncryptedDataBagItem.new and Chef::EncryptedDataBagItem.encrypt_data_bag_item(edited.to_hash, secret). The alternative to this is editing the data bag on the server with knife data bag edit and then downloading it locally with knife data bag show -Fj without the secret key set and then commit that to git.

Ansible has the command called ansible-vault. You edit the file locally. You save it in source control. Done.

Variables

In Chef, it’s sometimes very difficult to track down where and how a variable was set or overridden. You have to make rules for your team as to where things are set or not because otherwise, bad things can happen. Chef actually changed the precedence order from 10 to 11 and we had to rewrite a lot of our code to account for it.

Ansible errs on the side of simplicity, there are only a few logical places to load variables from:

role/defaults/main.yml
group_vars/all/main.yml or any other file in the directory
group_vars/the_group_name directory
inside the play itself by using include_vars
the global playbook can load variables

I believe that is the correct precedence order, but I usually just set variables in the first three. A concrete example will be in next section.

- name: some example
  hosts: localhost
  vars_files:
    - private/ssh_key.yml
    - private/nova.yml
  vars:
    somevar: val1
    somevar2: val2
    somelist:
      - test1
      - test2
    somedict:
      key1: val1
      key2: val2

Because variables are pretty much always replaced, each variable should be prefixed by the role that it belongs to so you don’t have to worry about the scope. So if you’re creating a role for an app server in the role myapp, you wouldn’t use the variable memory, but myapp_memory. This ensures that there are no collisions across different roles.

Note that dictionaries (hashes) are different than lists (arrays) and ansible supports different iterators for each with_dict and with_items. More on that later.

Environments otherwise known as groups_vars

The next major item a configuration management system needs to do is have the ability to override variables per environment. Development != Staging != Production.

Let’s talk about real-world examples, like credentials and authorizations. They will be different in each environment and they need to be encrypted. And generally speaking, we want a set of generic credentials that are overridden at the environment/group level.

In Chef, we have the encrypted data bags and we’ll need to merge these from generic to the environment. We’ll need to write a library function that does this merging and it will go something like this:

def self.fetch_merged_databag(dbag_name)
    generic = Chef::EncryptedDataBagItem.load(dbag_name, 'generic').to_hash
    env_dbag = nil
    begin
      env_dbag = Chef::EncryptedDataBagItem.load(dbag_name, $node.chef_environment).to_hash
    rescue
      env_dbag = generic
    end
    dbag_info = Chef::Mixin::DeepMerge.merge(generic, env_dbag)
    JSON.parse(dbag_info.to_hash.to_json)
end

In ansible, to get the same behaviour, we set hash_behaviour = merge in ansible.cfg. The variables can be overriden in the group_vars directory and the variables set and loaded in all are overriden in the specific environment group. The only time there may be confusion is if a host is part of two groups, let’s say production and app, and the variable is set in both places.

group_vars/all/app.yml
group_vars/all/creds.yml
group_vars/staging/creds.yml
group_vars/staging/something.yml
group_vars/production/creds.yml
group_vars/production/something.yml
group_vars/app/creds.yml

Essentially, we’re getting the same behaviour for free. The only difference is that we have to model our data to use the iterators that are provided. This means that the data needs to be modelled in a certain way to use the iterators. I’ve put this in another post.

Ansible patterns

I’ve briefly mentioned delegate_to in an earlier example. Essentially, the task that is running on the present host can be delegated to run on a completely different host. The example on the documentation they used was to add and delete things out of a load balancer.

For another common example, let’s say that you have a bunch of roles and need to specify backups and monitoring. Where should those be? The cleanest place for them to be would be in the role itself, so when it is deployed, it is responsible for ensuring all of that exists.

In Chef, this gets very complicated to do, you have to set some variables on the chef server, which then get read by the monitoring node on the next run, and the checks get instantiated.

With ansible, it is borderline trivial. Let’s go through this for sensu:

The check needs to be registered on sensu server.
The sensu api possible needs to reload to know about the check.
The client.json on the monitored node needs to be subscribing to the check.

Delegate to

First, let’s say we want to define a check in the mysql role, and have it written to all the sensu masters. It is trivially done below with each sensu node, as we’re using with_items: groups[sensu_monitoring_master_group].

- template: 
    src: checks/mysql_connections.json.j2 
    dest: /etc/sensu/conf.d/checks/mysql_connections.json
  delegate_to: "{{ item }}"
  with_items: groups[sensu_monitoring_master_group]
  notify: restart_sensu_api

Note that there is a notify statement for the sensu api to restart if the template has been created/changed. Now this would get really ugly if we had to specify the restart handler in every role. Luckily, this is handled by dependencies.

Dependencies

Like Chef, where cookbooks depend on other cookbooks, roles can depend on other roles. Of course, an ansible role is more like an LWRP per say, but in ansible, everything is an LWRP, but that’s for the next section. For the handler to be present in the deployment of the mysql role, we include it in the meta.

cat mysql/meta/main.yml 
---
dependencies:
  - { role: sensu, sensu_client: true }

And we can define restart_sensu_api in the sensu role handlers as:

- name: restart_sensu_api
  shell: /etc/init.d/sensu-api reload
  delegate_to: "{{ item }}"
  with_items: groups[sensu_monitoring_master_group]

So the handler gets called on the database node, but executes a reload on the configuration on only the sensu nodes.

That is some magic.

But there is one more step in configuring sensu. The check is specified and setup on the sensu server, but the database node now needs to subscribe to that check and possibly merge the configuration with any other checks. It’s not as simple as creating a subscription file in conf.d and moving on. The json for the client looks like:

{
  "client": {
     "keepalive": {
        "thresholds": {
            "critical": 300, 
            "warning": 300
         }
     }, 
     "subscriptions": [
        "all", 
        "common_community_plugins", 
        "clamav", 
        "db_master", 
        "vagrant", 
        "db1", 
        "mysql"
     ], 
     "name": "db1", 
     "address": "10.0.3.172"
  }
}

Chef certainly would excel in this case, but you can do this in ansible, too, with an extra step. But first an interlude to bring you the equivalent of LWRP’s.

LWRP - Lightweight Resource Providers

In Chef, LWRP’s are a best practice. It’s essentially parameterizing multiple tasks into one. For instance, there is a runit cookbook which has a runit_service LWRP which creates the run files, sets up the log directories, and links them all together. For the community runit cookbook, this takes about 800 lines, but a barebones implementation would probably be about 200 or so.

In ansible, task files and roles are both LWRP. So here’s a role that installs runit (https://github.com/tjheeta/ansible-runit-role) and has a task called scaffold which is essentially meant to be used as an LWRP:

- name: Create runit dir {{ runit_scaffold_name }}
  file: dest={{ runit_sv_dir }}/{{ runit_scaffold_name }}/log state=directory mode=0755 recurse=true
  tags: runit_scaffold

- name: Create runit log destination dir for {{ runit_scaffold_name }}
  file: dest=/var/log/{{ runit_scaffold_name }} state=directory mode=0755 recurse=true owner={{ runit_scaffold_owner }}
  tags: runit_scaffold

- name: Create runit log run file {{ runit_scaffold_name }}
  copy: dest={{ runit_sv_dir }}/{{ runit_scaffold_name }}/log/run mode=0755 content="#!/bin/sh\nexec svlogd -tt /var/log/{{ runit_scaffold_name }}\n"
  tags: runit_scaffold

# https://groups.google.com/forum/#!topic/ansible-project/Lhr0mqm91TQ
# No notify logic attached to include, so have to manually check if changed
- name: Create runit run file {{ runit_scaffold_name }}
  copy:
    dest: "{{ runit_sv_dir }}/{{ runit_scaffold_name }}/run"
    mode: 0755
    content: "{{ runit_scaffold_run_content }}"
  register: runit_scaffold_changed
  tags: runit_scaffold

- name: Link runit {{ runit_scaffold_name }} to /etc/service/
  file: src={{ runit_sv_dir }}/{{ runit_scaffold_name }} dest=/etc/service/{{ runit_scaffold_name }} state=link
  tags: runit_scaffold

And we can setup a new runit service by passing in arguments to the include statement:

- include: ../../runit/tasks/scaffold.yml
  notify: restart_your_service
  vars:    
    runit_scaffold_name: yourservice
    runit_scaffold_owner: yourowner
    runit_scaffold_run_content: |
      #!/bin/sh
      exec 2>&1
      ulimit -n 8192
      exec chpst -u {{ runit_scaffold_owner }} /usr/local/bin/yourservice

# Note that notify on include doesn't work yet
# https://groups.google.com/forum/#!topic/ansible-project/Lhr0mqm91TQ
- name: Restart if changed
  shell: echo "Restarting..."
  notify: restart_your_service
  when: runit_scaffold_changed.changed == true

In ansible v2, the ability to use include and with_items together will be back making this even better.

So back to our sensu example, we can create a file in our sensu role called subscribe.yml:

- name: Create the subscription
  file: name=/etc/sensu/ansible_state/subscriptions/{{ subscription }} state=touch
  notify: rebuild_sensu_client_configuration

add an additional handler which executes a script to create client.json with all the subscriptions in /etc/sensu/ansible_state/subscriptions:

- name: rebuild_sensu_client_configuration
  shell: |
      cd /etc/sensu/ansible_state
      ./merge_subscriptions.py ./client.json subscriptions /etc/sensu/conf.d/client.json
      /etc/init.d/sensu-client restart

and call this in our mysql role:

- include: ../../sensu/tasks/subscribe.yml
  vars:
    subscription: mysql

Pretty simple and easy to do. To do the same in chef, takes a few hours at least.

Note that roles can also be parameterized in both playbooks and dependencies:

- name: setup solr master
  hosts: solr_master
  roles:
    - { role: solr,
              solr_enable_master: true,
              solr_enable_slave: false
      }

- name: setup solr slave
  hosts: solr_slave
  roles:
    - { role: solr,
              solr_enable_master: false,
              solr_enable_slave: true,
              solr_master_group: solr_master
      }

Wait for

The action wait_for is specific to Ansible. Earlier, I went through an example where a box was conditionally rebooted by using delegate_to and wait_for. This task simplifies configuration management greatly where you want to make sure something is up before executing the next task. The module is here

There are also retry loops to check if something is available:

- name: Wait untils app is available
  shell: curl --head --silent http://localhost:8080/
  register: result
  until: result.stdout.find("200 OK") != -1
  retries: 12
  delay: 5

Validate

This is available for both the modules lineinfile and template . It ensures that you don’t accidentally nuke a set of boxes or configuration. There is no equivalent of this in Chef, but I’ve written some libraries to do so.

# Validate the sudoers file before saving
- lineinfile: dest=/etc/sudoers state=present regexp='^%ADMIN ALL\=' line='%ADMIN ALL=(ALL) NOPASSWD:ALL' validate='visudo -cf %s'

# Copy a new "sudoers" file into place, after passing validation with visudo
- template: src=/mine/sudoers dest=/etc/sudoers validate='visudo -cf %s'

There are a bunch of other examples from Jan-Piet

Run once

This defines a singleton method that can be used for database migrations if you don’t want to put it in orchestration layer:

- command: /opt/application/upgrade_db.py
  run_once: true
  delegate_to: web01.example.org

Lookups

Lookups are an Ansible only feature which allows data from outside sources. This won’t be used very often, but it’s handy if you need to look up dns, etc. It is a quicker way to get registered variables.

For instance, using an environmental variable:

- debug: msg="{{ lookup('env','HOME') }} is an environment variable"

Setting a fact based on a template:

- set_fact: 
    some_template_content: "{{ lookup('template', '../templates/template.j2') }}"

Or even a faster way of getting content than using shell and register:

- debug: msg="{{ lookup('pipe','date') }} is the raw result of running this command"

Conclusion

So far, I haven’t found anything possible in Chef that isn’t in Ansible. Of course, if you have found something that can’t be done in Ansible, I’d like to know what it is. It might not be perfect for every use case, but it’s sooo easy to use. The migration path is fairly straightforward. The wins over chef are:

orchestration
maintenance costs
speed of development/deployment
workflow
simplicity.

It’s tough to beat.

tl;dr - It took me 12 hours to write this epic. Please go read it.

Nothing interesting...

About