Chef Cookbook dependency management and the environment cookbook pattern / by Matt Wrock

Last week I discussed how we at CenturyLink Cloud are approaching the versioning of cookbooks, environments and data bags focusing on our strategy of generating version numbers using git commit counts. In this post I’d like to explore one of the concrete values these version numbers provide in your build pipeline. We’ll explore how these versions can eliminate surprise breaks when cookbook dependencies in production may be different from the cookbooks you used in your tests. You are testing your cookbooks right?

A cookbook is its own code plus the sum of its dependencies

My build pipeline must be able to guarantee to the furthest extent possible that my build conditions match the same conditions of production. So if I test a cookbook with a certain set of dependencies and the tests pass, I want to have a high level of confidence that this same cookbook will converge successfully on production environments. However, if I promote the cookbook to production but my production environment has different versions of the dependent cookbooks, this confidence is lost because my tests accounted for different circumstances.

Even if these different versions are just at the patch level (the third level version number in the semantic version numbering schema), I still consider this an important difference. I know that only changes of the major build number should include breaking changes but lets just see a show of hands of those who have deployed a bug fix patch that regressed elsewhere…I thought so. You can put your hands down now everyone. Regardless, it can be quite simple to allow major version changes to slip in if your build process does not account for this.

Common dependency breakage scenarios

We will assume that you are using Berkshelf to manage dependencies and that you keep your Berksfile.lock files synced and version controlled. Lets explore why this is not enough to protect from dependency change creep.

Relaxed version constraints

Even if you have rigorously ensured that your metadata.rb files include explicit version constraints, there is never any guarantee that some downstream community cookbook has not specified any constraints in its metadata.rb file. You might think this is exactly where the Berksfile.lock saves you. A Berksfile.lock will snap the entire dependency graph to the exact version of the last sync. If this lock file is in source control, you can be assured that fellow teammates and your build system are testing the top level cookbook with all the same dependency versions. So far so good.

Now you upload the cookbooks to the chef server and when its time for a node to converge against your cookbook changes, where is your Berksfile.lock now? Unless you have something in place to account for this, chef-client is simply going to get the highest available version for cookbook dependencies without constraints. If anyone at any time uploads a higher version of a community cookbook to the chef server that is used by other cookbooks that had been tested against lesser versions, a break can easily occur.

Dependency islands in the same environment

This is related to the scenario just described above and explains how cookbook dependencies can be uploaded from one successful cookbook test run that can break sibling cookbooks in the same environment.

The Berksfile.lock file generates the initial dependency graph upon the very first berks install. Therefore you can create two cookbooks that have all of the same cookbook dependencies but if you build the Berksfile.lock file even hours apart, there is a perfectly reasonable possibility that the two cookbooks will have different versioned dependencies in their respective Berksfile.lock files.

Once both sets are uploaded to the chef server and unless an explicit version constraint is specified, the highest eligible version wins and this may be a different version than some of the cookbooks that use this dependency were tested against. So now you can only hope that everything works when nodes start converging.

Poorly timed application of cookbook constraints

You may be thinking the obvious remedy to all of these dependency issues is to add cookbook constraints to your environment files. I think you are definitely on the right track. This will eliminate the mystery version creep scenarios and you can look at your environment file and know exactly what versions of which cookbooks will be used. However in order for this to work, it has to be carefully managed. If I promote a cookbook version along with all of its dependencies by updating the version constraints in my environment, can I be guaranteed that all other cookbooks in the environment with the same downstream dependencies  have been tested with any new versions being updated?

I do believe that constraining environment versions is important. These constraints can serve the same function as your Berksfile.lock file within any chef server environment. Unless the entire matrix of constraints matches up against what has been tested, these constraints provide inadequate safety.

Safety guidelines for cookbook constraints in an environment

A constraint for every cookbook

Not only your internal top level cookbooks should be constrained. All dependent cookbooks should also include constraints. Any cookbook missing a constraint introduces the possibility that an untested dependency graph will be introduced.

All constraints point to exact versions (no pessimistic constraints)

I do think pessimistic constraints are fine in your Berksfile.lock or metadata.rb files. This basically says you are ok with upgrading dependencies within dev/test scenarios, but once you need to establish a baseline of a known set of good cookbooks, you want that known group to be declared rigidly. Unless you point to precise versions, you are stating that you are ok with “inflight” change and that’s the change that can bring down your system.

Test all changes and all cookbooks potentially affected by changes

You will need to be able to identify what cookbooks that had no direct changes applied but are part of the same graph undergoing change. In other words if both cookbook A and B depend on C and C gets bumped as you are developing A, you must be able to automatically identify that B is potentially affected and must be tested to validate your changes to A even though no direct changes were made to B. Not until A, B, and C all converge successfully can you consider the changes to be included in your environment.

Leveraging the environment cookbook pattern to keep your tree whole

The environment cookbook pattern was introduced by Jamie Windsor in this post. The environment cookbook can be thought of as the trunk or root of your environment. You may have several top level cookbooks that have no direct dependencies on one another and therefore you are subject to the dependency islands referred to above. However if you have a common root declaring a dependency on all your top level cookbooks, you now have a single coherent graph that can represent all cookbooks.

The environment cookbook pattern prescribes the inclusion of an additional cookbook for every environment you want to apply this versioning rigor. This cookbook is called the environment cookbook and includes only four files:

README.md

Providing thorough documentation of the cookbook’s usage.

metadata.rb

Includes one dependency for each top level cookbook in your environment.

Berksfile and Berksfile.lock

These express the canonical dependency graph of your chef environment. Jamie suggests that this is the only Berksfile.lock you need to keep in source control. While I agree it’s the only one that “needs” to be in source control, I do see value in keeping the others. I think by keeping “child” Berksfile.lock files in sync the top level dependencies may fluctuate less often and provide a bit more stability during development.

Generating cookbook constraints against an environment cookbook

Some will suggest using berks apply in the environment cookbook and point to the environment you want to constrain. I personally do not like this method because it simply uploads the constraints to the environment on the chef server. I just want to generate it locally first where I can run tests and version control the environment file first.

At CenturyLink Cloud we have steps in our CI pipeline that I believe not only adds the correct constraints but allows us to identify all cookbooks impacted by the constraints and also ensures that all impacted cookbooks are then tested against the exact same set of dependencies. Here is the flow we are currently using:

Generating new cookbook versions for changed cookbooks

As included in the safety guidelines above, this not only means that cookbooks with changed code get a version bump, it also means that any cookbook that takes a dependency on one of these changed cookbooks also gets a bump. Please refer to my last post which describes the version numbering strategy. This is a three step process:

  1. Initial versioning of all cookbooks in the environment. This results in all directly changed cookbooks getting bumped.
  2. Sync all individual Berksfile.lock files. This will effectively change the Berksflie.locks of all dependent cookbooks.
  3. A second versioning pass that ensures that all cookbooks affected by the Berksfile.lock updates also get a version bump.

Generate master list of cookbook constraints against the environment cookbook

Using the Berksfile of the environment cookbook, we will apply the new cookbook versions to a test environment file:

def constrain_environment(environment_name, cookbook_name)
  dependencies = environment_dependencies(cookbook_name)
  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
  content = JSON.parse(File.read(env_file))
  content['cookbook_versions'] = {}
  dependencies.each do | dep |
    content['cookbook_versions'][dep.name] = dep.locked_version.to_s
  end

  File.open(env_file, 'w') do |out|
    out << JSON.pretty_generate(content)
  end
  dependencies
end

def environment_dependencies(cookbook_name)
  berks_name = File.join(config.cookbook_directory, 
    cookbook_name, "Berksfile")
  berksfile = Berkshelf::Berksfile.from_file(berks_name)
  berksfile.list
end

This will result in a test.json environment file getting all of the cookbook constraints for the environment. Another positive byproduct of this code is that it will force a build failure in the event of version conflicts.

It is very possible that one cookbook will declare a dependency with an explicit version while another cookbook declares the same cookbook dependency but with a different version constraint. In these cases the Berkshelf list command invoked above will fail because it cannot satisfy both constraints. Its good that it fails now so you can align the versions before the final constraint is locked and potentially causing a version conflict during a chef node client run.

Run kitchen tests for impacted cookbooks against the test.json environment

How do we identify the impacted cookbooks? Well as we saw above, every cookbook that was either directly changed or impacted via a transitive dependency got a version bump. Therefore it’s a matter of comparing a cookbook’s new version to the version of the last known good tested environment. I've created an is_dirty function to determine if a cookbook needs to be tested:

def is_dirty(environment_name, cookbook_name, environment_cookbook)
  dependencies = environment_dependencies(environment_cookbook)
  
  cb_dependency = (
    dependencies.select { |dep| dep.name == cookbook_name })[0]

  env_file = File.join(config.environment_directory, 
    "#{environment_name}.json")
    
  content = JSON.parse(File.read(env_file))
  if content.has_key?('cookbook_versions')
    if content['cookbook_versions'].has_key?(cookbook_name)
      curr_version = cb_dependency.locked_version.to_s
      curr_version != content['cookbook_versions'][cookbook_name]
    else
      true
    end
  end
end

This method takes the environment that represents my last known good environment (the one where all the tests passed), the cookbook to check for dirty status and the environment cookbook. If the cookbook is clean, it effectively passes this build step.

In a future post I may go into detail regarding how we utilize our build server to run all of these tests concurrently from the same git commit and aggregate the results into a single master integration result.

Create a new Last Known Good environment

If any test fails, the entire build fails and it all stops for further investigation. If all tests pass, we run through the above constrain_environment method again to produce the final cookbook constraints of our Last Known Good environment which serves as a release candidate of cookbooks that can converge our canary deployment group. The deployment process is a topic for a separate post.

The Kitchen-Environment provisioner driver

One problem we hit early on was that when test-kitchen generated the Berksfile dependencies to ship to the test instance, the versions it generated may differ from the versions in the environment file. This was because Test-Kitchen’s chef-zero driver as well as most the other chef provisioner drivers, run a berks vendor against the Berksfile of the individual cookbook under test. These may produce different versions than a berks vendor against the environment cookbook and it also illustrates why we are following this pattern. When this happens, it means that the individual cookbook on its own runs with a different dependency than it may in a chef server.

What we needed was a way for the provisioner to run berks vendor against the environment cookbook. The following custom provisioner driver does just this.

require "kitchen/provisioner/chef_zero"

module Kitchen

  module Provisioner

    class Environment < ChefZero

      def create_sandbox
        super
        prepare_environment_dependencies
      end

      private

      def prepare_environment_dependencies
          tmp_env = "TMP_ENV"
          path = File.join(tmpbooks_dir, tmp_env)
          env_berksfile = File.expand_path(
            "../#{config[:environment_cookbook]}/Berksfile", 
            config[:kitchen_root])    

          info("Vendoring environment cookbook")    
          ::Berkshelf.set_format :null

          Kitchen.mutex.synchronize do
             Berkshelf::Berksfile.from_file(env_berksfile).vendor(path)

              # we do this because the vendoring converts metadayta.rb
              # to json. any subsequent berks command on the 
              # vendored cookbook will fail
              FileUtils.rm_rf Dir.glob("#{path}/**/Berksfile*")

              Dir.glob(File.join(tmpbooks_dir, "*")).each do | dir |
                cookbook = File.basename(dir)
                if cookbook != tmp_env
                  env_cookbook = File.join(path, cookbook)
                  if File.exist?(env_cookbook)
                    debug("copying #{env_cookbook} to #{dir}")
                    FileUtils.copy_entry(env_cookbook, dir)
                  end
                end
              end
              FileUtils.rm_rf(path)
          end
      end

      def tmpbooks_dir
        File.join(sandbox_path, "cookbooks")
      end

    end
  end
end

Environments as cohesive units

This is all about treating any chef environment as a cohesive unit wherein any change introduced must be considered upon all parts involved. One may find this to be overly rigid or change adverse. One belief I have regarding continuous deployment is that in order to be fluid and nimble, you must have rigor. There is no harm in a high bar for build success as long as it is all automated and therefore easy to apply the rigor. Having a test framework that guides us toward success is what can separate a continuous deployment pipeline from a continuous hot fix fire drill.