I'm so excited. I'm about to join Chef and I think I like it! by Matt Wrock

A little over a year ago, a sparkly devops princess named Jennifer Davis came to my workplace at CenturyLink Cloud and taught me a thing or two about this thing she called Chef.  I had been heads down for the previous 3 years trying to automate windows. While wandering in the wilderness of windows automation I had previously bumped into Chocolatey where I met my good friend Rob Reynolds who had told me stories about these tools called Puppet, Chef, and Vagrant. Those were the toys that the good little girls and boys on the other side of the mountain (ie linux land) used for their automation. Those tools sounded so awesome but occupied with my day job at Microsoft that was not centered around  automation and my nights and weekends developing Boxstarter, I had yet to dabble in these tools myself.

So when the opportunity came to join a cloud computing provider CenturyLink Cloud and focus on data center automation for a mixed windows and linux environment, I jumped at the opportunity. Thats where Jennifer entered the scene and opened a portal door leading from my desert to a land of automation awe and wonder and oh the things I saw. Entire communities built, thriving and fighting all about transforming server configuration from lines of code. To these people, "next" and "finish" were commands and not buttons. These were my people and with them I wanted to hang.

There were windows folk I spied through the portal. Jeffrey Snover had been there for quite some time in full robe and sandals pulling windows folk kicking and screaming through to the other side of the portal. Steven Murauwski was there too translating Jeffrey's teachings and pestering Jeffrey's disciples to write tests because that's how its done. Rob had been sucked through the portal door at least a year prior and was working for Puppet. Hey look he's carrying a Mac just like all the others. Is that some sort of talisman? Is it the shared public key needed to unveil their secrets? I don't know...but its so shiny.

So what the heck, I thought. Tonight we'll put all other things aside and I followed Jennifer through the portal and never looked back. There was much ruby to be learned and written. Its not that different from powershell and had many similarities to C# but it was indeed its own man and would occasionally hit back when I cursed it. One thing I found was that even more fun that using Chef to automate CenturyLink data centers, was hacking on and around the automation tooling like Chef, Vagrant, Ruby's WinRM implementation, Test-Kitchen, etc.

There was so much work to be done here to build a better bridge between windows and the automation these tools had to offer. This was so exciting and Chef seemed to be jumping all in to this effort as if proclaiming I know, I know, I know, I know, I want you, I want you. Before long Steven took up residence with Chef and Snover seemed to be coming over for regular visits where he was working with Chef to feed multitudes on just a couple loaves of DSC. I helped a little here and there along with a bunch of other smart and friendly people.

Well meanwhile on the CenturyLink farm, we made alot of progress and added quite a bit of automation to building out our data centers. Recently things have been winding down for me on that effort and it seemed a good time to asses my progress and examine new possibilities. Then it occured to me, if Jennifer can be a sparkly devops princess at chef, could I be one too?

We may not yet know the answer to that question yet but I intend to explore it and enter into the center sphere of its celestial force.

So I'll see my #cheffriends Monday. I'm so excited and I just cant hide it!

A dirty check for Chef cookbooks: testing everything that has changed and only when it is changed by Matt Wrock

I was recently pinged by a #cheffriend of mine, Alex Vinyar. A few months back, I sent him a bit of code that I wrote at CenturyLink Cloud that runs on our build agents to determine which cookbooks to test after a cookbook commit is pushed to source control. He just mentioned to me that he has gotten some positive feedback on that code so I thought I'd blog on the details of that for others who may be interested. This can also be considered a follow up to my post from almost a year ago on cookbook dependency management where I shared some theory I was pondering at the time on this subject. This post covers how I followed up on that theory with working code.

I'm going to cover two key strategies:

  1. Testing all cookbooks that depend on a changed cookbook
  2. Ensuring that once a cookbook is tested and succeeds, it is never tested again as long as it and its dependencies do not change

I'll share the same code I shared with Alex (and more) in a bit.

Not enough to test just the cookbook changed

Lets consider this scenario: You manage infrastructure for a complex distributed application. You maintain about a baker's dozen cookbooks - see what I did there?...baker...chef...look at me go girl! You have just committed and pushed a change to one of your lower level library cookbooks that exposes functions used by several of your other cookbooks. We'd expect your pipeline to run tests on that library cookbook but is that enough? What if your upstream cookbooks that depend on the library cookbook "blow up" when they converge with the new changes. This is absolutely possible and not just for library cookbooks.

I wanted a way for a cookbook commit to trigger not only its own chefspecs and test-kitchen tests but also the tests of other cookbooks that depend on it.

Kitchen tests provide slow but valuable feedback, don't run more than needed

Here is another scenario: You are now testing the entire dependency tree touched by a change...yay! But failure happens and your build fails. However the failure may not have occurred at the root of your tree (the cookbook that actually changed). Here are some possibilities:

  1. Maybe there was some transient network glitch that caused a upstream cookbook to fail
  2. Gem dependency changes outside of your own managed cookbook collection (you know - the dark of semver - minor build bumps that are really "major")
  3. Your build has batched more than one root cookbook change. One breaks but the others pass

So an hour's worth (or more) of testing 10 cookbooks results in a couple broken kitchen tests. You make some fixes, push again and have to wait for the entire set to build/test all over again.

I have done this before. Just look at my domain name!

However many of these tests may be unnecessary now. Some of the tests that has succeeded in the filed batch may be unaffected by the fix.

Finding dependencies

So here is the class we use at CenturyLink Cloud to determine when one cookbook as been affected by a recent change. This is what I sent to Alex. I'll walk you through some of it later.

require 'berkshelf'
require 'chef'
require 'digest/sha1'
require 'json'

module Promote
  class Cookbook
    def initialize(cookbook_name, config)
      @name = cookbook_name
      @config = config
      @metadata_path = File.join(path, 'metadata.rb')
      @berks_path = File.join(path, "Berksfile")
    end

    # List of all dependent cookbooks including transitive dependencies
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def dependencies
      @dependencies ||= get_synced_dependencies_from_lockfile
    end

    # List of cookbooks dependencies explicitly specified in metadata.rb
    #
    # @return [Hash] a hash keyed on cookbook name containing versions
    def metadata_dependencies
      metadata.dependencies
    end

    def version(version=nil)
      metadata.version
    end

    def version=(version)
      version_line = raw_metadata[/^\s*version\s.*$/]
      current_version = version_line[/('|").*("|')/].gsub(/('|")/,"")
      if current_version != version.to_s
        new_version_line = version_line.gsub(current_version, version.to_s)
        new_content = raw_metadata.gsub(version_line, new_version_line)
        save_metadata(new_content)
      end
      version
    end

    # Stamp the metadata with a SHA1 of the last git commit hash in a comment
    #
    # @param [String] SHA1 hash of commit to stamp
    def stamp_commit(commit)
      commit_line = "#sha1 '#{commit}'"
      new_content = raw_metadata.gsub(/#sha1.*$/, commit_line)
      if new_content[/#sha1.*$/].nil?
        new_content += "\n#{commit_line}"
      end
      save_metadata(new_content)
    end

    def raw_metadata
      @raw_metadata ||= File.read(@metadata_path)
    end

    def metadata
      @metadata ||=get_metadata_from_file
    end

    def path
      File.join(@config.cookbook_directory, @name)
    end

    # Performs a berks install if never run, otherwise a berks update
    def sync_berksfile(update=false)
      return if berksfile.nil?

      update = false unless File.exist?("#{@berks_path}.lock")
      update ? berksfile_update : berksfile.install
    end

    # Determine if our dependency tree has changed after a berks update
    #
    # @return [TrueClass,FalseClass]
    def dependencies_changed_after_update?
      old_deps = dependencies
      sync_berksfile(true)
      new_deps = get_dependencies_from_lockfile

      # no need for further inspection is 
      # number of dependencies have changed
      return true unless old_deps.count == new_deps.count

      old_deps.each do |k,v|
        if k != @name
          return true unless new_deps[k] == v
        end
      end

      return false
    end

    # Hash of all dependencies. Guaranteed to change if 
    # dependencies of their versions change
    #
    # @param [environment_cookbook_name] Name of environmnt cookbook
    # @return [String] The generated hash
    def dependency_hash(environment_cookbook_name)
      # sync up with the latest on the server
      sync_latest_app_cookbooks(environment_cookbook_name)
      
      hash_src = ''
      dependencies.each do | k,v |
        hash_src << "#{k}::#{v}::"
      end
      Digest::SHA1.hexdigest(hash_src)
    end

    # sync our environmnt cookbook dependencies 
    # with the latest on the server
    #
    # @return [Hash] the list of cookbooks updated
    def sync_latest_app_cookbooks(environment_cookbook_name)
      result = {}

      #Get a list of all the latest cookbook versions on the server
      latest_server_cookbooks = chef_server_cookbooks(1)
      
      env_cookbook = Cookbook.new(environment_cookbook_name, @config)
      
      # iterate each cookbook in our environment cookbook
      env_cookbook.metadata_dependencies.each do |key|
        cb_key = key[0] # cookbook name

        # skip if its this cookbook or if its not on the server
        next if cb_key == @name || !latest_server_cookbooks.has_key?(cb_key)
        latest_version = latest_server_cookbooks[cb_key]['versions'][0]['version']
        
        # if the server as a later version than this cookbook
        # we do a berks update to true up with the server
        if dependencies.keys.include?(cb_key) &&
          latest_version > dependencies[cb_key].to_s
          
          berksfile_update(cb_key)
          result[cb_key] = latest_version
        end
      end

      # return the updated cookbooks
      result
    end

    private

    def chef_server_cookbooks(versions = 'all')
      Chef::Config[:ssl_verify_mode] = :verify_none
      rest = Chef::REST.new(
        @config.chef_server_url, 
        @config.node_name, 
        @config.client_key
      )
      rest.get_rest("/cookbooks?num_versions=#{versions}")
    end

    def berksfile
      @berksfile ||= get_berksfile
    end

    def get_berksfile
      return unless File.exist?(@berks_path)
      Berkshelf.set_format :null
      Berkshelf::Berksfile.from_file(@berks_path)
    end

    def save_metadata(content)
      File.open(@metadata_path, 'w') do |out|
        out << content
      end

      # force subsequent usages of metadata to refresh from file
      @metadata=nil
      @raw_metadata = nil
    end

    def get_metadata_from_file
      metadata = Chef::Cookbook::Metadata.new
      metadata.from_file(@metadata_path)
      metadata
    end

    def get_synced_dependencies_from_lockfile
      # install to make sure we have a lock file
      berksfile.install

      get_dependencies_from_lockfile
    end

    # get the version of every dependent cookbook from berkshelf
    def get_dependencies_from_lockfile
      berks_deps = berksfile.list
      deps = {}

      berks_deps.each {|dep| deps[dep.name] = dep.locked_version}
      deps
    end

    def berksfile_update(cookbooks = nil)
      cookbooks.nil? ? berksfile.update : berksfile.update(cookbooks)
      @dependencies = nil
    end
  end
end

This class is part of a larger "promote" gem that we use to promote cookbooks along our delivery pipeline. The class represents an individual cookbook and can be instantiated given an individual cookbook name and a config.

The config class (not shown here) is an extremely basic data class that encapsulates the properties of the chef environment.

Some assumptions

This code will only work given a couple conditions are true. One is fairly common and the other not so much. Even if these conditions do not exist in your environment you should be able to groc the logic.

Using Berkshelf

You are using berkshelf to manage your dependencies. This class uses the berkshelf API to find dependencies of other cookbooks.

Uning the Environment Cookbook Pattern

We use the environment cookbook pattern to govern which cookbooks are considered part of a deployable application. You can read more about this pattern here but an environment cookbook is very simple. It is just a metadata.rb and a Bekshelf.lock. The metadata file includes a depends for each of my cookbooks responsible for infrastructure used in the app. The Berkshelf.lock file uses that to build the entire tree of not only "my own" cookbooks but all third party cookbooks as well.

The dependency hash

For the purposes of this post, the dependency_hash method of the above class is the most crucial. This hash represents all cookbook dependencies and their version at the current point in time. If I were to persist this hash (more on that in a bit) and then later regenerate it with a hash that is at all different from the persisted one, I know something has changed. I don't really care where. Any change at all anywhere in a cookbook's dependency tree means it is at risk of breaking and should be tested.

The "Last Known Good" environment

Every time we test a cookbook and it passes, we store its current dependency hash in a special chef environment called LKG (last known good). This is really a synthetic environment with the same structure as any chef environment but the version constraints may look a bit strange:

{
  name: "LKG_platform",
  chef_type: "environment",
  json_class: "Chef::Environment",
  cookbook_versions: {
    p_rabbitmq: "1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba",
    p_elasticsearch: "1.1.79.2dd2472fbd47bf32db4ad83be45acb810b788a90",
    p_haproxy: "1.1.47.75dfc55b772fa0eb08b6fbdee02f40849f37ca77",
    p_win: "1.1.121.fa2a465e8b964a1c0752a8b59b4a5c1db9074ee5",
    p_couchbase: "1.1.92.325a02ab22743eccc17756539d3fac57ce5ef952",
    elasticsearch_sidec: "0.1.7.baab30d04181368cd7241483d9d86d2dffe730db",
    p_orchestration: "0.1.48.3de2368d660a4c7dfd6dba2bc2c61beeea337f70"
  }
}

We update this synthetic environment any time a cookbook test runs and passes. Here is a snippet of our Rakefile that does this:

cookbook = Promote::Cookbook.new(cookbook_name, config)
environment = Promote::EnvironmentFile.new("LKG", config)
lkg_versions = environment.cookbook_versions
cookbook_version = "#{cookbook.version.to_s}.#{hash}"

lkg_versions[cookbook_name] = cookbook_version

environment.write_cookbook_versions(lkg_versions)

This is just using the cookbook and environment classes of our Promote gem to write out the new versions and following this, the environment like all of our environments is committed to source control.

The dirty check

We have now seen how we generate a dependency hash and how we persist that on successful tests. Here is where we actually compare our current dependency hash and the last known hash. We need to determine before running integration tests on a given cookbook if it is any different than the last time we succesfully tested it. If we previously tested it and neither it nor any of its dependencies (no matter how deep) have changed, we really do not need or want to test it again.

def is_dirty(
  source_environment,
  lkg_environment,
  cookbook_name,
  dependency_hash = nil
)
  # Get the environment of the last good tests
  lkg_env_file = EnvironmentFile.new(lkg_environment, config)

  # if this cookbook is not in the last good tests, we are dirty
  unless lkg_env_file.cookbook_versions.has_key?(cookbook_name)
    return true
  end

  # get the current environment under test
  source_env_file = EnvironmentFile.new(source_environment, config)
  
  # get the version of this cookbook from the last good tests
  # it will look like 1.1.15.2d88c011bfb691e9de0df593e3f25f1e5bf7cbba
  # thi is the normal version followed by the dependency hash tested
  lkg_parts = lkg_env_file.cookbook_versions[cookbook_name].split('.')

  # Get the "plain" cookbook version
  lkg_v = lkg_parts[0..2].join('.')
  
  # the dependency hash last tested
  lkg_hash = target_parts[3]
  
  # we are dirty if our plain version or dependency hash has changed
  return source_env_file.cookbook_versions[cookbook_name] != lkg_v || 
    dependency_hash != lkg_hash
end

This method takes the environment currently under test, our LKG environment that has all the versions and their dependency hashes last tested, the cookbook to be tested and its current dependency hash. It returns whether or not our cookbook is dirty and needs testing.

Prior to calling this code, our build pipeline edits our "test" environment (the "source_environment" above). This environment holds the cookbook versions of all cookbooks being tested including the last batch of commits. We use a custom kitchen provisioner plugin to ensure Test-Kitchen uses these versions and not just the versions in each cookbooks berkshelf lock file. As I discussed in my dependency post from last year, its important that all changes are tested together.

So there you have it. Now Hurry up and Wait for your dirty cookbooks to run their tests!

Detecting a hung windows process by Matt Wrock

A couple years ago, I was adding remoting capabilities to boxstarter so that one could setup any machine in their network and not just their local environment. As part of this effort, I ran all MSI installs via a scheduled task because some installers (mainly from Microsoft) would pull down bits via windows update and that always fails when run via a remote session. Sure the vast majority of MSI installers will not need a scheduled task but it was just easier to run all through a scheduled task instead of maintaining a whitelist.

I have run across a couple installers that were not friendly to being run silently in the background. They would throw up some dialog prompting the user for input and then that installer would hang forever until I forcibly killed it. I wanted a way to automatically tell when "nothing" was happening as opposed to a installer actually chugging away doing stuff.

Its all in the memory usage

This has to be more than a simple timeout based solution  because there are legitimate installs that can take hours. One that jumps to mind rhymes with shequel shurver. We have to be able to tell if the process is actually doing anything without the luxury of human eyes staring at a progress bar. The solution I came up with and am going to demonstrate here has performed very reliable for me and uses a couple memory counters to track when a process is truly idle for long periods of time.

Before I go into further detail, lets look at the code that polls the current memory usage of a process:

function Get-ChildProcessMemoryUsage {
    param(
        $ID=$PID,
        [int]$res=0
    )
    Get-WmiObject -Class Win32_Process -Filter "ParentProcessID=$ID" | % { 
        if($_.ProcessID -ne $null) {
            $proc = Get-Process -ID $_.ProcessID -ErrorAction SilentlyContinue
            $res += $proc.PrivateMemorySize + $proc.WorkingSet
            $res += (Get-ChildProcessMemoryUsage $_.ProcessID $res)
        }
    }
    $res
}

Private Memory and Working Set

You will note that the memory usage "snapshot" I take is the sum of PrivateMemorySize and WorkingSet. Such a sum on its own really makes no sense so lets see what that means. Private memory is the total number of bytes that the process is occupying in windows and WorkingSet is the number of bytes that are not paged to disk. So WorkingSet will always be a subset of PrivateMemory, why add them?

I don't really care at all about determining how much memory the process is consuming in a single point in time. What I do care about is how much memory is ACTIVE relative to a point in time just before or after each of these snapshots. The idea is that if these numbers are changing, things are happening. I tried JUST looking at private memory and JUST looking at working set and I got lots of false positive hangs - the counts would remain the same over fairly long periods of time but the install was progressing.

So after experimenting with several different memory counters (there are a few others), I found that if I looked at BOTH of these counts together, I could more reliably use them to detect when a process is "stuck."

Watching the entire tree

You will notice in the above code that the Get-ChildProcessMemoryUsage function recursively tallies up the memory usage counts for the process in question and all child processes and their children, etc. This is because the initial installer process tracked by my program often launches one or more subprocesses that do various bits of work. If I only watch the initial root process, I will get false hang detections again because that process may do nothing for long periods of time as it waits on its child processes.

Measuring the gaps between changes

So we have seen how to get the individual snapshots of memory being used by a tree of processes. As stated before, these are only useful in relationships between other snapshots. If the snapshots fluctuate frequently, then we believe things are happening and we should wait. However if we get a long string where nothing happens, we have reason to believe we are stuck. The longer this string, the more likely our stuckness is a true hang.

Here are memory counts where memory counts of processes do not change:

VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 0
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 1
VERBOSE: [TEST2008]Boxstarter: SqlServer2012ExpressInstall.exe 12173312
VERBOSE: [TEST2008]Boxstarter: SETUP.EXE 10440704
VERBOSE: [TEST2008]Boxstarter: setup.exe 11206656
VERBOSE: [TEST2008]Boxstarter: ScenarioEngine.exe 53219328
VERBOSE: [TEST2008]Boxstarter: Memory read: 242688000
VERBOSE: [TEST2008]Boxstarter: Memory count: 2

These are 3 snapshots of the child processes of the 2012 Sql Installer on an Azure instance being installed by Boxstarter via Powershell remoting. The memory usage is the same for all processes so if this persists, we are likely in a hung state.

So what is long enough?

Good question!

Having played and monitored this for quite some time, I have come up with 120 seconds as my threshold - thats 2 minutes for those not paying attention. I think quite often that number can be smaller but I am willing to error conservatively here. Here is the code that looks for a run of inactivity:

function Test-TaskTimeout($waitProc, $idleTimeout) {
    if($memUsageStack -eq $null){
        $script:memUsageStack=New-Object -TypeName System.Collections.Stack
    }
    if($idleTimeout -gt 0){
        $lastMemUsageCount=Get-ChildProcessMemoryUsage $waitProc.ID
        $memUsageStack.Push($lastMemUsageCount)
        if($lastMemUsageCount -eq 0 -or (($memUsageStack.ToArray() | ? { $_ -ne $lastMemUsageCount }) -ne $null)){
            $memUsageStack.Clear()
        }
        if($memUsageStack.Count -gt $idleTimeout){
            KillTree $waitProc.ID
            throw "TASK:`r`n$command`r`n`r`nIs likely in a hung state."
        }
    }
    Start-Sleep -Second 1
}

This creates a stack of these memory snapshots. If the last snapshot is identical to the one just captured, we add the one just captured to the stack. We keep doing this until one of two thing happens:

  1. A snapshot is captured that varies from the last one recorded in the stack. At this point we clear the stack and continue.
  2. The number of snapshots in the stack exceeds our limit. Here we throw an error - we believe we are hung.

I hope you have found this information to be helpful and I hope that none of your processes become hung!