Getting Ready…Troubleshooting unattended windows installation / by Matt Wrock

ready.PNG

I install windows (and linux) A LOT in my role at CenturyLink Cloud automating our infrastructure rollout and management. Sometimes things go wrong. Usually if our provisioning code has been waiting for more than a few minutes for the machine to be reachable I know something is not right. So I might pop open a VMWare console and see this ever familiar screen. The windows installation is “Getting Ready.” That may fill one with the adrenaline of sweet anticipation but I know this only ends in disappointment. I can assure you that if windows is not ready now, it will never be ready. As in never, ever, ever ready.

In the past I have sat staring into the spinning circle of emptiness wondering what in gods name is windows doing. There are no error messages and usually nothing helpful in the VMWare events other than telling me that the OS customization has failed. Mmm…thanks. Sometimes after 5 or 15 minutes, the OS may come to life but often not in a state that our provisioning can connect to over winrm. I’m usually caught off guard by this since I have been spending the past several minutes in a very intense Vulcan mind meld with my monitor. Hoping somehow to break through and thinking I’m just beginning to feel the silent, cold, lonely suffering of a failed domain join when suddenly I am asked to press ctrl+alt+delete. Well…ok…I will…and slowly, as if just awoken from one of those inception dreams within a dream within another dream and having aged hundreds of years, I type just that – ctrl+alt+delete.

OK. You got me. Ctrl+Alt+Del does not work in a VMWare console, but you get the idea. Anyhoo, I next run off to the event logs reading lot and lots of events that are entirely unhelpful and provide no clues. Usually this all ends up being some stupid error like providing a faulty domain admin password to the unattend file. Not too long ago we added code to our windows provisioning that adds a second NIC and that introduced a few issues leading to this phenomenon until I got the sequence just right of adding the NIC, disabling it, configuring it and enabling it. But a couple weeks ago I ran into a new issue that really stumped me and I was not able to solve by looking over my provisioning code or configuration data. This prompted me to research how to get to the bottom of what's going on when Windows is “Getting Ready.” In this post I will cover what I learned and hopefully reveal clues that can help others figure out how to get out of these installation hang-ups

Overview of CenturyLink Cloud’s server provisioning sequence

It may help to point out roughly how we go about installing our windows boxes. Our methods may be different from yours but that should be irrelevant and the techniques here to troubleshooting windows installation hangs and errors should be just as applicable to just about any unattended windows install. Our windows servers do run server 2012 R2 so older OSs may certainly be different.

We have been using chef for our server automation and, in particular, Chef-Metal for our provisioning process. We have written a custom Chef-Metal Vsphere driver that leverages the RBVMOMI ruby library to interact with the VMWare VSphere API that does all the footwork of going to the right host, cloning a initial VM template, hooking up the right data stores, setting up initial networking etc. This also calls into VMWare’s guest OS customization configuration which will produce a windows unattend.xml file. Also known as an answer file. The VMWare tools will inject this file into the setup which windows will then use to drive its installation.

Our unattend file ends up being pretty simple. It performs a domain join and runs some scripts that tweak winrm so our provisioner can talk to the machine, install the Chef client and kick-off the appropriate cookbooks and recipes making the machine a “real boy” in the end. We run a mix of windows and linux but everything goes through this same sequence but of coarse the linux boxes don’t have unattend.xml files generated but they do have their own OS customization process that configures initial networking.

If everything goes right. This takes about 5 minutes from the initial cloning until the machine can receive network traffic and begin its convergence to whatever role that machine will fill: web server, rabbitMQ server, CouchDB server, etc. It really doesn't matter if its windows or linux, 5 minutes is roughly the norm. BTW: for most of our automation testing of linux machines we use Docker which is nearly instantaneous but we do not use that in production (yet).

Breaking through Getting Ready

So what can one do when the windows install gets “stuck” in this Getting Ready state? Shift-F10 is your friend. I don’t think it matters what hypervisor infrastructure you are using or even if this is a bare metal install. We use VMWare but this should work on Hyper-V, VirtualBox, etc. Shift-F10 will immediately open a CMD.exe as administrator if typed during the unattended install phase.

From here you can start pouring through logs and can even open regedit and other gui based tools if necessary but this command prompt is usually enough to find out what is happening.

Where are the logs?

As I have stated above, I have personally not found the VMWare events or the machine event logs to be much help. Your mileage may vary but you are likely going to want to find the unattend activity log which is located, of course, in

c:\windows\panther\UnattendGC\setupact.log

I don’t know what Panther is. I like to think there was some MS windows team back in the early 90’s that called themselves the panther team pioneering the way forward in windows automation. I also like to think they used gang-like panther calls to communicate with one another when spotting each other in the cafeteria or the campus store. They may have worn special jackets with the wild face of a panther on the back and perhaps some had tattoos or some form of tribal scarification applied resembling panther like imagery. Who knows…I can only guess.

At least in my case this is where the answers were found. Certainly they will be here if the issue is related to the domain join which mine usually tend to be. If the authentication with the domain admin account is at fault, that should be clear here. For instance:

2014-09-06 22:30:10, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:20, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:30, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:40, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:30:51, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:01, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:11, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:22, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:32, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...
2014-09-06 22:31:42, Warning  [DJOIN.EXE] Unattended Join: NetJoinDomain attempt failed: 0x775, will retry in 10 seconds...

The key above is the hex error code. Given the nature of the hexadecimal numeric format, the root is often immediately obvious and if not a google search usually points you to a more specific message.

In my recent stump scenario, the issue was that the domain controller could not be found. It ended up that although I was explicitly giving the domain controller IPs as the DNS servers to use, I was assigning the machine IP via DHCP and the DHCP server pointed to a different pair of DNS servers. For whatever reason, windows was choosing to use those servers and therefore unable to resolve the domain name to its correct domain controllers. There is also many other non-domain join details to be found here as well.

Other log locations that may be helpful

If for whatever reason, the unattend activity log does not have helpful information, there are a few more places to look. All files and subdirectories under:

c:\windows\panther
c:\windows\debug
c:\windows\temp

If you too are using the VMWare tools to drive the OS customization, you will find logs specific to VMWare’s work in c:\windows\temp. Many of the logs in the directories mentioned above may duplicate one another but some may have more granular detail than others.

I certainly hope this helps. If it does and you so happen to spot me in a crowd, let out a wild panther shriek and I promise to return with the same.