Tuesday, August 7, 2012

ESXi Rebuild

While reviewing some material for the Microsoft Exchange 2007 70-236 exam, I rebooted one of my Exchange VMs and was greeted with a blue screen informing me that the registry was corrupted. 

After trying to boot into safe mode and then trying to boot with the last known good configuration option, I tried a system repair using the Windows Server 2008 disc.  I was able to repair one of the hives, however after rebooting I was hit with another blue screen stating that another hive was corrupted. 

After doing some research online, I found via Microsoft's Knowledge base that there might be an issue with the hard drive.  I then remembered that I'd seen some file system errors on some of my Linux VMs and after logging into a few of them, I noted that almost all of them had file systems errors and the file systems were mounted read-only.  I booted the ESXi server off of a Linux USB key and found that one of the hard drives had 50 bad sectors on it.  I tried running fsck from the ESXi command-line but it appeared to be only available if the host itself could not boot and was not available as a general command-line option. 

I figured that since the drive is bad, I might as well buy a new drive, take an image of the existing drive, and then write the image to the new drive.  I opted to use CloneZilla to take the image but ran into the issue with CloneZilla storing the image files locally on the bootable USB drive before copying them to their final destination (a SMB share I'd set up on my desktop). 

I figured that if I installed the new drive, did a clean install of ESXi 5 on it and then added the existing datastore from the old drive, I could just copy the VMs over and then get rid of the old datastore.  After completing the full install I went into the datastore browser and selected all of the VMs and moved them to the new datastore.  This was a mistake.  It caused the host to almost completely lock up to the point where I had to restart the management agents to regain control.  After doing that I was left with partially copied VMs.  When trying to move the vmdk files over to the new datastore ESXi told me that the files already existed there.  When logging in via SSH to verify, I found no trace of the files on second datastore, but it appeared that ESXi some how created a symbolic link between the source and destination during the move process.  I had to end up renaming the offending vmdk files and then pointing the disks in the VMs to the new renamed vmdk files.  I was able to move almost all VMs successfully. One wouldn't boot due to me selecting 'I moved it' instead of 'I copied it' when starting the VM and four others appeared to have been corrupted from the bad sectors. 

After migrating the VMs and verifying they all worked, I then applied the latest ESXi patch to the host.  Now prior to applying the patch I was able to unmount and mount the old datastore in order to verify that none of the VMs depended on any files on it.  However after the patch, I was unable to unmount or delete the datastore because during the patching process a diagnostic partition had been created on the old datastore.  When trying to delete the partition using ESXi command-line, I was unable to do so because it was in use.  Which leads to be believe that VMFS doesn't allow deletion of any files or resources that are in use.  I ended up shutting down the machine and physically disconnecting the drive.  After booting up, the datastore was gone (of course) and I then proceeded to add the previous drive to the new datastore regaining my 1.5TB datastore. 

Key take aways from this were:
1) Don't buy cheap hardware.
2) VMFS doesn't allow for deletion of files that are in use.
3) It is best to install the hypervisor on a seperate physical disk to make disaster recovery easier.

This was quite the ordeal, but also learning experience.