Friday, May 18, 2012

XenServer losing space on SR, and Snapshots fail due to chain count at 30




I have had 2 issues lately with my XenServer (6.0.2).  First of all I was using a snapshot technique to make a complete image backup of each VM.  This was working great... until I reached about 30 backups of any particular VM, then the snapshot would fail, with error that the chain count was too high.  Apparently, coalescing is supposed to put the pieces back together and reset the chain count.  Therefore, I gather coalescing was not happening.


As I began to look into this issue more, I found that I had lost a ton of space on my SR also.  I appeared that there were numerous lost or orphaned VDIs that Xen no longer knew about but were taking up space on the SR.  I believe these lost VDIs were also a part of the chain count snapshot problem.


I've learned that the xe sr-scan uuid= may be the key to all of this.


When you run this from the console (I believe it should be run from the host master).  It usually returns to the command prompt fairly quickly (like in a few seconds).  What I've learned is that running this command simply starts the SR scan.  Then you must monitor the SMLog /var/log/SMlog .


If you check it right away, sr-scan may not have finished.  I found that using XenCenter, I could see a lot of activity on the disk (or network) of the host I was running the scan from.  Obviously, this will also be dependent on the amount of activity on your VMs.  Once you see the stats drop on your host, you may want to check the log again.  When it was finished I was seeing the exception logged as below:


<24565> 2012-05-17 16:45:28.148185               ***********************
<24565> 2012-05-17 16:45:28.148237               *  E X C E P T I O N  *
<24565> 2012-05-17 16:45:28.148288               ***********************
<24565> 2012-05-17 16:45:28.148350      gc: EXCEPTION util.SMException, os.unlin
k(/var/run/sr-mount/2fb93e09-a88c-e02e-183b-cfb1b3386f6c/7d1b380a-d812-4270-a673
-358cfefa8ead.vhd) failed
<24565> 2012-05-17 16:45:28.148402        File "/opt/xensource/sm/cleanup.py", l
ine 2515, in gc
    _gc(None, srUuid, dryRun)
  File "/opt/xensource/sm/cleanup.py", line 2417, in _gc
    _gcLoop(sr, dryRun)
  File "/opt/xensource/sm/cleanup.py", line 2387, in _gcLoop
    sr.garbageCollect(dryRun)
  File "/opt/xensource/sm/cleanup.py", line 1438, in garbageCollect
    self.deleteVDIs(vdiList)
  File "/opt/xensource/sm/cleanup.py", line 1862, in deleteVDIs
    SR.deleteVDIs(self, vdiList)
  File "/opt/xensource/sm/cleanup.py", line 1452, in deleteVDIs
    self.deleteVDI(vdi)


Now, if you note the os.unlink line.  (I have seen another post that mentions another error, but the idea is the same; find the vdi is that causing trouble and deal with it)  It has a path that you can change into, with the SR uuid, then the uuid of the vhd or vdi.  So at this point, I knew there was some problem with that vdi.  Then I ran: xe vdi-list


[root@v1 2fb93e09-a88c-e02e-183b-cfb1b3386f6c]# xe vdi-list uuid=7d1b380a-d812-4270-a673-358cfefa8ead

uuid ( RO)                : 7d1b380a-d812-4270-a673-358cfefa8ead
          name-label ( RW): base copy
    name-description ( RW): 
             sr-uuid ( RO): 2fb93e09-a88c-e02e-183b-cfb1b3386f6c
        virtual-size ( RO): 536870912000
            sharable ( RO): false
           read-only ( RO): true

Sometimes this would return nothing.  If this was the case, I would just start the sr-scan again.  If it returned a vdi, as above, then I would check to see if the file existed.  I would look in 

/var/run/sr-mount// (this is the location where valid vdis/vhds should be)

and do an 

ls -l uuid=7d1b380a-d812-4270-a673-358cfefa8ead.vhd

In my case, this file never existed.  So I was pretty certain it was safe to "delete" this vdi, as it didn't really exist anyway.  To do this, I used: xe vdi-forget

[root@v1 2fb93e09-a88c-e02e-183b-cfb1b3386f6c]# xe vdi-forget uuid=7d1b380a-d812-4270-a673-358cfefa8ead

if the vdi had existed when I did the ls, I would have used vdi-destroy.  I probably would have backed up the file before doing a destroy, just in case...

Once I had done the vdi-forget, then I re-ran the sr-scan, and dealt with the next error as it arose...

As I slowly cleaned up more and more errors, sometimes the scan would take 20+ minutes before it gave me an exception.

---

I have been monitoring and doing this for about 24 hours now.  The number of vdis that are marked "base-copy" has been lowered from over 200 to under 80 so far...  The SR is reporting 600Gb less used so far as well (600Gb reclaimed)...  I'm hopeful that once all the errors are cleaned up, sr-scan will run successfully, coalesce the chains (so snapshots work again) and restore some or all of the lost space on the SR...

Also to note, at least so far for me, none of the problem vdis have existed; i guess they were just still listed in the Xen SR DB or something...


--- 


Update 2012-05-20


After continuing the above process for about 3 days, sr-scan finally completed successfully.  I'm happy to report that my vdi chain-counts have been reset, and I have regained much of my "lost" space from my SR.