Friday, May 18, 2012

XenServer losing space on SR, and Snapshots fail due to chain count at 30




I have had 2 issues lately with my XenServer (6.0.2).  First of all I was using a snapshot technique to make a complete image backup of each VM.  This was working great... until I reached about 30 backups of any particular VM, then the snapshot would fail, with error that the chain count was too high.  Apparently, coalescing is supposed to put the pieces back together and reset the chain count.  Therefore, I gather coalescing was not happening.


As I began to look into this issue more, I found that I had lost a ton of space on my SR also.  I appeared that there were numerous lost or orphaned VDIs that Xen no longer knew about but were taking up space on the SR.  I believe these lost VDIs were also a part of the chain count snapshot problem.


I've learned that the xe sr-scan uuid= may be the key to all of this.


When you run this from the console (I believe it should be run from the host master).  It usually returns to the command prompt fairly quickly (like in a few seconds).  What I've learned is that running this command simply starts the SR scan.  Then you must monitor the SMLog /var/log/SMlog .


If you check it right away, sr-scan may not have finished.  I found that using XenCenter, I could see a lot of activity on the disk (or network) of the host I was running the scan from.  Obviously, this will also be dependent on the amount of activity on your VMs.  Once you see the stats drop on your host, you may want to check the log again.  When it was finished I was seeing the exception logged as below:


<24565> 2012-05-17 16:45:28.148185               ***********************
<24565> 2012-05-17 16:45:28.148237               *  E X C E P T I O N  *
<24565> 2012-05-17 16:45:28.148288               ***********************
<24565> 2012-05-17 16:45:28.148350      gc: EXCEPTION util.SMException, os.unlin
k(/var/run/sr-mount/2fb93e09-a88c-e02e-183b-cfb1b3386f6c/7d1b380a-d812-4270-a673
-358cfefa8ead.vhd) failed
<24565> 2012-05-17 16:45:28.148402        File "/opt/xensource/sm/cleanup.py", l
ine 2515, in gc
    _gc(None, srUuid, dryRun)
  File "/opt/xensource/sm/cleanup.py", line 2417, in _gc
    _gcLoop(sr, dryRun)
  File "/opt/xensource/sm/cleanup.py", line 2387, in _gcLoop
    sr.garbageCollect(dryRun)
  File "/opt/xensource/sm/cleanup.py", line 1438, in garbageCollect
    self.deleteVDIs(vdiList)
  File "/opt/xensource/sm/cleanup.py", line 1862, in deleteVDIs
    SR.deleteVDIs(self, vdiList)
  File "/opt/xensource/sm/cleanup.py", line 1452, in deleteVDIs
    self.deleteVDI(vdi)


Now, if you note the os.unlink line.  (I have seen another post that mentions another error, but the idea is the same; find the vdi is that causing trouble and deal with it)  It has a path that you can change into, with the SR uuid, then the uuid of the vhd or vdi.  So at this point, I knew there was some problem with that vdi.  Then I ran: xe vdi-list


[root@v1 2fb93e09-a88c-e02e-183b-cfb1b3386f6c]# xe vdi-list uuid=7d1b380a-d812-4270-a673-358cfefa8ead

uuid ( RO)                : 7d1b380a-d812-4270-a673-358cfefa8ead
          name-label ( RW): base copy
    name-description ( RW): 
             sr-uuid ( RO): 2fb93e09-a88c-e02e-183b-cfb1b3386f6c
        virtual-size ( RO): 536870912000
            sharable ( RO): false
           read-only ( RO): true

Sometimes this would return nothing.  If this was the case, I would just start the sr-scan again.  If it returned a vdi, as above, then I would check to see if the file existed.  I would look in 

/var/run/sr-mount// (this is the location where valid vdis/vhds should be)

and do an 

ls -l uuid=7d1b380a-d812-4270-a673-358cfefa8ead.vhd

In my case, this file never existed.  So I was pretty certain it was safe to "delete" this vdi, as it didn't really exist anyway.  To do this, I used: xe vdi-forget

[root@v1 2fb93e09-a88c-e02e-183b-cfb1b3386f6c]# xe vdi-forget uuid=7d1b380a-d812-4270-a673-358cfefa8ead

if the vdi had existed when I did the ls, I would have used vdi-destroy.  I probably would have backed up the file before doing a destroy, just in case...

Once I had done the vdi-forget, then I re-ran the sr-scan, and dealt with the next error as it arose...

As I slowly cleaned up more and more errors, sometimes the scan would take 20+ minutes before it gave me an exception.

---

I have been monitoring and doing this for about 24 hours now.  The number of vdis that are marked "base-copy" has been lowered from over 200 to under 80 so far...  The SR is reporting 600Gb less used so far as well (600Gb reclaimed)...  I'm hopeful that once all the errors are cleaned up, sr-scan will run successfully, coalesce the chains (so snapshots work again) and restore some or all of the lost space on the SR...

Also to note, at least so far for me, none of the problem vdis have existed; i guess they were just still listed in the Xen SR DB or something...


--- 


Update 2012-05-20


After continuing the above process for about 3 days, sr-scan finally completed successfully.  I'm happy to report that my vdi chain-counts have been reset, and I have regained much of my "lost" space from my SR.




4 comments:

Anonymous said...

Hi,

I am wondering if this is caused due the manner in which you are creating and destroying snapshots.

Once a snapshot i crated you can get rid of it using 2 methods.

xe shapshot-destroy snapshot-uuid=

OR

xe shapshot-uninstall snapshot-uuid=

A snapshot-destroy will destory the metadata or checkpoint for the snapshot, but leave the associated vdi behind.

Where as a snapshot-uninstall get rid of the metadata as well as destory the vdi associated with the snapshot.

My instincts say, you are issuing a snapshost-destroy, which leaves all the snapshot vdi's behind. And once you have manually destroyed all of the vdi's it coalesces back.

Hope this makes sense, and helps bring closure to your problem.

Thank you.

Unknown said...

I was having a similar problem, and this citrix article helped a bunch.

http://support.citrix.com/article/CTX123400/

I was having a coalescing problem with one VM's snapshots apparently. after running the command:

xe host-call-plugin host-uuid= plugin=coalesce-leaf fn=leaf-coalesce args:vm_uuid=


against all my VMS, I instantly got my disk space back. Cheers

Dandy341 said...

I've been trying to clean up several snapshots after moving VMs between hosts. Also moved some from SR's on the SAN to local storage.
I started out with a snapshot list and then "xe snapshot-destroy uuid=the uuid I'm removing". Nothing ever happens even after waiting for two hours.
Is this normal?

Juan said...

Hi,
I have the same problem with "the snapshot chain is too long" when doing a snapshot on xencenter.

I do the vm-list and there was like 35 items, I'm deleting (vm-forget) some of them because some items can not be forgeted because there are in use but i don't know where and with what function.

I can't do the sr scan because it puts "the uuid you supplied was invalid" but it was the correct uuid of the vm or xenserver.

I don't know how to do, I need to make snapshot for security backups.

please help!

Thanks