Champion of Cyrodiil: recover

Recently ceph monitors got blown away. Along with this was all of the metadata associated with the monitors. Using this technique I was able to recover some of my data, but it was a lot of sleuthing

In the top left corner are the script running in a loop over all of the unique 'header' files from the various osds.

The main script is in the top right corner. Essentially we traverse the servers (nodes) and ceph osd instances throughout the cluster, collecting files (with find) that match the wildcard and are bigger than a byte.

The "wildcard" is the key, "13f2a30976b17" which is defined as replicated header file names for each rbd image on your ceph cluster. If you had 10 images, with 3 replicas, you would find 30 header files in your cluster, with identical names for the replicas. This would be okay, even if they are on the same server; because they are in separate osd data folders.

Using SSH we fetch a list of all the files on an osd instance and dump to a temp file. We do a cut on the slash(/) folder separator and dump a list of the files in a new file and remove the temp.

We then dump all the files into a csv, with the osd node location in column 1 and the file name in column 2. the -u switch only snags unique instances, so replicas are dropped.

We then execute a little script called scp obs. the tricky part here is the backslash in the ceph file names. use double quotes in the scp command and escape the \ with \\. So that's 3 slashes surrounded in double quotes w/ the scp command.

finally once we have all the object files. we 'dd' them together ans the final output.

Two quick notes,

in my cut command i use column #8 and #9. Thinking about it, this could give you a different result depending on where your osd data folder is. Mine is the default path, /var/lib/ceph/osd/ceph-0/current/4.4f/

For my convenience, at the end I mv the "raw" file to qcow2, since I know that is what these images are. This is based on the output of hexdump -C -n 4 -s 0 $first-block, where the first-block is the object with 16 zeroes. (the first block in the object group). It basically tells me the header of the first block which is 'QFI' for qcow2.

I even converted one of the qcow2 files to a VDI and booted in successfully in virtualbox.

The bash scripts can be found here:
https://github.com/charlescva/ceph-recovery

UPDATE:
It is the next morning, and I let my script run overnight. Check it out. :)

Search This Blog

Thursday, February 26, 2015

Recover Openstack Ceph data with missing/no monitor(s)