Champion of Cyrodiil: ceph

Showing posts with label ceph. Show all posts

Thursday, February 26, 2015

Recover Openstack Ceph data with missing/no monitor(s)

Recently ceph monitors got blown away. Along with this was all of the metadata associated with the monitors. Using this technique I was able to recover some of my data, but it was a lot of sleuthing

In the top left corner are the script running in a loop over all of the unique 'header' files from the various osds.

The main script is in the top right corner. Essentially we traverse the servers (nodes) and ceph osd instances throughout the cluster, collecting files (with find) that match the wildcard and are bigger than a byte.

The "wildcard" is the key, "13f2a30976b17" which is defined as replicated header file names for each rbd image on your ceph cluster. If you had 10 images, with 3 replicas, you would find 30 header files in your cluster, with identical names for the replicas. This would be okay, even if they are on the same server; because they are in separate osd data folders.

Using SSH we fetch a list of all the files on an osd instance and dump to a temp file. We do a cut on the slash(/) folder separator and dump a list of the files in a new file and remove the temp.

We then dump all the files into a csv, with the osd node location in column 1 and the file name in column 2. the -u switch only snags unique instances, so replicas are dropped.

We then execute a little script called scp obs. the tricky part here is the backslash in the ceph file names. use double quotes in the scp command and escape the \ with \\. So that's 3 slashes surrounded in double quotes w/ the scp command.

finally once we have all the object files. we 'dd' them together ans the final output.

Two quick notes,

in my cut command i use column #8 and #9. Thinking about it, this could give you a different result depending on where your osd data folder is. Mine is the default path, /var/lib/ceph/osd/ceph-0/current/4.4f/

For my convenience, at the end I mv the "raw" file to qcow2, since I know that is what these images are. This is based on the output of hexdump -C -n 4 -s 0 $first-block, where the first-block is the object with 16 zeroes. (the first block in the object group). It basically tells me the header of the first block which is 'QFI' for qcow2.

I even converted one of the qcow2 files to a VDI and booted in successfully in virtualbox.

The bash scripts can be found here:
https://github.com/charlescva/ceph-recovery

UPDATE:
It is the next morning, and I let my script run overnight. Check it out. :)

Wednesday, February 18, 2015

Administering Fuel with Openstack Juno Services

I have recently started using Openstack in an environment with 'production' value. By this I mean that our Openstack instance is becoming a critical component of our business infrastructure; and at this point several development support services are tenants within it.

Openstack is not an easy solution. Almost every core service is distributed, decentralized, and utilizes the full scope of their dependencies. This results in good news, and bad news. The good news is that your infrastructure is so loosely coupled, that failures will USUALLY be localized to a specific process or configuration setting. The bad news is, until you learn the terminology and components, you'll be running around like a mad man trying to find the various configs and error logs.

Ceph

First you will need to ensure your file system is stable. Ceph has been with Openstack since for a long time. Yes it is different than any other file system you're likely used to. This means you'll have to learn something new. One of the biggest issues with migration and spawning VMs can stem from failures to read/write RAW data to the distributed file system.

The best thing to do first, is read over this paper on RUSH, or replication under scalable hashing: http://www.ssrc.ucsc.edu/Papers/honicky-ipdps04.pdf.

The gist of this paper should help you to understand that Ceph clients in Openstack use the jenkins hash (http://en.wikipedia.org/wiki/Jenkins_hash_function) with a tree of weighted buckets (CRUSH Map, http://ceph.com/docs/master/rados/operations/crush-map/) and a map defaulting of 256 placement groups (http://ceph.com/docs/master/rados/operations/placement-groups/) to figure out where objects are stored. Also that Ceph is not a file system, per say, but an "object store". This means there is no central server the clients must negotiate with to read and write object data. The Ceph documentation is phenomenal, and you should familiarize yourself with it is much as you can. Most of your questions are answered in the documentation, you'll just need to be patient, read it all at a decent pace, and let the information resonate with your mind for a night before digging in to it again. After a couple of days it will start to make more sense. Here are some common commands to take a peak at:

ceph osd tree
ceph -w
ceph osd reweight (don't just run this randomly, understand what it does first)

Also keep in mind there have been bug reports regarding applying a new Crush map to a running cluster. So spend a lot of time looking at a sample crush map in a test cluster before applying a new one. It is likely that you can resolve a lot of your issues by using reweight and or modifying the number of replicas in largely used storage pools. like your Openstack volumes, images and compute pool for ephemeral storage

RBD (Rados Block Device)

RBD is used on top of the Ceph object store. This provides the API Openstack uses to connect your volumes and images to the hypervisor you're using (Hopefully QEMU, because I like it and want it supported). Here are some helpful commands:

rados df
rbd import
rbd export
rbd ls|rm|mv
qemu-img convert (although not rbd specific, relvent when dealing with RAW rbd images and compressing them to qcow2 for moving across the network)

In an earlier post on this blog, you will see my experience upgrading openstack. In there you will see where I manually migrated each of my VMs from an Icehouse cluster to Juno. I had some hardware constraints and it was tough, but in the end it worked very well.

nova,cinder,glance CLI

You won't get by on the UI alone. The bash command line for an openstack controller is your best tool. Don't be afraid to poke around the databases on mysql for cinder, glance and nova. Use the nova, glance and cinder tools with the 'help' argument and read the usage. These tools are required to communicate with the API in a standardized way that is supported by the developers of Openstack. If you're using 3rd party providers like Mirantis Fuel for Openstack, then you will need to use their documentation for maintaining Openstack environments. Be advised, some of these 3rd party tools are lacking support and capability to perform some of the tasks you will need to know to properly maintain the environment.

Here are the ones to know:

nova boot

--availability-zone
--nic id
--flavor
flags for Volume or Image backed.

nova services-list
nova service-delete (Gets mention for not in Havana, but is in Juno!)

Seriously though, use mysql and don't be affraid to adjust the instances metadata. Sometimes a VM is actually OFF, but the Horizon UI will show it as 'Shutting Down...' or 'Running'. You can verify the status of your VM by SSHing into the compute node hosting the instance, and as root running:

# ps -ef | grep kvm

You'll see the instance id in the run command, as well as a bunch of other args. Be advised, the domain.xml virsh uses is generated in code by python and uses the information in mysql to do so. So modifying things like the video driver or video ram, require changes to the flavor and image metadata. I recently saw in Juno an option to nova boot with args passing metadata key values to set in the virsh domain, although I have not tried it yet. I believe it is here: http://docs.openstack.org/cli-reference/content/novaclient_commands.html#novaclient_subcommand_boot, and the boot option appears to be --image-width .

Neutron

Neutron is a bit overwhelming. Just know that the Open vSwitch service on your compute nodes handle the networking for the VMs running there. Just because your L3 Agent(s) are down and you cannot get to the VM using it's public IP, does not mean that the VM is Off, it just means that the external connection isnt being routed. Ensure all of these services are running and configured correctly. This section is intentionally short because of the vast configuration options with neutron.

neutron list-agents

Last I need to thank the developers at Mirantis Fuel and others hanging out on the freenode IRC channel #fuel. I could not have learned as much as I know at this point, without the help of a few users in there. Thank you guys for your gracious support throughout my adoption of Openstack.

Monday, January 26, 2015

Upgrading Openstack with Fuel

DISCLAIMER: This is a development environment as well as a work in progress. Do not attempt this on a 'production' system without going through the process first on a non-production system and learning how to trouble shoot the system at various points.

Also, this is an 'in place' upgrade. Which means you need to have approximately less than 49% of your hardware resources used, so you can create a new environment and migrate everything. Unless you only plan to migrate a few things. However, this is an entirely different subject.

I have tried to break this process down in to the following steps:

Plan Upgrade
Migrate Instances(VMs) to free up additional nodes.
Remove free Node's Ceph OSD from Ceph Cluster
Clean up Neutron agents.
Disable Nova Services
Upgrade your Fuel Server
Deploy new Environment (Juno here)
Export Volumes from Old (Icehouse) Environment
Import Volumes as Images and boot new VMs w/ new Volume
Repeat Step 9 for all instances to keep, and Delete the old Environment

We have been running a "steady" instance of Fuel 5.1 with 3 controllers configured with "HA" (High-Availability) and 4 Nodes operating as both CEPH and Compute nodes.

We started finding small timeout issues and bugs w/ 5.1, so decided it was time to upgrade. 6.0 is out, so we had to download the 5.1.1 update, and the 6.0 update. And run them both in order. Here are the links to the software, and installation guide(s). Please be advised, this information is current as of January 26, 2015.

https://www.fuel-infra.org/ - Latest Download. (Fuel 6.0 for me today.)

Step 1.) Plan

Our plan is to decommission a single Compute, Storage - Ceph OSD node, and a single Controller. Fuel will not like running HA with 2 controllers. But it should be okay to deal with it while we migrate our environment.

Step 2.) Migrate instances and free up a node.

Live Migration Guide - Using this guide, I was able to determine the instances I had running on Node18, and migrate them to a node that had less.

[root@node-10 ~]# nova host-describe node-18.ccri.com

+------------------+----------------------------------+-----+-----------+---------+

+------------------+----------------------------------+-----+-----------+---------+

| node-18.ccri.com | (total) | 24 | 72492 | 8731 |

| node-18.ccri.com | (used_now) | 13 | 27136 | 260 |

| node-18.ccri.com | (used_max) | 10 | 20480 | 200 |

| node-18.ccri.com | dc25784fc9d94e58b3887045756cf9e8 | 8 | 16384 | 160 |

| node-18.ccri.com | 0dc5c66d16b04d48b07c868cc195f46a | 2 | 4096 | 40 |

+------------------+----------------------------------+-----+-----------+---------+

[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

In my case, node-15 had the least number of instance, so I decided to migrate the two instance on 18 over to 15; starting with 'alfresco'.

[root@node-10 ~]# nova live-migration b58af781-bb57-4c35-bbe5-4153e2d4bb6e node-15.ccri.com

ERROR: HTTPConnectionPool(host='192.168.3.100', port=8774): Max retries exceeded with url: /v2/0dc5c66d16b04d48b07c868cc195f46a/servers/b58af781-bb57-4c35-bbe5-4153e2d4bb6e/action (Caused by : )

Notice the ERROR message. I believe this is because the migration took longer than expected. However, I verified through the nova list command, as well as the Horizon UI that the server was still migrating hosts, so i waited. It finished within 5 minutes, so i then verified:

[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

| ID                                   | Name     | Status | Task State | Power State | Networks                       |

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | -          | Running     | ONR=172.16.0.39, 192.168.3.161 |

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

[root@node-10 ~]# nova list --host node-15.ccri.com --all-tenants

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

Also notice, the public IP did not change. This is good. :) Repeat this process to free up a second node to support ceph install on juno, or find a new server in your budget to use.

Step 3.) Remove the Ceph OSD from Ceph cluster

Since I am removing Node-18, I will remove the ceph instance from OSD via Node-18!

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster

List some pools for sanity.

[root@node-18 ~]# ceph osd lspools
0 data,1 metadata,2 rbd,3 images,4 volumes,5 .rgw.root,6 compute,7 .rgw.control,8 .rgw,9 .rgw.gc,10 .users.uid,11 .rgw.buckets.index,12 .rgw.buckets,

Determine which OSD ID this node has. #6 in this case

[root@node-18 ~]# ps -ef | grep ceph
root 3258 1 2 2014 ? 1-01:46:17 /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph
root 11490 10726 0 17:25 pts/0 00:00:00 grep ceph

Mark it for removal

[root@node-18 ~]# ceph osd out 6
marked out osd.6.

Watch the rebalance happen

[root@node-18 ~]# ceph -w
cluster 994f6ed1-69c0-4e8b-8c76-fc1186c7eda5
health HEALTH_WARN mon.node-10 low disk space; mon.node-12 low disk space; mon.node-13 low disk space
monmap e3: 3 mons at {node-10=10.10.20.3:6789/0,node-12=10.10.20.5:6789/0,node-13=10.10.20.6:6789/0}, election epoch 676, quorum 0,1,2 node-10,node-12,node-13
osdmap e217: 7 osds: 7 up, 6 in
pgmap v6442467: 5312 pgs, 13 pools, 1301 GB data, 301 kobjects
2304 GB used, 5895 GB / 8199 GB avail
1 active+clean+scrubbing+deep
5311 active+clean

You will see lot of entries that describe what is happening. Most importantly something like this:

9854/630322 objects degraded (1.563%)

This means that ceph's RBD objects (which keep your openstack data), do not have enough replicas for 9854 objects. It will copy the replica to another host, so you will have to wait until all objects that were on your OSD node are rebalanced. This will utilize a lot of network I/O, and your active VMs will suffer. So warn your users before doing this.

Again, ensure this is done on any additonal nodes you want to delete.

Step 4.) Cleanup Neutron

You may notice that once your node is gone, there are some stale neutron agents marked dead.

The following command will list all dead (xxx) agents and show details. Change 'agent-show' to 'agent-delete' to remove them permenantly:

for i in $(neutron agent-list | grep "xxx" | awk '{print $2}'); do neutron agent-show $i; done;

Step 5.) Cleanup Nova Services

Just like the neutron services, the nova services on your old node may still show up in horizon as 'down'. You can use something like the command below to disable them.

for i in $(nova service-list | grep node-13 | awk '{print $2}'); do nova service-disable node-13.ccri.com $i; done;

I have did not figure out how to delete the services. But it doesn't really matter, because I will be deleting the entire environment once the upgrade and migrations are complete.

Step 6.) Upgrade to Fuel 6.0 if you didn't already.

Once your instances have been all migrated. You should be able to use the Fuel UI and decomission the node and one controller. Run the update.sh either before or after. I did it before migrating instances.

Step 7.) Create the new Environment with Fuel UI.

You should now have a free controller and nodes to build a Juno Openstack environment to start migrating your instances from the old Icehouse Openstack environment. Hopefully with virtually 0 downtime.

Ceph installation requires at least 2 nodes.

Step 8.) Export Volumes from Old Environment

There are likely a variety of ways to import/export volumes in openstack. I have found the following method works well.

First, find a place on your old controller w/ extra disk. Generally /var/lib/mongo has a lot of space w/ default partitioning. Locate the UUID for a volume using nova or cinder list. instance IDs are used for the ephemeral disks, volume IDs are used for 'volume' disks. Make sure the Instance using the disk is shut off.

Export it with rbd, then compress it to qcow2 so you can pass it over the Fuel network to your other environment's controller. In this example, I am exporting a 'Volume' as 'raw' disk, then converting it.

[root@node-10 mongo]# rbd export --pool=volumes volume-0f2a87ec-74c5-4356-a4e7-12fffd6fe5ea docker-registry.raw
Exporting image: 100% complete...done.
[root@node-10 mongo]# qemu-img convert -f raw -O qcow2 ./docker-registry.raw docker-registry.qcow2

To support SCP, on the old environment controller modify /etc/ssh/sshd_config and set PasswordAuthentication to 'yes' which is at the bottom of the file. (Also int he middle of the file, but commented out). Then # useradd temp and set the password, # passwd temp
#service sshd restart, and you should now be able to scp the data from the new controller.

Step 9.) Import Volumes as Images and launch

Here, my new Juno environment's controller is 'node-35'

[root@node-35 ~]# scp temp@node-10:/var/lib/mongo/docker-registry.qcow2 .
Warning: Permanently added 'node-10,10.20.0.3' (RSA) to the list of known hosts.
temp@node-10's password:
docker-registry.qcow2 65% 7489MB 90.3MB/s 00:21 ETA

Get the byte size first:
[root@node-35 ~]# ls -al docker-registry.qcow2

-rw-r--r-- 1 root root 11334057984 Feb 11 16:34 docker-registry.qcow2

And now, import it as an image with glance:

[root@node-35 ~]# glance image-create --size 11334057984 --name docker-registry --store rbd --disk-format qcow2 --container-format bare --file ./docker-registry.qcow2

Importing into Glance while watching Ceph/rbd

At this point, you should be able to launch a new instance from a converted Image->Volume, and specify this glance image as the source, and create a new Volume specifying the size of the original volume. In my case with this docker registry, it was 100GB, even though qcow compressed it down to 11GB.

Create Volume from Image

Create Volume from Image with Specified Size of Original Volume (Not qcow size!)

Booting instance from new Volume

If at any point the UI has an error. Just watch the osd pool stats # watch ceph osd pool stats volumes

You should see client io that is pretty heavy. When it is done, you can refresh your volumes on horizon UI and it should be there as 'Available'.

Step 10.) "Rinse and Repeat"

Go ahead and repeat step 8/9 for all the instances you want migrated. Set up your public IPs, update your external DNS entries, etc. And wait a day to make sure things are stable. Afterwards go ahead and delete the old environment, add the free nodes to your new environment, and migrate some instances to lighten the load on your first couple of nodes, and you should be good to go!

Search This Blog

Thursday, February 26, 2015

Recover Openstack Ceph data with missing/no monitor(s)

Wednesday, February 18, 2015

Administering Fuel with Openstack Juno Services

Monday, January 26, 2015

Upgrading Openstack with Fuel