DISCLAIMER: This is a development environment as well as a work in progress. Do not attempt this on a 'production' system without going through the process first on a non-production system and learning how to trouble shoot the system at various points.
Also, this is an 'in place' upgrade. Which means you need to have approximately less than 49% of your hardware resources used, so you can create a new environment and migrate everything. Unless you only plan to migrate a few things. However, this is an entirely different subject.
I have tried to break this process down in to the following steps:
- Plan Upgrade
- Migrate Instances(VMs) to free up additional nodes.
- Remove free Node's Ceph OSD from Ceph Cluster
- Clean up Neutron agents.
- Disable Nova Services
- Upgrade your Fuel Server
- Deploy new Environment (Juno here)
- Export Volumes from Old (Icehouse) Environment
- Import Volumes as Images and boot new VMs w/ new Volume
- Repeat Step 9 for all instances to keep, and Delete the old Environment
We have been running a "steady" instance of Fuel 5.1 with 3 controllers configured with "HA" (High-Availability) and 4 Nodes operating as both CEPH and Compute nodes.
We started finding small timeout issues and bugs w/ 5.1, so decided it was time to upgrade. 6.0 is out, so we had to download the 5.1.1 update, and the 6.0 update. And run them both in order. Here are the links to the software, and installation guide(s). Please be advised, this information is current as of January 26, 2015.
Step 1.) Plan
Our plan is to decommission a single Compute, Storage - Ceph OSD node, and a single Controller. Fuel will not like running HA with 2 controllers. But it should be okay to deal with it while we migrate our environment.
Step 2.) Migrate instances and free up a node.
Live Migration Guide - Using this guide, I was able to determine the instances I had running on Node18, and migrate them to a node that had less.
[root@node-10 ~]# nova host-describe node-18.ccri.com
+------------------+----------------------------------+-----+-----------+---------+
| HOST | PROJECT | cpu | memory_mb | disk_gb |
+------------------+----------------------------------+-----+-----------+---------+
| node-18.ccri.com | (total) | 24 | 72492 | 8731 |
| node-18.ccri.com | (used_now) | 13 | 27136 | 260 |
| node-18.ccri.com | (used_max) | 10 | 20480 | 200 |
| node-18.ccri.com | dc25784fc9d94e58b3887045756cf9e8 | 8 | 16384 | 160 |
| node-18.ccri.com | 0dc5c66d16b04d48b07c868cc195f46a | 2 | 4096 | 40 |
+------------------+----------------------------------+-----+-----------+---------+
[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+
| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |
| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | - | Running | ONR=172.16.0.39, 192.168.3.161 |
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+
In my case, node-15 had the least number of instance, so I decided to migrate the two instance on 18 over to 15; starting with 'alfresco'.
[root@node-10 ~]# nova live-migration b58af781-bb57-4c35-bbe5-4153e2d4bb6e node-15.ccri.com
ERROR: HTTPConnectionPool(host='192.168.3.100', port=8774): Max retries exceeded with url: /v2/0dc5c66d16b04d48b07c868cc195f46a/servers/b58af781-bb57-4c35-bbe5-4153e2d4bb6e/action (Caused by : )
Notice the ERROR message. I believe this is because the migration took longer than expected. However, I verified through the nova list command, as well as the Horizon UI that the server was still migrating hosts, so i waited. It finished within 5 minutes, so i then verified:
[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | - | Running | ONR=172.16.0.39, 192.168.3.161 |
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
[root@node-10 ~]# nova list --host node-15.ccri.com --all-tenants
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+
| 2e1057ee-48d5-4b7f-aa9e-14b0103535ec | Mantis | ACTIVE | - | Running | ONR=172.16.0.10, 192.168.3.138 |
| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |
| 655fba2a-9867-4305-935c-e6b3c3a84368 | docker-registry | ACTIVE | - | Running | net04=192.168.111.7, 192.168.3.130 |
| f368bab9-e054-4bda-84ee-e5633e6381cb | docker01 | ACTIVE | - | Running | DS Network=172.16.0.4, 192.168.3.140 |
| b336499d-7314-464f-9f98-ee1ed0ddd787 | inventory | ACTIVE | - | Running | net04=192.168.111.8, 192.168.3.131 |
| 1b57d04c-29c7-4a1b-8cac-114f491ec5d3 | onr-node-4 | ACTIVE | - | Running | ONR=172.16.0.54, 192.168.3.167 |
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+
Also notice, the public IP did not change. This is good. :) Repeat this process to free up a second node to support ceph install on juno, or find a new server in your budget to use.
Step 3.) Remove the Ceph OSD from Ceph cluster
Since I am removing Node-18, I will remove the ceph instance from OSD via Node-18!
http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster
List some pools for sanity.
[root@node-18 ~]# ceph osd lspools
0 data,1 metadata,2 rbd,3 images,4 volumes,5 .rgw.root,6 compute,7 .rgw.control,8 .rgw,9 .rgw.gc,10 .users.uid,11 .rgw.buckets.index,12 .rgw.buckets,
Determine which OSD ID this node has. #6 in this case
[root@node-18 ~]# ps -ef | grep ceph
root 3258 1 2 2014 ? 1-01:46:17 /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph
root 11490 10726 0 17:25 pts/0 00:00:00 grep ceph
Mark it for removal
[root@node-18 ~]# ceph osd out 6
marked out osd.6.
Watch the rebalance happen
[root@node-18 ~]# ceph -w
cluster 994f6ed1-69c0-4e8b-8c76-fc1186c7eda5
health HEALTH_WARN mon.node-10 low disk space; mon.node-12 low disk space; mon.node-13 low disk space
monmap e3: 3 mons at {node-10=10.10.20.3:6789/0,node-12=10.10.20.5:6789/0,node-13=10.10.20.6:6789/0}, election epoch 676, quorum 0,1,2 node-10,node-12,node-13
osdmap e217: 7 osds: 7 up, 6 in
pgmap v6442467: 5312 pgs, 13 pools, 1301 GB data, 301 kobjects
2304 GB used, 5895 GB / 8199 GB avail
1 active+clean+scrubbing+deep
5311 active+clean
You will see lot of entries that describe what is happening. Most importantly something like this:
9854/630322 objects degraded (1.563%)
This means that ceph's RBD objects (which keep your openstack data), do not have enough replicas for 9854 objects. It will copy the replica to another host, so you will have to wait until all objects that were on your OSD node are rebalanced. This will utilize a lot of network I/O, and your active VMs will suffer. So warn your users before doing this.
Again, ensure this is done on any additonal nodes you want to delete.
Step 4.) Cleanup Neutron
You may notice that once your node is gone, there are some stale neutron agents marked dead.
The following command will list all dead (xxx) agents and show details. Change 'agent-show' to 'agent-delete' to remove them permenantly:
for i in $(neutron agent-list | grep "xxx" | awk '{print $2}'); do neutron agent-show $i; done;
Step 5.) Cleanup Nova Services
Just like the neutron services, the nova services on your old node may still show up in horizon as 'down'. You can use something like the command below to disable them.
for i in $(nova service-list | grep node-13 | awk '{print $2}'); do nova service-disable node-13.ccri.com $i; done;
I have did not figure out how to delete the services. But it doesn't really matter, because I will be deleting the entire environment once the upgrade and migrations are complete.
Step 6.) Upgrade to Fuel 6.0 if you didn't already.
Once your instances have been all migrated. You should be able to use the Fuel UI and decomission the node and one controller. Run the update.sh either before or after. I did it before migrating instances.
Step 7.) Create the new Environment with Fuel UI.
You should now have a free controller and nodes to build a Juno Openstack environment to start migrating your instances from the old Icehouse Openstack environment. Hopefully with virtually 0 downtime.
Ceph installation requires at least 2 nodes.
Step 8.) Export Volumes from Old Environment
There are likely a variety of ways to import/export volumes in openstack. I have found the following method works well.
First, find a place on your old controller w/ extra disk. Generally /var/lib/mongo has a lot of space w/ default partitioning. Locate the UUID for a volume using nova or cinder list. instance IDs are used for the ephemeral disks, volume IDs are used for 'volume' disks. Make sure the Instance using the disk is shut off.
Export it with rbd, then compress it to qcow2 so you can pass it over the Fuel network to your other environment's controller. In this example, I am exporting a 'Volume' as 'raw' disk, then converting it.
[root@node-10 mongo]# rbd export --pool=volumes volume-0f2a87ec-74c5-4356-a4e7-12fffd6fe5ea docker-registry.raw
Exporting image: 100% complete...done.
[root@node-10 mongo]# qemu-img convert -f raw -O qcow2 ./docker-registry.raw docker-registry.qcow2
To support SCP, on the old environment controller modify /etc/ssh/sshd_config and set PasswordAuthentication to 'yes' which is at the bottom of the file. (Also int he middle of the file, but commented out). Then # useradd temp and set the password, # passwd temp
#service sshd restart, and you should now be able to scp the data from the new controller.
Step 9.) Import Volumes as Images and launch
Here, my new Juno environment's controller is 'node-35'
[root@node-35 ~]# scp temp@node-10:/var/lib/mongo/docker-registry.qcow2 .
Warning: Permanently added 'node-10,10.20.0.3' (RSA) to the list of known hosts.
temp@node-10's password:
docker-registry.qcow2 65% 7489MB 90.3MB/s 00:21 ETA
Get the byte size first:
[root@node-35 ~]# ls -al docker-registry.qcow2
-rw-r--r-- 1 root root 11334057984 Feb 11 16:34 docker-registry.qcow2
And now, import it as an image with glance:
[root@node-35 ~]# glance image-create --size 11334057984 --name docker-registry --store rbd --disk-format qcow2 --container-format bare --file ./docker-registry.qcow2
|
Importing into Glance while watching Ceph/rbd |
+------------------+--------------------------------------+
| Property | Value |
+------------------+--------------------------------------+
| checksum | a209fafa8ae5369e0a93b30e41c4e27c |
| container_format | bare |
| created_at | 2015-02-11T16:40:09 |
| deleted | False |
| deleted_at | None |
| disk_format | qcow2 |
| id | 22209ea8-2287-425b-9e45-c79ec210d380 |
| is_public | False |
| min_disk | 0 |
| min_ram | 0 |
| name | docker-registry |
| owner | aeea9a5fd7284450a3468915980a8c45 |
| protected | False |
| size | 11334057984 |
| status | active |
| updated_at | 2015-02-11T16:47:53 |
| virtual_size | None |
+------------------+--------------------------------------+
At this point, you should be able to launch a new instance from a converted Image->Volume, and specify this glance image as the source, and create a new Volume specifying the size of the original volume. In my case with this docker registry, it was 100GB, even though qcow compressed it down to 11GB.
|
Create Volume from Image |
|
Create Volume from Image with Specified Size of Original Volume (Not qcow size!) |
|
Booting instance from new Volume |
If at any point the UI has an error. Just watch the osd pool stats # watch ceph osd pool stats volumes
You should see client io that is pretty heavy. When it is done, you can refresh your volumes on horizon UI and it should be there as 'Available'.
Step 10.) "Rinse and Repeat"
Go ahead and repeat step 8/9 for all the instances you want migrated. Set up your public IPs, update your external DNS entries, etc. And wait a day to make sure things are stable. Afterwards go ahead and delete the old environment, add the free nodes to your new environment, and migrate some instances to lighten the load on your first couple of nodes, and you should be good to go!