Champion of Cyrodiil: January 2015

DISCLAIMER: This is a development environment as well as a work in progress. Do not attempt this on a 'production' system without going through the process first on a non-production system and learning how to trouble shoot the system at various points.

Also, this is an 'in place' upgrade. Which means you need to have approximately less than 49% of your hardware resources used, so you can create a new environment and migrate everything. Unless you only plan to migrate a few things. However, this is an entirely different subject.

I have tried to break this process down in to the following steps:

Plan Upgrade
Migrate Instances(VMs) to free up additional nodes.
Remove free Node's Ceph OSD from Ceph Cluster
Clean up Neutron agents.
Disable Nova Services
Upgrade your Fuel Server
Deploy new Environment (Juno here)
Export Volumes from Old (Icehouse) Environment
Import Volumes as Images and boot new VMs w/ new Volume
Repeat Step 9 for all instances to keep, and Delete the old Environment

We have been running a "steady" instance of Fuel 5.1 with 3 controllers configured with "HA" (High-Availability) and 4 Nodes operating as both CEPH and Compute nodes.

We started finding small timeout issues and bugs w/ 5.1, so decided it was time to upgrade. 6.0 is out, so we had to download the 5.1.1 update, and the 6.0 update. And run them both in order. Here are the links to the software, and installation guide(s). Please be advised, this information is current as of January 26, 2015.

https://www.fuel-infra.org/ - Latest Download. (Fuel 6.0 for me today.)

Step 1.) Plan

Our plan is to decommission a single Compute, Storage - Ceph OSD node, and a single Controller. Fuel will not like running HA with 2 controllers. But it should be okay to deal with it while we migrate our environment.

Step 2.) Migrate instances and free up a node.

Live Migration Guide - Using this guide, I was able to determine the instances I had running on Node18, and migrate them to a node that had less.

[root@node-10 ~]# nova host-describe node-18.ccri.com

+------------------+----------------------------------+-----+-----------+---------+

+------------------+----------------------------------+-----+-----------+---------+

| node-18.ccri.com | (total) | 24 | 72492 | 8731 |

| node-18.ccri.com | (used_now) | 13 | 27136 | 260 |

| node-18.ccri.com | (used_max) | 10 | 20480 | 200 |

| node-18.ccri.com | dc25784fc9d94e58b3887045756cf9e8 | 8 | 16384 | 160 |

| node-18.ccri.com | 0dc5c66d16b04d48b07c868cc195f46a | 2 | 4096 | 40 |

+------------------+----------------------------------+-----+-----------+---------+

[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |

+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

In my case, node-15 had the least number of instance, so I decided to migrate the two instance on 18 over to 15; starting with 'alfresco'.

[root@node-10 ~]# nova live-migration b58af781-bb57-4c35-bbe5-4153e2d4bb6e node-15.ccri.com

ERROR: HTTPConnectionPool(host='192.168.3.100', port=8774): Max retries exceeded with url: /v2/0dc5c66d16b04d48b07c868cc195f46a/servers/b58af781-bb57-4c35-bbe5-4153e2d4bb6e/action (Caused by : )

Notice the ERROR message. I believe this is because the migration took longer than expected. However, I verified through the nova list command, as well as the Horizon UI that the server was still migrating hosts, so i waited. It finished within 5 minutes, so i then verified:

[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

| ID                                   | Name     | Status | Task State | Power State | Networks                       |

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | -          | Running     | ONR=172.16.0.39, 192.168.3.161 |

+--------------------------------------+----------+--------+------------+-------------+--------------------------------+

[root@node-10 ~]# nova list --host node-15.ccri.com --all-tenants

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | - | Running | net04=192.168.111.10, 192.168.3.128 |

+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+

Also notice, the public IP did not change. This is good. :) Repeat this process to free up a second node to support ceph install on juno, or find a new server in your budget to use.

Step 3.) Remove the Ceph OSD from Ceph cluster

Since I am removing Node-18, I will remove the ceph instance from OSD via Node-18!

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster

List some pools for sanity.

[root@node-18 ~]# ceph osd lspools
0 data,1 metadata,2 rbd,3 images,4 volumes,5 .rgw.root,6 compute,7 .rgw.control,8 .rgw,9 .rgw.gc,10 .users.uid,11 .rgw.buckets.index,12 .rgw.buckets,

Determine which OSD ID this node has. #6 in this case

[root@node-18 ~]# ps -ef | grep ceph
root 3258 1 2 2014 ? 1-01:46:17 /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph
root 11490 10726 0 17:25 pts/0 00:00:00 grep ceph

Mark it for removal

[root@node-18 ~]# ceph osd out 6
marked out osd.6.

Watch the rebalance happen

[root@node-18 ~]# ceph -w
cluster 994f6ed1-69c0-4e8b-8c76-fc1186c7eda5
health HEALTH_WARN mon.node-10 low disk space; mon.node-12 low disk space; mon.node-13 low disk space
monmap e3: 3 mons at {node-10=10.10.20.3:6789/0,node-12=10.10.20.5:6789/0,node-13=10.10.20.6:6789/0}, election epoch 676, quorum 0,1,2 node-10,node-12,node-13
osdmap e217: 7 osds: 7 up, 6 in
pgmap v6442467: 5312 pgs, 13 pools, 1301 GB data, 301 kobjects
2304 GB used, 5895 GB / 8199 GB avail
1 active+clean+scrubbing+deep
5311 active+clean

You will see lot of entries that describe what is happening. Most importantly something like this:

9854/630322 objects degraded (1.563%)

This means that ceph's RBD objects (which keep your openstack data), do not have enough replicas for 9854 objects. It will copy the replica to another host, so you will have to wait until all objects that were on your OSD node are rebalanced. This will utilize a lot of network I/O, and your active VMs will suffer. So warn your users before doing this.

Again, ensure this is done on any additonal nodes you want to delete.

Step 4.) Cleanup Neutron

You may notice that once your node is gone, there are some stale neutron agents marked dead.

The following command will list all dead (xxx) agents and show details. Change 'agent-show' to 'agent-delete' to remove them permenantly:

for i in $(neutron agent-list | grep "xxx" | awk '{print $2}'); do neutron agent-show $i; done;

Step 5.) Cleanup Nova Services

Just like the neutron services, the nova services on your old node may still show up in horizon as 'down'. You can use something like the command below to disable them.

for i in $(nova service-list | grep node-13 | awk '{print $2}'); do nova service-disable node-13.ccri.com $i; done;

I have did not figure out how to delete the services. But it doesn't really matter, because I will be deleting the entire environment once the upgrade and migrations are complete.

Step 6.) Upgrade to Fuel 6.0 if you didn't already.

Once your instances have been all migrated. You should be able to use the Fuel UI and decomission the node and one controller. Run the update.sh either before or after. I did it before migrating instances.

Step 7.) Create the new Environment with Fuel UI.

You should now have a free controller and nodes to build a Juno Openstack environment to start migrating your instances from the old Icehouse Openstack environment. Hopefully with virtually 0 downtime.

Ceph installation requires at least 2 nodes.

Step 8.) Export Volumes from Old Environment

There are likely a variety of ways to import/export volumes in openstack. I have found the following method works well.

First, find a place on your old controller w/ extra disk. Generally /var/lib/mongo has a lot of space w/ default partitioning. Locate the UUID for a volume using nova or cinder list. instance IDs are used for the ephemeral disks, volume IDs are used for 'volume' disks. Make sure the Instance using the disk is shut off.

Export it with rbd, then compress it to qcow2 so you can pass it over the Fuel network to your other environment's controller. In this example, I am exporting a 'Volume' as 'raw' disk, then converting it.

[root@node-10 mongo]# rbd export --pool=volumes volume-0f2a87ec-74c5-4356-a4e7-12fffd6fe5ea docker-registry.raw
Exporting image: 100% complete...done.
[root@node-10 mongo]# qemu-img convert -f raw -O qcow2 ./docker-registry.raw docker-registry.qcow2

To support SCP, on the old environment controller modify /etc/ssh/sshd_config and set PasswordAuthentication to 'yes' which is at the bottom of the file. (Also int he middle of the file, but commented out). Then # useradd temp and set the password, # passwd temp
#service sshd restart, and you should now be able to scp the data from the new controller.

Step 9.) Import Volumes as Images and launch

Here, my new Juno environment's controller is 'node-35'

[root@node-35 ~]# scp temp@node-10:/var/lib/mongo/docker-registry.qcow2 .
Warning: Permanently added 'node-10,10.20.0.3' (RSA) to the list of known hosts.
temp@node-10's password:
docker-registry.qcow2 65% 7489MB 90.3MB/s 00:21 ETA

Get the byte size first:
[root@node-35 ~]# ls -al docker-registry.qcow2

-rw-r--r-- 1 root root 11334057984 Feb 11 16:34 docker-registry.qcow2

And now, import it as an image with glance:

[root@node-35 ~]# glance image-create --size 11334057984 --name docker-registry --store rbd --disk-format qcow2 --container-format bare --file ./docker-registry.qcow2

Importing into Glance while watching Ceph/rbd

At this point, you should be able to launch a new instance from a converted Image->Volume, and specify this glance image as the source, and create a new Volume specifying the size of the original volume. In my case with this docker registry, it was 100GB, even though qcow compressed it down to 11GB.

Create Volume from Image

Create Volume from Image with Specified Size of Original Volume (Not qcow size!)

Booting instance from new Volume

If at any point the UI has an error. Just watch the osd pool stats # watch ceph osd pool stats volumes

You should see client io that is pretty heavy. When it is done, you can refresh your volumes on horizon UI and it should be there as 'Available'.

Step 10.) "Rinse and Repeat"

Go ahead and repeat step 8/9 for all the instances you want migrated. Set up your public IPs, update your external DNS entries, etc. And wait a day to make sure things are stable. Afterwards go ahead and delete the old environment, add the free nodes to your new environment, and migrate some instances to lighten the load on your first couple of nodes, and you should be good to go!

Here is an example Python script using the docker-py API. In this example, I start 3 containers. One with Accumulo, one with Apache YARN, and one with Geoserver. I am also linking the containers so that they have hosts file entries to support the hostname lookups.

additionally i have declared some volumes that i bind to the host's home folder under /geomesa-docker-volumes/*

If you get version mismatch errors, just modify the version in the get_client_unsecure function.

#!/usr/bin/env python
# The unsecure client requires that your Docker daemon is listening on port 5555 in addition to the default unix socket.
# DOCKER_OPTS="-H unix:///var/run/docker.sock -H tcp://127.0.0.1:5555"
# $ sudo service docker(.io) restart

__author__ = 'championofcyrodiil'

import docker
import getpass
from subprocess import call

geoserver_image = "user:geoserver"
accumulo_image = "user:accumulo"
yarn_image = "user:yarn"
remote_docker_daemon_host = "127.0.0.1"
unsecure_docker_port = 5555

def get_client_unsecure(host, port):
client = docker.Client(base_url="http://%s:%s" % (host, port), version="1.10")
return client

if __name__ == '__main__':
# unsecured connection on localhost (127.0.0.1)
dc = get_client_unsecure(remote_docker_daemon_host, unsecure_docker_port)

def start_geomesa(user):
#accumulo container
accumulo_volumes = ['/opt/accumulo/accumulo-1.5.2/lib/ext/', '/data-dir/', '/data']
accumulo_container = \
dc.create_container(image=accumulo_image,
name=(user + 's-accumulo'),
tty=True,
stdin_open=True,
hostname='accumulo',
ports=[2181, 22, 50070, 50095, 50075, 9000, 9898, 3614],
volumes=accumulo_volumes,
mem_limit="4g")

accumulo_binds = {
'/home/' + getpass.getuser() + '/geomesa-docker-volumes/accumulo-libs':
{
'bind': '/opt/accumulo/accumulo-1.5.2/lib/ext/',
'ro': False
},
'/home/' + getpass.getuser() + '/geomesa-docker-volumes/accumulo-data':
{
'bind': '/data-dir/',
'ro': False
},
'/home/' + getpass.getuser() + '/geomesa-docker-volumes/hdfs-data':
{
'bind': '/data/',
'ro': False
}
}
dc.start(accumulo_container, publish_all_ports=True, binds=accumulo_binds)

#YARN CONTAINER
yarn_container = dc.create_container(image=yarn_image,
name=(user + 's-yarn'),
stdin_open=True,
tty=True,
hostname='yarn',
ports=[8088, 8042, 22], mem_limit="2g")

link = {(user + 's-accumulo'): 'accumulo'}
dc.start(yarn_container,
publish_all_ports=True,
links=link)

#geoserver container
geoserver_container = dc.create_container(image=geoserver_image,
name=(user + 's-geoserver'),
stdin_open=True,
tty=True,
hostname='geoserver',
ports=[8080, 22, 7979], mem_limit="2g")
link = {(user + 's-accumulo'): 'accumulo', (user + 's-yarn'): 'yarn'}
dc.start(geoserver_container,
publish_all_ports=True,
links=link)

start_geomesa('test')
call("./geomesa_info.py")

Search This Blog

Tuesday, January 27, 2015

How to configure Dell BMC

vMedia

Monday, January 26, 2015

Upgrading Openstack with Fuel

Monday, January 19, 2015

Docker Python API

RabbitMQ handshake_timeout