Search This Blog

Tuesday, January 27, 2015

How to configure Dell BMC

When configuring a Dell PowerEdge C6220 (Or Similar) to be remotely administrated with BMC you will need to modify the BIOS configuration.  In this example, a dedicated NIC is used.
Steps:
  1. Boot the Server and Press F2 to enter the BIOS.
  2. From the BIOS, use the right arrow key to navigate to the "Server" menu, then move down to the BMC Administration.
  3. Configure the IP Address for the BMC and set the Inteface from "Shared-NIC" to "Dedicated NIC".
  4. Press Esc to get back to the main menu, then use the right arrow key and navigate all the way to last menu and Save the configuration. (Do not exit the BIOS at this time.)
  5. Connect an Ethernet cable to the Dedicated BMC port (Identified with a open ended wrench icon) and plug the other end into your network LAN switch.
  6. From another PC, use a web browser to connect to http:// and use the username root/root to log in.


Once you are able to log in to the console, you will likely want to configure the remote KVM.  This is slightly more complex:
  1. Navigate to the vKVM settings and click the Launch link, then click "Launch Java KVM Client", this should launch a JNLP file with javaws. However, since Java 1.7.0_51, self-signed code can not be executed.  The work around create a file: ~/.java/deployment/security/exception.sites and add the following lines:
    1. http://
    2. https://
  2. Now, when you run the JNLP KVM client, you will be allowed to authorize the execution of self signed code. 


If you have configured everything correctly you should be able to see the BIOS where we left off in Step 4 above:

vMedia

vMedia is used to map the server's CD ROM drive to your PC's CDROM, or a disk image (*.iso,*.dmg).  It is executed from the web console as well.


Additional Information:

Use the PowerEdge BMC getting started guide: poweredge-c6105_User's Guide_en-us.pdf
If you get an error running the KVM 'login denied/not authroized', edit the username/password inside the JNLP file. Change it to 'root/root' or whatever the security credentials are configured as. Then relaunch Java KVM ($ javaws ~/Downloads/viewer.jnlp)
http://docs.oracle.com/javase/7/docs/technotes/guides/jweb/jcp/properties.html - Info about self signed code execution in jre 1.7.0 update 51
 The Advocent KVM and vMedia require port 2068 to be accessible from the client to the server. (This means that if there is a firewall between the client and the server the client intends to connect to, an exception must be made from the client to the destination port 2068 on the server.

Monday, January 26, 2015

Upgrading Openstack with Fuel

DISCLAIMER:  This is a development environment as well as a work in progress.  Do not attempt this on a 'production' system without going through the process first on a non-production system and learning how to trouble shoot the system at various points.

Also, this is an 'in place' upgrade.  Which means you need to have approximately less than 49% of your hardware resources used, so you can create a new environment and migrate everything.  Unless you only plan to migrate a few things.  However, this is an entirely different subject.

I have tried to break this process down in to the following steps:

  1. Plan Upgrade
  2. Migrate Instances(VMs) to free up additional nodes.
  3. Remove free Node's Ceph OSD from Ceph Cluster
  4. Clean up Neutron agents.
  5. Disable Nova Services
  6. Upgrade your Fuel Server
  7. Deploy new Environment (Juno here)
  8. Export Volumes from Old (Icehouse) Environment
  9. Import Volumes as Images and boot new VMs w/ new Volume
  10. Repeat Step 9 for all instances to keep, and Delete the old Environment


We have been running a "steady" instance of Fuel 5.1 with 3 controllers configured with "HA" (High-Availability) and 4 Nodes operating as both CEPH and Compute nodes.

We started finding small timeout issues and bugs w/ 5.1, so decided it was time to upgrade.  6.0 is out, so we had to download the 5.1.1 update, and the 6.0 update.  And run them both in order.  Here are the links to the software, and installation guide(s).  Please be advised, this information is current as of January 26, 2015.

https://www.fuel-infra.org/ - Latest Download.  (Fuel 6.0 for me today.)


Step 1.) Plan

Our plan is to decommission a single Compute, Storage - Ceph OSD node, and a single Controller.  Fuel will not like running HA with 2 controllers.  But it should be okay to deal with it while we migrate our environment.

Step 2.) Migrate instances and free up a node.

Live Migration Guide - Using this guide, I was able to determine the instances I had running on Node18, and migrate them to a node that had less.

[root@node-10 ~]# nova host-describe node-18.ccri.com
+------------------+----------------------------------+-----+-----------+---------+
| HOST             | PROJECT                          | cpu | memory_mb | disk_gb |
+------------------+----------------------------------+-----+-----------+---------+
| node-18.ccri.com | (total)                          | 24  | 72492     | 8731    |
| node-18.ccri.com | (used_now)                       | 13  | 27136     | 260     |
| node-18.ccri.com | (used_max)                       | 10  | 20480     | 200     |
| node-18.ccri.com | dc25784fc9d94e58b3887045756cf9e8 | 8   | 16384     | 160     |
| node-18.ccri.com | 0dc5c66d16b04d48b07c868cc195f46a | 2   | 4096      | 40      |
+------------------+----------------------------------+-----+-----------+---------+
[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+
| ID                                   | Name     | Status | Task State | Power State | Networks                            |
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+
| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco | ACTIVE | -          | Running     | net04=192.168.111.10, 192.168.3.128 |
| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | -          | Running     | ONR=172.16.0.39, 192.168.3.161      |
+--------------------------------------+----------+--------+------------+-------------+-------------------------------------+

In my case, node-15 had the least number of instance, so I decided to migrate the two instance on 18 over to 15; starting with 'alfresco'.

[root@node-10 ~]# nova live-migration b58af781-bb57-4c35-bbe5-4153e2d4bb6e node-15.ccri.com
ERROR: HTTPConnectionPool(host='192.168.3.100', port=8774): Max retries exceeded with url: /v2/0dc5c66d16b04d48b07c868cc195f46a/servers/b58af781-bb57-4c35-bbe5-4153e2d4bb6e/action (Caused by : )


Notice the ERROR message.  I believe this is because the migration took longer than expected.  However, I verified through the nova list command, as well as the Horizon UI that the server was still migrating hosts, so i waited.  It finished within 5 minutes, so i then verified:


[root@node-10 ~]# nova list --host node-18.ccri.com --all-tenants
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
| ID                                   | Name     | Status | Task State | Power State | Networks                       |
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
| 2006d7db-d18e-4390-ae1c-40dd77644853 | hannibal | ACTIVE | -          | Running     | ONR=172.16.0.39, 192.168.3.161 |
+--------------------------------------+----------+--------+------------+-------------+--------------------------------+
[root@node-10 ~]# nova list --host node-15.ccri.com --all-tenants
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+
| ID                                   | Name            | Status | Task State | Power State | Networks                             |
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+
| 2e1057ee-48d5-4b7f-aa9e-14b0103535ec | Mantis          | ACTIVE | -          | Running     | ONR=172.16.0.10, 192.168.3.138       |
| b58af781-bb57-4c35-bbe5-4153e2d4bb6e | alfresco        | ACTIVE | -          | Running     | net04=192.168.111.10, 192.168.3.128  |
| 655fba2a-9867-4305-935c-e6b3c3a84368 | docker-registry | ACTIVE | -          | Running     | net04=192.168.111.7, 192.168.3.130   |
| f368bab9-e054-4bda-84ee-e5633e6381cb | docker01        | ACTIVE | -          | Running     | DS Network=172.16.0.4, 192.168.3.140 |
| b336499d-7314-464f-9f98-ee1ed0ddd787 | inventory       | ACTIVE | -          | Running     | net04=192.168.111.8, 192.168.3.131   |
| 1b57d04c-29c7-4a1b-8cac-114f491ec5d3 | onr-node-4      | ACTIVE | -          | Running     | ONR=172.16.0.54, 192.168.3.167       |
+--------------------------------------+-----------------+--------+------------+-------------+--------------------------------------+


Also notice, the public IP did not change.  This is good. :)  Repeat this process to free up a second node to support ceph install on juno, or find a new server in your budget to use.

Step 3.) Remove the Ceph OSD from Ceph cluster

Since I am removing Node-18, I will remove the ceph instance from OSD via Node-18!

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/#take-the-osd-out-of-the-cluster

List some pools for sanity.

 [root@node-18 ~]# ceph osd lspools
0 data,1 metadata,2 rbd,3 images,4 volumes,5 .rgw.root,6 compute,7 .rgw.control,8 .rgw,9 .rgw.gc,10 .users.uid,11 .rgw.buckets.index,12 .rgw.buckets,

Determine which OSD ID this node has.  #6 in this case

[root@node-18 ~]# ps -ef | grep ceph
root      3258     1  2  2014 ?        1-01:46:17 /usr/bin/ceph-osd -i 6 --pid-file /var/run/ceph/osd.6.pid -c /etc/ceph/ceph.conf --cluster ceph
root     11490 10726  0 17:25 pts/0    00:00:00 grep ceph

Mark it for removal

[root@node-18 ~]# ceph osd out 6
marked out osd.6. 

Watch the rebalance happen

[root@node-18 ~]# ceph -w
    cluster 994f6ed1-69c0-4e8b-8c76-fc1186c7eda5
     health HEALTH_WARN mon.node-10 low disk space; mon.node-12 low disk space; mon.node-13 low disk space
     monmap e3: 3 mons at {node-10=10.10.20.3:6789/0,node-12=10.10.20.5:6789/0,node-13=10.10.20.6:6789/0}, election epoch 676, quorum 0,1,2 node-10,node-12,node-13
     osdmap e217: 7 osds: 7 up, 6 in
      pgmap v6442467: 5312 pgs, 13 pools, 1301 GB data, 301 kobjects
            2304 GB used, 5895 GB / 8199 GB avail
                   1 active+clean+scrubbing+deep
                5311 active+clean

You will see lot of entries that describe what is happening.  Most importantly something like this:

9854/630322 objects degraded (1.563%)

This means that ceph's RBD objects (which keep your openstack data), do not have enough replicas for 9854 objects.  It will copy the replica to another host, so you will have to wait until all objects that were on your OSD node are rebalanced.  This will utilize a lot of network I/O, and your active VMs will suffer.  So warn your users before doing this.

Again, ensure this is done on any additonal nodes you want to delete.

Step 4.) Cleanup Neutron

You may notice that once your node is gone, there are some stale neutron agents marked dead.

The following command will list all dead (xxx) agents and show details.  Change 'agent-show' to 'agent-delete' to remove them permenantly:

for i in $(neutron agent-list | grep "xxx" | awk '{print $2}'); do neutron agent-show $i; done;

Step 5.) Cleanup Nova Services

Just like the neutron services, the nova services on your old node may still show up in horizon as 'down'.  You can use something like the command below to disable them.

for i in $(nova service-list | grep node-13 | awk '{print $2}'); do nova service-disable node-13.ccri.com $i; done;

I have did not figure out how to delete the services.  But it doesn't really matter, because I will be deleting the entire environment once the upgrade and migrations are complete.


Step 6.) Upgrade to Fuel 6.0 if you didn't already.

Once your instances have been all migrated.  You should be able to use the Fuel UI and decomission the node and one controller.  Run the update.sh either before or after.  I did it before migrating instances.

Step 7.) Create the new Environment with Fuel UI.

You should now have a free controller and nodes to build a Juno Openstack environment to start migrating your instances from the old Icehouse Openstack environment.  Hopefully with virtually 0 downtime.

Ceph installation requires at least 2 nodes.

Step 8.) Export Volumes from Old Environment

There are likely a variety of ways to import/export volumes in openstack.  I have found the following method works well.

First, find a place on your old controller w/ extra disk.  Generally /var/lib/mongo has a lot of space w/ default partitioning.  Locate the UUID for a volume using nova or cinder list.  instance IDs are used for the ephemeral disks, volume IDs are used for 'volume' disks. Make sure the Instance using the disk is shut off.

Export it with rbd, then compress it to qcow2 so you can pass it over the Fuel network to your other environment's controller.  In this example, I am exporting a 'Volume' as 'raw' disk, then converting it.

[root@node-10 mongo]# rbd export --pool=volumes volume-0f2a87ec-74c5-4356-a4e7-12fffd6fe5ea docker-registry.raw                                                                                                                               
Exporting image: 100% complete...done.
[root@node-10 mongo]# qemu-img convert -f raw -O qcow2 ./docker-registry.raw docker-registry.qcow2

To support SCP, on the old environment controller modify /etc/ssh/sshd_config and set PasswordAuthentication to 'yes' which is at the bottom of the file. (Also int he middle of the file, but commented out).  Then # useradd temp and set the password, # passwd temp
#service sshd restart, and you should now be able to scp the data from the new controller.

Step 9.) Import Volumes as Images and launch

Here, my new Juno environment's controller is 'node-35'

[root@node-35 ~]# scp temp@node-10:/var/lib/mongo/docker-registry.qcow2 .                                                                   
Warning: Permanently added 'node-10,10.20.0.3' (RSA) to the list of known hosts.
temp@node-10's password:
docker-registry.qcow2                                                  65% 7489MB 90.3MB/s    00:21 ETA

Get the byte size first:
[root@node-35 ~]# ls -al docker-registry.qcow2 

-rw-r--r-- 1 root root 11334057984 Feb 11 16:34 docker-registry.qcow2

And now, import it as an image with glance:

[root@node-35 ~]# glance image-create --size 11334057984 --name docker-registry --store rbd --disk-format qcow2 --container-format bare --file ./docker-registry.qcow2

Importing into Glance while watching Ceph/rbd

+------------------+--------------------------------------+
| Property         | Value                                |
+------------------+--------------------------------------+
| checksum         | a209fafa8ae5369e0a93b30e41c4e27c     |
| container_format | bare                                 |
| created_at       | 2015-02-11T16:40:09                  |
| deleted          | False                                |
| deleted_at       | None                                 |
| disk_format      | qcow2                                |
| id               | 22209ea8-2287-425b-9e45-c79ec210d380 |
| is_public        | False                                |
| min_disk         | 0                                    |
| min_ram          | 0                                    |
| name             | docker-registry                      |
| owner            | aeea9a5fd7284450a3468915980a8c45     |
| protected        | False                                |
| size             | 11334057984                          |
| status           | active                               |
| updated_at       | 2015-02-11T16:47:53                  |
| virtual_size     | None                                 |
+------------------+--------------------------------------+


At this point, you should be able to launch a new instance from a converted Image->Volume, and specify this glance image as the source, and create a new Volume specifying the size of the original volume.  In my case with this docker registry, it was 100GB, even though qcow compressed it down to 11GB.
Create Volume from Image

Create Volume from Image with Specified Size of Original Volume (Not qcow size!)
Booting instance from new Volume


If at any point the UI has an error.  Just watch the osd pool stats # watch ceph osd pool stats volumes
You should see client io that is pretty heavy.  When it is done, you can refresh your volumes on horizon UI and it should be there as 'Available'.

Step 10.) "Rinse and Repeat"

Go ahead and repeat step 8/9 for all the instances you want migrated.  Set up your public IPs, update your external DNS entries, etc. And wait a day to make sure things are stable.  Afterwards go ahead and delete the old environment, add the free nodes to your new environment, and migrate some instances to lighten the load on your first couple of nodes, and you should be good to go!


Monday, January 19, 2015

Docker Python API

Here is an example Python script using the docker-py API.  In this example, I start 3 containers.  One with Accumulo, one with Apache YARN, and one with Geoserver.  I am also linking the containers so that they have hosts file entries to support the hostname lookups.

additionally i have declared some volumes that i bind to the host's home folder under /geomesa-docker-volumes/*

If you get version mismatch errors, just modify the version in the get_client_unsecure function.

#!/usr/bin/env python
# The unsecure client requires that your Docker daemon is listening on port 5555 in addition to the default unix socket.
# DOCKER_OPTS="-H unix:///var/run/docker.sock -H tcp://127.0.0.1:5555"
# $ sudo service docker(.io) restart

__author__ = 'championofcyrodiil'

import docker
import getpass
from subprocess import call

geoserver_image = "user:geoserver"
accumulo_image = "user:accumulo"
yarn_image = "user:yarn"
remote_docker_daemon_host = "127.0.0.1"
unsecure_docker_port = 5555

def get_client_unsecure(host, port):
    client = docker.Client(base_url="http://%s:%s" % (host, port), version="1.10")
    return client


if __name__ == '__main__':
    # unsecured connection on localhost (127.0.0.1)
    dc = get_client_unsecure(remote_docker_daemon_host, unsecure_docker_port)

def start_geomesa(user):
    #accumulo container
    accumulo_volumes = ['/opt/accumulo/accumulo-1.5.2/lib/ext/', '/data-dir/', '/data']
    accumulo_container = \
        dc.create_container(image=accumulo_image,
                            name=(user + 's-accumulo'),
                            tty=True,
                            stdin_open=True,
                            hostname='accumulo',
                            ports=[2181, 22, 50070, 50095, 50075, 9000, 9898, 3614],
                            volumes=accumulo_volumes,
                            mem_limit="4g")

    accumulo_binds = {
        '/home/' + getpass.getuser() + '/geomesa-docker-volumes/accumulo-libs':
        {
            'bind': '/opt/accumulo/accumulo-1.5.2/lib/ext/',
            'ro': False
        },
        '/home/' + getpass.getuser() + '/geomesa-docker-volumes/accumulo-data':
        {
            'bind': '/data-dir/',
            'ro': False
        },
        '/home/' + getpass.getuser() + '/geomesa-docker-volumes/hdfs-data':
        {
            'bind': '/data/',
            'ro': False
        }
    }
    dc.start(accumulo_container, publish_all_ports=True, binds=accumulo_binds)

    #YARN CONTAINER
    yarn_container = dc.create_container(image=yarn_image,
                                         name=(user + 's-yarn'),
                                         stdin_open=True,
                                         tty=True,
                                         hostname='yarn',
                                         ports=[8088, 8042, 22], mem_limit="2g")

    link = {(user + 's-accumulo'): 'accumulo'}
    dc.start(yarn_container,
             publish_all_ports=True,
             links=link)

    #geoserver container
    geoserver_container = dc.create_container(image=geoserver_image,
                                              name=(user + 's-geoserver'),
                                              stdin_open=True,
                                              tty=True,
                                              hostname='geoserver',
                                              ports=[8080, 22, 7979], mem_limit="2g")
    link = {(user + 's-accumulo'): 'accumulo', (user + 's-yarn'): 'yarn'}
    dc.start(geoserver_container,
             publish_all_ports=True,
             links=link)

start_geomesa('test')
call("./geomesa_info.py")

RabbitMQ handshake_timeout

Currently I am maintaining an Openstack cluster deployed via Mirantis Fuel 5.1 (Icehouse). Things were going well for a while, but at some point there were a lot of delays in requests to the APIs to perform various tasks such as creating an instance, volume, mounting, etc. This would cause failures and would regularly leave openstack objects in an inconsistent state. This is very frustrating and difficult to diagnose because you will see errors all over the place.

The issue for us was the system swappiness default setting of 60 with Centos 6. This caused a lot of messages to take longer than the rabbitmq default of 3 seconds, resulting in a timeout and failed request.

As root on all openstack controllers:
# sysctl vm.swappiness=10
# swapoff /dev/mapper/os-swap

Additionally it looks like mirantis fuel used LVM. This is likely a slower file system than ext4 native on non lvm partitioned disks.

 Also make sure you have enough RAM to disable swap. More importantly, make sure you have enough RAM for your openstack controller.

see: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Performance_Tuning_Guide/s-memory-tunables.html

Update: This has been added to launchpad as a bug in 5.1, 6.0 and 6.1: https://bugs.launchpad.net/fuel/+bug/1413702