fping Support in OpenStack

OpenStack is very good at launching virtual machines – that’s its purpose, isn’t it? But usually you want to monitor state of you machines somehow, and there are many reasonable ways.

  1. You can test daemons running on the machine, e.g., check up open ports or poll known services. Of course, this approach means that you know exactly what services should be running – and this is the most precise way to test system health.
  2. You can ask hypervisor if the machine is ok. That’s a very dirty check since hypervisor will likely report that VM is active while its operating system kernel can encounter problems.
  3. A compromise settlement may be pinging the machine. It’s a general solution since a lot of VMs respond to ping normally. Sure, VM can ignore ping or its daemons can have problems while host is responding to ping, but this solution is far easier to implement then check each machine according to an individual plan.

Let’s concentrate on the last two approaches. I would like to launch a machine and check it.

[root@a001 ~]# nova image-list
+--------------------------------------+--------------+--------+--------+
| ID                                   | Name         | Status | Server |
+--------------------------------------+--------------+--------+--------+
| 960dc70a-3e0e-496a-b8da-0e9cd91d3a44 | selenium-img | ACTIVE |        |
+--------------------------------------+--------------+--------+--------+
[root@a001 ~]# nova boot --flavor m1.small --image 960dc70a-3e0e-496a-b8da-0e9cd91d3a44 selenium-0
...
[root@a001 ~]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
+--------------------------------------+-------------------+--------+-------------------------+
[root@a001 ~]# fping 10.109.0.4
10.109.0.4 is unreachable

As you can see, VM status is reported as active, but the machine has not booted really. Even more, consider a damaged image (I use a text file for this purpose):

[root@a001 ~]# glance index 
ID                                   Name                           Disk Format          Container Format     Size          
------------------------------------ ------------------------------ -------------------- -------------------- --------------
7d8007fe-a63c-4d02-8edf-a6cc19fa1d73 text                           qcow2                ovf                           17043
[root@a001 ~]# nova boot --flavor m1.small --image 7d8007fe-a63c-4d02-8edf-a6cc19fa1d73 text-0
[root@a001 ~]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
| 461e73e4-7f88-4c8f-bb1f-49df9ec18d84 | text-0            | ACTIVE | selenium-net=10.109.0.5 |
+--------------------------------------+-------------------+--------+-------------------------+

Nova bravely reports that the new instance is active, but it obviously is not functioning: a text file is not a disk image with an operating system. And fping reveals that the VM is ill:

[root@a001 ~]# fping 10.109.0.5
10.109.0.5 is unreachable

We can extend nova API adding this fping feature. Nova will run fping for requested instances and report which ones seems to be truly alive. I have developed this extension and it was accepted to Grizzly on November 16, 2012 (https://github.com/openstack/nova/commit/a220aa15b056914df1b9debc95322d01a0e408e8).

fping API is simple and straightforward. We can ask to check all instances or a single one. In fact, we have two API calls.

  1. GET /os-fping/<uuid> – check a single instance.
  2. GET /os-fping?[all_tenants=1]&[include=uuid[,uuid...][&exclude=...] – check all VMs in the current project. If all_tenants is requested, data for all projects is returned (by default, this option is allowed only for admins). include and exclude are parameters specifying VM masks. These parameters are mutually exclusive and exclude is ignored if include is specified. Consider that VM list is VM_all, then if include list is set, the only VM_all * VM_to_include (set intersection) will be tested – thus we can check several instances in a single API call. If exclude list is provided, VM_all -
    VM_to_exclude
    (set difference) will be polled – thus we can skip testing for instances that are not supposed to respond to ping.

fping increases I/O load on nova-api node, so, by default, fping API is limited to 12 calls in an hour (nevertheless it’s a single or several instances poll).

I have added nova fping support to python-novaclient (https://github.com/openstack/python-novaclient/commit/ff69e4d3830f463afa48ca432600224f29a2c238) making easy to write a daemon in Python that will periodically check instance states and send notifications on detected problems. This daemon is available in Grid Dynamics Altai Private Cloud For Developers and is called instance-notifier (https://github.com/altai/instance-notifier). The daemon is installed and configured by Altai installer automatically. Despite Altai 1.0.2 runs Essex, not Grizzly, I have added nova-fping as an additional extension package.

Let’s see how to use fping from client side. We have three instances: selenium-0 (shut off), selenium-1 (up and running), and text (invalid image). Nova reports that they are active:

[root@a001 /]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
| 20325b87-6858-49df-ab30-795a189dd2ac | selenium-1        | ACTIVE | selenium-net=10.109.0.3 |
| 461e73e4-7f88-4c8f-bb1f-49df9ec18d84 | text-0            | ACTIVE | selenium-net=10.109.0.5 |
+--------------------------------------+-------------------+--------+-------------------------+

Check them with nova fping!

[root@a001 /]# python
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from novaclient.v1_1 import Client
>>> cl = Client("admin", "topsecret", "systenant", "http://localhost:5000/v2.0")
>>> ping_list = cl.fping.list()
>>> ping_list
[<Fping: 461e73e4-7f88-4c8f-bb1f-49df9ec18d84>, <Fping: a9060a07-d32a-4dcf-8387-1c7d69f897dc>, <Fping: 20325b87-6858-49df-ab30-795a189dd2ac>]
>>> import json
>>> print json.dumps([p._info for p in ping_list], indent=4)
[
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "461e73e4-7f88-4c8f-bb1f-49df9ec18d84", 
        "alive": false
    }, 
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "a9060a07-d32a-4dcf-8387-1c7d69f897dc", 
        "alive": false
    }, 
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "20325b87-6858-49df-ab30-795a189dd2ac", 
        "alive": true
    }
]

As expected, nova fping reported that only selenium-1 (id=20325b87-6858-49df-ab30-795a189dd2ac) is really alive.

So, fping in nova is a fast and quite reliable way to check instance health. Like a phonendoscope, it cannot provide full information, but if a human doesn’t respire, he’s likely to be dead.

fping-phonendoscope

About these ads

One thought on “fping Support in OpenStack

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s