Altai v1.1.1 is out

Hello everybody!

A new version of Altai Private Cloud for Developers 1.1.1 is ready to use. In this release, we support reliable updates from any previous Altai version..

This release is recommended to use for everyone instead of 1.1.0. Update procedure from any previous release is safe and automated – just follow our upgrade guide.

Other important changes:

  • guess root partition of virtual machine images for setting login credentials;
  • add script for LDAP configuration after Altai installation;
  • support instance live migration;
  • fix VNC security problem that occurs for deleted instances;
  • strong validation of Altai configuration parameters to determine installation/updating problems earlier;
  • bugfixes in UI;
  • correct adding records for NS servers on zone creation;
  • purge info about deleted instances from database.

OpenStack Unit Testing: a Long Journey Starts With a Small Step

Alessio Ababilov

This time I would like to talk about OpenStack unit tests we were working on at Grid Dynamics.

First and furthermost, writing unit tests for OpenStack is a tar pit: if you have started, you cannot resist. The task is really difficult by itself; moreover, creating tests for such quickly developed project is like a life on a volcano.

Our task seemed to be quite trivial. We have a set of projects with different test coverage. We have to cover them at least for 80%. Python coverage utility is at our service.

I have introduced the following infrastructure. We publish our patches on Gerrit that replicates its repositories to another server for easy recovering. Jenkins monitors Gerrit and recalculates coverage after every patch publication. Coverage for each project is gathered to table.

I hoped that coverage calculation is not a big deal, but I was too optimistic. Imagine that we simply use run_tests.sh script that is available for almost all the projects. Than we will face some problems.

  1. One does not simply use run_tests.sh. It installs a fresh virtual environment for Python packages using corresponding pip-requires and test-requires files. This environment lacks some necessary packages (e.g., iso8601) and its eventlet is incompatible with Python 2.6 that we use in CentOS. So, I had to prepare a virtual environment with a custom script. So does OpenStack Jenkins.
  2. It usually takes a lot of time to download and compile all Python dependencies. It was too long even for a Gentoo user, so I cached the environment using a common env for all the packages.
  3. pip is not a reliable utility. During package installation, it is used to erasing all package data except for metainformation. The last point means that the package is considered to be installed while its files are gone, so, you have to uninstall it and install again.

Well, now we have a shiny virtual environment. It’s time to run tests and calculate coverage. Currently, OpenStack unit tests for Nova, Glance, Quantum, and Keystone are slow and run about 10-20 minutes on our testing server. There is a good idea to migrate for nosetests to testr: this will allow to run the tests in parallel (the thing is that nosetests parallelization is incompatible with eventlet/greenlet widely used in OpenStack). Nova already uses testr, but updating other packages is a laborious task. Some tests depend on each other and cannot be run in parallel, here is an example of one such problem I’ve detected and fixed: https://bugs.launchpad.net/quantum/+bug/1125951.

I have used incremental testing to increase the speed. I cache .coverage file that accumulates statistics for previous testing and run only those tests that were added or updated. The latter are determined using git diff --name-only command. Also, incremental testing solves another problem: summarize coverage for all commits for given project. These commits are siblings on commit tree, so, you cannot checkout to the latest one and run all published tests that are not accepted yet.

OpenStack includes the Oslo project – a set of python libraries shared by other projects. Unfortunately, Oslo is not a true library now: its code it copied with a special script to other projects. Thus our coverage statistics would be incorrect if we include uncovered Oslo code to common report. So, I had to implement a blueprint to fix this issue.

All preparations are done, now we are ready for testing!

Our testing process differs from one used in OpenStack Jenkins. It happens that upstream tests fail just because they were not run during verification, but sometimes they pass somehow in OpenStack jobs and fail in my environment. Here is an example: Quantum’s test_policy passes in https://review.openstack.org/#/c/21091/, but they should fail because one necessary configuration parameter is not imported properly (my fix: https://review.openstack.org/#/c/21205/).

So, now OpenStack community has a new team that looks at unit tests from slightly different point of view detecting problems and fixing them.

Altai v1.1.0 with LDAP/AD support

Hello, everybody!

We are glad to announce a new version of Altai Private Cloud for Developers.

This update introduces out-of-the-box LDAP/AD support in Altai Cloud. It should help to manage cloud users and authorization in companies where LDAP or AD is used.

Useful links:

 

Preconditions for Common OpenStack Client Library

OpenStack client packages have a long history. It begins in November 2009 when Rackspace Cloud Servers package started. It provided a Python API (the cloudservers module) and a command-line script (cloudservers). Initially, the script was just a stub, but it became a useful CLI utility able to launch, stop, and resize virtual machines.

cloudservers package introduced a library architecture that is used till now. All entities can be split into five groups.

  1. Resources, e.g., a flavor, a server, or an image. Technically, a resource is a Python object and its class is a descendant of the Resource class.
  2. Managers – they provide operations on resources, for example, “list all flavors” or “delete an image”. So, we have a flavor manager, an image manager, and so on. As you may guess, manager’s classes are descendants of the Manager class.
  3. HTTP client provides a convenient interface for managers that send HTTP requests to the server. The HTTP client is also responsible for authentication process that changed a lot after introducing a new Keystone service, so, the newer HTTP clients are a bit more complicated.
  4. Exceptions are normal Python exceptions raised by HTTP client for HTTP error codes. This is a more or less rich hierarchy with exceptions such as Unauthorized, BadRequest, or NotFound.
  5. Client (not to be confused with HTTP client!) puts HTTP client and various managers together (using class composition: HTTP client and managers are members of a client). As a user, you create a client and can immediately perform any API calls:
    # this is the client
    client = Client(USERNAME, PASSWORD, PROJECT_ID, AUTH_URL)
    # client.flavors is a manager
    all_flavors = client.flavors.list()
    # and all_flavors is a list of resources
    print all_flavors                                         
    

The oldest of currently alive clients (novaclient) was born in January 2011 as a fork of Rackspace Cloud Servers package. Since that time, cloudservers library and CLI script were renamed to novaclient. They support new Nova API that was growing all the time, but these two main functions (a Python API and a command line client) remain unchanged.

About a year later, a new OpenStack client project called keystoneclient was started. It was a flesh of novaclient’s flesh with almost the same architecture with a small difference: client was a child class of HTTP client thus using inheritance, not composition. And, of course, keystoneclient has its own managers and resources (tenant, user, etc.).

Al lot of code required in the new client package was already written in novaclient (the base Resource and Manager and HTTP client). But this code was copied, not imported in keystoneclient. On the one hand, it made these packages independent: you haven’t to install novaclient if you would like to use keystoneclient. On the other hand, the story of duplicated classes diverged and they gained different features available in one package and absent in another.

glanceclient used the same copying approach with the same benefits and pitfalls. However, quantumclient and swiftclient are completely different and I won’t discuss them here.

So, what do we have now?

  1. Keystone server provides tokens with limited time to live. So, that’s natural to obtain an “Unauthorized” error after a series of successful calls. Nova and Keystone clients handle this situation correctly: they do one call to obtain a new fresh token and repeat the fallen query. glanceclient just raises an exception.
  2. Keystone server supports two ways of authentication on a tenant: with user name and password and with an unscoped token. As a response to successful authentication, it returns a scoped token and a catalog of all OpenStack service endpoints (nova, glance, keystone, swift etc.). So, keystoneclient supports both authentication ways while novaclient handles only authentication with user name and password. glanceclient is even less prompt: it requires a scoped token and a Glance server endpoint. It knows nothing about clever Keystone service and you have to do the dirty job. By the way, glanceclient’s shell uses keystoneclient to issue this initial call to Keystone.
  3. All client constructors use different parameters. For example, the thing that is called password in keystoneclient is an api_key in novaclient for historical reasons: it was called apikey (without underscore!) in cloudservices three years ago.
  4. clients have not only different constructors but also diverse behavior: keystoneclient authenticates immediately when you create the client object while novaclient does it lazily during first API call.
  5. Often you would like to make calls to different services. A dashboard or a common command line tool usually requests tenant list from Keystone, image list from Glance, and sends a “launch an instance” command to Nova. With current clients it’s different to share the same token and service endpoint catalog. A simple ways could be just to use a common HTTP client object, but it’s impossible because of incompatible architectures in different client packages.

To solve these problems, we could move the common code to a separate library that would be imported in all three clients. The common library would contain:

  • the base Resource class;
  • the base Manager class;
  • a rich Exceptions hierarchy;
  • a feature rich HTTP client that supports all ways of authentication, handles outdated token faults, and saves the whole service catalog returned by Keystone;
  • the base client class that contains an instance of HTTP client as a member: this way, several clients (e.g., a client for Nova and a client for Keystone) can share the same HTTP client.

I developed a sample implementation of this library and called it python-openstackclient-base. The library is used in Altai Private Cloud, a project of Grid Dynamics. python-openstackclient-base is easy to use:

from openstackclient_base.client import HttpClient
http_client = HttpClient(username="...", password="...", tenant_name="...", auth_uri="...")

# Nova Compute API client
from openstackclient_base.nova.client import ComputeClient
# create a client class and use servers manager
print ComputeClient(http_client).servers.list()   

# Identity (Keystone) Public API client
from openstackclient_base.keystone.client import IdentityPublicClient 
# use the same HTTP client as above
print IdentityPublicClient(http_client).tenants.list()

Now I’m going to put this library to oslo-incubator project and use it in all three clients. When oslo-incubator will be mature, it will be imported in OpenStack projects as I want, but now its code will be just copied literally to other projects. However, it’s also quite satisfiable since it will reach all the goals mentioned above.

fping Support in OpenStack

OpenStack is very good at launching virtual machines – that’s its purpose, isn’t it? But usually you want to monitor state of you machines somehow, and there are many reasonable ways.

  1. You can test daemons running on the machine, e.g., check up open ports or poll known services. Of course, this approach means that you know exactly what services should be running – and this is the most precise way to test system health.
  2. You can ask hypervisor if the machine is ok. That’s a very dirty check since hypervisor will likely report that VM is active while its operating system kernel can encounter problems.
  3. A compromise settlement may be pinging the machine. It’s a general solution since a lot of VMs respond to ping normally. Sure, VM can ignore ping or its daemons can have problems while host is responding to ping, but this solution is far easier to implement then check each machine according to an individual plan.

Let’s concentrate on the last two approaches. I would like to launch a machine and check it.

[root@a001 ~]# nova image-list
+--------------------------------------+--------------+--------+--------+
| ID                                   | Name         | Status | Server |
+--------------------------------------+--------------+--------+--------+
| 960dc70a-3e0e-496a-b8da-0e9cd91d3a44 | selenium-img | ACTIVE |        |
+--------------------------------------+--------------+--------+--------+
[root@a001 ~]# nova boot --flavor m1.small --image 960dc70a-3e0e-496a-b8da-0e9cd91d3a44 selenium-0
...
[root@a001 ~]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
+--------------------------------------+-------------------+--------+-------------------------+
[root@a001 ~]# fping 10.109.0.4
10.109.0.4 is unreachable

As you can see, VM status is reported as active, but the machine has not booted really. Even more, consider a damaged image (I use a text file for this purpose):

[root@a001 ~]# glance index 
ID                                   Name                           Disk Format          Container Format     Size          
------------------------------------ ------------------------------ -------------------- -------------------- --------------
7d8007fe-a63c-4d02-8edf-a6cc19fa1d73 text                           qcow2                ovf                           17043
[root@a001 ~]# nova boot --flavor m1.small --image 7d8007fe-a63c-4d02-8edf-a6cc19fa1d73 text-0
[root@a001 ~]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
| 461e73e4-7f88-4c8f-bb1f-49df9ec18d84 | text-0            | ACTIVE | selenium-net=10.109.0.5 |
+--------------------------------------+-------------------+--------+-------------------------+

Nova bravely reports that the new instance is active, but it obviously is not functioning: a text file is not a disk image with an operating system. And fping reveals that the VM is ill:

[root@a001 ~]# fping 10.109.0.5
10.109.0.5 is unreachable

We can extend nova API adding this fping feature. Nova will run fping for requested instances and report which ones seems to be truly alive. I have developed this extension and it was accepted to Grizzly on November 16, 2012 (https://github.com/openstack/nova/commit/a220aa15b056914df1b9debc95322d01a0e408e8).

fping API is simple and straightforward. We can ask to check all instances or a single one. In fact, we have two API calls.

  1. GET /os-fping/<uuid> – check a single instance.
  2. GET /os-fping?[all_tenants=1]&[include=uuid[,uuid...][&exclude=...] – check all VMs in the current project. If all_tenants is requested, data for all projects is returned (by default, this option is allowed only for admins). include and exclude are parameters specifying VM masks. These parameters are mutually exclusive and exclude is ignored if include is specified. Consider that VM list is VM_all, then if include list is set, the only VM_all * VM_to_include (set intersection) will be tested – thus we can check several instances in a single API call. If exclude list is provided, VM_all -
    VM_to_exclude
    (set difference) will be polled – thus we can skip testing for instances that are not supposed to respond to ping.

fping increases I/O load on nova-api node, so, by default, fping API is limited to 12 calls in an hour (nevertheless it’s a single or several instances poll).

I have added nova fping support to python-novaclient (https://github.com/openstack/python-novaclient/commit/ff69e4d3830f463afa48ca432600224f29a2c238) making easy to write a daemon in Python that will periodically check instance states and send notifications on detected problems. This daemon is available in Grid Dynamics Altai Private Cloud For Developers and is called instance-notifier (https://github.com/altai/instance-notifier). The daemon is installed and configured by Altai installer automatically. Despite Altai 1.0.2 runs Essex, not Grizzly, I have added nova-fping as an additional extension package.

Let’s see how to use fping from client side. We have three instances: selenium-0 (shut off), selenium-1 (up and running), and text (invalid image). Nova reports that they are active:

[root@a001 /]# nova list
+--------------------------------------+-------------------+--------+-------------------------+
| ID                                   | Name              | Status | Networks                |
+--------------------------------------+-------------------+--------+-------------------------+
| a9060a07-d32a-4dcf-8387-1c7d69f897dc | selenium-0        | ACTIVE | selenium-net=10.109.0.4 |
| 20325b87-6858-49df-ab30-795a189dd2ac | selenium-1        | ACTIVE | selenium-net=10.109.0.3 |
| 461e73e4-7f88-4c8f-bb1f-49df9ec18d84 | text-0            | ACTIVE | selenium-net=10.109.0.5 |
+--------------------------------------+-------------------+--------+-------------------------+

Check them with nova fping!

[root@a001 /]# python
Python 2.6.6 (r266:84292, Jun 18 2012, 14:18:47) 
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from novaclient.v1_1 import Client
>>> cl = Client("admin", "topsecret", "systenant", "http://localhost:5000/v2.0")
>>> ping_list = cl.fping.list()
>>> ping_list
[<Fping: 461e73e4-7f88-4c8f-bb1f-49df9ec18d84>, <Fping: a9060a07-d32a-4dcf-8387-1c7d69f897dc>, <Fping: 20325b87-6858-49df-ab30-795a189dd2ac>]
>>> import json
>>> print json.dumps([p._info for p in ping_list], indent=4)
[
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "461e73e4-7f88-4c8f-bb1f-49df9ec18d84", 
        "alive": false
    }, 
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "a9060a07-d32a-4dcf-8387-1c7d69f897dc", 
        "alive": false
    }, 
    {
        "project_id": "4fd17bd4ac834dcf8ba1236368f79986", 
        "id": "20325b87-6858-49df-ab30-795a189dd2ac", 
        "alive": true
    }
]

As expected, nova fping reported that only selenium-1 (id=20325b87-6858-49df-ab30-795a189dd2ac) is really alive.

So, fping in nova is a fast and quite reliable way to check instance health. Like a phonendoscope, it cannot provide full information, but if a human doesn’t respire, he’s likely to be dead.

fping-phonendoscope

Altai v1.0.2 is out

Hello everybody!

A new version of Altai Private Cloud for Developers 1.0.2 is ready to use. In this release, we reviewed and cleaned up third-party packages and made bugfixes, primarily to user interface.

This release is recommended to use for everyone instead of 1.0.1. Update procedure is safe and automated – just follow our upgrade guide.

What’s New in Altai 1.0.2 from Maintainer’s Point of View

A new version of Altai Private Cloud for Developers 1.0.2 was released.

The new release is devoted to cleaning package dependencies. Also, a bunch of bugfixes was made, primarily to user interface. Let’s see what’s new in Altai 1.0.2 from maintainer’s point of view.

In previous releases, we had this motto: “Take basic CentOS/RHEL, take our source RPMs, and you will be able to build the whole Altai and install it”.
Altai RPMs (both source and binary) were grouped in two repositories: “main” and “deps”. “deps” were packages rebuilt from their third-party source RPMs without changes. All other packages went to “main”, including customized third-party software (like nginx with uploading module) and Altai proper packages like Focus web UI.
Since we built both “main” and “deps” packages, we signed them with Grid Dynamics signature.

This model had one pitfall: we had to maintain plenty of well-known packages that were not included into basic CentOS/RHEL, such as Rabbit MQ or Erlang. That made our repositories really tremendous: 500 MiB, 100 MiB for “main” and 400 for “deps”! Imagine how wasteful is add these tons of unchanged third-party packages to every release. That’s why we tried the following solution in the previous release (1.0.1): include a chain of repositories so almost all unchanged packaged are downloaded from 1.0.0 release and 1.0.1 repository contains only packages to upgrade. As it was shown in this article, YUM can handle thousands of repositories simultaneously without performance problems. So, the repository chain approach significantly saves space for newer releases, but it leads to some maintaining problems.

For example, imagine if one package should be downgraded in the next release. We can call yum downgrade package-name in Altai installer, but how could we guarantee that this packages will not be updated later accidentally by user in a yum update procedure?

A more complex problem is that it’s difficult to determine a list of all packages that belong to given release if they are spread between lots of repositories. Even more, build a new release repository being the next in the repo chain is a nontrivial task.

Fortunately, if you decide to use EPEL packages, you’ll say farewell to all these obstacles. First, the repository becomes significantly smaller just because now you haven’t to rebuild heaps of packages. Now we have only 160 MiB of binary packages. Second, with a small repository you haven’t to use cunning repository chain – everything becomes transparent and easy to support.

It’s worth to say that it using EPEL packages isn’t as simple as it seems to be. Some important Python libraries are installed in such places that you would have to patch your programs or they wouldn’t find their dependencies. We decided to reject these libraries and package them ourselves. Luckily, the most of EPEL packages were able to be used in Altai without complications.

As far as we reviewed all Altai packages, we chosen another repository layout. Let’s briefly describe it.

  • centos6: these packages are maintained and developed by Grid Dynamics team. This group contains customized OpenStack and a lot of home-grown packages signed by Altai team. Sources of these packages are available at GitHub.
  • deps: these packages are not a subject of Grid Dynamics development. This category includes the following subdirectories.
    1. centos6-updates – necessary update packages for CentOS 6 signed by CentOS.
    2. epel – necessary packages from EPEL repositories signed by EPEL.
    3. misc – packages built and signed by Altai team.
    4. misc-srpms – source RPMs for misc and signed by Altai team.

As you see, we still provide sources of all packages we’ve built as it’s appropriate for an open source project.

As it was mentioned above, we keep Altai sources in git. There are two steps between a git repository and a binary RPM. First, a source RPM must be built from a git repo. Second, a binary RPM is built from a source one.

Each step is a not-trivial operation. A source RPM must contain all information required for package build, including source tarball, spec file, and possibly patches that should be applied to unpacked tarball before build. ALT Linux team even developed a powerful toolkit called GEAR (Get Every Archive from git package Repository). GEAR contains tens of individual CLI programs for different purposes, including composing a source RPM from a git repository and importing a tarball to git. We used GEAR in previous releases, but the only feature we needed was git-to-source RPM conversion. Even more, almost every conversion was trivial because we develop software keeping in mind that they will be packaged to RPMs. GEAR, on its turn, allows to maintain third-party software that is under active or slow development and need to be patched before packaging.

Obviously, multifunctional GEAR led to boilerplate configuration files. That’s why we simplified git-to-source RPM conversion as in our case it could be done with a small and clear script. And there is now need to write GEAR rules file: it’s sufficient to just place a spec file to git repository.

Frankly speaking, the second step (source-to-binary RPM conversion) is trickier than the first, but, fortunately, there is a ready solution – mock tool used in Fedora and EPEL. mock prepares a clean and safe build chroot environment for build operation. We have already used mock for previous releases and we haven’t ceased to take its advantages.

So, Altai 1.0.2 is easy to develop, maintain, and support and in the same time more foolproof.

Hungry Process Breaks Your “while read” bash Cycle

Originally posted on Alessio Ababilov's Blog:

I am working on a build system that makes easy to control several connected git repositories forming one project. This system is being written in bash and uses lots of rarely used git and bash features.

Often I have to iterate over a table usually generated by git, e.g., to see the changes between a commit and its parent, I run:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos
:160000 160000 236fc8025f106375944457007f5a7a803297e683 f5ede37ddbf9eccd55012f1ddda3ae37259ca800 M	repos/altai/altai-installer
:160000 160000 2706a907bf2d136dd1f737e6c6cb4ca8e420329c 10a1bf5d8f716f30af089f1558eefbdeb07f9b3b M	repos/altai/nova-networks-ext
:160000 160000 7aaafb9f29b60ef0a4cf938b653de23354308be2 ad3725e92b08ca40cf65fb9ed604ae3285fee271 M	repos/altai/python-openstackclient-base
:160000 160000 748da9c4c1d058f96dd40ba328fd100719f768f7 eb568c5ffb4543b676208c96de7af2c62e455329 M	repos/openstack/glance

This output is easily parsed with bash’es while read:

$ git diff-tree --no-commit-id -r 9b8b0f6150790d2a757cd2091ef91d3ebe9ce317 -- repos | while read mode1 mode2 hash1 hash2 ignored path; do if [ "$mode2" == 160000 ]; then echo $path; fi; done
repos/altai/altai-installer
repos/altai/nova-networks-ext
repos/altai/python-openstackclient-base
repos/openstack/glance

read command get a line from stdin and sets variables one by one. We…

View original 605 more words

OpenStack Migration from Diablo to Essex

Originally posted on Alessio Ababilov's Blog:

We use Openstack at Grid Dynamics for more than a year. It is the basis of our private infrastructure originally named Cloud For Grid Dynamics (C4GD) and now known as Altai. C4GD provided cheap and fast VM management for our developers’ needs with reliable support. We were using the Diablo release and were happy with it.

On 5 April 2012 a new shiny Essex was released not without Grid Dynamics initiative (you can even find me in the list of contributors). I was challenged to investigate and prepare migration scripts for our cloud.

I started from my old scripts for installing Diablo and began writing a set of tools that make easy both migration and installation from scratch for different releases. You can see the result of my work at Github. These scripts work with OpenStack packaged to RPMs at Grid Dynamics.

Generally, OpenStack migrates rather well…

View original 1,186 more words

OpenStack EPEL: the Dependency Purgatory

Originally posted on Alessio Ababilov's Blog:

When you develop a software system, you can choose any solution between two extreme approaches.

  1. Build and maintain all your dependencies.
  2. Rely on external repositories and build only your specific packages.

Having chosen the first approach, you can be sure that your users will use your great tuned packages of carefully chosen versions that are doomed to work properly. And when you see a problem in a dependency, you freely patch it and… congratulations, now you are a happy maintainer of a zoo of numerous packages containing software written in several languages!

The second way is clear: you build a dozen of your own packages and publish a relatively small repository. And when your third-party dependencies become unavailable, it will be a user’s problem.

Developing Altai, we started from the first solution: a user installs basic RHEL/CentOS and just adds our repository. Nowadays, we are moving to the second…

View original 1,148 more words