CPAN RPM Packaging

Standard

Seeing as I have been asked about building RPM packages of CPAN modules today I thought it was worth putting some information down in a blog post – I would love comments on this, so that I can improve the information and hopefully make my processes better.

Firstly, I target Centos 4 and 5 i386 only, although I am going to have to start building for x86_64 too) and build stuff for our own internal requirements, the latter meaning that the stuff I package depends very much one what I need for work at the time. Its also a little painful that I cannot easily make this work generally available, but getting these packages into the wild would be complex. Packaging for other RPM flavoured distros should be easy enough, but slightly different.

There are a set of rules we work to for our internal deployments:-

  1. Everything is packaged as RPMs.
  2. You do not cheat on dependancies – if I do yum install foobar then foobar had better bring in all the appropriate required packages/modules
  3. We have 3 repositories – the first 2 are the standard Centos repos (or rather local mirrors) of the base os and updates respositories. The third is our own local repository of packages we build. We require that all packages are signed so have our own internal GPG key for our repos.
  4. All packages in our repository are built by us in a clean build environment and have no dependancies from outside the 3 repos we use.
  5. Packages should pass their own test suite (which we run as part of the build) unless there are exceptional circumstances

Although, as I stated above, we are building things for our own internal use, we do always contribute bug reports and fixes back to the upstream software provider (unless the patch is only needed due to our own bad practice) – maintenance is much easier if you keep close to upstream.

There are really 2 parts to building CPAN RPMs:-

  1. Making a source RPM (or even just the spec file)
  2. Building the binary RPMs

I’m going to consider these separately.

Creating The Source RPM

Mediocre Writers Borrow; Great Writers StealT.S. Eliot

Ideally use a source RPM package from a packager that you trust. In my experience the Fedora guys have now got the best RPM packaging of perl modules – they have a set of package rules they work to, and a good set of tools. If the Fedora folks have packaged a CPAN module then I will use their source RPM in preference to building one myself. If there are problems with that RPM then I will contribute fixes back to their packagers (or the upstream CPAN author). The best place to check for packages is the development sources directories – for example http://mirrors.kernel.org/fedora/development/source/SRPMS/

Fedora also produce the EPEL (Extra Packages for Enterprise Linux) additional packages for RHEL/Centos. Generally the only difference between these and the Fedora packages are possibly slight changes of dependancies due to the different OS environment (and Fedora uses perl 5.10 rather than Centos/RHEL perl 5.8).

If you can’t find a prebuilt source RPM file you will need to create your own. Although this is a relatively simple task (but unfortunately one that always appears to be hard to find documentation about) there are some additional tools to do the heavy lifting for you – these at least produce the skeleton of the package, but in many cases they produce a completely usable package with no intervention required.

I used to use cpan2rpm to produce packages – the version I had was hacked around (not code I was proud of) to produce better build dependancy listings.

However for the last couple of years I have been using cpanspec – another product of those incredible Fedora packagers. This produces a completely usable RPM in the vast majority of cases. A typical invocation (for me) looks like:-

$ cpanspec --follow --srpm SQL::Abstract

The follow flag does its best, but you do normally need to package pre-requisites yourself before packaging the required package.

cpanspec is packaged within the EPEL repositories.

Building RPMs

Basic information on building RPM packages can be found in the RPM Building Crash Course, as well as on the rpm.org wiki.

However it is best practice when building RPMs to build in a clean build environment to ensure that all the dependancy generation is correct, that no unpackaged files affect the build and to make sure the whole process is reproducible. This is done by having a separate chrooted build system that is regenerated for each new build.

Until a couple of years back I used the mach build system to produce clean builds (there is a HowTo document on this) – and I actually still use mach for building Centos 4 packages (because the system hasn’t broke so I’ve had no need to fix it). However for Centos 5 (and for any new environments) I now use mock (again from Fedora).

mock is packaged within the EPEL repositories.

In both cases you will need to configure your build system for the right base OS and repositories.

If you are building significant numbers of packages you will need additional scripting to manage things such as adding built packages to the repository (createrepo), expire off superseded packages (repomanage) and check that your repositories have no unfulfilled dependancies (repoclosure) – these programmes are generally all in the yum or yum-utils packages.

Advertisements

XML Processing

Standard

Recently I have had to revisit one of our systems that deal with XML call records (from a VOIP switch).

This system splits out Call Detail Records (CDRs) by customer. The version of this that was running was based on XML::Twig which used to run acceptably fast (this code was written a number of years ago), and has the advantage of being relatively light on memory as the document was processed a chunk at a time rather than being completely read into memory. However the system was getting apparently slower – mainly down the volume of calls being detailed increasing by a substantial factor.

So last week I spent a while trying out different approaches to this problem (as well as investigating approaches for a more database driven storage system for the future).

For the specific problem of splitting the data based on the customer responsible for the CDRs, the fastest approach I managed to put together was based on XML::LibXML. This has the disadvantage that it has to read in the complete XML file (and these are getting to be multi-gigabyte per hour), however the module is relatively light on memory compared to the other methods and a simplified rewrite of my previous programme resulted in a better than factor 20 speed up – rather worth having.

However this was basically a simple filter – splitting data coming in into several output streams based on a very simple criteria. Getting data fields out of the XML records with XML::LibXML appears to be relatively slow (and clumsy) – so for example if I want to extract all the fields into a database then the aggregate cost of accessing all the fields starts to be costly.

XML::Bare converts XML data files into perl hashes – either its own format which includes metadata to aid in reconstruction to XML, or a basic hash format very very similar to that used by the more ancient XML::Simple. Its fairly fast, although appears to be rather more profligate with memory (for some reason it holds the complete XML file as a string as well as hashified version – it also reads the whole file at once). XML::Bare is pretty fast, and if you are doing a lot of manipulation of the data within the XML file it might well be faster than using XML::LibXML

The big advantage of using XML::Twig originally was that its a quite perlish method of manipulating XML, and additionally you can use the simplify operation to convert the XML data into a hash – useful for dealing with individual records within the XML set.

However this cannot be done with XML::LibXML, and XML::Bare is too inflexible. So what would be useful was a fast mechanism for converting an XML::LibXML node into a hash making access within that node much simpler (and quite likely quicker – although there is an initial cost of conversion to a hash, as well as the memory cost).

I’m hoping to be able to set aside a little time to look at this. However I guess people may tell me other approaches – unfortunately the documentation within the various XML modules is somewhat opaque so its quite likely I have missed a great big feature somewhere!