XML Processing

Recently I have had to revisit one of our systems that deal with XML call records (from a VOIP switch).

This system splits out Call Detail Records (CDRs) by customer. The version of this that was running was based on XML::Twig which used to run acceptably fast (this code was written a number of years ago), and has the advantage of being relatively light on memory as the document was processed a chunk at a time rather than being completely read into memory. However the system was getting apparently slower – mainly down the volume of calls being detailed increasing by a substantial factor.

So last week I spent a while trying out different approaches to this problem (as well as investigating approaches for a more database driven storage system for the future).

For the specific problem of splitting the data based on the customer responsible for the CDRs, the fastest approach I managed to put together was based on XML::LibXML. This has the disadvantage that it has to read in the complete XML file (and these are getting to be multi-gigabyte per hour), however the module is relatively light on memory compared to the other methods and a simplified rewrite of my previous programme resulted in a better than factor 20 speed up – rather worth having.

However this was basically a simple filter – splitting data coming in into several output streams based on a very simple criteria. Getting data fields out of the XML records with XML::LibXML appears to be relatively slow (and clumsy) – so for example if I want to extract all the fields into a database then the aggregate cost of accessing all the fields starts to be costly.

XML::Bare converts XML data files into perl hashes – either its own format which includes metadata to aid in reconstruction to XML, or a basic hash format very very similar to that used by the more ancient XML::Simple. Its fairly fast, although appears to be rather more profligate with memory (for some reason it holds the complete XML file as a string as well as hashified version – it also reads the whole file at once). XML::Bare is pretty fast, and if you are doing a lot of manipulation of the data within the XML file it might well be faster than using XML::LibXML

The big advantage of using XML::Twig originally was that its a quite perlish method of manipulating XML, and additionally you can use the simplify operation to convert the XML data into a hash – useful for dealing with individual records within the XML set.

However this cannot be done with XML::LibXML, and XML::Bare is too inflexible. So what would be useful was a fast mechanism for converting an XML::LibXML node into a hash making access within that node much simpler (and quite likely quicker – although there is an initial cost of conversion to a hash, as well as the memory cost).

I’m hoping to be able to set aside a little time to look at this. However I guess people may tell me other approaches – unfortunately the documentation within the various XML modules is somewhat opaque so its quite likely I have missed a great big feature somewhere!

About these ads
Explore posts in the same categories: perl

5 Comments on “XML Processing”

  1. Chris Says:

    Perhaps you need something in a SAX? You could in theory parse your document into node hashes on the fly dynamically without having to read the full document into memory. I do something like this in XML::Toolkit, and there is an intro article at http://search.cpan.org/~grantm/XML-SAX-0.96/SAX/Intro.pod.

    XML::Twig has always been describe to me as halfway between LibXML and SAX, and perhaps instead of moving to the one side you need the other?

  2. Chris Says:

    Right so as soon as I remove the link to the XML.com articles from the last comment, XML.com decides to play nice. Kip Hampton wrote several good articles in 2001/2002 (http://www.xml.com/pub/au/83) specifically to SAX http://www.xml.com/pub/a/2001/02/14/perlsax.html,

  3. nperez Says:

    Have you considered making use of POE and POE::Filter::XML (inherently based on XML::LibXML) against a stream of XML? You could roll a system that uses POE::Wheel::FollowTail. This would allow you to plug in other POE technologies like one of the POE::Component::*DBI* modules to store the stuff into database. Just some thoughts.

  4. anonymouse Says:

    look at the pull parser interface for LibXML- XML::LibXML::Reader. it’s the fastest method to parse large files. XML::Twig is okay, but X::L::R kicks its butt.

  5. Matt S Trout Says:

    Basically – what anonymouse said. Use a pull based parser and your life will likely become much easier.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: