View Full Version : XML/Postgresql performances slowns down while processing

04-22-2006, 08:02 AM

I am trying to extract data from an XML file and fill a warehouse dimension with insert/update, the performances are greath at the process beginning but slows down dramatically while processing... from 400r/s in xml input and 750r/s in insert/update to respectivly 33r/s and 66r/s when 80% of the rows are done...

The XML file containts about 110000 elements at the same hierarchy level, with about 5 attributes for each one...

The test computer is a intel pentium M 1,5Ghz and 1250MB of ram, the memory occupation is about 500MB and for the CPU it's 100%...

I don't understand

04-22-2006, 08:12 AM
It's a known issue: the algorithm is using a DOM tree and wants to position randomly so this means cycling through the XML file for every request. The further to the end you get, the longer it takes to find the required position.
Especially very large files suffer from this.

As usual, we'll find a fix for this.

There, now you understand ;-)


04-22-2006, 11:45 AM
Then an immediate solution will be to fragment XML file to smaller ones and process them sequentially?
so the hole processing of all the files will be more rapid than the processing of the big file...

04-22-2006, 12:29 PM
Well, there is a lot of XML reader code that makes use of the XMLHandler.getSubNodeByNr() method.
This method is basically at fault here.

So what I did is, I added a Caching system (XMLHandlerCache, XMLHandlerCacheEntry) that caches the 500 most recent parent nodes. In tight loops like in a large XML document with thousands of elements it puts the performance back to linear.

I just commited this code so you can grab a new kettle.jar from development packages in 5 minutes.

The memory issue I have to still hunt down. Looking for a decent profiler.

All the best,

04-22-2006, 01:51 PM
Hi Matt,

A profiler I can recommend is JProfiler: http://www.ej-technologies.com/products/jprofiler/overview.html
It is useful for both memory and CPU profiling.



04-22-2006, 02:28 PM
Thanks for the link Wim, but I don't feel like paying €400 as I don't use it THAT much. :-)
I've installed EclipseProfiler and after some patching it works fine for me.

At first glance it seems the problem is in the DOM tree itself. (35M for a 6M XML document)
Perhaps by requesting children from a DOM tree, it's somehow leaking memory.
Mmmm, maybe we have to switch to another strategy althogether.

I'll investigate some more later.