Hitachi Vantara Pentaho Community Forums
Results 1 to 9 of 9

Thread: Invalid XML character - Unicode: 0x0?

  1. #1
    Join Date
    Sep 2008
    Posts
    20

    Exclamation Invalid XML character - Unicode: 0x0?

    I have a transformation that has consistently worked for a few weeks that now throws an error now that it is processing different data. The transformation is a simple table input feeding a "Get Data from XML" step to a select values and then a table output.

    Here is what the logs say:

    HTML Code:
    2008/11/07 05:05:33 - Table input.0 - linenr 1100000
    2008/11/07 05:05:33 - Select values.0 - linenr 1100000
    2008/11/07 05:05:33 - Table output.0 - linenr 1100000
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : Unexpected Error : org.pentaho.di.core.exception.KettleException: 
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : org.dom4j.DocumentException: Error on line 139 of document  : An invalid XML character (Unicode: 0x0) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : Error on line 139 of document  : An invalid XML character (Unicode: 0x0) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : org.pentaho.di.core.exception.KettleException: 
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : org.dom4j.DocumentException: Error on line 139 of document  : An invalid XML character (Unicode: 0x0) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : Error on line 139 of document  : An invalid XML character (Unicode: 0x0) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : 
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.setDocument(GetXMLData.java:146)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.ReadNextString(GetXMLData.java:350)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.getXMLRowPutRowWithErrorhandling(GetXMLData.java:630)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.getXMLRow(GetXMLData.java:616)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.processRow(GetXMLData.java:566)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2664)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.run(GetXMLData.java:832)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) : Caused by: org.dom4j.DocumentException: Error on line 139 of document  : An invalid XML character (Unicode: 0x0) was found in the element content of the document. Nested exception: An invalid XML character (Unicode: 0x0) was found in the element content of the document.
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.dom4j.io.SAXReader.read(SAXReader.java:482)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.dom4j.io.SAXReader.read(SAXReader.java:365)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     at org.pentaho.di.trans.steps.getxmldata.GetXMLData.setDocument(GetXMLData.java:124)
    2008/11/07 05:05:58 - Get ALBUM data from XML.0 - ERROR (version 3.1.0, build 826 from 2008/09/30 18:32:36) :     ... 6 more
    I'm sure that it's something in the data but I don't know how to debug it. I have 2+ million records in my source table that I have going through the grinder and I can't really search for a 0x0 (which I think is NULL in Unicode speak). Could someone suggest an approach I could take in finding which record from the source table input is causing this problem? Also, is there something I can do in the designer that could be put inline that can normalize any data before it is handed off to the XML parser?

    Thanks for your help

  2. #2

    Default

    If you are using Oracle, I would replace NULL values using NVL (if numeric) or DECODE (if character) as you output data to XML.

  3. #3
    Join Date
    May 2006
    Posts
    4,882

    Default

    Look at line 139 of the XML file with a good text editor.Eit.

    The application generating your XML input file is wrong, or the input data is wrong... depending on your view.

    Regards,
    Sven

  4. #4
    Join Date
    Sep 2008
    Posts
    20

    Default

    The challenge is that I have the XML stored in a table in Oracle. Here is how the transformation is setup:
    1. Table input: SELECT query from Oracle database
    2. Get XML data: Using option "XML source defined in a field" and setting it to field "XML_STRING" from database result
    3. Select values (to drop superfluous fields from table output)
    4. Table output: MySQL table
    The schema for the Oracle table kinda looks like this:

    CREATE TABLE `XMLDATA` {
    `RECORD_ID` INT,
    `XML_STRING` TEXT
    }

    The result always contains an XML string for the attribute "XML_STRING". I'm guessing that some part of the XML has a unicode 0x0 in it that isn't wrapped in a CDATA. Since it's a critical resource, I can't really run queries on it that are not approved from my companies Oracle department. I could certainly ask the Oracle guys to run a query that I know would confirm what is throwing it off but I can't really wait another week to have my request fulfilled by them. Even if I find the record and the part of the XML string that is throwing it off, I still need to deal with it in a way that doesn't create a bunch of extra work (such as having to maintain a list of records known to cause problems with the parser then exlude them from the result that's parsed).

    Since there's a RECORD_ID for each XML_STRING that is handed to the "Get XML data" step, how could I capture which cooresponding "RECORD_ID" for the "XML_STRING" causing the parser t o bomb? Also, is there a way I could create a check in spoon before I send it to the parser so I know if it should be parsred or not? This doesn't bomb until 2-3 hours into execution so it's a bit cumbersome to keep trying to reproduce it.
    Last edited by dyerrington; 11-07-2008 at 06:26 PM. Reason: Updated information

  5. #5
    Join Date
    May 2006
    Posts
    4,882

    Default

    In Oracle?

    Something as

    Code:
    select record_id, xml_string
    from xml_data
    where translate(xml_string, chr(0), 'T') != xml_string
    Regards,
    Sven

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You can do some really powerful things with KETTLE...

    0x0 is defined to not be valid XML, so it shouldn't be in your source data.
    On the other hand you can have KETTLE fix that for you inline.

    See thread http://forums.pentaho.org/showthread.php?t=65006

    In between the "Table Input Step" and "Get XML Data" step, you could put a JS step which does
    Code:
    xml_string = xml_data.getString()
    replace(xml_string,"\000"," ")
    Then select values to drop the original XML data.

    Not sure how safe that would be.... but you could do it.

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Also implemented a variable-like system in 3.2.0-M1, back-ported to version 3.1.1.

    The first thing that came to mind was implemented:

    $[00,01] or $[6F,FF,00,1F]

    In this specific case, you would need to enable variable replacement in the "Table Input" step and then reference value $[00] as binary 0 a.k.a. \u0000 a.k.a. "\000" etc.

  8. #8
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Just as a side note: if you find yourself using these kinds of hacks, try to be a better man/woman and consider what is wrong to make you have to use them in the first place.

  9. #9
    Join Date
    Sep 2008
    Posts
    20

    Default

    Quote Originally Posted by MattCasters View Post
    Just as a side note: if you find yourself using these kinds of hacks, try to be a better man/woman and consider what is wrong to make you have to use them in the first place.
    The weight of limited resources and unrealistic deadlines


    gutlez: I tried your example out and I could get the hack to work. This is the literal code I used:

    var xml_string = xml_data.getString()x
    xml_string = replace(xml_string,"\000"," ");
    xml_string = replace(xml_string,"\0x0"," ");



    Then, I set the field source in my "get data from xml" step to use the new xml_string item that appeared in the select element. This step allowed the transformation to complete with success where it previously failed.

    Thanks for your help guys.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.