Hitachi Vantara Pentaho Community Forums
Results 1 to 17 of 17

Thread: Data in XML file with wrong format & bad characters

  1. #1
    Join Date
    Jul 2010
    Posts
    9

    Default Data in XML file with wrong format & bad characters

    Hello Everyone!!

    To begin with, I am almost month old with Pentaho and have done very small db data copy/xml file reads etc so far. So if you explain I will learn ..

    We have a situation which is puzzling me and I want to find an automated solution to this.
    We have a number of XML files which are big (40 MBs to 100 MBs). Everytime we get these files, thay have some weird chars in them. The way we figure it out is we open one file in IE and there will be a script error which point to the line number & col number. That col and line when opened using file editors (I use notepad++) does not show anything.
    What we do is in notepad++, I do new file (new file format is UTF8), copy the contents of the old xml and paste it in the new file and save it. Now when I open this new file in IE, everything is good. I guess the problem is with the job which is creating the xml file (surely wrong format or something). I cannot change the job so lets not talk about it. We further use these xml file in another job (again not in my control) and it will fail if i use the old file but passes if I use the new file creating using notepad++.

    So what I want to do is do something in kettle which reads these xml files, takes the content and write to another file as xml and UTF8 format.

    I tried using 'Get Data from XML' but it fails to get Loop XPath with error
    Error on line 41 of document file:///C:/InstalledApps/kettle3.2.0/data-integration/UTF-8 : Invalid byte 1 of 1-byte UTF-8 sequence. Nested exception: Invalid byte 1 of 1-byte UTF-8 sequence.

    So I know that the file has a problem with some chars and format and that is what I want to automate what I do manually using Notepad++.

    Any thoughts to make this simple, this is driving me crazy.

    Thanks,

  2. #2
    Join Date
    Mar 2003
    Posts
    8,085

    Default

    To me this sounds as if your xml file is not in UTF-8 but in some sort of 8-bit encoding. You usually see that error when your XML files contains a byte with value > 127. In my world that happens when users treat XML files as text that can be edited in Notepad. My guess would be that the original author of your XML generator did not understand what encodings are and thus slapped the <?xml .. encoding="UTF-8"?> as some sort of voodoo magic on the file without actually using UTF-8 in the stream.

    You can either replace the XML header with one that states the correct encoding (maybe ISO-8859-1?)

    or

    you can search and replace all erroneous bytes (which is a pain, see http://en.wikipedia.org/wiki/UTF-8#I...byte_sequences )
    Get the latest news and tips and tricks for Pentaho Reporting at the Pentaho Reporting Blog.

  3. #3
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi there,

    well, to begin with: you are not alone. There are a lot of XML files floating around that simply are no XML files, they just look like they were. XML defines the charset issues in a very strict (but not intuitive) way. If you have a file, that is no XML file at all, your luck with different parsers and processors will vary.

    I do not know exactly where the problem is, since I don't have an example file. I would like to point to some tools I use regularly to solve this kind of issues :-)

    I use *nix for this kind of processing, and I gather you're on windows, so you'd have to look for windows versions or a cygwin environment for these

    I usually use Iconv to deal with charset conversions
    http://en.wikipedia.org/wiki/Iconv

    Sed might be helpful to replace bad or missing Chars or XML entities
    http://en.wikipedia.org/wiki/Sed

    Check out http://sed.sourceforge.net/grabbag/scripts/ especially the
    ISO8859-1 -> HTML and HTML -> ISO8859-1 sripts

    So long, hope you get it solved :-)

    Cheers

    Slawo

  4. #4
    Join Date
    Jul 2010
    Posts
    9

    Default

    Thanks Slawo & Taqua.
    So i am not alone out here...

    I read the wiki links and about the s/w. Nearest s/w which would help looked like Sed i.e perl for windows. But I will have to first learn perl in order to support that...so sad I never invested time in perl seriously. I was hoping since I have already spent time in kettle, I could do something in pentaho and read the error file & write as UTF8 steam.
    Bcos doing that in notepad++ seems to be working. Its just that it is time consuming & memory hogging opening so huge file. Sometime it crashes.
    Now the thing is I cannot show these files here.. And I dont know how to create such an error file
    I need more suggestions on what other people are doing..

  5. #5
    Join Date
    Sep 2009
    Posts
    810

    Default

    Well, I'd recommend using a windows version of iconv, like this one http://yukihiro.nakadaira.googlepages.com/win_iconv.zip

    You should be able to use it in Kettle's shell script job entry to convert your files before trying to read them....

  6. #6
    Join Date
    Jul 2010
    Posts
    9

    Default

    Using win_iconv, Don't I need to know what the from-encoding is to convert to to-encoding (UTF-8)? I don't know what the xml file was encoded as?

  7. #7
    Join Date
    Sep 2009
    Posts
    810

    Default

    Find out :-)

    look for special chars / non-latin chars in the line it is complaining about. Maybe we can guess :-)

    It is likely cp1252, cp1250 or ISO-8859-1 as Taqua said. Your english is perfect so I don't dare guessing your location and likely windows charset ;-)

    Cheers

    Slawo

  8. #8
    Join Date
    Jul 2010
    Posts
    9

    Default



    Ok..i will spend time looking for these bad guys, or i will try to look at the code which the xml generator is using...or trial & error with these obvious ones...

    Thanks much...

  9. #9
    Join Date
    Jul 2010
    Posts
    9

    Default

    I know this is going way off pentaho now but please do not remove this thread. I still want to do this in pentaho..trust me!!

    Slawo,
    I must say 'what is cool tool iconv is!!' I am doing 'command -f UTF-u -t UTF-8 filename' since nothing else seemed to work and this time it stopped right at the illegal char. I am attaching two small jpegs which shows what those chars are. One looks like AO and second F3. When I copy AO into UTF8 format txt file, it copies nothing. Looks like it is just a void. When I copy F3 it becomes that little empty square.
    I am thinking though this tool is now helping me point to the bad chars but I was able to do a new file in notepad++ faster than this tool since it has to scan the whole file. These are huge files ~80 megs. What else can I do here ?

    Secondly I found what was going on with the XML generator. The data for this XML is gathered using XMLText functionality of SQL server and then it is dumped into a file (formatted as UTF8) using MS SSIS.
    Attached Images Attached Images   

  10. #10

    Default

    Hi all,

    since PDI 4.0 we have a "Chnage file encoding" step,
    did you already tried?

    The step receive the source filename and write the content in a new target file with the specified encoding.

    Samatar
    Attached Images Attached Images  
    Samatar

  11. #11
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi there,

    Good. We're getting closer, I guess. If you'd be able to find out what the invalid characters are "supposed" to look like, you could probably find out the charset they are encoded in.

    As for other tools, I am not familiar with any other windows tool that would do the conversion automatically. You could try your luck with cygwin and the gnu iconv or write a short java snippet to convert the file contents on the fly in the kettle transformation before passing it on to the XML input step. As far as I know it accepts its XML from row fields as well.

    An example for doing charset conversion in java can be found here:
    http://www.exampledepot.com/egs/java...nvertChar.html

    You can use a User Defined Java Class step or a JavaScript step to do this.

    Cheers

    Slawo

    Edit: But of course *slap*! The new "change encoding" step from the post below will likely save you the trouble of doing the conversion manually. :-)

  12. #12
    Join Date
    Jul 2010
    Posts
    9

    Default

    sounds interesting...let me try the new version.. Thanks for your help...

  13. #13
    Join Date
    Jul 2010
    Posts
    9

    Default

    Slawo, I found Charco....& it seems to be doing the trick much faster than iconv.
    Also looks like it is all in WindowsANSI and I am going to try pdi 4.0 tomorrow for the new "Change file encoding" step tomorrow....

  14. #14
    Join Date
    Sep 2009
    Posts
    810

    Default

    Charco looks promising,
    Let us know how you end up solving the issue :-)

    Cheers
    Slawo

  15. #15
    Join Date
    Mar 2003
    Posts
    8,085

    Default

    A0 is the no-breakspace char, and if that happens to be in a telephone number field, then I bet that your user copied and pasted the number from a word-document or so. And the application that was used to enter the text did not clean out the data. And MS SQL declares UTF-8 but writes Cp125x or ISO-8859-x.

    To recode I would first try to use the native encoding/codepage of the system that runs your SQL-Server as source encoding, as lazy programmers always stick to the defaults.
    Get the latest news and tips and tricks for Pentaho Reporting at the Pentaho Reporting Blog.

  16. #16
    Join Date
    Jul 2010
    Posts
    9

    Default

    There is a lot of copy-paste text from emails being pasted right into the notes field somewhere and then it trickles down to the xml.

  17. #17
    Join Date
    Jul 2010
    Posts
    9

    Default

    Here is what i found so far. The new kettle step which converts file encoding , i cannot use it.
    Not sure what I am doing wrong. I am using 'Get Data from XML' . But this step fails since XPath nodes itself fails. Since the file has that char.
    'Error on line 153 of document file:///C:/InstalledApps/kettle4.0.0/data-integration/UTF-8 : Invalid byte 1 of 1-byte UTF-8 sequence. Nested exception: Invalid byte 1 of 1-byte UTF-8 sequence.'

    And I realize that Charco or iconv are able to convert to UTF-8 but they are adding a BOM to the UTF-8 file. And that is also not acceptable. What I need is to convert them into UTF-8 without BOM. I could not figure out how to tell iconv or charco to convert to UTF-8 without BOM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.