Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: How to read unicode with the JSON Input?

  1. #1

    Default How to read unicode with the JSON Input?

    I am using the JSON Input node to read a JSON file. The file is unicode. I am getting a bad character error, which makes me think this node can't read unicode. Unicode has a 0xff 0xfe byte order mark which makes it stumble.

    Are there any workarounds?

    And how can I search the forums for two search terms in AND not OR mode?

    Thanks
    Martin

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    A JSON parser expects Unicode character encoding.
    According to RFC 4627 the only thing allowed before the JSON object (or array) notation is whitespace.
    A Byte Order Mark (BOM) isn't whitespace.
    A BOM has not even meaning with UTF-8, so in that case you might be able to successfully enlighten the originator of the file an spare yourself some trouble.
    If you can't receive the file without the BOM you'll have to remove it on your side before you can parse it.

    Since the Forum Search has some shortcomings, we mostly use something like Google Search.
    So long, and thanks for all the fish.

  3. #3

    Default

    The byte order mark is a legal unicode character... so it really should not stumble over it.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Your implication is wrong.
    The BOM is not insignificant whitespace and thus is not allowed at the beginning of a JSON text.
    Please, read section 2 from the RFC again.
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by Martin_K View Post
    The byte order mark is a legal unicode character
    Not within a JSON it's not.

    BOM is to indicate the endianness of a file. However:
    3. Encoding

    JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

    Since the first two characters of a JSON text will always be ASCII characters [RFC0020], it is possible to determine whether an octet stream is UTF-8, UTF-16 (BE or LE), or UTF-32 (BE or LE) by looking at the pattern of nulls in the first four octets.

    00 00 00 xx UTF-32BE
    00 xx 00 xx UTF-16BE
    xx 00 00 00 UTF-32LE
    xx 00 xx 00 UTF-16LE
    xx xx xx xx UTF-8

  6. #6

    Default

    How does that help me? It can't read the text. It's a file reader, it should handle a BOM. It's like, notepad can handle it. type (in a command prompt) can handle it. Software should solve problems, not create them, or people are just going to move on.

  7. #7

    Default

    So I removed the BOM, now this

    Code:
    2016/01/08 14:57:05 - Json Input.0 - ERROR (version 6.0.0.0-353, build 1 from 2015-10-07 13.27.43 by buildguy) : Could not open file #1 : file:///C:/data-integration/wos-es.json --> org.pentaho.di.core.exception.KettleException: 
    2016/01/08 14:57:05 - Json Input.0 - java.lang.NullPointerException
    2016/01/08 14:57:05 - Json Input.0 -  at java.lang.Thread.run (null:-1)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.step.RunThread.run (RunThread.java:62)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonInput.processRow (JsonInput.java:344)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonInput.getOneRow (JsonInput.java:390)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonInput.openNextFile (JsonInput.java:206)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonInput.readFileOrString (JsonInput.java:258)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonInput.parseJson (JsonInput.java:281)
    2016/01/08 14:57:05 - Json Input.0 -  at org.pentaho.di.trans.steps.jsoninput.JsonReader.getPath (JsonReader.java:180)
    Guess this node isn't quite ready for prime time.

  8. #8
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If a file specification specifically says that it's not allowed, then it's not allowed.
    Removing it is the only way to create a file that meets the specifications.

    The error that you are getting usually tells me that the user didn't configure the step fully - often by neglecting to specify a field type.
    If you post a version of your KTR with just your JSONInput step, I'd be happy to look at it and tell you what the most likely cause is.

    I've been able to get the JSONInput step working such that it has a 100% success rate. Not saying that I've had 100% luck with getting it configured, just that once configured, it reads it 100% of the time.

  9. #9
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    While gutlez pinned down the remaining problem accurately, I would like to comment on something else.

    Quote Originally Posted by Martin_K View Post
    It's a file reader, it should handle a BOM. It's like, notepad can handle it. type (in a command prompt) can handle it.
    Each runtime environment has its own concepts and rules. Notepad will assume a text file to be single byte encoded (e.g. cp1252) if the UTF-8 BOM (indicating variable length encoding) is missing, hence the MS recommendation to "always prefix plain text files with a [UTF-8] BOM". Java is assuming UTF-8 when no BOM is found. Java engineers tried more than once to introduce UTF-8 BOM support but always found some existing code breaking.

    Quote Originally Posted by Martin_K View Post
    Software should solve problems, not create them, or people are just going to move on.
    Sometimes you must make compromises, especially when you find yourself being a collateral damage in a clash of titans (MS and Sun, at that time). You might find yourself moving on quite a lot, if you're not able to adapt. But hey, open a Kettle feature request, and see if you gather allies. Or develop an enhanced JSON-Input plugin that does what you want. Finally, you can try to convince Pentaho to add UTF-8 BOM support to JSON-Input by licensing the Enterprise Edition - money sometimes can be very convincing.

    BTW: I would separate UTF-8 BOM removal from my Kettle job as much as possible. GNU recode for example is a tool that can hide the unwanted BOM from Kettle on the shell level.
    Last edited by marabu; 01-09-2016 at 06:42 AM. Reason: one more link added
    So long, and thanks for all the fish.

  10. #10

    Default

    @gutlez

    I had just pressed the "preview fields" button when it crashed, hoping it would be able to generate the metadata. After I configured the fields by hand, it did work. Thanks!

    @marabu

    I think the difficulty deciding which features are appropriate stems from the fact this is a file reader + JSON parser node. I'd rather wish there was a separate text reader with its own options. There, you could better argue it should do its job really well. Thanks to you as well.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.