Hitachi Vantara Pentaho Community Forums
Results 1 to 4 of 4

Thread: TableOutput and Charset (or Codepage), string length too long

  1. #1
    Join Date
    Apr 2010
    Posts
    3

    Default TableOutput and Charset (or Codepage), string length too long

    I have a transformation that reads data from a DB2 instance and writes to another DB2 instance. Unfortunately, those DB2 instances have different codepages, 1252 and 1208 (in Java-speak: Charset Cp1252 and UTF-8).

    I have a string column defined as VARCHAR 255. In my target DB I have the same column, and the strings move unchanged from source to target.

    However, I have some strings which are really 255 characters long in the source, plus they have some special characters, which results in a byte length of > 255 when UTF-8 encoding is applied. So when I try to insert the string into the target table, I get an error: SQLSTATE 22001, meaning "value too long".

    It is acceptable for me to truncate the strings in Kettle, so that they will fit into the target column. However, I thought this would be a kind of standard problem in an ETL-Context, so I wounder if there is a step or something within Kettle that can truncate the strings for me. So far i have not found anything.

    Any hints?

  2. #2
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi there,

    would the "Strings cut" step from the "Transform" section "cut" it for you?

    Cheers

    Slawo

  3. #3
    Join Date
    Apr 2009
    Posts
    337

    Default

    you could even use javascript... but chk that it does not produce a performance overhead

  4. #4
    Join Date
    Apr 2010
    Posts
    3

    Default

    No it wouldn't. I cannot determine the cutoff index as a fixed value, as it depends on the data. Some strings dont need to be cutoff at all, others might be needed to cutoff after the 128 character (of 256). It depends on the amount of two-byte characters in the string (i.e. characters which are encoded by two bytes in UTF-8).

    We need to do something like this: Encode the string into target encoding, store the byte length for each encoded character in a lookup table. Then determine the difference between the encoded bytelength and the maximum allowed bytelength. Knowing this number and using the character-bytelength table, we can determine the number of characters needed to be cut off at the end of the string, so it fits into the maximum length.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.