Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: Reading from PostgreSQL database with SQL_ASCII encoding

  1. #1
    Join Date
    Feb 2011
    Posts
    12

    Default Reading from PostgreSQL database with SQL_ASCII encoding

    Hello Pentaho Community,

    this is my first post about Pentaho and Kettle. So first at all, excuse me English.

    My problem is that reading data from Postgre 9 database with SQL_ASCII encoding, Input Table step doesn't recognize special characters (I'm on a spanish environment with accents and ñ).

    I have found some post talking about somethin simillar: http://forums.pentaho.com/showthread...database-data-
    from-SQL_ASCII-to-UTF8-through-ISO-8859-1

    but it seems nobody have any idea or workaround.

    I have read Postgre 9 documentation and I have found the next note:

    "The SQL_ASCII setting behaves considerably differently from the other settings. When the server character set is SQL_ASCII, the server interprets byte values 0-127 according to the ASCII standard, while byte values 128-255 are taken as uninterpreted characters. No encoding conversion will be done when the setting is SQL_ASCII. Thus, this setting is not so much a declaration that a specific encoding is in use, as a declaration of ignorance about the encoding. In most cases, if you are working with any non-ASCII data, it is unwise to use the SQL_ASCII setting, because PostgreSQL will be unable to help you by converting or validating non-ASCII characters".

    So, I think it is a problem between this Postgre restriction and JDBC driver (that use JVM encoding by default (UTF), is´t it?

    Does anybody know any kind of workaround for this problem? Consider that is not possible for me to change Database encoding because is out of my responsability.

    Thanks very much.

    Regards.

  2. #2
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hey robertbi,

    if the server does not convert the charset, it will likely just give the binary representation of the char. Some nice setup you are facing there.
    Would you be able to select a couple of rows showing the issue and use the serialize step to attach some demo rows for us?

    I'll look into it and maybe it's possible to do the conversion in the row stream.

    Cheers
    Slawo

  3. #3
    Join Date
    Feb 2011
    Posts
    12

    Default Continue doing tests

    Thanks by the response Slawomir

    I have investigated a little more making some tests.
    - With a Postgre 9 in Windows machine, using jdbc driver postgresql-9.0-801.jdbc3.jar, and a database configured as:
    CREATE DATABASE test_encoding_sql_ascii
    WITH OWNER = postgres
    ENCODING = 'SQL_ASCII'
    TABLESPACE = pg_default
    LC_COLLATE = 'Spanish, Spain'
    LC_CTYPE = 'Spanish, Spain'
    CONNECTION LIMIT = -1;

    Launching a sql query of type 'select cast(strtest as bytea) as strtest_encode from test_table' from an Input Step, it works, and it shows me special characters in the Preview Window

    - But, I have discovered that the real Postgre target server is version 7.3.2 and database configuration is:
    CREATE DATABASE "databasename"
    WITH OWNER = dbowner
    ENCODING = 'SQL_ASCII';

    And, it seems that in Postgre 7.3 is not possible CASTING from any character type to bytea (the same query returns an error).

    - I have tried stablish some parameters in Database connection options too, but doesn't work
    allowEncodingChanges=true
    charset=SQL_ASCII

    I have not much hope

  4. #4
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hey robertbi,

    if the query works without casting, but the output looks garbled, we may have a chance of correcting the error in the row stream.
    If you'd like me to try that, please serialize a few of these rows and I'll have a look.

    Cheers
    Slawo

  5. #5
    Join Date
    Mar 2009
    Posts
    112

    Default

    I once had a similar problem. Perhaps you can try with: http://wiki.postgresql.org/wiki/File...icit-casts.sql
    implicit casts where removed from Postgres on version 8.3.
    On other hand I've learned that encodings can be a real PITA, so I always try to work with UTF-8. I don't think that Postgres is going to handle correctly the encodings, so my best bet is to script the output.

    Hope it helps in some way.
    Best regards,
    José Ignacio Santa Cruz G.

  6. #6
    Join Date
    Feb 2011
    Posts
    12

    Default Serialized file

    I have attached a serialized file with some rows with special characters...

    thanks
    Attached Files Attached Files

  7. #7
    Join Date
    Feb 2011
    Posts
    12

    Default

    Thanks José Ignacio. I have revised this option too, but I don't have control on Postgre source server so I'm not able to create functions and special conversions, and other objects in the database. I think encodings are a headache issue, so I try to work always with UTF-8 too.

  8. #8
    Join Date
    Mar 2009
    Posts
    112

    Default

    Do you have some way of running a psql -l command ? This is to have a view of what databases are created in Postgres and which encodings are used in each one. To correctly handle accented characters and ñ the POSIX or the C encoding is recommended (for Postgres databases), but it only works if the database files used by Postgres were created with the POSIX or C encoding, otherwise it won't work regardless any encoding or collation you use when creating the database. POSIX or C allow s to create databases in any encoding you choose.

    I found this http://www.webtoolkit.info/javascript-utf8.html perhaps it may help a bit. I suspect the step is getting the data on UTF-8, while Kettle is running on ISO-8859-1 or WIN-1252. If so, the javascript in the URL I posted could help, just get the data with the default encoding (no additional options) and apply a javascript step to decode it. Another option is to program your own step. You can also try adding this option to the java call on the batch file: -Dfile.encoding=whatever

    Best regards, and if you solve the problem please tell us how you did it.
    José Ignacio Santa Cruz G.

  9. #9
    Join Date
    Sep 2009
    Posts
    810

    Default

    Hi robertbi,

    sorry to report that the chars in question are serialized as "0xEF 0xBF 0xBD" in UTF-8, which is the "REPLACEMENT CHARACTER" or "UNKNOWN CHARACTER" code. It either means that the database is giving that char to the JDBC driver (that would mean the situation is hopeless unless you can change the server settings) or that the chars got lost during the serialization process, in which case I'd suggest looking into jsanta's suggestions.

    Another idea: would you be able to access the server and have postgres dump the contents of the table for you using its native tools?

    Cheers
    Slawo

  10. #10
    Join Date
    Feb 2011
    Posts
    12

    Default

    Thank you very much for all your suggestions guys...

    I have not found a solution to be applied. I think only way to get some workaround is manipulating database or server configuration, byt I don't have permission to manipulate Postgres SQL Server, so there's no solution at the moment.

    I would recommended everybody designing application and databases try not to define encoding SQL_ASCII (maybe usefull to write into it but the worst option when reading data now a day,s when standard seems to be UTF-8). Don't you think the same?

  11. #11
    Join Date
    Jan 2014
    Posts
    4

    Default Reading SQL_ASCII

    What i did, after searching a lot was

    entity_decode(nomyape::text)::character varying(250) AS nomyape

    that worked for me to read the special characters
    Last edited by romi_ponti; 01-24-2014 at 01:53 PM.

  12. #12
    Join Date
    Jan 2014
    Posts
    4

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.