Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: (beginner) Get base64 LDAP attributes into SQL (truncating, field size) varchar(max)

  1. #1
    Join Date
    Feb 2017
    Posts
    14

    Default (beginner) Get base64 LDAP attributes into SQL (truncating, field size) varchar(max)

    I am trying to transform LDAP input data to an SQL table. A few of my LDAP attributes contain base64 encoded binary data, very large in size.
    I see how I can set the length of the fields in the Fields tab of my LDAP Input. They seem to default to VARCHAR(100) for Strings, since that's what my SQL Create script (also created by Kettle) uses in its definitions of the fields where the Length is not explicitly defined.

    How do I change the field definitions for my large base64 fields to the SQL equivalent of varchar(max)?

    eta: I'm asking because the SQL INSERT statements currently fail for these large fields with a truncation exception because the standard VARCHAR(100) column isn't long enough.
    Last edited by stephanb; 02-02-2017 at 07:49 PM.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    I have a feeling that you created the target table using the SQL statement the Table-Output step offered you.
    Don't expect Kettle to know the right size for a column - it's just guessing, and 100 is the default for String fields when no educated guess is possible.
    Don't fiddle with length in the field metadata, do a proper table design instead.
    The truncation error will go away as soon as your target field is correctly dimensioned.

    There's a slight chance you are doing it all wrong, though:
    • Would you mind to talk about what's stored in that binary attribute?
    • What's its name? If it's a custom attribute, what's its LDAP syntax?
    • How exactly looks the field entry in your LDAP-Input step?
    • Why do you want to store binary data as a base64 encoded String in your database?
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Feb 2017
    Posts
    14

    Default

    The base64 data is photo and fingerprint data generated by a third party piece of software that is integrated into a mesh of COTS products. It's delivered in base64 right now. I don't have the bandwidth to update that delivery to pure binary in addition to migrating the backend data storage. The data is over 200GB of LDAP data that has outgrown its purpose. Reporting has become an issue with queries running 7+ hours to get all data for any given report. The same data in SQL would return results in seconds. LDAP isn't made for that, and I guess you can say that requirements have evolved over the years that this infrastructure has been alive.
    The data is stored in a custom ldap schema. I've attached the first few lines of the mapping I was able to configure in the LDAP-Input, but I figure it looks just like anyone else's LDAP-Input configuration.

    I used the SQL File-Out to generate the initial table schema for me. I figured I'd use the full functionality of the tool by modifying the field metadata. I would still like to do that so I don't have to go out-of-bounds and create (and keep up) with a table schema separately from the transformation project. Is there no way to set a field up so the result would be a varchar(max) after the SQL File output?

    Name:  fields.jpg
Views: 83
Size:  13.1 KB
    Last edited by stephanb; 02-03-2017 at 12:29 PM.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Thanks for providing some background.
    To answer your question: Field length is part of the field metadata and can be set in almost every step having a field list.
    You can simply change it in your LDAP-Input step.
    It can even be changed mid-stream using a Select-Values step.

    Your screenshot was aimed too high, I wanted to see the fields containing binary data.
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Feb 2017
    Posts
    14

    Default

    Ah, I see. The binary fields look the same. Almost all my fields have been marked as String. There are only two or three that were (correctly) cast as Integers in the Type column.
    So what do I need to enter into the Length column for these (base64) fields to get them marked as varchar(max) during the SQL transformation?
    I can enter an integer "x" there, and the result will be varchar(x). I tried entering the word "max" to see if it would translate properly, but it didn't.

    Name:  fields.jpg
Views: 83
Size:  4.7 KB
    Last edited by stephanb; 02-03-2017 at 08:27 PM.

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Kettle doesn't care about the varchar size limit supported by your particular SQL server, you must look it up yourself.

    But isn't a varchar column entirely wrong?

    When it comes to binary data, LDAP-Input gives you a couple of chances to botch things up.
    In order to configure LDAP-Input correctly, you should consult your LDAP schema to find out about fingerPrints (eTCustomField63 ?) attribute syntax and if it's a multi-valued attribute.
    Attributes holding binary data typically have syntax "Octet String" or "JPEG".
    If you want to read such attribute values as is, you must attach the attribute option "binary" like "eTCustomField63;binary".
    It's the way to tell the LDAP server you don't want any conversion or translation to happen before the data is sent to the client.
    Of course, you must tell Kettle to fetch as Binary, else Kettle will ruin it by converting on the client-side.
    Select field type Binary and you have what you need to store data in a BLOB column on your SQL server.
    For single valued attributes, that is.

    If you really know what you're doing, forget what I said.
    If you don't, find somebody to blame when **** hits the fan.
    Kettle isn't a guide dog, it doesn't take you by the hand when it comes to LDAP and SQL.
    You have to know your tools.
    No offense meant.
    Take care.
    So long, and thanks for all the fish.

  7. #7
    Join Date
    Feb 2017
    Posts
    14

    Default

    I get what you're saying. I really do. The binary data in these attributes is stored as a (base64 encoded) single-valued directoryString in the LDAP server. It is not stored in binary format. The only thing I want to do at this point, is take that string and store it in SQL. To do that I need a varchar column. I don't know maximum value this data can be, so I need to use the "varchar(max)" SQL definition. All I wanted to know is if it's possible to tell Kettle to use that definition verbatim in its SQL code generation for the destination table.
    I don't need Kettle to know what "max" is. I don't want Kettle to treat "max" as a variable. I just wanted to be able to tell Kettle to define the target column for these fields as varchar(max). At this point I'm just going to assume it can't do that, and manually adjust the table definition.

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    OK.

    Usually, when using action "Get Fields" Kettle will analyze the first n rows to determine field metadata. I seldom use this feature, though. You will be prompted for n with a default of 100, I think. That's guessing and you are not guaranteed that no larger values come later in your dataset.

    If you decouple extraction (LDAP-Input) and loading (Table-Output) by storing the extracted data (200GB ?) on a fast disk, you could use a throw-away transformation to determine the largest value without putting too much stress on the LDAP server. LDAP-Input will read what is stored, no matter the length of the attribute value. A Calculator step can determine the length of each value, and a Group-By step can find the maxlength via aggregate function.

    You will learn the exact maxlength you need for your DDL statement, but is it worth the trouble?
    I just would read a sample and add a safety margin.

    Let's hope that your database server is able to store all those values in a varchar column.
    MySQL for example will most certainly not be able given a rowsize limit of 64KB.

    EDIT: Are you by any chance talking about MS SQL server? varchar(max) isn't understood by all SQL servers.
    Last edited by marabu; 02-06-2017 at 11:30 AM.
    So long, and thanks for all the fish.

  9. #9
    Join Date
    Feb 2017
    Posts
    14

    Default

    Is there a way for me to put an intermediate step in my transformation that prints out the field size for each record it's processing?
    I have over 120 attributes for each user, and somehow my SQL column sizes on one or more attributes is too small; Kettle fails with little identifying information as to which attribute is wrongly sized. Any suggestion to help me identify which column is causing the insertion failure is helpful.

  10. #10
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Actually, I described a way to determine the maximum length values in my previous post.
    So long, and thanks for all the fish.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.