Hitachi Vantara Pentaho Community Forums
Results 1 to 25 of 25

Thread: Transposition of fields to rows

  1. #1
    Join Date
    Jun 2007
    Posts
    233

    Question Transposition of fields to rows

    Hi Everyone

    I am exploring a data storage scenario that I would like a little guidance on. The scenario is that a file may be received with any number of columns(fields), and this needs to be stored in a single table. The data extraction of raw information into columns (fields) is going to produce an unknown quantity. What I need to be able to do is to turn the data from the following format:

    Filenum | Rownum | Field_1 | Field_2 | etc.....
    1 | 1 | ABC | DEF | etc....

    into

    Filenum | Rownum | Fieldname | Fieldvalue
    1 | 1 | Field_1 | ABC
    1 | 1 | Field_2 | DEF

    I am wondering if anyone has done something similar to this or might be so kind as to point me in the right direction. I was thinking of trying to work with the row denormaliser / row normaliser to achieve the desired result, but I keep turning myself around with it. Any help would be greatly appreciated.

    Kind Regards

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    A most important information is whether fields (field_1, field_2, ...) are repeating fields carrying the same type of data or not.
    Another question would be if the creation of that NF2 input is under your control.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Jun 2007
    Posts
    233

    Question This is a sample of the desired output

    Hi marabu,

    Thanks for getting back to me. The source files (incoming data) are not under my control, hence the need to break them up this way. The 'raw' files are often PDF, which I push through a text extractor, read in as a single column, and then depending on the source of the data start to tear it apart with REGEX. During this process I need to store the data at various stages as part of the auditing requirements. In the original processing the PDF into TXT I read everything in as String. When I have had a chance to run a few REGEX against the data I start to classify the data types, but until then they all stay as String. For what its worth they can remain as String right up until the last stage where they are placed into 'operational' tables.

    I have written a small JS routine to pump out the type of output I am looking for. The attached transformation will do the job actually, but I am loathed to use JS for transformations with an unknown number of rows, possibly in the many millions. Maybe I need to write the thing as a Java step of some sort. If there is a way to achieve the same with native components I feel it would be better but I am unable to figure one out. I am using 5.01 stable.

    I am interested in your thoughts.

    Cheers

    The Frog
    Attached Files Attached Files
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    AFAICS you have a known set of input fields, so you simply could use "Row Normalizer" to convert from row/column to key/value.
    Attached Files Attached Files
    So long, and thanks for all the fish.

  5. #5
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi marabu

    In teh first stage of the process I have a single field (the raw text), but later on after some processing I will have an unknown number of fields due to the various types of document being fed to me. They all need to be stored in the same target table regradless of the number of fields. As I understand this the normaliser doesnt quite work this way. Am I misunderstanding something?

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    ETL Metadata Injection is a technique to overcome situations where the structure of your input data only becomes known at runtime.
    You still might use some scripting to extract the fieldnames, but you won't have to worry about scalability as much.
    So long, and thanks for all the fish.

  7. #7
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu

    Thanks for the pointer. I will have a play with that step over the weekend and come back on this. I have not used this step before so I need to do a little reading up on it first. I am thinking that I could use a [Metadata structure of stream] step to pull out the fieldnames, filter the results to remove the ones I need to for the normaliser to work properly (in this case section and line), and then inject the results of this into the normaliser step. If I understand the concept correctly then the normaliser should receive a list of the fields to normalise (and probably the type-field would have to be in there too) and is then able to use this on the data from the source file / step. Its a nice approach.

    I will attempt to get this up and working over the weekend. Thanks for the pointer Marabu.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  8. #8
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    I have had a chance to play with the ETL MetaData Injection step but I'm finding myself at a loss for a successful implementation. I have done some further reading on this tonight and seen an example set of transactions written by Matt (PDI-5422) some time ago. I am seeing from this that the injector step doesnt push the data directly into the desired target step in the same transformation, but rather does so for another transformation of your choosing.

    If this is the case, does that mean I need a 'File Reader' transformation for each type of incoming file where the metadata is extracted and injected then into a second transformation to read the file and use the recovered metadata in the normaliser? ie/ It's a two or more transformation process to get the job done? Am I on the right track here?

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  9. #9
    Join Date
    Jun 2007
    Posts
    233

    Question

    Hi again Marabu,

    Just an add-on thought. Does this also mean that I will have to worry about reading all the sheetnames and columns from an excel file? Seems a bit chicken and egg if this is right. Same goes for delimited files though I can always handle them as lines of text and break them up later. If I am right about the XL stuff then, in your opinion, would it be worth my while to write a step that simply extracts the metadata from excel files in the sheetname / column-name and then use it to populate the follow-up transformation(s) => each file->each sheet->all columns. This will be an almost daily requirement for the project I am working on. Might be one file in a day or two hundred. All need the same treatment.

    Curious for your thoughts. Maybe there is a better way.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  10. #10
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by TheFrog View Post
    does that mean I need a 'File Reader' transformation for each type of incoming file
    Not necessarily. If all your input files are delimited text as outlined in your opening post, I'm still confident you can manage with two transformations: One being a processing template and the other to provide the metadata.

    Quote Originally Posted by TheFrog View Post
    Does this also mean that I will have to worry about reading all the sheetnames and columns from an excel file?
    So you want to ingest Excel files, too? In that case you will need a second pair of transformations. I probably would use a User Defined Java Class to retrieve the metadata in a generic way. Can't remember having done this, but it shouldn't be too expensive with all the source code at hand to borrow from.
    So long, and thanks for all the fish.

  11. #11
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu

    Thanks again for getting back to me on this. I have experimented some more with the Meta Data Injection and seem to be getting nowhere. I must be missing something in my understanding of how to do this because I am going in circles.

    Let me bounce this off you and see how you would do it. Its kind of generic and simple I hope. Say there is a table on a database connection with an unknown number of fields. We know the tablename, but that is all. This table (for arguments sake) needs to be normalised. We can do a select * from tablename to retrieve some info with the table input step (limit the rows to 1 say for brevity), and use the 'Metadata Structure of Stream' step to get the fieldnames and types. We dont know what they are beforehand but we know that the transformation will produce this data. Now we have to pump the metadata into a second transformation that will normalise the structure. This second transformation would have a table input step same as the first transformation but without the limit on the number of rows, with a second step being the normaliser. In the first transformation we add a metadata injector step and specify the second transformation as the target (specifically the normaliser step in that transformation). So how do we specify the three required chunks of data that need to be set by the metadata injector into the normaliser when we dont know what they are yet (since we dont know the table fields)? Do we just arbitrarily add three stream fields with values? I am a little lost here. I have probably missed something really obvious and when I see it I'll have a facepalm moment but I am not seeing the way through this one. Can you straighten me out on this?

    Cheers

    The Frog

    PS: Thanks for the heads up on the XL stuff. I suspected that it might be the case to have to craft something. It has started me thinking that a series of steps designed to extract metadata from unknown sources - kind of like the 'get fields' - would be a good idea for a metadata driven tool.
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  12. #12
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    In the sample I substituted a Data Grid for your Table Input - cause portability ...

    Note to myself: Always remember "Metadata Structure of Stream".
    Attached Files Attached Files
    So long, and thanks for all the fish.

  13. #13
    Join Date
    Jun 2007
    Posts
    233

    Default

    Marabu, thankyou. I get it. This working example is simple and perfect. I didnt realise that the ETL Metadata Injection step actually runs the target transformation. I would have been doing the workload twice! I am going to try and put it through its paces today and migrate some 1500 tables I can toy with. Should give me an idea of how long some real work will take a little later on. I'm going to have to put some time into the XL metadata ripper and might pop it up here for others to play with and test. I am not sure how long it will take me to get that done as it is not hte top priority right now but I like it as a challenge... thats what spare time is for!

    Marabu a very big thankyou to you.

    Cheers and thanks

    A Very happy Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  14. #14
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    Just a passing thought given what I now know about the ETL Metadata Injection step: Since the step runs the target transformation as its default, I can turn this feature off. If I were to then create a 'standard' normalisation transformation that feeds a given target table, which 'gets its rows from results', do you think it would be possible to feed this from different 'reader' transformations? Or to put it another way, first transformation reads the metadata of a given filetype / source and pushes the metadata info into the normalisation step (final step). A second step would be to then read the file / source in its entirety and put rows to result. Third and final part is the normalisation transformation that will output the normalised data.

    If my understanding is correct then the above concept should work but before I go and plough a few too many hours into it I thought I'd ask. It could be very useful to have a Job / Transformation(s) that could swallow any data into a normalised holding structure.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  15. #15
    Join Date
    Jun 2007
    Posts
    233

    Question Test deployment works sometimes but not others...

    Hi Marabu,

    I have buillt a few transformations and jobs to migrate approximately 1500 tables worth of data into a normalised storage table. They were all created with custom code when ripping data from various source documents so no two tables are the same. In total there are two jobs (one is a sub job) and four transformations. The whole thing works if I fix the tablename and only work with one table (or at least it seems to). The issue comes in when I substitute variables for things like tablenames. When I use a text gile output step to see what is coming out of the various steps I can tell that the 'normaliser' transformation is reading the table that has been passed to it, but an error occurs directly after with a select values step complaining that it hasnt received anything to tell it what to do (step called Control Metadata). This is definitely specified in the previous transformation 'Extractor' that gets all the field data. I have verified that the values are being correctly produced in the extractor step (same as in your example basically) and so the metadata injector should be pushing these through to its target step - but it isnt!

    I have attached a copy of all the jobs and transformations. Having triple checked my work here I cant seem to find a fault with it, yet it fails. Since I am new to the injector I am suspecting that I am once again missing something. WOuld you mind taking a look when you have the time?

    The exact error message is : At least one of select, remove or meta screens should contain data. Error initializing step [Control Metadata]

    Your guidance would be greatly appreciated. Its odd, but if I hard code the table into the table input steps instead of using a variable it seems to work. I'm probably missing something obvious.

    Cheers

    The Frog
    Attached Files Attached Files
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  16. #16
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Well, you may "Copy rows to result", but "Get rows from result" will find the result rowset to be empty.
    Since the template is executed as kind of a subtransformation, control returns to the first transformation giving it the chance to clear the result rowset.
    At least I believe that's what's happening.

    Keep in mind that the result rowset must catch all the rows, so the scalability of that approach is limited by available memory.
    Persistent storage (filesystem, database, ...) will work, though.

    Good luck.

    PS: Didn't see your previous post until I saved mine

    Regarding that error message: I can see you configured the meta page via injection. Did you check the rowcount emitted from the Cleanup step? Seems to be zero, else the error wouldn't come up.
    Last edited by marabu; 01-30-2014 at 04:10 AM.
    So long, and thanks for all the fish.

  17. #17
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    Tried to get to the bottom of this ETL injection problem. The Metadata Extractor step is pumping out the row(s) after the cleanup step into the injection step. I verified this with a text file output to see what it was handing on down the stream. So the injection step is receiving the data. What appears to be happening is that the injection step isnt actually putting the data into the target steps and is unable to initialise them. This in turn causes a crash of the transformation and in turn the job. I have attached a row level log output where near the end of the log you can see what the steps are trying to do. I am receiving an SQL Statement error with no apparent origin when data is fed to the normaliser transformation.

    Ah HA! I think I have found the issue. It seems to be that the use of a variable in the table input step when selecting the table fails. The SQL statement error that is coming up is actually more specifically saying that it cant find the table. The ODBC driver is reporting that the table cant be found. If I hard set the table in the select statement then it works, but as a variable it apparently doesnt exist. The ODBC driver being used is MS Access on a Win7 64 system. Its the 32bit driver and I'm using 32bit java from Sun / Oracle. No wonder no meta data is being injected if the table input step cant return any rows. No surprise there. How the hell do I fix this one? Might have to do some experimenting over the weekend with the table input step.

    Oddly confused Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  18. #18
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by TheFrog View Post
    It seems to be that the use of a variable in the table input step when selecting the table fails.
    Generally, variable substitution works, but don't expect job entry "Set Variables" to yield control to the next job entry before all input rows are processed.
    That means, only the variable values taken from the last input row will survive.
    So long, and thanks for all the fish.

  19. #19
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    Yeah I get what your saying but that isnt an issue here. The sub job is being processed one row at a time. The variables are being set correctly and pased on correctly. What is happening is that the table input step, upon receiving the variable, is substituting it into the SQL select statement but the output rows arent appearing. This means that the lack of rows provides nothing for the meta-data injector to work on and indeed in turn the normalisation transformation too. The SQL statement that is being produced shows up in the log and appears to be correct, however the ODBC driver is responding with a 'table unknown' when the table input step tries to execute the query. This might explain why the whole process works for a single table when I set it in the SQL (instead of variable substitution). I am not sure what to make of this.
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  20. #20
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    From what you explain I get, that you see Kettle substituting the tablename variable, but your database doesn't accept the given name.
    Did you try to enclose the tablename in square brackets or whatever makes MS Access happy?
    So long, and thanks for all the fish.

  21. #21
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    Yeah did all that. All the normal MS Access tricks. I ended up taking a different approach to solve the issue. I made the assumption given the evidence at hand that the issue was not in fact with the tablename being passed but in fact an issue with the ODBC driver itself. What I did was to write some code to transfer the tables to MySQL first (DoCMD.TransferDatabase with DAO to an ODBC connection I made to an empty MySQL Schema / DB).

    Ran the entire process from the new MySQL schema and everything works as expected. Didnt change the names or anything like that, just a direct 1-1 copy. Works in MySQL but ODBC to MS Access is a slightly unreliable approach it seems.

    Marabu you have been a great help in this and I really appreciate your troubleshooting with me and pointing me in the right direction. I'll set the process in motion on the 'real' data tomorrow at work and see how it goes. I expect it to take about three to four hours depending on network traffic (our internal network is rather clogged). I'll post the results.

    Cheers and many many thanks

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  22. #22
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    The process works quite nicely without MS Access. The tables that were migrated across to MySQL have fed the process perfectly and the whole thing ran without a hitch.

    I dont know if the ODBC driver is broken or something else, but the answer is clear: MDB files are not always your friend. The quirks in their behaviour may not be their fault but the end result is that it can make the things you are trying to do unreliable.

    Thanks for your patience with all this Marabu, I appreciate the help.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  23. #23
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    I'm sure glad to hear you get along, Frog.

    Come to think of it, your problems don't originate from MS Access or Kettle. Most likely it's the outdated Sun JDBC-ODBC bridge driver you are using. It's not recommended for production use anyway, so you will have to look for a better driver or stay with your staging solution. But as a Kettle veteran (2007!) you sure know all that and more.
    So long, and thanks for all the fish.

  24. #24
    Join Date
    Jun 2007
    Posts
    233

    Default

    Hi Marabu,

    I started with Kettle / PDI back in 2007 or maybe even a little earlier. I used to be quite competant at using the tools but I havent practiced for a couple of years due to a change in role. Back in the game so to speak and working towards building a full DW and BI solution for my new role, but first I have to build a few products to get the daily operations under control. Lots of scrappy code everywhere and XL hell to boot. I'm still very much in stage 1 -> planning and organisation with a little bushfire putting out as needs arise. Should be an interesting next 12 months as the project evolves.

    The JDBC-ODBC bridge might well have contributed to the problem. I'm sure that it probably didnt help. I have also managed to detect that the WDAC / MDAC install on the machine I am using is not stable. Its not possible for example to connect to an MDB with DAO. ADO sometimes works. In short the machine I am using hasn't helped the situation. Ended up doing the migration on another computer. This will generally only need to be done once I think (unless I break something on the way through) so no biggie.

    Thanks for the help Marabu. I'm looking forward to when I can offer help back to people in this forum. I'll have the skillset back in order shortly. Thanks for your patience with this one too. There were a few moments on the way through I was tempted to require a new keyboard due to the inexplicable damage it suffered during testing and migration :-) Still, got there in the end. Thankyou for all your help. I really do appreciate it.

    Cheers

    The Frog
    Everything should be made as simple as possible, but not simpler - Albert Einstein

  25. #25

    Default Metadata Injection

    Hi,

    I have a requirement where in i need to make a generic graph to read the dynamic list of rows from csv file. Data grid provides the list of static rows.Can you please post an example on how to dynamically load the metadata and inject the column details in the ‘CSV File Input’ using ‘ETL Metadata Injection’ component.What component should i use for this in place of data grid in order to make it generic as the number and type of columns may vary.table output I cant use as if database goes down then my processing also wont be possible....please suggest if there is any other way.....I just need to take a file(columns may vary), perform sort, merge diff,select values and save it to other csv file....

    Other query is regarding how to pass Merge Join diff key as parameterized instead of hard coding it, is there any way to do that???

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.