Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Best way for batch conversion of Excel files with Java code

  1. #1

    Default Best way for batch conversion of Excel files with Java code

    Hello, this is my first post on the forum - I've only been using Spoon for a few weeks. I'm an experienced Java developer working in Germany.

    My question: I have a job that performs ETL on Excel sheets, extracting data and saving it to a PostgreSQL DB. So far so good. I have some Excel sheets that need extensive transformations first, such as deleting columns and empty rows, renaming and inserting header fields, filling in some missing values. This was formerly done via an Excel macro; I've written a Java class that uses Apache POI to do the job (much faster than the macro).

    What's the best way of integrating this with my job? The Java code needs to read in the whole file to process it, so I'm not sure if a Step is the right way to go. Or should I write a custom plugin for the job?

    My envisioned workflow would be giving the job/step a directory, where it then reads all the .xlsx/.xls files and performs the massive transformations, before moving on to the "easier" ETL steps. I've read as much as I could find about UDJC, but am really unsure if that's what I need to do, and if so how I implement processRow when I need to read in all the rows first.

    Thanks in advance,

    John

  2. #2
    Join Date
    Apr 2008
    Posts
    4,684

    Default

    The idea behind PDI is to show your logic, so that maintainability is easier.

    Each of the things that you want to do (dropping some rows, adding values to others) with the exception of adding header fields (why does a DB need them?) is possible with PDI steps.

    If you do it with PDI steps, rather than direct Java code, then when the next person comes along to help you, they can see the logic that you are using to do the transformation.

    It almost sounds like you might want to look at building a plug-in which preprocesses the xlsx/xls files and then provides a datastream for PDI steps to handle. uwegeercken has some experience with building plug-ins and might be able to provide some input.

  3. #3

    Default

    Thank you for the quick response, gutlez. I haven't been able to figure out how to drop, say, the first ten rows of an Excel file (which are not empty in my case). I also need to cut (or ignore) certain rows in the sheet during processing which have superfluous data. With the standard Microsoft Excel Input Step I can't even get the file to read in correctly - it's not organized as a proper Excel sheet should be. Am I missing something here? If I could get the file read in as an Excel sheet I could obviously ignore the columns I don't need. How about filling in missing values? For this I need to copy values from cells a row above to the next row. You're right about renaming the headers, that's purely a convenience for the processing steps so I know what I'm processing.
    Unfortunately I have no control over the format of the Excel. Can I really achieve all the above with standard steps? Otherwise it does look like building my own plugin, I'm a bit hesitant about plunging in there.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    You can specify the top left corner of a cell block with each Excel Input step on the Sheets tab.
    While you can't skip columns during read, you can drop the fields from the rowset in a subsequent Select Values step by selecting only columns you are interested in.
    Filling in values is easy, too. Just describe and we will guide you.
    So long, and thanks for all the fish.

  5. #5

    Default

    Thanks marabu, here's an exact description of the steps needed (the order is as they're implemented now):

    1. Remove all empty rows.
    2. Remove the first three rows at the top.
    3. Now iterate through the sheet. If we find a row where the first cell is empty, copy the first three cells from the previous row to this row.
    4. Now look for rows where the second cell is empty. Delete these rows.
    5. Remove the last line of the sheet.

    I already know how to name the fields found, and how to ignore those I don't need.

    Thanks!

  6. #6
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Removing empty rows is an option enabled by default - see Content tab.
    Adjust the row offset on the Sheets tab to ignore rows on the top of the sheet.
    Enable Repeat option of the first field to use the previous value not null as a default for an empty cell.
    Use a Filter Rows step to drop rows with no value in the second field.
    With a suitable filter expression you can drop the last row in the sheet as well.
    So long, and thanks for all the fish.

  7. #7

    Default

    That works perfectly for me. I changed my rule 4 to delete the rows where the fourth cell is empty (since we filled the second cell in the previous step), that also takes care of the last line. The result is exactly the same as the sheet I parsed with Java code using Apache POI (except of course for the extra columns, but I can easily ignore those). Many thanks!

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.