Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: Testing an ETL

  1. #1
    Join Date
    Jan 2017
    Posts
    7

    Default Testing an ETL

    Hello.

    Guys, I want to test my ETL by counting the rows of my initial "Table input" and, after the transformation and loading process are done, count the rows in my "Table output" and compare if the number is equal.

    I guess is pretty easy but I haven't managed to do it by now... I see there is a tool under "Utility" that says Table Compare, but I also haven't managed to use it correctly, apparently.

    Thanks for the replays.

  2. #2
    Join Date
    May 2016
    Posts
    282

    Default

    There are multiple ways to achieve it, executing a SQL script where you perform the comparison, or running a transformation with just a simple SELECT COUNT(1) FROM each table and then stop with an error if the count is different.

  3. #3
    Join Date
    Jan 2017
    Posts
    7

    Default

    Quote Originally Posted by Ana GH View Post
    There are multiple ways to achieve it, executing a SQL script where you perform the comparison, or running a transformation with just a simple SELECT COUNT(1) FROM each table and then stop with an error if the count is different.
    Thanks for replying.

    Yes, I do understand "what" I need to do, but it is in the "how" that I'm having trouble. I suppose I have to put a simple if statement somewhere, but I don't know where or how that would be. Aditionaly, I don't know how to raise an error :/

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by jdg9999 View Post
    Aditionaly, I don't know how to raise an error :/
    I take it that you're a programmer.
    STOP.

    Take a big step back.
    Programming by flow is not like event based programming, or even like procedural programming.

    So now that we're a step back from things...

    What is the goal behind counting the rows coming in from the table input and comparing to table output. They won't necessarily be the same. If you use a statistics step, it could be a lot less. If you use a split rows step, it could be a lot more.

  5. #5
    Join Date
    Jan 2017
    Posts
    7

    Default

    Yes, I am a developer, but two weeks into working with Pentaho DI.

    Ok, here is how the whole thing is supposed to go:

    I have this very simple ETL: Table input, select values, Table output. I have to run it periodically (I believe that is called a Job, something I will have to figure out afterwards).
    Now, 'Big Boss' asked me that I have to verify that whenever the ETL runs, it has run smoothly and check if every single row has moved (I believe Pentaho DI does that by itself; he is not so sure). Yes, the row count is not supposed to be the same, since what I will have to do is some sort of upsert, kind of thing (but anyway, I guess I would still have to count the rows from the data input and the data output and compare them. I think I have figured out the "Table compare" object, but riiiight at this very second I can't execute the ETL since my local server is malfunctioning).

    So yeah, that's it.

    Thanks.

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I have a bit of (very outdated) programming experience, and usually tell coders to forget (almost) everything they know. Flow based programming doesn't work the same way. But then, I think you've probably discovered that by now.

    Yes, if there is an issue with any of the three steps (Table Input / cannot connect to server, Select Values / Value cannot be converted, Table Output / Duplicate Key violation) then PDI itself will throw an error on that step (you can define error handling, or you can leave them unhandled and the whole transform will fail)

    Untested advise follows:
    Have a look at the "output step metrics" step. You should now be able to collect the input step rows read and the table output steps written and compare them. But then what? Did you want to fail the transformation and try to roll back? What is the end goal of doing this comparison?

  7. #7
    Join Date
    Jan 2017
    Posts
    7

    Default

    I solve the issue in a somewhat unorthodox manner. I created a new table where I would write the number of rows read/written for each step and compare them. I know it' is a strange thing to do, but it was required from me, so... that's how it went.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.