Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: What's the best way to use Kettle to clean data?

  1. #1
    Join Date
    Jun 2007
    Posts
    112

    Default What's the best way to use Kettle to clean data?

    I'm new to ETL and I'm trying to figure out the best way to use Kettle to flag and/or correct problems in my data.

    One method would be to extract the data to memory and then put it through a series of sub-transformations. Each sub-transformations would have a data quality test. Kettle would filter the results, and failing rows would be flagged or corrected before passing them on to the next step. This set up would be like an assembly line, with different data quality conditions being tested at each step along the way.

    A different idea would be to copy my extracted data into a staging area in my data base, and then run kettle transformations that query for problems like invalid column values or referential integrity. I would only have to process the set of rows that failed each screen. I could also run some screens in parallel.

    Is there a best practice for this? I've been reading The Data Warehoues ETL Toolkit by Ralph Kimball and Joe Caserta, but I'm not sure how to put their ideas into practice with Kettle.

    Thanks for your advice!
    Nathan

  2. #2

    Default

    My two coppers, although I have not read "The Data Warehoues ETL Toolkit", although I do keep meaning to:

    For actual Kettle implementation, how I have move forward with data cleaning:

    *Text Input step, configure all fields as 'String', this way you can handle the errors in a more controlled manner if you need to.

    *Use ModJS step(s) to check individual fields for validity (i.e. take the 'string' field and check if it is a date, number, etc). Then, you can create seperate fields in the datastream to demark individual field warnings, field errors, etc. By doing this approach, you have control over what is considered a warning, what is a rejectable error, and be able to detail what was wrong (i.e. field was blank, field was poorly formatted, field value is not in lookup (see next section), field value is not a proper value (expect M/F, got N), etc).

    *Use Database Lookup (with cache enabled) and save the 'database-lookup-field' (i.e. make sure this value exists) to the stream, then read that field with ModJS and again customize your field warnings, field errors, etc.

    *Dimension Lookup can help with unique record identification that already exists in a database. Sort and Group steps with LINE NUMBER FIELD enabled can detect duplicates within the stream; demark with ModJS.

    *If you have existing 'data cleaning' functionality in java code, you can call the java code in ModJS through the Packages.* feature, and re-use your existing (or more complicated) cleaning/validation checks.

    Hope this helps,
    -D

  3. #3
    Join Date
    Jun 2007
    Posts
    112

    Default

    Thanks D. That's exactly the kind of practical advice I was looking for.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.