Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Difficulty with data validation - please help?

  1. #1

    Question Difficulty with data validation - please help?

    Hello all!

    I'm a new/jr DBA and I am working on a quick project that I'm having difficulty with. I want to add a step in some of our existing transformations to validate that the input dates are valid and in range (e.g. 07/25/2009 is fine but 07/25/2100 is not). I was looking around on the 'net and found that I can use Regex with the Data Validation step for this aspect so I created the following expression (for dates between 1900 and 2099):
    ^((((0?[13578])|(1[02]))[- \/ .]?(([0-2][0-9])|(3[01])))|(((0?[469])|(11))[- \/ .]?(([0-2][0-9])|(30)))|(0?2[- \/ .]?[0-2][0-9]))[- \/ .]?((19|20)?\d{2})$

    (I tested the expression itself on http://rubular.com/ and it seems to be doing what I need.) Unfortunately, the expression does not fail for any of the Excel dates - even ones that I made to specifically be out of range to ensure a failure. The dates are in the Excel spreadsheet in a numeric mm/dd/yyyy format. For the Excel input, I had it import the fields from the header row and Pentaho did correctly determine the data type for each field. I've attached screenshots from the Data Validation and Excel Input steps. the Data Validation step goes to a "fail" Dummy step and a "pass" Dummy Step.

    Pentaho version is 4.2.1-GA. Files are xlsx files (and this is indicated in the Excel input step, although I did not take a screenshot of that).
    Attached Images Attached Images   
    Last edited by qanx; 06-13-2013 at 12:17 PM. Reason: typo

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Bring a test transformation and we'll make it behave.
    So long, and thanks for all the fish.

  3. #3

    Default

    I made a smaller XLS file that only has two columns (the date and an ID field). Both the KTR and XLS files are attached. I removed the XLS path from the Excel input because I figured you guys would save the XLS wherever. Thank you!
    Attached Files Attached Files
    Last edited by qanx; 06-13-2013 at 12:58 PM. Reason: typo

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Why not use the Max and Min values for your validation?

    I created a validate step as follows (XML pulled from KTR)
    Code:
        <name>Data Validator</name>
        <type>Validator</type>
        <description/>
        <distribute>N</distribute>
        <copies>1</copies>
             <partitioning>
               <method>none</method>
               <schema_name/>
               </partitioning>
    <validate_all>N</validate_all>
    <concat_errors>N</concat_errors>
    <concat_separator>|</concat_separator>
           <validator_field><name>SCANDATE</name>
    <validation_name>Scan Date Range</validation_name>
    <max_length/>
    <min_length/>
    <null_allowed>N</null_allowed>
    <only_null_allowed>N</only_null_allowed>
    <only_numeric_allowed>N</only_numeric_allowed>
    <data_type>Date</data_type>
    <data_type_verified>N</data_type_verified>
    <conversion_mask>MM&#47;dd&#47;yyyy</conversion_mask>
    <decimal_symbol/>
    <grouping_symbol/>
    <max_value>12&#47;31&#47;2099</max_value>
    <min_value>01&#47;01&#47;1950</min_value>
    <start_string/>
    <end_string/>
    <start_string_not_allowed/>
    <end_string_not_allowed/>
    <regular_expression/>
    <regular_expression_not_allowed/>
    <error_code/>
    <error_description/>
    <is_sourcing_values>N</is_sourcing_values>
    <sourcing_step/>
    <sourcing_field/>
    <allowed_value></allowed_value></validator_field>
         <cluster_schema/>
     <remotesteps>   <input>   </input>   <output>   </output> </remotesteps>    <GUI>
          <xloc>161</xloc>
          <yloc>50</yloc>
          <draw>Y</draw>
          </GUI>
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  5. #5
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Open your Excel Input step on page "Fields" and enter format MM/dd/yyyy for your date field.

    But why don't you just use the maximum and minimum values instead of that regex?


    PS: Hi gutlez, didn't see you coming ...
    Last edited by marabu; 06-13-2013 at 01:13 PM.
    So long, and thanks for all the fish.

  6. #6
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    marabu ... great minds and all that
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  7. #7

    Default

    Thank you for the reply! My concern is partially ensuring that dates are valid as well - for example if someone enters 02/30/1990 it should still fail as an invalid date even if the year is "in bounds". If I use the min/max value, will it check that the date is valid? Also, when I enter the max/min dates as 12/31/2099 and 01/01/1900 I receive the following error when I try to run it:
    A data conversion error was encountered while doing validation
    SCAN DATE String : couldn't convert string [01/01/1900] to a date using format [yyyy/MM/dd HH:mm:ss.SSS]
    Unparseable date: "01/01/1900"

    (There is a similar error for the upper bound.) Should I be entering the date in a different format (even though the date is MM/dd/yyyy)? I tried entering 1900/01/01 well but encountered the same error.

    (Btw I do have the date format as MM/dd/yyyy in the Fields tab of the Excel input, I think I just didn't save that when I saved the KTR file.)

  8. #8
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    The excel step will fail if you try to enter a date of "02/30/1990" as that date doesn't exist. No rows after it will be processed. After that you then have to actually convert the row.

    Fill in your data conversion mask (it's not in an obvious location unfortunately) on the Data Validator step, and it will take the values in MM/dd/yyyy format.
    Attached Files Attached Files
    Last edited by gutlez; 06-13-2013 at 02:31 PM.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  9. #9

    Default

    Thank you!

    For the invalid date that Excel makes a string (02/30/1990) - is there a way to have it continue past the error and move to the next row of data? Or is it the type of error that Pentaho will always stop at?

    A couple more questions:
    - How sensitive is Pentaho to the dividers that are used in the date? For example, if another similar file had the date as 02-02-2009 instead of 02/02/2009 would I need to update the transformation accordingly or will the program recognize any dividers (/, ., -, etc.)?
    - How do I indicate a text month instead of a numerical one? (e.g. Feb-02-2009 instead of 02-02-2009)

  10. #10
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by qanx View Post
    Thank you!

    For the invalid date that Excel makes a string (02/30/1990) - is there a way to have it continue past the error and move to the next row of data? Or is it the type of error that Pentaho will always stop at?
    If you don't handle it in your design, it will always stop the transformation when it finds invalid data and not continue. You should design your transformation to handle it. You can select the "Ignore Errors" and "Skip Error Rows" on the Error Handling tab of the Excel Input, so it will allow you to input the rest of the file. See my prior sample file for an example of ONE way to handle the error rows (there are many ways to handle them)

    Quote Originally Posted by qanx View Post
    A couple more questions:
    - How sensitive is Pentaho to the dividers that are used in the date? For example, if another similar file had the date as 02-02-2009 instead of 02/02/2009 would I need to update the transformation accordingly or will the program recognize any dividers (/, ., -, etc.)?
    VERY sensitive. It will fail to process the data, since it can't find the correct separators. You could file a Jira request (http://jira.pentaho.com/browse/PDI) to ask for a separator wildcard that could be used to allow different formats... But if the format is changing, how do you know if it is Month Day Year, or Day Month Year?

    Quote Originally Posted by qanx View Post
    - How do I indicate a text month instead of a numerical one? (e.g. Feb-02-2009 instead of 02-02-2009)
    Use a conversion mask of MMM-dd-yyyy -- a word of caution, it will use the LOCAL month format, which has caused issues for users in the past. (Feb vs Fev vs 二月 )
    See http://docs.oracle.com/javase/1.4.2/...ateFormat.html for examples.
    Last edited by gutlez; 06-13-2013 at 02:55 PM.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  11. #11

    Default

    Thank you!

    Quote Originally Posted by gutlez View Post
    Use a conversion mask of MMM-dd-yyyy -- a word of caution, it will use the LOCAL month format, which has caused issues for users in the past. (Feb vs Fev vs 二月 )
    See http://docs.oracle.com/javase/1.4.2/...ateFormat.html for examples.
    This seems very similar to the Excel formatting - can I assume that if I need the full month (e.g. September instead of Sep) at a later point I would use MMMM?

    Thank you again!

  12. #12
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    According to the link I attached, yes, MMMM will expect a full month name.
    I will reiterate the issue of local language however, as you may provide septiembre (September in Spanish), and have it rejected if your system expects English.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  13. #13

    Default

    Sorry I missed the link for some reason - thank you very, very much! I really appreciate all the help

    (PS Thank you for the note about the language for the months - right now the system that we have is English only, but the data we're using is also English only so I don't foresee that being a problem right now. That said, I will keep it in mind in case something changes.)

  14. #14
    Join Date
    Jul 2009
    Posts
    476

    Default

    If the dates in your Excel files are truly atrocious, you might consider loading them from Excel as Strings instead of Dates, and use a JavaScript step (Modified JavaScript Value) to parse the dates from their various formats and flag the invalid ones, and proceed from there. I'm not saying that's the best choice, but it's an option.

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.