Hitachi Vantara Pentaho Community Forums
Results 1 to 10 of 10

Thread: Date Format Auto Detection

  1. #1
    Join Date
    Feb 2014
    Posts
    11

    Default Date Format Auto Detection

    I have a date column (Text file input) whose format is different for each run. Any way auto detection of date format is possible?
    Have seen Matt Casters mention it in a jira case in comments section. Have included the link as well
    http://jira.pentaho.com/browse/PDI-10477
    Or is there a requirement to automatically detect which date format a date is in and then pick the right one out of 2 or more formats?
    This is exactly what i am trying to accomplish.
    Thanks in Advance. I am using PDI-5.0.1

  2. #2
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    They will likely be unable to do that until such time as you can definitively tell me what date this is:
    02/03/04
    or
    12/10/06
    or
    ...
    As long as days, months, and years can overlap, there is no way to know for certain if a date part is a month, a day, or a year.

  3. #3
    Join Date
    Feb 2014
    Posts
    11

    Default

    As long as days, months, and years can overlap, there is no way to know for certain if a date part is a month, a day, or a year.
    True true.. Not even when most dates are clear and its only few records that are ambigous??
    For example maybe my data starts as
    10/12/2006
    11/12/2006
    and then most dates are of clearly discernible format?
    11/25/2006
    11/27/2006
    11/28/2006

    Thanks for your time gutlez.

  4. #4
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    If you have well distributed date values, you might succeed with a separate transformation, the sole purpose being to determine the date format used in a specific rowset according to some simple rules.
    If the value range for a datepart is 1 to 12 you can conclude it's the month, ... you get the picture.
    It all depends on size and distribution of the sample you analyze.
    You never will have absolute certainty as gutlez said.
    As soon as you decided on a date format you can use metadata injection or scripting to adjust your transformation.
    With proper data contracts your problem shouldn't exist in the first place...
    So long, and thanks for all the fish.

  5. #5

    Default

    Don't know if this helps in your case (and don't know either how this functionality is implemented) but there's the option "Date Format Lenient?" in the "Meta-data" tab in the "Select values" step. Works for me as we expect the date to be YYYYMMDD but one customer sends it in as YYYY-MM-DD (CSV-file).

  6. #6
    Join Date
    Feb 2014
    Posts
    11

    Default

    Quote Originally Posted by marabu View Post
    If the value range for a datepart is 1 to 12 you can conclude it's the month
    Yes marabu, I am indeed working on this logic only. Currently using validator to check the values and if all the values pass through the validator(min value-01, max value-12) its a month column. First i need to ascertain if all rows have passed the step. I have to do this using tow steps. Output Step metrics and fliter rows.
    The issue is when I include output step metrics in the main flow, i lose the data i need and instead get metric details (step id, lines read, lines written, time duration)alone. So have to keep it in a separate flow and use the filter row result. (Filter row compares metrics of data validation step to ascertain if input and output row numbers are same indicating validation success). Now the filter rows result gives me whether the column is day or month but now the issue is both are in different flows and i have to resort to setting variables and getting them in a separate transformation. Hope it works.

  7. #7
    Join Date
    Feb 2014
    Posts
    11

    Default

    Quote Originally Posted by d.hommrich View Post
    Don't know if this helps in your case (and don't know either how this functionality is implemented) but there's the option "Date Format Lenient?" in the "Meta-data" tab in the "Select values" step. Works for me as we expect the date to be YYYYMMDD but one customer sends it in as YYYY-MM-DD (CSV-file).
    Yeah hommrich, Thanks for the tip. But the issue here is same transformation for dealing with different library generated sourcefiles, each of which have different formats as in yyyymmdd or yyyyddmm.Thanks again.

  8. #8
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by Prashan View Post
    Hope it works.
    I thought of a much simpler approach: After splitting a date into three parts, I would use a Group By step to determine the value range for each part (min / max) in the sample.
    So long, and thanks for all the fish.

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I deal with user supplied data most often... Thus I don't trust the formats when they give them to me.

    Too often I've seen two rows (one after another) that look like:
    12/15/2001
    22/01/2002

    If you are dealing with library output, then you might be able to guess it in the way that Marabu suggests.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  10. #10
    Join Date
    Feb 2014
    Posts
    11

    Default

    Quote Originally Posted by gutlez View Post

    If you are dealing with library output, then you might be able to guess it in the way that Marabu suggests.
    Thumbs up gutlez and marabu... I had changed tack since i last posted here, but the method you guys suggested is even more simpler. Thanks a ton for the help guys!

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.