Hitachi Vantara Pentaho Community Forums
Results 1 to 16 of 16

Thread: Parsing invalid date or number value in a text file input step don't fails!!!

  1. #1
    Join Date
    Apr 2011
    Posts
    21

    Exclamation Parsing invalid date or number value in a text file input step don't fails!!!

    Hi everybody,

    I’m having big troubles when I do some testing with bad data. For example I have a text file looking like this:

    RECORD_DAY; RATIO_ID; RATIO_INDEX; RATIO_NAME; VALUE
    30.06.2011; 1210aaa007420; 515P_I; Some name; 1234.55
    31.22.2011; 1223454545; 515P_I; Some other name; 123.98

    The fields are defined as follows:

    RECORD_DAY(Date, dd.MM.yyyy)
    RATIO_ID(Number(10,5), #)
    RATION_INDEX(String(10))
    RATION_INDEX(String(100))
    RATION_INDEX(Number(15,5))

    When I start the transformation with this data, I expect that it will abort. But this isn't the case! The Transformation runs successfully and all the data is being processed. The result in the database is absolutely wrong!!!

    The invalid date 31.22.2011 is transformed to 31.10.2012 and the invalid number value gets cut to the first character and inserted as 1210 in the table. Why??? This cannot be a wished result! It has to produce a type conversion error. Other ways there isn't any possibility to do some error handling, because the ignore errors option doesn’t shows to advantage.


    Using a CSV input step isn't an option as well. When using this step the invalid data values are being replaced with NULLs, and turning error handling is not working, bad data is being processed anyways!

    Does anyone have some experience with any of these issues?

    Kind regards
    Attached Files Attached Files

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    Hi.
    One solution would be to import data as text and do some filtering.
    Once removed bad records, then you can use the Select Value to convert string into dates/numerics.

    Mick

  3. #3
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Nothing to get worked up about as there is an option in the text file input step to enable or disable strict date parsing.

  4. #4
    Join Date
    Apr 2011
    Posts
    21

    Default

    Strict date parsing is enable, but the invalid date value gets still parsed and converted to some valid date! And even if this option would work, there is still the problem with the numeric fields!

  5. #5
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    When you have Ignore Errors checked in your Text File input, it will... Ignore the errors, and keep going.
    If you uncheck Ignore Errors, the transformation will abort on the invalid date.

    The rest is all misconfiguration of Data Validation.
    Last edited by gutlez; 10-28-2011 at 07:25 PM.

  6. #6
    Join Date
    Apr 2011
    Posts
    21

    Default

    Dear gutlez,

    > When you have Ignore Errors checked in your Text File input, it will... Ignore the errors, and keep going.
    > If you uncheck Ignore Errors, the transformation will abort on the invalid date.

    Yes, the Ignore Errors option is checked since I want to check in the next step if a read error has occurred!
    And no, if I uncheck this option, the transformation doesn't abort!

    > The rest is all misconfiguration of Data Validation.

    Then please be so kind and let me know how to configure the step correctly, so that those two errors I'm talking about gets cached?
    Did you even try running this transformation with the provided data?

    Kind regards

  7. #7
    Join Date
    Apr 2011
    Posts
    21

  8. #8
    Join Date
    Apr 2011
    Posts
    21

    Default

    Bay the way ... preview data does act correctly and throws a convertion exception:

    [...]
    Caused by: org.pentaho.di.core.exception.KettleValueException:
    RECORD_DAY String : couldn't convert string [31.22.2011] to a date using format [dd.MM.yyyy]
    Unparseable date: "31.22.2011"

    at org.pentaho.di.core.row.ValueMeta.convertStringToDate(ValueMeta.java:609)
    at org.pentaho.di.core.row.ValueMeta.getDate(ValueMeta.java:1660)
    at org.pentaho.di.core.row.ValueMeta.convertData(ValueMeta.java:3061)
    at org.pentaho.di.core.row.ValueMeta.convertDataFromString(ValueMeta.java:3256)
    at org.pentaho.di.trans.steps.textfileinput.TextFileInput.convertLineToRow(TextFileInput.java:679)
    ... 3 more
    Caused by: java.text.ParseException: Unparseable date: "31.22.2011"
    at java.text.DateFormat.parse(DateFormat.java:337)
    at org.pentaho.di.core.row.ValueMeta.convertStringToDate(ValueMeta.java:604)
    ... 7 more
    [...]

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by DGeorgieva View Post
    Then please be so kind and let me know how to configure the step correctly, so that those two errors I'm talking about gets cached?
    Change Text File Input:
    KENNZAHLEN_ID as String

    Change Validator:
    Remove Error Code
    Remove Max String Length
    Check box Only Numeric Data Expected
    Change Field to check to KENNZAHLEN_ID

    Define the fields to capture the reason for the error in your error handling.

    That will send your 1210aaa007420 to the write to log.

    At this point, it doesn't check that dates are valid dates within the Data Validator, but that discussion is going on on your JIRA entry.

    If you turn the Ignore Errors off, then the Text File Input will abort the transform on the bad date.
    If you **REALLY** want to do checking later, you can input your date as a string, then convert it to a date (With a select Values step), and define error handling on the select values step.
    Last edited by gutlez; 11-01-2011 at 03:06 PM.

  10. #10
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by DGeorgieva View Post
    The invalid date 31.22.2011 is transformed to 31.10.2012
    Since there is no month 22, it is being rolled ahead 10 months past the highest valid month of 12 (yes, into the next year!). This is consistent with how most databases handle this situation. Try it. I was burned by this before I knew. The easiest way I know of to catch this is with a regex filter. Something like this:
    Code:
    (0[1-9]|[1-2][0-9]|3[0-1])\.(0[1-9]|1[0-2])\.20[0-9]{2}
    It catches most things but not everything. Dates like 31-NOV-2011 would pass as well as 29-FEB whether it is a leap year or not.
    Last edited by darrell.nelson; 11-01-2011 at 04:52 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  11. #11
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    If you use the select vales method (or let the Text File Input do it), it will abort on other things as well, including times that don't exist (see the 2AM March 31 threads), almost valid dates (31-Nov) and shouldn't be leap-years (29-Feb-2011)

    This can cause troubles as well, eg. Incoming date/time in UTC processing in US/Eastern

  12. #12
    Join Date
    Apr 2011
    Posts
    21

    Default

    Dear gutlez,

    thank you for your help!

    Change Text File Input:
    KENNZAHLEN_ID as String

    Change Validator:
    Remove Error Code
    Remove Max String Length
    Check box Only Numeric Data Expected
    Change Field to check to KENNZAHLEN_ID

    Define the fields to capture the reason for the error in your error handling.

    That will send your 1210aaa007420 to the write to log.

    I don't know which version of PDI do you have, but in PDI 4.2.0-GA this work around is not going ok either. If I follow yor suggestion, every single record gets outputed to the log with the following error description:

    2011/11/02 11:06:53 - Write to log.0 - ====================
    2011/11/02 11:06:53 - Write to log.0 -
    2011/11/02 11:06:53 - Write to log.0 - ------------> Linenr 3------------------------------
    2011/11/02 11:06:53 - Write to log.0 - RECORD_DAY = 30.06.2011
    2011/11/02 11:06:53 - Write to log.0 - RATION_ID = 1211155510
    2011/11/02 11:06:53 - Write to log.0 - RATION_INDEX = 555P_I
    2011/11/02 11:06:53 - Write to log.0 - RATION_NAME = Benefit Insurance
    2011/11/02 11:06:53 - Write to log.0 - WERT = 0000022222,33333
    2011/11/02 11:06:53 - Write to log.0 - ERROR_DESCRIPTION =
    2011/11/02 11:06:53 - Write to log.0 - During validation of field 'RATIO_ID' we found that its data type [String(10)] is different from the expected [Number].
    2011/11/02 11:06:53 - Write to log.0 -
    2011/11/02 11:06:53 - Write to log.0 - ERROR_FIELD = KENNZAHLEN_ID
    2011/11/02 11:06:53 - Write to log.0 - ERROR_CODE = KVD004
    2011/11/02 11:06:53 - Write to log.0 -
    2011/11/02 11:06:53 - Write to log.0 - ====================

    If you turn the Ignore Errors off, then the Text File Input will abort the transform on the bad date.
    It doesn't matter if the Ignore Errors option in the Text File Input is on or off, even if it's off the transform still does not abort, what is exactly the problem!

    If you **REALLY** want to do checking later, you can input your date as a string, then convert it to a date (With a select Values step), and define error handling on the select values step.
    Yes, this is working fine for the date value. But not working for the numeric value. This step acts the same way as the Text Input File - the number filed gets cut to the first character.

    Kind regards

  13. #13
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    The number parsing "problem" is one of Java actually. We're building extra code in 4.3 to catch the situation.
    It's not an error per se since by definition Java parses numeric data until there are no more valid numeric symbols.
    Obviously, since there are folks relying on this feature to be present we need to make the additional checks optional.

  14. #14
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by DGeorgieva View Post
    2011/11/02 11:06:53 - Write to log.0 - ERROR_DESCRIPTION =
    2011/11/02 11:06:53 - Write to log.0 - During validation of field 'RATIO_ID' we found that its data type [String(10)] is different from the expected [Number].
    This indicates that you have the field type checking turned on, which I don't recall seeing in your earlier version.
    If you turn it off, this error will go away.

    If you have text in a number field, you have bad data. While you might expect the behavior of just dropping letters, others might not. Since you have bad data, you have to design a way of handling it. To get what you want, you could validate on Only Numeric expected, then use a replace in string to get rid of the letters, which can then be returned to your data stream.

    Quote Originally Posted by DGeorgieva View Post
    It doesn't matter if the Ignore Errors option in the Text File Input is on or off, even if it's off the transform still does not abort, what is exactly the problem!
    I can't help you there. I'm using 4.2.0-GA, and it aborts the run if there's bad data in my text input. There is the possibility of data reaching the output file if the bad data is late enough in the stream. That is expected behavior in PDI
    Last edited by gutlez; 11-02-2011 at 11:44 AM.

  15. #15
    Join Date
    Apr 2011
    Posts
    21

    Default

    Hi gultez,

    This indicates that you have the field type checking turned on, which I don't recall seeing in your earlier version.

    If you turn it off, this error will go away.
    You are right, I did turned on type checking option instead of Only Numeric Data Expected .. the bad data record gets transferred to the log file this way.

    We are going to extend the jobs with some data cleansing and do exactly what you suggest, but the first step was actually to explore what's happening if bad data comes in. We did expected, that a transformation will abort, if the meta data doesn’t match the input data. This is how many others ETL-Tools react in such a case.



    I can't help you there. I'm using 4.2.0-GA, and it aborts the run if there's bad data in my text input. There is the possibility of data reaching the output file if the bad data is late enough in the stream. That is expected behavior in PDI


    And about the invalid date issue... I finally found out where the problem is - it is the Enterprise Repository we use! I have no clue why, but when I disconnect from the ER and run the transformation locally, it works just fine - gets aborted on the invalid date as expected. But when I start the transformation from the ER, then for some reason it doesn't abort as described in my previous posts.

    This is very very strange ...
    Last edited by DGeorgieva; 11-03-2011 at 04:53 AM.

  16. #16
    Join Date
    Apr 2011
    Posts
    21

    Default

    The option "Be lenient when parsing dates?" cannot be deactivated when using Enterprise Repository!!!

    Every time I check out a transformation from the ER, this option is on. After I do deactivate it, save and reopen the transform, the option is activated again.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.