Hitachi Vantara Pentaho Community Forums
Results 1 to 14 of 14

Thread: Error in GetXMLData.handleMissingFiles

  1. #1

    Default Error in GetXMLData.handleMissingFiles

    Hello-

    I am getting this very strange error when processing XML data from one of our suppliers. My transformation is very simple : it has one Get Data from XML input connected to a table output step.

    When I run it via a serialized job with kitchen, it fails with "ERROR... - Required files - WARNING: Missing..." and lists in the PDI log all the files in the processing directory like so

    Code:
    ERROR 28-08 18:57:10,156 - Required files -
    WARNING: Missing file:///home/pdiadm/DOMI/J_OUT/MEM/JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~2~P~1100~HR24-500~033057582042~20120827~085243.XML
    ....
    file:///home/pdiadm/DOMI/J_OUT/MEM/JGS-Memphis~PRE~DTV_PREP~x-LathamJ~JGMDTV268~8~P~1100~H23-600~029365619625~20120825~100218.XML
     
                    at org.pentaho.di.trans.steps.getxmldata.GetXMLData.handleMissingFiles(GetXMLData.java:233)
                    at org.pentaho.di.trans.steps.getxmldata.GetXMLData.processRow(GetXMLData.java:547)
                    at org.pentaho.di.trans.step.BaseStep.runStepThread(BaseStep.java:2889)
                    at org.pentaho.di.trans.steps.getxmldata.GetXMLData.run(GetXMLData.java:832)
     
    INFO  28-08 18:57:10,512 - XML.0 - Finished processing (I=0, O=0, R=0, W=0, U=0, E=1)
    ERROR 28-08 18:57:10,512 - XML_SCAUTOMATION_DIRECT_MEM - Errors detected!
    I tried running the transformation directly from within Spoon. Same error. This is very confusing. What error is Kettle refering to? The files are all there! And the Kettle user id has r/w permissions on all of them.

    Has anyone experienced this? Is there a workaround or a fix?

    PDI: Kettle version 3.2
    OS: Red Hat
    Database: IBM DB2
    Last edited by acbonnemaison; 08-28-2012 at 03:27 PM.
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  2. #2
    Join Date
    Nov 2008
    Posts
    777

    Default

    I have not experienced that. Have you tried specifying just the directory and then using a regex filename wildcard like "JGS.*\.XML"? Perhaps there is something to do with all the tilde characters. In the Get Data From XML step, do all the filenames show up when you hit the "Show filename(s)..." button?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  3. #3

    Default

    Quote Originally Posted by darrell.nelson View Post
    I have not experienced that. Have you tried specifying just the directory and then using a regex filename wildcard like "JGS.*\.XML"? Perhaps there is something to do with all the tilde characters.
    In the Get XML Data step, I specified "(.)+.XML" as my RegExp and set my directory where I have my files.

    In the Get Data From XML step, do all the filenames show up when you hit the "Show filename(s)..." button?
    No, not all of them. I get about 45,000 files to process on a daily basis so when I click on Show Filename, it shows me about 1K worth.

    What's weird is that I have an identical job I created for another supplier and it works without errors. I suspect it may have something to do with the XML contents for this one guy but I have no DTD to validate the file with...
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Do you have the box "Required" set to yes?
    Is there anything else accessing / moving the files?

  5. #5
    Join Date
    Nov 2008
    Posts
    777

    Default

    Here are a few things I would try:

    1. Try running it with the most recent stable version of Spoon. There have been a lot of bug fixes since 3.2.

    2. Since you suspect the XML contents try separating the file input from the XML decomposition. You could use a Load File Content in Memory step (it's in version 4.2.x, I know) to read the files without trying to decode the XML. That way you would be able to determine if the file handling can be done. If you can get that to work, then you could use the Get Data from XML step to read the XML from the stream instead of files. Make sense?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  6. #6

    Default

    Quote Originally Posted by darrell.nelson View Post
    Here are a few things I would try:

    1. Try running it with the most recent stable version of Spoon. There have been a lot of bug fixes since 3.2.

    2. Since you suspect the XML contents try separating the file input from the XML decomposition. You could use a Load File Content in Memory step (it's in version 4.2.x, I know) to read the files without trying to decode the XML. That way you would be able to determine if the file handling can be done. If you can get that to work, then you could use the Get Data from XML step to read the XML from the stream instead of files. Make sense?
    Upgrading to 4.2 is not an option.
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  7. #7

    Default

    Quote Originally Posted by gutlez View Post
    Do you have the box "Required" set to yes?
    It is set to N on all the transformations. Can you please explain why it matters?

    Is there anything else accessing / moving the files?
    No. I have checked and there is nothing.
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  8. #8
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by acbonnemaison View Post
    It is set to N on all the transformations. Can you please explain why it matters?
    Because it's coming up in your log files as if it were required...
    Code:
    
    ERROR 28-08 18:57:10,156 - Required files - WARNING: Missing file:///home/pdiadm/DOMI/J_OUT/MEM/JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~2~P~1100~HR24-500~033057582042~20120827~085243.XML ....

  9. #9

    Default

    Quote Originally Posted by gutlez View Post
    Because it's coming up in your log files as if it were required...
    Code:
    
    ERROR 28-08 18:57:10,156 - Required files - WARNING: Missing file:///home/pdiadm/DOMI/J_OUT/MEM/JGS-Memphis~AT1~Pre-Test~X-BanhT~JGMDTV387~2~P~1100~HR24-500~033057582042~20120827~085243.XML ....
    Changed to Y. Same error.

    I need to isolate which XML files are causing this in my job and transformation.

    In my transformation, I tried using a GetFileNames input step connected to my GetXMLData input step but it is not working.

    With text files, I can easily connect a GetFileNames input step to a Text Input step and use the "Step to read filenames from" option. However, GetXMLData does not have such option.

    Is there a way to do that in my transformation or should I do it in my job instead?
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  10. #10
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Quote Originally Posted by acbonnemaison View Post
    In the Get XML Data step, I specified "(.)+.XML" as my RegExp and set my directory where I have my files.
    You might want to use ".+\.XML" as your RegExp.

    As far as I can see in the 3.2 sources, you should be able to pipe the filenames in from a "Get File Names" step.
    Attached Files Attached Files
    So long, and thanks for all the fish.

  11. #11
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by acbonnemaison View Post
    With text files, I can easily connect a GetFileNames input step to a Text Input step and use the "Step to read filenames from" option. However, GetXMLData does not have such option.
    In my copy of 3.2-CE, the Get XML Data setp has a box on the top of the File tab which says "XML source from field"
    Checking the XML source is defined in a field box opens the rest of the boxes up for working with.
    Second box is XML source is a filename
    Third box is Read source as URL
    Fourth box is picklist "get XML source from a field" (with a list of the incoming fields)

    You should be able to wire it up from this.

    Another question: Is this file the only file in your transform when you are running the transform?
    Last edited by gutlez; 08-29-2012 at 01:55 PM.

  12. #12

    Default

    Quote Originally Posted by gutlez View Post
    In my copy of 3.2-CE, the Get XML Data setp has a box on the top of the File tab which says "XML source from field"
    Checking the XML source is defined in a field box opens the rest of the boxes up for working with.
    Second box is XML source is a filename
    Third box is Read source as URL
    Fourth box is picklist "get XML source from a field" (with a list of the incoming fields)

    You should be able to wire it up from this.
    I did and it worked. Thank you. But it introduced a challenge.

    Suppose I have 10 XML files to process. GetXMLData step will not be able to process file number 6 because of some error (missing element in the XML structure - we get a lot of those). How can I move file 6 to a special location, email me a copy for debugging and keep processing files 7 to 10?

    I tried do it in my transformation (transform.ktr) with a move command in the parent job (job.kjb) but it is not happening. The bottom branch of my code (see screenshot) is never receiving the file name to move and email.Name:  Screenshot.jpg
Views: 21
Size:  15.3 KB

    Did I overlook something when setting up the error handling? How can I move a "bad" file and finish processing whatever is left in the queue?
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

  13. #13
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Because you are handling the errors in the transform, the transform reports back to the job that it ended happily. So your Job never goes down the "Transform Failed" path. You can think of this like a try .. catch in java - you don't need to declare the throws if you are catching the error.

    You might consider restructuring a little bit:
    Master Job
    - Transform to build list of XML files to process
    - Processor Job (Execute once for each result)
    - - Processor Transform (Start with get rows from result)
    - -a - If Success, move Success
    - -b - If Fail, move Fail
    - -b - Send Mail

    This may process more slowly, but will give you the results you are looking for.

  14. #14

    Default

    Quote Originally Posted by gutlez View Post
    Because you are handling the errors in the transform, the transform reports back to the job that it ended happily. So your Job never goes down the "Transform Failed" path. You can think of this like a try .. catch in java - you don't need to declare the throws if you are catching the error.

    You might consider restructuring a little bit:
    Master Job
    - Transform to build list of XML files to process
    - Processor Job (Execute once for each result)
    - - Processor Transform (Start with get rows from result)
    - -a - If Success, move Success
    - -b - If Fail, move Fail
    - -b - Send Mail

    This may process more slowly, but will give you the results you are looking for.
    Good grief. You're right. I need to pass the file name as argument from the parent job to the transformation.

    I'll work on this and update the thread. Thank you.
    Pentaho Data Integration CE 5.3.0.x
    JDK 1.7
    OS X Yosemite version 10.10.x
    MySQL 5.5.37
    Amazon Redshift
    Pacific Standard Time

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.