Hitachi Vantara Pentaho Community Forums
Results 1 to 15 of 15

Thread: Debug hanging job in kitchen.

  1. #1
    Join Date
    Aug 2015
    Posts
    16

    Question Debug hanging job in kitchen.

    I have a job which is hung in kitchen. Is there a good way to inspect which steps it is currently on? I'm logging to tables, so I checked those, but I only have a *very* small output thus far despite the fact that the job had gone a lot further than the logs indicate (perhaps the log output hasn't been flushed / committed yet).

    Is there a method to inspect this better? Java stack traces? Logs I'm unaware of? I'm open to any ideas as I'd like to debug this recurring issue.

    Thank you.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Could be a deadlock somewhere.
    Try to identify the transformation that stalls and run it in Spoon to see the Step Metrics.
    When halted you should find an exhausted buffer - watch out for a step where ingoing and outgoing row counts differ by the "Nr of rows in rowset" (Transformation settings, Miscellaneous).
    Now find out why that particular step buffer isn't drained.
    I was bitten twice over the years by a poorly designed flow graph sporting a mesh with a common input and a closing Merge-Join.
    I dimly recall it was the rows piling up on one side while the other side ran dry due to some grouping going on there.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Oct 2013
    Posts
    216

    Default

    Also you can enable log level = "Rowlevel" to debug it.
    -- NITIN --

  4. #4
    Join Date
    Aug 2015
    Posts
    16

    Default

    The problem with this, is that the issue is intermittent.

    I have the transformation running every minute of every day via CRON. It will run fine for several days, and then it will just hang.

    I have the log being written to a file. The latest hanging job has a log which looks normal up until the point where I have hours of the following:
    2016/01/21 04:32:59 - _send_scheduled_email - Triggering heartbeat signal for _send_scheduled_email at every 10 seconds
    2016/01/21 04:33:09 - send_scheduled_emails_csv - Triggering heartbeat signal for send_scheduled_emails_csv at every 10 seconds
    2016/01/21 04:33:09 - _send_scheduled_email_csv - Triggering heartbeat signal for _send_scheduled_email_csv at every 10 seconds
    2016/01/21 04:33:09 - _send_scheduled_email - Triggering heartbeat signal for _send_scheduled_email at every 10 seconds
    2016/01/21 04:33:19 - send_scheduled_emails_csv - Triggering heartbeat signal for send_scheduled_emails_csv at every 10 seconds
    2016/01/21 04:33:19 - _send_scheduled_email_csv - Triggering heartbeat signal for _send_scheduled_email_csv at every 10 seconds
    2016/01/21 04:33:19 - _send_scheduled_email - Triggering heartbeat signal for _send_scheduled_email at every 10 seconds
    2016/01/21 04:33:29 - send_scheduled_emails_csv - Triggering heartbeat signal for send_scheduled_emails_csv at every 10 seconds
    2016/01/21 04:33:29 - _send_scheduled_email_csv - Triggering heartbeat signal for _send_scheduled_email_csv at every 10 seconds
    2016/01/21 04:33:29 - _send_scheduled_email - Triggering heartbeat signal for _send_scheduled_email at every 10 seconds
    2016/01/21 04:33:39 - send_scheduled_emails_csv - Triggering heartbeat signal for send_scheduled_emails_csv at every 10 seconds
    2016/01/21 04:33:39 - _send_scheduled_email_csv - Triggering heartbeat signal for _send_scheduled_email_csv at every 10 seconds
    2016/01/21 04:33:39 - _send_scheduled_email - Triggering heartbeat signal for _send_scheduled_email at every 10 seconds

    I cannot seem to find any Exhausted buffers and in fact, the majority of my transformations have 0 rows input / output because in this situation I had no records to process from my Table Input.

    Any other techniques would be appreciated. At this point, I've had to build a job just to report suck jobs.

  5. #5
    Join Date
    Aug 2011
    Posts
    360

    Default

    This log seem quite explicit: it is sending a heartbeat signal to something every 10s, like waiting for some answer.....
    So.... which step or job entry is logging theses lines?? What are you doing at this point of the job?

  6. #6
    Join Date
    Aug 2015
    Posts
    16

    Default

    This portion of logging is not originating from my own doing. This is within PDI's core logic to send these heartbeats. At this point, my job *appears* to have finished, but it is hung up on something which is not letting it complete. I do not know how to easily figure out what is causing the hang. I've spent well over an hour digging through the logs, but they are incredibly verbose making it very difficult to read, particularly when you don't know what to look for.

  7. #7
    Join Date
    Aug 2011
    Posts
    360

    Default

    Which ersion are you using? I've never seen this before

  8. #8
    Join Date
    Aug 2015
    Posts
    16

    Default

    6.0

  9. #9
    Join Date
    Feb 2014
    Posts
    22

    Default

    I am experiencing this same behavior. The transformation appears to be done but it just keeps logging Triggering heartbeat signal for TRANSNAME at every 10 seconds. There are not more queries running and everything appears to have committed. I did not have this issue in 5.4 and do not recall ever having seen this in the logs before. Any word on how to make it stop?

  10. #10
    Join Date
    Aug 2015
    Posts
    16

    Default

    Good to know that this problem does not occur in 5.4, indicating it is something that can probably be more easily identifiable as a regression bug.
    Last edited by ryno1234; 03-06-2016 at 11:47 AM.

  11. #11
    Join Date
    Aug 2011
    Posts
    360

    Default

    If it is in a particular job, maybe this cause by a single jobentry, you could do a thread dump of your jvm and try to identify the tgread than the job entry

  12. #12
    Join Date
    Aug 2015
    Posts
    16

    Default

    I have done a java stack dump, but didn't know what I would be looking for exactly since I wouldn't be familiar with the stack traces and what exactly to look for. If a seasoned Kettle contributor could review my thread stacks, that might work. Are you familiar with the source or do you know of someone familiar with the source?

  13. #13
    Join Date
    Aug 2015
    Posts
    16

    Default

    UPDATE:

    I originally had this job running every minute by cron. I've spaced the job out to once every 3 minutes and this issue has stopped.

    I don't know if it has to do with memory consumption / starvation, file locks, object locks or what exactly. I'd prefer to run my job every minute as I originally had it as the job is responsible for sending on-demand, custom generated emails. Waiting 3+ minutes is much longer than I would like.

    Any help would be appreciated.

  14. #14
    Join Date
    Jan 2011
    Posts
    2

    Default

    Hi there,

    I have just been through a similar hair-pulling-going-crazy experience, trying to upgrade from pdi 4.2 to 6.1. I'm not sure if it's related or not, but I see you (ryno1234 and evaleah) mentioned versions 5.4 and 6.0.

    I was trying to run my 4.2 jobs on 6.1 on servers where 4.2 was previously deployed and my jobs would hang at approximately the same point all the time on some server, but not on others (where they would just run normally and complete). When run in "Rowlevel" level, I would see the same "Triggering heartbeat signal for **job_name** at every 10 seconds". We have a custom plugin that was previously coded in java6 and is now in java8. I thought it had something to do with it, and tried to run the job from the installed directory (PDI-5076), but it still didn't work. Removing the plugin still didn't resolve the issue.

    This forum post (here) kinda pointed me to the .kettle folder. I don't have any db cache file, but I had the regular kettle.properties and the shared.xml (since we deployed the same connections to multiple servers). For fun sake, I deleted the shared.xml and tried and it worked!!!

    My conclusion is that my shared.xml is somehow not 6.1-compatible. I still haven't tried converting into 6.1 format (the special characters, like "$", "{", "}", "/", ":", are encoded differently).

    I hope this helps...

  15. #15
    Join Date
    Aug 2017
    Posts
    2

    Default

    Did anyone fix this issue? I have the same issue.

    We are trying to upgrade Pentaho from 5.2 to 7.1. After one of the senior member suggesting to go through major releases, we tried in 6.1 as well. But the job seems to hang after getting arguments from get system data. Next step is database join. I don't see the query in database join step fired on database.

    Bhavya
    Last edited by Bhavya; 09-27-2017 at 08:39 PM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.