Hitachi Vantara Pentaho Community Forums
Results 1 to 12 of 12

Thread: Analyzing a transformation "hang"

  1. #1
    DEinspanjer Guest

    Default Analyzing a transformation "hang"

    I'm trying to figure out how a transformation could get into the state where all the output buffers leading up to a particular step fill up and this step's input buffer is full as well and the transformation just sits there doing nothing.

    This is only happening on one machine out of my cluster. It is much faster than all the other machines in the cluster, so it is able to actually put a lot more rows into the I/O buckets whereas the other machines get CPU or I/O bound much quicker and start plodding along.

    The odd thing is that looking at a thread dump, the step is parked waiting for getRow() to return a new row to it. It seems almost like a deadlock, but it isn't being reported as such by the thread dump.

    Obviously, I have a lot more investigation to do on my own, but I wanted to throw what I have so far out there just on the off chance that someone else might have run into something like this.

    Here is the relevant piece of the transformation. The step that is "stuck" is the Aggregate updates step:


    Here is a shot of the carte execution. The speed numbers are so low because it has been sitting there idle for the last hour and a half while I've been researching and finally typing up this post:


    Finally, here is a snippit of the thread dump. I'm running with a non-debug build of 3.2.1, so the core Kettle libraries don't have debugging information in them.
    Code:
    "Aggregate updates.0 (Thread-32)" prio=10 tid=0x0000000048534c00 nid=0x6408 waiting on condition [0x0000000043edc000..0x0000000043edcb10]
       java.lang.Thread.State: WAITING (parking)
            at sun.misc.Unsafe.park(Native Method)
            - parking to wait for  <0x00002aacd3428808> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:778)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1114)
            at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:186)
            at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:262)
            at java.util.concurrent.ArrayBlockingQueue.size(ArrayBlockingQueue.java:373)
            at org.pentaho.di.core.RowSet.size(Unknown Source)
            at org.pentaho.di.trans.step.BaseStep.getRow(Unknown Source)
            at Aggregateupdates.execute(Aggregateupdates.java:29)
            at plugin.org.pentaho.di.trans.steps.simpleplugin.SimplePluginBase.processRow(SimplePluginBase.java:51)
            at org.pentaho.di.trans.step.BaseStep.runStepThread(Unknown Source)
            at plugin.org.pentaho.di.trans.steps.simpleplugin.SimplePluginBase.run(SimplePluginBase.java:89)
    
    "Stream lookup.0 (Thread-31)" prio=10 tid=0x000000004811c800 nid=0x6407 waiting on condition [0x0000000043ddb000..0x0000000043ddbb90]
       java.lang.Thread.State: WAITING (parking)
            at sun.misc.Unsafe.park(Native Method)
            - parking to wait for  <0x00002aacd3428808> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
            at java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
            at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:747)
    Any ideas are welcome. I'm exhausted from trying to figure this out. I actually threw it back over to my IT team to have them run hardware diagnostics on the server because this problem seemed to happen shortly after the server was brought down for a memory upgrade. They ran a full suite of diagnostics and were unable to find any problems, memory or otherwise.

  2. #2
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    There was a thread here a while ago about a bug in the JVM that caused incorrect stalls with the concurrent classes.

    Things like this.

    An update of the JVM should fix that.
    See if there is one for your situation.

  3. #3
    DEinspanjer Guest

    Default

    Hrm. Well, I was running 1.6.0_11 before which is the latest version available in the RHN for RHEL5. I went ahead and downloaded 1.6.0_14 and manually installed it just to make sure. It is now hanging in a different step, but still behaving the same way.
    I tried running it with pan instead of carte, that didn't change anything either.
    I did realize one big difference between this instance and the others though. This one has double the number of files to process, and that means that there is a second rowset pipeline active. I'm going to try changing the transformation to eliminate that second pipeline and see if it makes a difference.

  4. #4
    DEinspanjer Guest

    Default

    It definitely seems to have something to do with having multiple rowset pipelines. If I take all the step copies down to 1 then the problem doesn't happen (of course, the transformation is way to slow, but...

    I'm trying to distill a test case to see if I can reproduce it outside of the particulars of my transformation's data.

  5. #5
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Well, you reduce the complexity of the internal JVM locking mechanism to one-producer/one-consumer in stead of more complex situations.

    Of-course, if it's an issue with Janino having a thread-safety issue somehow, all bets are off...

  6. #6
    DEinspanjer Guest

    Default

    I've managed to reduce this down to a simpler test case that uses Generate Rows and does not use either of the new Janino steps.

    On the machine in question, this test will live-lock on one or more of the pipelines. The step that actually live-locks is arbitrary.

    I'm running the test on a different machine to see if I can reproduce it anywhere else. So far, I have not seen the same behavior on my other ETL machines which is good because it keeps me in business but bad because it makes solving the problem so much harder. :/

    Two contrast two of the machines, etl01 (the okay one) and etl02 (the borked one), they are both HP blades, dual quad core xeons, both have Sun Java 1.6.0_11 (although I have also tested etl02 with 1.6.0_14 and have the same results). etl01 has 32GB of memory, etl02 has 20GB. In both cases, I'm starting Carte with an -Xmx of about 9GB allowed. Both have the exact same binary bits of Kettle installed (a 3.2.1 build) and they are both running the same transformation file. Exhaustive hardware diagnostics have been performed on etl02 to try to rule out any potential memory issues or some such, but it is hard for me to look at this problem and not point my finger at the hardware.

    If you have any ideas, I'd really appreciate hearing them. I'm going to open a support issue and also head over to the #java channel to see if anyone there might have any ideas (of course, they usually get stuck up as soon as they hear that I'm working on a large codebase that is extremely multi-threaded and uses classes that extend Thread. Most people there cannot fathom a legitimate use case for that.
    Attached Files Attached Files

  7. #7
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    and uses classes that extend Thread. Most people there cannot fathom a legitimate use case for that.
    Say the word and we'll get rid of it in 4.0.
    Personally I don't think it makes that much different since Java works with the Runnable interface that Thread implements.

    It should be interesting to meet up with the Java zealots at the next Devoxx (JavaPolis) conference in Antwerp. They were obviously fed terribly incorrect advice because they invited me to speak there. Oh well!

    As for the borking instance... I've read anecdotal evidence (for Solaris) that Thread.sleep() can interfere with the concurrent locking system if some OS patches are not applied. You can eliminate all sleep() by disabling the "Manage threads" option in the Misc tab of the transformation settings. Perhaps you can give it a try to see if that helps.

    HTH,
    Matt

  8. #8
    DEinspanjer Guest

    Default

    Quote Originally Posted by MattCasters View Post
    Say the word and we'll get rid of it in 4.0.
    I'll say the word, and the word is Kilim. I've been wanting to take a crack at prototyping a minimal Kettle engine using Kilim instead of threads. I really do think it could provide an order of magnitude performance increase on many of the transformations that I run, and possibly some more widespread Kettle usecases as well.

    http://www.malhar.net/sriram/kilim/

  9. #9
    DEinspanjer Guest

    Default

    Quote Originally Posted by MattCasters View Post
    As for the borking instance... I've read anecdotal evidence (for Solaris) that Thread.sleep() can interfere with the concurrent locking system if some OS patches are not applied. You can eliminate all sleep() by disabling the "Manage threads" option in the Misc tab of the transformation settings. Perhaps you can give it a try to see if that helps.
    I need to try out disabling managed thread priority as you suggest here, even though I'm not running Solaris, maybe they missed the possibility that it could affect RHEL too.

    I wanted to comment that I have found a workaround that fixes the issue on this borked computer. If I change the construction of the ArrayBlockingQueue used by RowSet to pass true, enabling the "fairness policy", then the problem goes away. According to the the javadoc, the fairness policy ensures that the waiting threads receive access in a FIFO order. Otherwise, the order is undetermined.

    What do you think about that change? Also, WHY IN THE HECK IS IT JUST THE ONE MACHINE!? /me stamps off to the datacenter and kicks this machine in the nuts (and bolts).

  10. #10
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    Hi Daniel,

    It could be as simple as an updated shared library somewhere, no idea beyond that. I'll do a few quick benchmarks. The fairness shouldn't matter that much since we always have a single producer and a single consumer for a single hop == a RowSet.

    Cheers,
    Matt

  11. #11
    Join Date
    Nov 1999
    Posts
    9,729

    Default

    A simple test (5 generators to a single dummy) is 40% slower with fairness enabled.

  12. #12
    DEinspanjer Guest

    Default

    Yeah, I was certainly unhappy with that as a long term solution due to the slowdown. Fortunately, we finally found something like an answer.

    We had upgraded the machine (an HP blade) to 20GB in the following configuration: 2x4GB,6x2GB. Still suspecting the memory as the most likely culprit since that was what had changed, we swapped the 2x4GB DIMMs out for 2x2GB (unfortunately bringing the machine back down to 16GB) and now my test case does not cause a hang anymore.

    Since those 2x4GB DIMMs were in the machine before, I don't think they are bad. Rather, I think there is some very odd case of mixing DIMM size in the same bank of slots that caused this problem to surface. BLEH!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.