Hitachi Vantara Pentaho Community Forums
Results 1 to 25 of 25

Thread: Writting stops after 20000 records

  1. #1

    Default Writting stops after 20000 records

    Hello,

    I Have job which basically loads data from a database, and join this with another data from another database. I have 5 Table input, and one of then, called "DW Tempo", returns about 54.000, and right after it is a Sort Rows.

    In this Sort rows ("Ordernar Data") step, everything works fine untill it meet 20.000 written records. If i limit from the database to 19.999 records it works, but with 20.000 looks like PDI keeps processing, but dont do any progress...

    This is the only Table Input that is causing problem, the others work fine. By the way, this is the only Sort Rows ("Ordenar Data") that uses 2 streams for result, usyng COPY option. It looks like the pic in attachment.

    Name:  kettle_print.jpg
Views: 46
Size:  12.3 KB

    Why would it be limited to 20.000 records? I'm i doing something wrong?

    Thanks

  2. #2
    Join Date
    Apr 2012
    Posts
    253

    Default

    You are probably creating a race condition with the parallel joins with dependencies. I experienced this the other day. The 20k read has nothing to do with what's actually occurring. It read the first 10k (per standard configuration) copied them then read the next 10k but nothing is happening elsewhere. I'd try two sorts after the input to Join Lancamento and Join Faturamento....it might look like it should work but it doesn't.

    ie: it doesn't seem like you can copy data from 1 SORT to 2 JOINS, you need 2 SORTS each servicing a separate join.

    Thinking about it a bit more, I think you need to do the COMPLETE sort before copying.
    Last edited by flamierd; 01-08-2014 at 04:19 PM.

  3. #3
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by lucassig View Post
    It looks like the pic in attachment.

    Name:  kettle_print.jpg
Views: 46
Size:  12.3 KB

    Why would it be limited to 20.000 records? I'm i doing something wrong?

    Thanks
    Yes, you've done something wrong - you've created a classic deadlock.

    Ordenar Data will stop sending data out, when Join Faturamento says it's full.
    If Ordenar Data is not done, Ordenar AS fat will not complete, so Join Faturamento can't start (nothing will come out of Ordenar AS fat until all rows are received!)
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  4. #4

    Default

    Thanks for replying guys.

    So, what is the best way for me to load data only once from database? I will have many Date columns that need the to be ordered to make a join, and i dont want to make one time for each column.

    Flamierd, I have tried to place a dummy between "Ordenar Data" and the two Joins, but it get realy strange... I have about 54k records, and in the Join Faturamento step it get about 1milion rows (the first "join Lançamentos" works fine). It is making a cross join, but i double checked the join settings and it is OK.

    Edit:
    By the way, when i disable "ordenar AS lanc" and "Join Lancamento", the "Join faturamento" step works fine. I have tried to sort data in two separeted steps, and the result still the same.

    What could cause this? What is the best way to do this?
    Last edited by lucassig; 01-09-2014 at 07:58 AM. Reason: New tests

  5. #5
    Join Date
    Apr 2012
    Posts
    253

    Default

    You need to do the Lancamento Join FIRST then the Faturamento join as the Fat Join is dependent on the Lanc join. Essentially you have to wait for the Lanc join to finish then do the fat join. You might want to create a job with two transformations. I'm presuming that the last line exiting Fat Join is your output (to the right and up).

  6. #6

    Default

    But doing this way i would be duplicating steps, which I am trying to avoid. I can duplicate the DW Tempo step in the same transformation and works fine. But it seems kind of wastefull to have a 3 steps for every date field in my Fact table

  7. #7
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    You need to put a step that will consume all rows *without using the row buffers* between each leg of the copy and where they rejoin.

    For example, put a sort rows between Ordenar Data and Join Faturamento, and it *MIGHT* work.

    Best thing is to figure out how to build this flow without having to split the stream and rejoin it. You have one step which is requiring both legs of the Ordenar Data to fully finish before it can do its work. This won't always be possible.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  8. #8
    Join Date
    Nov 2008
    Posts
    271

    Default

    Judging from the pic, Ordenar data is useless, considering that it is put right after a table input. You can retrieve the stream sorted at a sql statement level with an ORDER BY.
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  9. #9
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by Ato View Post
    Judging from the pic, Ordenar data is useless, considering that it is put right after a table input. You can retrieve the stream sorted at a sql statement level with an ORDER BY.
    Only if you know for certain that the DB will use the same collation as PDI since PDI is sorting in "Order AS lanc" - that has bitten more than a few people on this forum.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  10. #10
    Join Date
    Nov 2008
    Posts
    271

    Default

    You are right, but it is a matter of a few tests and, besides, I wonder if the hop entering Ordenar as Lanc is coming from another table input in the same db.

    @lucassig: did you already envisage a workflow with fewer sort row steps? They hurts performance badly, apart from the current deadlock issue.
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  11. #11

    Default

    Quote Originally Posted by Ato View Post
    @lucassig: did you already envisage a workflow with fewer sort row steps? They hurts performance badly, apart from the current deadlock issue.
    This is my first loading job, its the first idea i got. In the "tips" from PDI i read that i need to ORDER before JOIN.

    There is a better way to join this rows? My point is to get the surrogate Keys in DW to join at my fact table.


    Also, i notice that for JOIN it takes quite a time. Taking a stress test with 1million records, pdi took 4 hours to join 50k records and i gave up...
    Last edited by lucassig; 01-09-2014 at 02:51 PM.

  12. #12
    Join Date
    Nov 2008
    Posts
    271

    Default

    So, if i right understand, you have an incoming stream of "facts" from legacy/source system , and you are trying to get surrogate key from dimensions, perhaps to land the data into your fact tables? Is it this the case?
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  13. #13

    Default

    Yes.

  14. #14
    Join Date
    Nov 2008
    Posts
    271

    Default

    Then you may want considering a different approach, based on database lookup step, your lookup table being, for instance, the time dimension (TEMPO).
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  15. #15
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Or even (if possible) do as much as you can at the DB level.

    eg.
    select a.*, b.* from facttable a, dimensiontable b where b.id=a.dimid
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  16. #16

    Default

    Ato, i took the "Database lookup" path and it works perfect. Performance now is way better...

    Gutlez, i'm trying to know more about PDI features fow now, but i'll keep it in mind.

    Thanks

  17. #17
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quite often you will hear people on the board saying to avoid the Modified Java Script Value, or to push things down to the DB.

    In your Table Input step, if you ask the DB to do some of the work, you'll usually get better performance. DBs are designed for building joins like this, so getting your fact table out of your legacy system with all the dimension fields in place may be a better path.

    On the other hand, DB Lookup caches values, so if you have the same value to look up over and over again, and your DB is on the far side of a thin pipe, then DB Lookup could improve performance.

    There are many ways to do things in PDI, and none of them are the best way in all circumstances.
    **THIS IS A SIGNATURE - IT GETS POSTED ON (ALMOST) EVERY POST**
    I'm no expert.
    Take my comments at your own risk.

    PDI user since PDI 3.1
    PDI on Windows 7 & Linux

    Please keep in mind (and this may not apply to this thread):
    No forum member is going to do your work for you. We will help you sort out how to do a specific part of the work, as best we can, in the timelines that our work will allow us.
    Signature Updated: 2014-06-30

  18. #18

    Default

    Gutlez,

    I was thinking about that, if i make a huge join getting all information that i need in a single Table Input i would skip many steps. But I would need to do then in a single transformation, which may get more dificult to maintain (espacially for other people in my team).

    As i notice so far, every step i add will take about the same time that the it previous step, the only difference is that they start "simultaniously", so this time dont get too high.

    Now i am searching about rows/second rate and settings, because i get 50 rows/s in each step. It seems to be too low... Do you have any performance tips to increase this value? I am noticing that even after i retrive all information from DB, the steps like "Dummy" or "Copy rows" (just to add 1 pre-def. row) takes too much time and keep this 50 r/s rate.

    Please tell me if this is time to open a new Thread for missleading the subject...

  19. #19
    Join Date
    Apr 2012
    Posts
    253

    Default

    DB Lookup is going to be much slower. But you only working with a million record output. On huge loads I just use PDI to do bulk inserts into a work table and perform the necessary joins there, outputting to another table. DB is always going to blow PDI out of the water with joins and lookups. PDI is a marvelous tool but it is not a database system.

    Though I can't think of anything that should slow you down to 50 r/s unless your lookup is on a field in a large table without a valid index.
    Last edited by flamierd; 01-13-2014 at 11:00 AM.

  20. #20

    Default

    Flamierd,

    I was just creating index on database now and made a test. It increased to 300 r/s, creating index just on foreign keys. This realy helps.

    Now i've been playing with "Change numbers of Copies to Start", with 4 copies i get 300 r/s, in each copy. That is nice

    Now i'll make some massive tests. Do you use "Transational Database" option (that one that uses only one conections for DB.)? This helps?

  21. #21
    Join Date
    Nov 2008
    Posts
    271

    Default

    Are you over a network?
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

  22. #22
    Join Date
    Apr 2012
    Posts
    253

    Default

    Another thing would be that the default memory usage for PDI is set to 1024m. If you kick this up to 2048m you should experience better performance especially with large loads. Of course this is dependent on your environment Windows/Linux.

  23. #23

    Default

    Yes i'm over a local network.

    I have 4gb mem, Core2duo 2.93ghz and i'm running on Windows (my dev station) and Linux on server. Changing the memory size to 2g did not increased the r/s speed.

    By the way, how much would it be a good R/s speed in my case? I'm loading data from my local postgre instance, and saving it on network postgre instance.

  24. #24

    Default

    Also, i notice that my PC uses about 900k mem, even having 2.5gb free.

  25. #25
    Join Date
    Nov 2008
    Posts
    271

    Default

    Quote Originally Posted by lucassig View Post
    By the way, how much would it be a good R/s speed in my case? I'm loading data from my local postgre instance, and saving it on network postgre instance.
    This is hard to say, because it mostly depends on what are you doing, what kind of steps are you using, if there are some steps that prevent the stream flowing in a round-robin fashion (like the sort one), what kind of query you are firing against the db, if there are indexes on your fact table.

    At your current speed you have a very bad performance. At 50-300 r/s, I would say you are short of 1 or 2 order of magnitude. The most likely bottlenecks are the network and your fact table (try drop all indexes, if any, before loading).
    Andrea Torre
    twitter: @andtorg

    join the community on ##pentaho - a freenode irc channel

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.