Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Read and Write to same table

  1. #1
    Join Date
    Dec 2017
    Posts
    4

    Default Read and Write to same table

    I have a scenario where I have a table with a lot of repeating values that happen to have multiple strings as a UK. This means I can't order it to select only the new values. I have organized my queries to use an auxiliary table that stores the hashes of these strings and whenever I want to find new values to process, I select distinct values that doesn't have a matching hash in this auxiliary table. I know this is an horrible scenario, but I was able to optimize it and it actually runs quite fast.

    Table A
    Code:
    +------------+------------+----------------------------------+------------------------+
    |  User_Id   |  Post_Id   |               Hash               |      Observation       |
    +------------+------------+----------------------------------+------------------------+
    | syysuRSxwW | 9hqsletkft | 3c4ad7dcd0dd9cf1a8fb7e66f283f321 | User A, Post A>A       |
    | NghTd3Dynj | kNwl2P0cfp | d07402b56c6a159acf9bf78733bd525b | User B, Post B>A       |
    | thB4MUM11K | XSb42ho3e0 | ebab10a45e6b6f4482c91b38f3b18a1f | User C, Post C>A       |
    | syysuRSxwW | vwq6usikET | 1a2e448f369f11aaef1866c6603f9992 | User A again, Post A>B |
    | tOeoU45UdE | 9w79SLMsS4 | eafe41f415f3d6f1f26054933bd4a809 | User D, Post D>A       |
    +------------+------------+----------------------------------+------------------------+
    Hash Table

    Code:
    +----------------------------------+
    |               Hash               |
    +----------------------------------+
    | 3c4ad7dcd0dd9cf1a8fb7e66f283f321 |
    | d07402b56c6a159acf9bf78733bd525b |
    | eafe41f415f3d6f1f26054933bd4a809 |
    +----------------------------------+
    Query to fetch new values

    Code:
    SELECT * FROM Table_A WHERE Has NOT IN (SELECT Hash FROM Hash_Table_A)
    (This is an oversimplification of my scenario)


    Whenever I process the new rows from Table_A, I must insert the hashes that were processed from Table_A to Hash_Table_A, but it severely hits performance. I tried using a Blocking Step to wait all rows being queried before inserting to hash table, but if it takes too long to process these rows, the insertion to hash table fails due to connection being closed. I also read that using a blocking step is not a great decision.

    Could somebody help me out? What should I do?

    Thanks in advance.

  2. #2
    Join Date
    Jun 2012
    Posts
    5,534

    Default

    Your performance problem seems to be related to the way you store those new hash values. Is it a Table-Output step?

    Executing a statement like INSERT INTO Hash_Table_A (Hash) SELECT Hash FROM Table_A WHERE Hash NOT IN (SELECT Hash FROM Hash_Table_A) could speed up the process.
    So long, and thanks for all the fish.

  3. #3
    Join Date
    Dec 2017
    Posts
    4

    Default

    Quote Originally Posted by marabu View Post
    Your performance problem seems to be related to the way you store those new hash values. Is it a Table-Output step?

    Executing a statement like INSERT INTO Hash_Table_A (Hash) SELECT Hash FROM Table_A WHERE Hash NOT IN (SELECT Hash FROM Hash_Table_A) could speed up the process.
    You make me feel extremely dumb. Thank you for such a simple solution!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.