Hitachi Vantara Pentaho Community Forums
Results 1 to 7 of 7

Thread: What's the biggest bottleneck (time) in writing data to a SQL database?

  1. #1
    Join Date
    Sep 2014
    Posts
    175

    Default What's the biggest bottleneck (time) in writing data to a SQL database?

    I'm trying to optimize the time it takes for pure inserts (I guess updates is a different story).
    I'm wondering what the biggest bottleneck is, or best way to reduce time, for basic inserts into a SQL database.
    I mean, the first thing, probably, is the size of the data, right? Both the number of rows, the number of columns, and the data size in each column. Some of these may not be able to be minimized, the KB/ footprint of each row is one thing that can be potentially optimized, right?
    What else can be optimized or is the largest factor? Is it the transmission medium? I mean, how much magnitude of difference is there between writing to a database that's on the same computer, vs writing across a web connection (that is robust, fast, and has a ping of 1 ms?).
    Finally --- why is it that multiple parallel connections to the database seem to speed up the process to a point? I mean, when I have 20 connections making inserts round-robin style, it's about 6-7x faster than one connection writing ALL the data. I'm curious why this is.
    Right now I have 2.2 million rows totaling 2.7 GB. That's 1.23 kb per row.
    Right now inserting 1000 rows at at time (1.23 MB) using 14 connections takes 6.7 seconds. That's a snail-paced 10.66 rows per second. Even assuming 1 connection would be just as fast (it isn't) that's 150 rows/ second at best, which is not exactly "fast" either. I'm writing over a super-fast, robust web connection b/c we can't have the ETL process on the same space as the data warehouse.
    So .. how can I optimize for speed here?
    The reason for 1000 rows at a time is because the data comes from pages of 1000 - but optimizing the parsing is a separate issue for now.
    I do have one primary index I believe, but nothing too write-expensive. Right now I've simply been doing Monte Carlo like testing (try it and see what works) but I need something more focused.



    I posted similar on StackOverflow and so far have got the response that I need to leverage "bulk inserting" and "minimal logging" (as far as SQL Server goes).


    How exactly would I do a bulk insert using Pentaho? I'm already doing a transaction commit size of 1000 rows (each data set) but perhaps I must make that larger. Do I need to use "execute SQL statement" instead of insert table, so I can utilize the "bulk insert into" syntax?

  2. #2

    Default

    Which step do you use for Inserts? Insert/Update? Or Table Ouput? If you use Table Output, you can do Bulk Inserts by activating "use batch update for inserts".

  3. #3
    Join Date
    Jul 2009
    Posts
    476

    Default

    Generally, the fastest way to insert a lot of data is to drop all of the indexes on the table, bulk load the table, and rebuild the indexes. I don't see a Bulk Loader step for SQL Server, so the Table Output step looks like your next-fastest choice (but search the forum first).

    If your destination table already has millions of rows, or tens of millions, then rebuilding the indexes might take a lot of time, and if the table is being used while you are running ETL, then dropping and rebuilding the indexes might not be feasible. In that case, if you can partition the table so that you are loading into an empty partition, then the drop-index/load-data/rebuild-index method might work for you on the specific partition, while leaving the rest of the table as-is.

    Some DBMSes allow you to exchange a table with a partition, so your ETL could load the data into an ordinary table, where you drop and rebuild the indexes, and then you swap the table with an empty partition and you're done.

  4. #4
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    Quote Originally Posted by cool_runnings View Post
    I mean, how much magnitude of difference is there between writing to a database that's on the same computer, vs writing across a web connection (that is robust, fast, and has a ping of 1 ms?).
    A web layer will introduce a lag. While the ping may have a round-trip of 1 ms, the DB layer may add a few ms, and then the web layer adds a few ms on top of that.

    The closer you can place PDI to the DB, and the fewer round-trips you can do, the faster the performance.

    -Update: Added Example:
    Using the exact same tools, running the exact same queries, against the exact same DB:
    - Across a network with 1 ms ping data load can take 15 minutes
    - On local machine to DB, data load can take 1-2 minutes

    Above example is based on examples from real-world DB configuration, both setups using TCP directly to the DB. The big difference, one is TCP:localhost the other is TCP:<RemoteIP>
    Last edited by gutlez; 10-20-2015 at 10:46 AM.

  5. #5
    Join Date
    Sep 2014
    Posts
    175

    Default

    Use batch updates in the table insert step
    Double checked but already am, with a commit size of 1000 (I can only insert 1000 rows per cycle, since each JSON doc has 1000 records -- perhaps I can aggregate these documents but still).

    drop all of the indexes on the table, bulk load the table, and rebuild the indexes.
    So in terms of Pentaho Spoon, I would essentially be making heavy use of the 'execute SQL statement' step, both to A. drop/ rebuild the indexes (not a problem) but B for "bulk load" ... I'm not exactly sure how to do this.

    Also, the 'execute SQL statement' is a bit buggier when it comes to inserts. Some fields require quotes, some cannot have them - I can build that in a javascript step I suppose. The double hyphen "--" STILL comments out the rest of the SQL statement, even if contained within quotes (major bug). I suppose I can give it a try.

    Bulk inserts require loading the data from a flat file format though, correct?

    Across a network with 1 ms ping data load can take 15 minutes
    - On local machine to DB, data load can take 1-2 minutes
    Hmm this may be something to look into. For some reason my team REALLY doesn't want the ETL scripts on the same machine as the data warehouse. They fear a security issue or just increased risk (maybe a rogue ETL script can crash the machine if it uses enough memory).

  6. #6
    Join Date
    Jul 2009
    Posts
    476

    Default

    There is a "BulkLoad into MSSQL" job step, not a transformation step, which runs a bulk load command to load a file into SQL Server. You might be able to output the rows in your transformation to a text file, then bulk-load the text file in a job with the job step. I don't have experience with this job step, but it's there for your type of situation.

  7. #7
    Join Date
    Apr 2008
    Posts
    4,696

    Default

    I would suggest doing an insert with 1 row, then with 10 rows, then with 100 rows, then with 1000 rows, then with 10000 rows.

    Time each one.

    Setting up the insert will have some time associated with it.
    Sending each row to the DB will have some time associated with it.
    Tearing down the connection will have some time associated with it.

    By layering a web component into the middle, you *HAVE* to do setup, transfer, tear-down *PER BLOCK*

    You need to get the web component out of the middle if you want to have decent performance.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.