Hitachi Vantara Pentaho Community Forums
Results 1 to 5 of 5

Thread: Best Practices for Processing Lots of Zipped Files

  1. #1
    Join Date
    Aug 2009
    Posts
    17

    Question Best Practices for Processing Lots of Zipped Files

    Fellow ETL developers.

    I have a question regarding processing a large amount of ziped (gzip) csv files. An hour's worth of data is about 500 - 700 GB in zipped format. My current process takes around 5 hours to process all of the files.

    Let me attempt to explain my current process at a high level.

    JOB_1
    1. SFTP files for previous hour

    TRANSFORMATION_1
    1. Get list of files
    2. Check for corupt zip files
    3. File Input step
    4. Data cleansing
    5. Spit out to flat file

    (intermediate steps between transformation and JOB2)
    JOB_2 (run one per created flat file)
    1. Check for Database
    2. Check for Table
    3. Bulk Insert into mySQL

    I think my question is simple... is there a better way / method to process this type and amount of data?

    I have played with the various performance tweaks and they each seem to make the process faster. At one point I was averaging 11k rows / second.

    That said, my question stands, is there a better / faster method for processing this amount of data?
    01010100011010000110000101101110011010110111001100100001
    Richard
    @nitrohawk.com

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Basically that's pretty much it... Next step would be to parallellize stuff. Starting up e.g. your stream 2 times... each half doing half of the files. But here it usually gets tricky ... things that need to happen before other things, ...

    Regards,
    Sven

  3. #3
    DEinspanjer Guest

    Default

    Just to make sure I'm not losing my mind, you mean to say that you have about 700 GB of data when uncompressed, right? Cause if you are processing 6 TB of data in 5 hours, I'm not qualified to be giving you performance tips!

    Basically, what I've found in this type of processing is to do everything you can to keep the data as pipe-lined as possible. Any place that you have to wait for a file to be copied or written or sorted, that is going to kill your performance.

    Here is my typical data flow:

    1. Cronjobs and log agents move log files onto my NFS log file storage server as soon as they are rolled over. This is netops and outside my domain.
    2. Every hour, my ETL process kicks off looking at logs from three hours earlier (to give data a chance to finish copying / compressing)
    3. ETL gets a list of files on the log file server for the desired hour
    4. Checks each of those files for validity and logs the resulting list of files to process and files that were skipped

    I have one main transformation script that does all the processing. It receives the file list from step 4 as its input.

    1. distribute the list of file names out to 3 copies of a GCSV step
    2. Every step after the GCSV also runs in three copies, this keeps the data pipe-lines in sync so that I effectively have three copies of the transformation running at once. I run on a dual quad-core zeon server.
    3. Pre-filter for uninteresting data and perform some basic cleansing
    4. regex step to parse out the fields in my NCSA log file format
    5. create the date field using a JS step that parses the [01/Apr/2009 00:00:00 -0800] format
    6. GeoIP lookup
    7. DB Lookup step to get the FK for the date
    8. DB Lookup step to get the FK for the GeoIP
    9. Switch Case step to split out the requests to different parser modules.

    These parser modules will do different things like parse out the URL with another Regex step to look up the product key or the addon key, etc. Everything is done inline.

    The last step of everything is to bulk load stream the data into my staging fact table. At the end of the day, I perform a group by query that inserts the daily summary data into my main fact tables.

    If I want to scale up my performance, then I distribute the files to process out to more servers. For instance, if I'm back-processing a month of data, I'll set up a cluster of servers that will each process two or three days of data.

    meant to include a link to a post that I made recently where I mentioned the size of my logs: http://blog.mozilla.com/data/2009/07...ipt-analytics/ You are clearly parsing a lot more data than I am since your hourly volume is about equivalent to my daily volume.
    Last edited by DEinspanjer; 08-17-2009 at 10:48 AM.

  4. #4
    Join Date
    Feb 2009
    Posts
    296

    Default

    First of all - I like your post DEinspanjer!

    I'm doing similar things without PDI at the moment. So... I'll give you a couple of hints from a whole different perspective.

    Quote Originally Posted by DEinspanjer View Post
    Basically, what I've found in this type of processing is to do everything you can to keep the data as pipe-lined as possible. Any place that you have to wait for a file to be copied or written or sorted, that is going to kill your performance.
    I agree. Which is why I *hate* the fact that I have to sort my data ascending to get a proper 'MAX' and 'MIN' from PDI's Group by step. This really pains me.
    So... sometimes you just have to sort your data. No way around it.

    1. distribute the list of file names out to 3 copies of a GCSV step
    2. Every step after the GCSV also runs in three copies, this keeps the data pipe-lines in sync so that I effectively have three copies of the transformation running at once. I run on a dual quad-core zeon server.
    Are you really creating all your steps trice or do you work with the mapping step?

    3. Pre-filter for uninteresting data and perform some basic cleansing
    4. regex step to parse out the fields in my NCSA log file format
    5. create the date field using a JS step that parses the [01/Apr/2009 00:00:00 -0800] format
    Most webservers allow you to specify the way they print the date. This is very useful as the server has to convert from unix timestamp to the format you see anyway. So it's no performance impact to have it print in a 'proper' format.
    This being said they also allow you to have very nice tabs between the fields and stuff like this. That helps me reducing those RegExp steps which saves a lot of CPU time.

    7. DB Lookup step to get the FK for the date
    In another post I mentioned the way we handle our date dimension - it has a pre-defined surrogate key. Another fellow mentioned 'days since epoch' for his/her date dimension surrogate key. This might save you another valuable bunch of lookups. Well - only if you're not using the Lookup cache anyway.

    The last step of everything is to bulk load stream the data into my staging fact table.
    What database are you loading into? Any Indexes/Constraints on that staging table?
    Fabian,
    doing ETL with his hands bound on his back

  5. #5
    DEinspanjer Guest

    Default

    Quote Originally Posted by fabianS View Post
    Which is why I *hate* the fact that I have to sort my data ascending to get a proper 'MAX' and 'MIN' from PDI's Group by step. This really pains me.
    So... sometimes you just have to sort your data. No way around it.
    I try to defer this type of operation until the end of the day when I can do it in my aggregation query. Sometimes the data is partially sorted, so you can use custom Java code to store the data in a HashMap or HashSet according to key and then have logic to release the rows once you know that there won't be any more that are part of the same aggregation set. Otherwise, you can split the stream, send the data along the rest of the data path to do all the other work, and have the split-stream doing the sort and aggregation. Then, at the very end, you can merge the streams back together to get your min/max values back into the row just before bulkload.
    Quote Originally Posted by fabianS View Post
    Are you really creating all your steps trice or do you work with the mapping step?
    No, I create the step once then right click and select "Run in N copies".
    Quote Originally Posted by fabianS View Post
    Most webservers allow you to specify the way they print the date. This is very useful as the server has to convert from unix timestamp to the format you see anyway. So it's no performance impact to have it print in a 'proper' format.
    This being said they also allow you to have very nice tabs between the fields and stuff like this. That helps me reducing those RegExp steps which saves a lot of CPU time.
    This is a very good point. I am not the only consumer of these log files though, so I can't easily dictate the format. Certainly something to keep in mind though.
    Quote Originally Posted by fabianS View Post
    In another post I mentioned the way we handle our date dimension - it has a pre-defined surrogate key. Another fellow mentioned 'days since epoch' for his/her date dimension surrogate key. This might save you another valuable bunch of lookups. Well - only if you're not using the Lookup cache anyway.
    I am using a pre-loaded cache so no big penalty here. I gotta say I've thought about transforming my whole data warehouse to use a pre-defined surrogate just for the readability sake. I'm tired of trying to remember what day 3869 translates to.
    Quote Originally Posted by fabianS View Post
    What database are you loading into? Any Indexes/Constraints on that staging table?
    Vertica. I can bulk load several streams into the same table at the same time. I can also distribute the bulk loads out to the different nodes in my Vertica cluster for more throughput.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.