We have a transformation that runs a query, does a bunch of lookups and then writes the in-memory stream to text files (CSV) on the PDI server using Text file output step. We are running into performance issues outputting a small volume of ~3 million rows.
The flow is
- Redshift query using table input step - ~ 6 minutes
- Approximately 30 Look up steps - ~ 3 minutes
- Write in memory data in text files (to folder on PDI server or AWS S3) ~ 20 minutes.
After the last look up step finishes around the 9th minute, the Text File output step crawls for another 11 minutes or so - and hence a total of 20 minutes.
Also tried using the No. Of Copies step but that has not helped. There has to be a better way to write output files. Any help or pointers around speeding the Text file output step would be really appreciated. Thanks in advance.