Hitachi Vantara Pentaho Community Forums
Results 1 to 18 of 18

Thread: Need assatance creating reports from XML data

  1. #1
    Join Date
    Sep 2013
    Posts
    13

    Default Need assatance creating reports from XML data

    Hello everyone,

    I am new to the Pentaho community. I am very interested in learning Kettle and the other great Pentaho products. I am currently working with XML data and would like to transform the XML file into both graphical and spreadsheet type data. I been self teaching myself Kettle as well as learning about XML. I am currently trying to use an XML input stream to read in an XML file. I need some understanding how to parse the file into usable data to create html reports. I have attached an XML example of the data I am working with. This would be an example of a file I would get each day and would need to transform the data to run reports on the different attributes in the XML file. I also have an XML config file that sets sting values for each attribute id found in the attached file. I would need to use this configuration file to convert ID's to Names. For example: I would want to display "Item name" rather than "Item ID#".

    Any help would be greatly appreciated. If I can prove proof of concept that Kettle can meet our needs, I can then build a case why would should invest into purchasing a EE version. Thank you again for your assistance.

    Regards,

    Joe

    transactions.xml

  2. #2
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    To read xml files you can use two steps:
    1. xml input stream (stax) (http://wiki.pentaho.com/display/EAI/...eam+%28StAX%29)
    2. get data from xml (http://wiki.pentaho.com/display/EAI/Get+Data+From+XML)

    In my (very limited) experience, get data is much easier to set up and use, I only tried xml input stream with very large xml files (around 1GB).

    Once you've extracted your data you can use a Select value step to rename fields.
    You can then use a Microsoft Excel Writer (or Microsoft Excel Output) step to create an excel file.

    Note that there are samples in your installation folder in data-integration/samples folder.
    -- Mick --

  3. #3
    Join Date
    Nov 2008
    Posts
    777

    Default

    Do you even need Kettle? There are reporting tools, such as Pentaho PRD and BIRT, that can read XML directly and produce reports.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  4. #4
    Join Date
    Sep 2013
    Posts
    13

    Default

    Quote Originally Posted by darrell.nelson View Post
    Do you even need Kettle? There are reporting tools, such as Pentaho PRD and BIRT, that can read XML directly and produce reports.

    Thank you for your reply. I am not familiar with PRD or BIRT. I will have to educate myself on both. Ultimately, I will be connecting to a MS SQL Database to pull data to create the XML file and then upload this file to another server like Pentaho BI for reporting purposes. I will also be creating these daily XML files from multiple locations. I would like to create a solution where each location will have access to their site data as well as having a corp access to access all site data. I thought I would use the Pentaho BI Platform to create user accounts and dashboards to display data and reports.

    I am open to suggestions. What would you recommend as the best Pentaho solution/s to work with?

    Regards,

    Joe

  5. #5
    Join Date
    Sep 2013
    Posts
    13

    Default

    Quote Originally Posted by Mick_data View Post
    To read xml files you can use two steps:
    1. xml input stream (stax) (http://wiki.pentaho.com/display/EAI/...eam+%28StAX%29)
    2. get data from xml (http://wiki.pentaho.com/display/EAI/Get+Data+From+XML)

    In my (very limited) experience, get data is much easier to set up and use, I only tried xml input stream with very large xml files (around 1GB).

    Once you've extracted your data you can use a Select value step to rename fields.
    You can then use a Microsoft Excel Writer (or Microsoft Excel Output) step to create an excel file.

    Note that there are samples in your installation folder in data-integration/samples folder.
    Thank you for you reply. I do have a high level understanding of both XML input stream (stax) and get data form xml. I wanted to stay with input stream because the kettle 5.0 documentation says that the get data from xml is deprecated and the xml input stream should be used.

    I am able to use the xml input stream to read the xml file. I have repeating "Order" nodes that I am not sure how to parse out and create a single data row for each Order. I have found one example that uses a script to parse the xml stream and retrieve each repeating node. If you have any experience with this and would be willing to provide an example using xml input stream with the file I had attached to my original post, i would be must indebted to you.

    Regards,

    Joe

  6. #6
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by jcoyle View Post
    Ultimately, I will be connecting to a MS SQL Database to pull data to create the XML file and then upload this file to another server like Pentaho BI for reporting purposes. I will also be creating these daily XML files from multiple locations. I would like to create a solution where each location will have access to their site data as well as having a corp access to access all site data.
    Okay then, do you really even need to go through XML files? Is there any reason you can't just use Kettle to pull the transaction data directly from the MS SQL Database(s) and populate a Data Warehouse on a central database server?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  7. #7
    Join Date
    Sep 2013
    Posts
    13

    Default

    Quote Originally Posted by darrell.nelson View Post
    Okay then, do you really even need to go through XML files? Is there any reason you can't just use Kettle to pull the transaction data directly from the MS SQL Database(s) and populate a Data Warehouse on a central database server?
    Good point.....I would need to do some testing to see what would be more efficient. Making an ftp connection to the server and uploading an XML file daily or making a database connection and updating a remote Data Warehouse. With the later there are firewall issues that would have to be considered. I guess the same would be true for using ftp as well but at least your database is not exposed to direct attack on the database port/s used for remote connections.

    Regards,

    Joe

  8. #8
    Join Date
    Nov 2008
    Posts
    777

    Default

    Quote Originally Posted by jcoyle View Post
    ...there are firewall issues that would have to be considered.
    Sure, security should be considered. In either case, PDI will work well to populate the data warehouse. In the XML case, I would most likely use the Get data from XML step - just select the file on the "File" tab, hit the "Get XPath nodes" button on the "Content" tab to select the "Loop XPath", and hit the "Get fields" button on the "fields" tab. And there you have it - all your order fields are right there in the preview!

    I'm still using 4.4 but is the Get data from XML step really deprecated in 5.0?
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  9. #9
    Join Date
    Sep 2013
    Posts
    13

    Default

    Quote Originally Posted by darrell.nelson View Post
    Sure, security should be considered. In either case, PDI will work well to populate the data warehouse. In the XML case, I would most likely use the Get data from XML step - just select the file on the "File" tab, hit the "Get XPath nodes" button on the "Content" tab to select the "Loop XPath", and hit the "Get fields" button on the "fields" tab. And there you have it - all your order fields are right there in the preview!

    I'm still using 4.4 but is the Get data from XML step really deprecated in 5.0?
    I was mistaken.....it was the XML input that is deprecated not "Get Data From XML" Here is the link I was thinking of http://wiki.pentaho.com/display/EAI/XML+Input
    Last edited by jcoyle; 04-03-2014 at 04:59 PM.

  10. #10
    Join Date
    Nov 2008
    Posts
    777

    Default

    Whew! I have dozens of transformations that use that step. So attached is a whopping one-step sample transformation the uses Get data from XML on your data. Doesn't get much easier than that. (Note that I had to fix the closing tag at the end of your sample XML file in order to make it work.)

    If I may be so bold and offer my opinion, I think your biggest challenge is to design the logical structure of your data warehouse. I recommend using a dimensional model in line with the writings of Ralph Kimball. He has some excellent books on Data Warehouses and ETL. And Kettle has two powerful steps (Dimension lookup/update & Combination lookup/update) that make updating the model data amazingly easy.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  11. #11
    Join Date
    Sep 2013
    Posts
    13

    Default

    Thank you for the continue assistance you are providing for my project. I was able to progress using Get data from XML because of your previous post. Thank you for providing these steps.

    The next challenge I have is converting all data IDs in the transaction.xml. I have a configuration.xml file that stores the the values of these ids. What would be the best approach in creating a transformation step that uses this configuration file to rename the values from the output when I use the "get data from xml" step? There are some ID's in the configuration.xml file that reference other ID's within this file as well. For example. There is a section for Item_Category_ID that also has a reference to a Parent_Category_ID that then reference back to another Item_Category_ID. You will notice that a Item_Category that is a Parent_Category as a value of "0" for Parent_Category_ID because the Parent Category is the highest level and a Item_category that is a Parent_Category cannot be assigned to an Item as a Item_Category. It can only be assigned to an Item_Category as a Partent_category

    I will create a configuration file every time I create a transaction.xml file because the data can change over time. For example, new items, new categories, new employees, etc. So the configuration file will always be up to date when the transaction.xml file is created. Please find below the configuration.xml file that would match the data in the transaction.xml.

    The configuration file is needed because the transaction data is store with these ID in the db table. The configuration is pulling these ID values from other tables in the db. For example, employee table, items table, category table, etc. This is just how the application was written. I have no control over the application. I can only connect to the db and extract the data from the tables.

    Regards,

    Joe

    configuration.xml
    Last edited by jcoyle; 04-04-2014 at 03:52 PM. Reason: clarification

  12. #12
    Join Date
    Nov 2008
    Posts
    777

    Default

    Another Get data from XML step and then a Stream lookup step should do what you are looking for. The attached sample2.ktr performs a lookup into the configuration data based on an ID in the transaction data. (Although I had to nix an unclosed element in your configuration file first.)

    I don't exactly follow your ID to other ID matching description so take this for what it's worth.
    Attached Files Attached Files
    Last edited by darrell.nelson; 04-04-2014 at 04:33 PM.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  13. #13
    Join Date
    Sep 2013
    Posts
    13

    Default

    Wow that was fast. Thank you.

    Would I need to create one "get cfg from xml" step for each element I need or can I just use one? In your example you are just pulling Menu_Item_ID. Each order has several IDs I need to crab. For example, Employee_ID, Revenue_ID, Tender_Type_ID. There does not seem a way to add multiple loop paths to the content tab.

    regards,

    Joe

  14. #14
    Join Date
    Sep 2013
    Posts
    13

    Default

    Please take a look at this example and let me know if I am on the right track. I notice using this method it is adding a new column to the output rather than replacing the column with the the new values. For example. There is an Menu_ITEM_ID Column and after the transformation there is a new column called NAME that adds the Menu Item Name. Rather than add a new column I would think I would be OK to just replace the ID column with the the Name column or would I do that in another step?

    I also need a way to loop the the ITEMS under each CHECK. For clarification, each TRANSACTION can have one or more ORDER. Each ORDER can have one CHECK and each CHECK can have one or more ITEMS. In your example I am getting 5 Orders but only the first ITEM from each CHECK.

    regards,

    Joe

    same3.ktr
    Last edited by jcoyle; 04-05-2014 at 09:58 AM.

  15. #15
    Join Date
    Nov 2008
    Posts
    777

    Default

    Your are on the right track. Here are my comments and recommendations:
    1. The lookups will always add new fields to each row. If you have duplicate names, the lookup will actually modify the name before adding it. For instance, if you have a NAME field in your incoming stream and then lookup a field called NAME, the second one will have been renamed to NAME_1 when it shows up in the output stream.
    2. You can always remove and/or rename fields with the Select values step. There are also several other steps to replace fields with other fields, constants, or calculation results.
    3. Your issue with only 5 orders showing up in the output is happening because you have two depths of normalization in your transaction file (the TRANSACTIONS element has multiple ORDER elements and each order's CHECK element has multiple ITEM elements) and the Get data from XML supports only 1 Loop XPath. You can solve this by cascading the <CHECK> XML node down to another Get data from XML step and then set the ITEM element as the Loop XPath. The attached sample3.ktr demonstrates this cascading technique.
    4. Also, one trick of the trade you should learn is to avoid hard-coding the path names of supporting files. I recommend one of two options (my sample uses option #1 in several places):
      1. For files located in the same folder or a sub-folder of the transformation file, use the pre-defined variable ${Internal.Transformation.Filename.Directory} in the path. Just hit CTRL-SPACE for a menu of pre-defined variables.
      2. For a "remote" folder, define one of your own variables for the path. The easiest way to do that is to place a definition in the kettle.properties file located in the .kettle folder of your home directory. These variables will also show up in the CTRL-SPACE menu for easy selection.
    Attached Files Attached Files
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

  16. #16
    Join Date
    Sep 2013
    Posts
    13

    Default

    Another big "Thank You".

    Been real busy at work today so just now had time to respond.

    I knew there had to be away to loop through multiple depths of normalization. I have now updated my project with pre-defined variables as your last post recommended. I also updated my project so it can Process multiple files one at a time. I used an example I found searching this forum. I am not sure if this is the best way. If you have a better method to process multiple xml files in a folder, please share it.

    I still need a step that removes the columns I do not need. From your suggestions it seems straight-forward and I will do that next.

    What would be the best way to write this data out to an excel file or some other db table? It would seem I would add something to the Parent.kjb but I read somewhere that jobs are not for displaying data and that transformation steps are used for this. So it would seem I would need to create another transformation step and call it from the parent.kjb or add a hop to the process_file.ktr that writes the data to a db table or some type of file like excel. I could then use this output to create html reports.

    Regards,

    Joe

    Sample3.1.zip

  17. #17
    Join Date
    Apr 2008
    Posts
    1,771

    Default

    I still need a step that removes the columns I do not need. From your suggestions it seems straight-forward and I will do that next.
    Select value Step - simply delete fields from the tab
    -- Mick --

  18. #18
    Join Date
    Nov 2008
    Posts
    777

    Default

    You can easily write the data out from within the transformation you read it with; just add a Microsoft Excel Output step or a Table output step as the next step.

    Ultimately, you will probably want to sequence your whole process with a parent job (note that job entries run in series while transformation steps run in parallel). The sequencing could consist of identifying which transaction files to load, loading the transactions, moving the transaction files to a "loaded" folder, and emailing the results of the load.
    pdi-ce-4.4.0-stable
    Java 1.7 (64 bit)
    MySQL 5.6 (64 bit)
    Windows 7 (64 bit)

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.