Hitachi Vantara Pentaho Community Forums
Results 1 to 25 of 25

Thread: insert/update table and incremental extraction

  1. #1

    Default insert/update table and incremental extraction

    hi all,

    want to ask something. i want to read data from a log file, and it supposed to be incremental extraction. for example i have 10 rows of data and i extract it to my warehouse. and after 15 minutes, i want to run the transformation again with the same log file, but what i want to extract is the new data added in the log file after the last extraction. so how am i gonna do that??

    next question is i'm having a problem with output table. i run my transformation and for example add 10 rows to my table. after that, i run the transformation again, with the same data. and the rows in my table becoming 20, so it duplicates. i use insert/update table. isn't that supposed to update the data if the row is the same?? how to make it update the rows instead of adding the same data to the table??

    and i'm using postgres as my database
    Last edited by rian_wu; 12-18-2007 at 09:54 PM.

  2. #2
    Join Date
    May 2006
    Posts
    4,882

    Default

    Quote Originally Posted by rian_wu View Post
    want to ask something. i want to read data from a log file, and it supposed to be incremental extraction. for example i have 10 rows of data and i extract it to my warehouse. and after 15 minutes, i want to run the transformation again with the same log file, but what i want to extract is the new data added in the log file after the last extraction. so how am i gonna do that??
    DIY ... What usually is done is that an extraction is done. On the table to which you write to the cut-off point is selected and written to a restart table. And the original transformation/query uses the cutt-off point to determine from where to extract.

    Quote Originally Posted by rian_wu View Post
    next question is i'm having a problem with output table. i run my transformation and for example add 10 rows to my table. after that, i run the transformation again, with the same data. and the rows in my table becoming 20, so it duplicates. i use insert/update table. isn't that supposed to update the data if the row is the same?? how to make it update the rows instead of adding the same data to the table??
    If you fill in the right field names in the right places in the step... yes

    Regards,
    Sven

  3. #3

    Default

    Quote Originally Posted by sboden View Post
    DIY ... What usually is done is that an extraction is done. On the table to which you write to the cut-off point is selected and written to a restart table. And the original transformation/query uses the cutt-off point to determine from where to extract.
    this is what i get from your explanation Sven, correct me if i'm wrong

    so what you mean is that after i run the first transformation where i get the first 10 rows of data, at the last row i simply put something unique from any fields and then put that in a file or another table. after that i add some steps in the transformation where i read from that unique field and read the log file to get the last row of the data added in the first transformation. and start to extract the data from after that row?? how to do that??

    If you fill in the right field names in the right places in the step... yes
    you mean the key(s) to look up the value(s) fields?? i already put all the fields there. so it should be ok right??coz the data in every rows are the same since i got it from the same log file. but still it didn't update the rows, just some rows updated.

    can u explain more Sven, thanks a lot.

  4. #4

    Default

    Quote Originally Posted by rian_wu View Post
    next question is i'm having a problem with output table. i run my transformation and for example add 10 rows to my table. after that, i run the transformation again, with the same data. and the rows in my table becoming 20, so it duplicates. i use insert/update table. isn't that supposed to update the data if the row is the same?? how to make it update the rows instead of adding the same data to the table??
    figured this out. the reason it duplicates, because one of my field has null value. if i remove the field from the compared key, then it works fine.

  5. #5

    Default

    i'm still having problem with the incremental extraction from the log file. can somebody help me out how to do that??

    thanks

  6. #6
    Join Date
    May 2006
    Posts
    4,882

    Default

    What is your current problem, since your last problem seemed to be solved.

    If you're still have a problem with insert/update... have a look at the user guide on it and play a little with it. If you put the right fields in the right table you will get a correct behaviour.

    Regards,
    Sven

  7. #7

    Default

    Log file extraction (I think that's the one you still haven't got working):

    In order to allow the ETL to figure out what it has or hasn't imported already, you need to set a marker somehwere indicating something unique about the latest entry in the logfile that is imported on every single run. More often than not that would be the time stamp from that latest entry. So here is what you do:

    1.) Initially the marker is set to a date way off in the past, thus ensuring that all of the logfile is imported.

    2.) During each ETL run you only import data from the logfile where timestamp is greater than the marker. You then figure out the highest timestamp imported and you write that to the marker table (that table doesn't need to include any other columns, just the marker. There won't be any problem, because the marker will be different on each run and even if it isn't , that simply means your logfile hasn't got any new records).

    3.) Now use the highest marker setting as an input for each subsequent run.

    Of course, if your log file doesn't have a timestamp, you'll need to use something different as a marker.
    Hope this helps.

  8. #8

    Default

    yeah, i'm still having problem with the log file

    i have date and time in my log file. so that could be used as unique keys right??

    so what i need to do is just after i run my transformation, i get the last date and time and put it in a, let say text file. but i need to put another step in my transformation where i need to read the timestamp text file and input the data after the last date time in the timestamp. the question now is what step to filter the log file and get the data after the last timestamp text file??how can i do that?? using filter rows i can use '<' symbol to filter the date and time. but how to add that value to the text file input step??

    generally my steps in the transformation are these:
    1. read log file
    2. seperate the data in the log file and insert to text file
    3. filter the data based on the transactions from the text file
    4. insert to database table from the filtered text

    thanks a lot
    Last edited by rian_wu; 01-06-2008 at 11:21 PM.

  9. #9

    Default

    Use either CSV file input or Text file input. Then sort the input and use the values from the last row (or first, depending on your sort order). Or, even simpler, overwrite the file each time a transformation runs, so it contains only one row.

  10. #10

    Default

    that is my intention, i'm gonna re-write the text file for the timestamp so there gonna be 1 data only. but the questions now are:
    1. how to get the last row from the log?? is it using java script or is there any other simpler step to do that??
    2. i know that i need to filter the log file based on the timestamp. but how to do that since i'm using text file input and there is no feature to filter the log from other step unless using filter row?? i tried to put row filter step and combine the input from the timestamp and log file and then compare the field. but error since it said that in the log step i have 3 field and in the timestamp i have 1 field only.

  11. #11
    Join Date
    May 2006
    Posts
    4,882

    Default

    1) blocking step and only pass last row
    2) you have to make a proper date of it of course.

    Regards,
    Sven

  12. #12

    Default

    done that Sven, and it works. thanks a lot. now after i have the timestamp, how to insert that value to read the log file and start to extract it from that timestamp. what step do i need to insert??
    Last edited by rian_wu; 01-08-2008 at 06:02 AM.

  13. #13
    Join Date
    May 2006
    Posts
    4,882

    Default

    You can store it in a table, a file... The important thing is that you store it somewhere so that you can later filter on rows with a bigger date.

    Regards,
    Sven

  14. #14

    Default

    i already put the timestamp in a file, what i'm asking is that how to filter the log file again when i run my transformation again based on that timestamp??what step do i need to add other than the step to get the timestamp.

    thanks

  15. #15
    Join Date
    May 2006
    Posts
    4,882

    Default

    What you could do for example:

    - Store the timestamp in a file (1 line, 1 column)
    - Read in the data, read in the timestamp
    - Do a join (join step) between the 2 and each row will have your current timestamp and your original timestamp.
    - Use e.g. filter to compare timestamp and route the rows... before original throw away, after original process.

    Regards,
    Sven

  16. #16

    Default

    got it finally.

    what i've done is that after i use the blocker step to get the timestamp and put that in a file. and for the extraction from the log, i add read from the log and read from the timestamp. and then use join rows. filter that based on the comparison of the time from the log file > than the time from the timestamp. and then just put that into a new file for further extraction.

    thanks a lot Sven and wolfgangsz.

  17. #17

    Default Doubt regarding Update/Delete

    While going through some examples i found out one step that is filterrows…..with that we can put a condition like insert ,update or delete….iam using that for my requirement,but iam not getting the exact o/p….can u guide me ragarding this.
    My req,
    If rev id =1
    islatest=yes
    delete=no
    for first record
    if revid>=2
    if instruction=”rectify”
    islatest=yes
    delete=no
    for previous records islatest=no
    delete=yes
    if revid>=2
    ifinstruction=”update”
    islatesdt=yes
    delete=no
    for previousrecords
    islatest=no
    delete=no
    for revid>=2
    if instruction=Delete
    islatest=no
    delete=yes
    for previous records
    islatest=no
    delete=yes
    I have this kind of requirement ….can u show me the way how to do it…
    Thanks,
    Sree

  18. #18

    Default

    filter rows basically can be used to produce rows to 2 steps based on the condition whether it is true or false. it cannot be used to update, insert or delete. looking from your requirement, i think its gonna be several filter steps. example: 1st filter is for the instruction id=1. if the stream has an id = 1, then send the true step to some other steps to set the islatest and delete. for the false step, just pass it to another filter rows step to filter the instruction = "rectify" and so on until u have all the filter rows step that can produce the conditions.
    Thanks,
    -Rian_Wu-

  19. #19

    Default

    Thanks for giving the reply...i tried that thing only i got the direct conditions like rectify or update....but for previous records i have to go to another step...which i dont know how to do it.....can u help me regarding this......

    Thanks

  20. #20
    DEinspanjer Guest

    Default

    Did you look at the examples in the samples directory of your Kettle install?
    Did you read the documentation on the Filter step included in the user guide?

  21. #21

    Default

    Yes,I looked at all examples then only i find out some steps...my requirement is now half part done....the other half part...I mean for previous records the conditions will change....i have done only for new records coming.....now i have to do it for previous records too.....

    thanks,

  22. #22

    Default

    if the data streams come from 1 source, then it shouldn't be a problem to filter it since u have all the fields, just need to add some steps to modify the data or put them to some output as u want them.

    to change the previous record, maybe u can put it in a file or a table, and then everytime new data processed, just update that file or table to make it as your latest data. with those data, what do u want to do?? u just wanna save the latest row or u want to keep the previous record as well by changing the field?? if u just wanna keep the latest row, my suggestion just use text output, and don't append it. so everytime the transformation runs, it will only keep the last row processed.
    Thanks,
    -Rian_Wu-

  23. #23

    Default

    Hai,

    I have some xml files and they contain body..in that many paragraphs will be there....for every xml file it will differ.....so now i have to make all the paragraph elements into one element or field like bodycontent.And i have to display all the paragraphs at once in that field.Main problem is different xml files contains different no of <p> tags.Can anyone suggest me how to do it in kettle...i mean which step i have to opt for that.

    Thanks in Advance

  24. #24

    Default

    Hai Matt, can u help me the below question i posted.....this is my xml file

    <?xml version="1.0" encoding="ISO-8859-1"?>

    <NewsML>
    <Catalog Href="www.sda.ch/newsml/topics/SDACatalog.xml"/>
    <NewsEnvelope>
    <file_name>20080208074000587102012100000_brz003.xml</file_name>
    <DateAndTime>20080208T074001+0100</DateAndTime>
    <NewsService FormalName="SDA-ATS News Service"/>
    <NewsProduct FormalName="Si-Online D"/>
    <Priority FormalName="3"/>
    </NewsEnvelope>
    <NewsItem>
    <Identification>
    <NewsIdentifier>
    <ProviderId>www.sda-ats.ch</ProviderId>
    <DateId>20080208</DateId>
    <NewsItemId>brz003</NewsItemId>
    <RevisionId PreviousRevision="0" Update="N">1</RevisionId>
    <PublicIdentifier>urn:newsml:www.sda-ats.ch:20080208:brz003:1N</PublicIdentifier>
    </NewsIdentifier>
    </Identification>
    <NewsManagement>
    <NewsItemType FormalName="News"/>
    <FirstCreated>20080208T074001+0200</FirstCreated>
    <ThisRevisionCreated>20080208T074001+0200</ThisRevisionCreated>
    <Status FormalName="Usable"/>
    <Urgency FormalName="3"/>
    <AssociatedWith NewsItem="urn:newsml:www.sda-ats.ch:20080208hd9802"/>
    <AssociatedWith NewsItem="urn:newsml:www.sda-ats.ch:20080208hd9801"/>
    </NewsManagement>
    <NewsComponent>
    <NewsLines>
    <HeadLine>In Lausanne den Turniersieg im Visier</HeadLine>
    <DateLine>Eishockey</DateLine>
    <NewsLine>
    <NewsLineType FormalName="CatchLine"/>
    <NewsLineText>Den Turniersieg im Visier</NewsLineText>
    </NewsLine>
    <NewsLine>
    <NewsLineType FormalName="CatchWord"/>
    <NewsLineText>Vierländerturnier</NewsLineText>
    </NewsLine>
    </NewsLines>
    <AdministrativeMetadata>
    <Provider>
    <Party FormalName="si"/>
    </Provider>
    <Source>
    <Party FormalName="si"/>
    </Source>
    <Property FormalName="author" Value="voe"/>
    </AdministrativeMetadata>
    <DescriptiveMetadata>
    <Language FormalName="DE"/>
    <Genre FormalName="Current"/>
    <SubjectCode>
    <Subject FormalName="15000000"/>
    <SubjectMatter FormalName="15031000"/>
    </SubjectCode>
    <Location>
    <Property FormalName="Country" Value="CH"/>
    </Location>
    </DescriptiveMetadata>
    <NewsComponent>
    <ContentItem>
    <Format Scheme="IptcFormat" FormalName="NITF"/>
    <Characteristics>
    <Property FormalName="FormatVersion" Value="3.0"/>
    </Characteristics>
    <DataContent>
    <nitf>
    <body>
    <body.content>
    <p lede="true">Nach dem Turniersieg im November am Deutschland-Cup und einer erfolgreichen Länderspielreise im Dezember (zwei Siege aus drei Spielen) will das Eishockey-Nationalteam am Heimturnier in Lausanne die positiven Eindrücke der letzten Wochen bestätigen.</p>
    <p>"Natürlich wollen wir unser Heimturnier gewinnen", sagt Ralph Krueger. Anderseits spielen beim Februar-Termin, unmittelbar vor Beginn der Playoffs, auch andere Aspekte eine wichtige Rolle. Weil in der Meisterschaft Ende Monat die entscheidende Phase beginnt, liess der Nationalcoach beispielsweise am Mittwochabend den leicht angeschlagenen Berner Goalie Marco Bührer zum Klub zurückkehren. Statt Bührer stehen nun gegen Deutschland (Freitag), die Slowakei (Samstag) und Frankreich (Sonntag) Ronnie Rüeger und Thomas Bäumle vor dem Schweizer Tor.</p>
    <p>Ralph Krueger wird das Abschneiden in Lausanne aber nicht nur an den Ergebnissen messen. Seine Mannschaft, die zuletzt im Ausland äusserst kompakt auftrat, soll die vielen guten Eindrücke vor eigenem Publikum bestätigen. Die Spieler und die Zuschauer sollen Spass haben. Es gehe in den kommenden Tagen auch darum, das Schweizer Hockey zu zelebrieren. Krueger: "Wir fesseln die Spieler im Moment nicht mit taktischen Handschellen."</p>
    <p>Beim Zelebrieren in Lausanne fehlen einige Titulare. Torhüter David Aebischer wurde auch nach Bührers Forfait so wenig aufgeboten wie Martin Plüss, Paul Di Pietro und Patrick Fischer. Zu bedeuten habe das nichts, so Krueger, denn "wir mussten Platz schaffen für Spieler wie Christen, Wick und Déruns". Diese Spieler gefielen im November oder Dezember und erhalten nun eine Möglichkeit, sich nochmals für die WM-Teilnahme im Mai zu empfehlen. Wer aber mit an die Weltmeisterschaft nach Kanada darf, entscheidet sich erst im April.</p>
    <p>Die Probleme, mit denen sich Krueger vor einem Jahr konfrontiert sah, stellen sich im Moment dem deutschen Coach Uwe Krupp. Die Deutschen sind am Freitag der erste Gegner der Schweizer. Vor dem Turnier kritisierten die Kölner Haie die "fehlende Kommunikation" des Bundestrainers, ausserdem provozierte die Nichtberücksichtigung des DEL-Topskorers Hock eine öffentliche Polemik. Krupp nominierte für das Lausanner Turnier jene Leute, mit denen er auch die WM plant.</p>
    <p>Im slowakischen Team figurieren zahlreiche Söldner. Aus der NLA sind Stanislav Hudec (Basel), Juraj Kolnik (Servette) und Peter Sejna (ZSC Lions) mit von der Partie. Krueger: "Die Deutschen und die Slowaken werden sich ähnlich stark präsentieren wie im November und im Dezember. Sie sind ideale Gegner für uns. Die Spiele verlaufen immer äusserst umstritten."</p>
    <p>Für Krueger und seine Assistenten stellen indes die Franzosen den interessantesten Gegner dar. Von den Franzosen wisse er im Moment noch nicht viel, gibt Krueger zu. Am 3. Mai werden aber die Schweizer gegen Frankreich in Québec City das erste WM-Spiel bestreiten. Krueger: "In den nächsten drei Monaten wollen wir über die Franzosen möglichst viel erfahren, wir werden ihre drei Spiele in Lausanne genauestens analysieren." Die Franzosen treten mit einer erfahrenen Equipe an. Zehn Akteure bestritten schon über 100 Länderspiele. Aus der NLA figuriert Laurent Meunier (Servette) im Team, nicht aber Sébastien Bordeleau (Bern).</p>
    <p>Nach drei (stimmungslosen) Austragungen in Basel verlegte der Schweizer Verband sein Vierländerturnier erstmals an den Genfersee. In Lausanne bestritt die Schweiz vor fast fünf Jahren das letzte Länderspiel (1:1 gegen Tschechien im April 2003). Die Organisatoren hoffen, dass die Fans den Anlass mittragen. Im Vorverkauf wurden 500 Dauerkarten und 2000 Tickets abgesetzt. Damit wurden die Basler Zahlen aus dem Vorjahr bereits dreimal übertroffen. Für Gérard Scheidegger, den Geschäftsführer des HC Lausanne, stellt diese Zahl eine gute Basis dar. Scheidegger: "Wir hoffen, dass am Freitag und am Samstag jeweils vier- bis fünftausend Leute in die Malley-Halle kommen."</p>
    <p>Programm. Lausanne. Vierländerturnier (8.-10. Februar). Heute Freitag: Slowakei - Frankreich (16.00 Uhr). Schweiz - Deutschland (20.10). - Samstag: Deutschland - Frankreich (16.00 Uhr). Schweiz - Slowakei (20.00). - Sonntag: Slowakei - Deutschland (12.00 Uhr). Schweiz - Frankreich (16.00).</p>
    </body.content>
    </body>
    </nitf>
    </DataContent>
    </ContentItem>
    </NewsComponent>
    </NewsComponent>
    </NewsItem>
    </NewsML>

    and there i have may paragraphs.for every xml file the paragraph count will differ.so i have to parse the entire xml file and get the paragraph part into one field like bodycontent.for this iam using java program to get the paragraph thing into only one field.And am calling that java programn in javascript ....but it is taking only the first xml file and dispaying that body content in all the rows of other xml files .so can u help me regrading this......how to do it in java script.iam taking the xml input step,then javascript then tableo/p.

    thanks,
    sree

  25. #25

    Default

    Hai,

    Can anyone help me the below one that i already posted.I want to call a non static java class in java script.can anyone tell me how to do this one.


    Thanks,
    Sree

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.