Hitachi Vantara Pentaho Community Forums
Results 1 to 3 of 3

Thread: Preprocessing: first label removed

Hybrid View

Previous Post Previous Post   Next Post Next Post
  1. #1
    Join Date
    Jul 2016
    Posts
    2

    Default Preprocessing: first label removed

    Hi,
    I'm having a strange issue with the StringToWordVector preprocessing procedure on datasets (weka.filters.unsupervised.attribute)


    No matter what dataset I use: when I complete the preprocessing and save the dataset, the first label of the class attribute is removed from the istances.


    For example, using the SMS Spam Collection:
    Code:
    http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

    With this header:


    Code:
    @attribute Text string
    @attribute class-att {ham,spam}
    Taking two istances, before preprocessing:


    Code:
    'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',ham
    
    'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv',spam
    After preprocessing:


    Code:
    {82 1,380 1,423 1,501 1,504 1,557 1,668 1,703 1,736 1,873 1,919 1,945 1,987 1}
    
    
    {12 1,137 1,253 1,557 1,748 1,769 1,894 1,974 1,1092 1,1160 1,1235 1,1259 1,1271 1,1435 1,1453 1,1522 1,1559 1,1595 1,1602 1,1716 1,1720 1,1756 1,1765 1,1772 1,1803 1,1832 spam}

    We expect that both istances have the respective label in the last position ("ham" and "spam") but the "ham" label is removed in the first istance; this is the same for all istances of the "ham" label.
    No problem with all istances for the "spam" label.


    Is this normal?

  2. #2
    Join Date
    Aug 2006
    Posts
    1,741

    Default

    It hasn't been removed :-) StringToWordVector outputs instances in Weka's *sparse* data representation. Zeros are not represented explicitly in this format. It applies to nominal attributes too, simply because Weka stores the index of the corresponding label in the instance data. Since ham is the first label it has index 0.

    Cheers,
    Mark.

  3. #3
    Join Date
    Jul 2016
    Posts
    2

    Default

    Hi Mark,
    so my intuition was correct, since the classification goes well. It's a Weka method related to the representation.

    Thank you for the answer

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Privacy Policy | Legal Notices | Safe Harbor Privacy Policy

Copyright © 2005 - 2019 Hitachi Vantara Corporation. All Rights Reserved.