Hoe to compute a similarity score for pairs of records in Weka?

03-27-2009, 12:32 PM
Hello, hope you can help me.

I have a data set containing the purchases of customers at a web site. Each record represents a purchase and includes information such as the customer ID, the day of the purchase, the product purchased, etc.
I want to compare each record with all the other records in the data set, and to obtain a similarity score between each pair of record/transaction.

As an example, if my data set includes the following records:

1. Customer1, Monday 13th, mobile phone
2. Customer1, Sunday 19th , fridge
3. Customer2, Saturday 18th, mobile phone
4. Customer3, Monday 13th, pc
5. Customer3, Saturday 18th, mobile phone

I would like to have a measure of similarity for each pair of comparisons, as follows:

(Customer1, Monday 13th, mobile phone VS Customer1, Sunday 19th, fridge)
(Customer1, Monday 13th, mobile phone VS Customer2, Saturday 18th, mobile phone)
(Customer1, Monday 13th, mobile phone VS Customer3, Monday 13th, pc)
(Customer1, Monday 13th, mobile phone VS Customer3, Saturday 18th, mobille phone)
(Customer1, Sunday 19th , fridge VS Customer2, Saturday 18th, mobile phone)

Do you know an algorithm in Weka which give as an output a measure of similarity?
How can I "automate" the procedure since I have many records?

Thank you very much.

03-27-2009, 04:15 PM

Weka has a number of distance functions that are used by nearest neighbor methods (take a look in weka/core). To get the output that you want, you'd need to write a small program to use a selected distance function on pairs of instances and then format the output as per your requirements.


03-30-2009, 05:18 PM
So looks like you are interested in data matching ?

04-02-2009, 02:54 PM
Yes, I'm interested in data matching. Is there a program I can download and use to perform the matching? (I have Windows vista.)

Thanks a lot for your hepl