I have PDI 4.4.0 with Weka 3.6.10 set up and have been using the Weka Scoring plugin to perform quick k-means clustering on data in a PDI-flow, return the predicted cluster numbers and add them to the flow.

Things have been working well the past couple of months, but recently I got curious on whether everything's fine or not. My scenario is the following:

- Weka Scoring takes a single field (column) from the flow and performs k-means clustering on it with a 2 clusters output, and then returns the cluster membership for each row.

Simple enough. Everything checks out, field mapping is okay too. The problem is that the input field contains some 750 rows, and ALL of them end up in the same cluster. Technically I get a single cluster result. However, if I input the same data column to Weka (outside PDI), it returns proper results with some 80% of the cases in cluster 1 and 20% of the cases in cluster 2.

This leads me to believe there is something wrong with the model file. It is true that I originally built the model file with an extremely small, 10 cases random sample; but then again, the k-means method shouldn't need training. Give it the number of required output clusters, a max iteration count, a dataset to work on and it should be good to go. I have a guess there is some sort of "training" going on that biases the results because I can see the results of the original 10-cases training set classification under Weka Scoring's "Model" tab.

Can you help me out with this? I can't default to an "a-priori" model because I run this transformation on over a hundred (and ever increasing number of) datasets, each of them different in sample size (number of rows), distribution and statistical mean. How is it possible to just run a simple k-means clustering on whatever data weka scoring gets, and return cluster membership?

Thanks in advance!