PDA

View Full Version : Best place to start?



gvanto
05-13-2008, 01:29 AM
Hey guys,

I'm really enjoying the first edition of the book "Data Mining" - cheers to the authors for an awesome book!

I am in the historical stock-price analysis industry (and programmer) and was wondering if there are any good tutorials and/or what the best place to start is to get to know data-mining algorithms for extracting / analyzing rules for making trading decisions?

I am thinking a particular algorithm for deducing rules rather than decision trees may be better? I quite like the 1-Rule algorithm:
I've only covered the 1-Rule algorithm so far in the book and while I understand the general concept, I don't quite understand how table 4.1 (p. 79) was created - in particular the temp { hot, mild, cool }.

Any help or advice much appreciated,
Gert
Weka newbie

Mark
05-13-2008, 06:38 PM
Hi Gert,

There are probably more books and papers on predicting the stock market than you've had hot dinners :-) Best just to google around for information. Here is one blog entry that has some discussion and references:

http://aplawrence.com/Blog/B1034.html

A lot of approaches to predicting stock prices, currency exchange rates and the like are based on time-series methods. Time series prediction is something that Weka is not well suited to. However, there are some projects that have integrated time series prediction/classification methods into Weka. Try googling "Weka time series".

Table 4.1 in the book shows all the possible singe-attribute rules that can be generated for the weather data, along with the predicted class value that makes the fewest errors on the training data.

Cheers,
Mark.

gvanto
05-13-2008, 08:15 PM
Hi Mark,

Thanks for the advice!

I am still wondering about which algorithm is best to use but I am not sure I necessarily need time-series prediction:
The dataset I am analyzing is simply one large spreadsheet, with a hypothetical historic trade on each row (yes the row has a date value, but it's not used in the analysis at all).

What I am basically trying to do is establish a relationship between some of the columns (attributes, right?) labelled things such as 'Opening % Change', 'Prev. Close', Volume, etc. WHERE output 'Profit' column [class, right?] is above zero. Think this is association learning but not sure since the class is involved?

Thanks for your help on table 4.1. What I don't quite get is how the categories hot, mild & cool were derived for temperature? Pages 80 & 81 discuss how the numeric data is categorized (partitioned) but I cant see from that how the hot, mild cool rules were derived.

Do you perhaps know if an online step by step tutorial exists showing how table 4.1 (or a similar problem) was derived from the instance-data?

Many thanks for your help again,
Gert

Mark
05-15-2008, 05:11 PM
What I am basically trying to do is establish a relationship between some of the columns (attributes, right?) labelled things such as 'Opening % Change', 'Prev. Close', Volume, etc. WHERE output 'Profit' column [class, right?] is above zero. Think this is association learning but not sure since the class is involved?


OK. This is a standard propositional learning problem, so plenty of Weka's many classifiers should be able to be applied. Is the class attribute numeric or discrete (this will dictate what sort of learner to apply)?



Thanks for your help on table 4.1. What I don't quite get is how the categories hot, mild & cool were derived for temperature? Pages 80 & 81 discuss how the numeric data is categorized (partitioned) but I cant see from that how the hot, mild cool rules were derived.


Oh, I think I understand what you are getting at :-) There are two versions of the weather data - one with numeric attributes and the other with discrete attributes. Some kind soul at some stage in the past converted the numeric values into discrete ones for us, the algorithm didn't do that :-)

Cheers,
Mark.

gvanto
05-27-2008, 12:06 AM
Hi Mark,

Sorry for the tardy reply! I am still looking at this problem.

In answer to your Q, the class is simply Buy/Sell (based on whether a particular instance (trade) made profit or not. So the class I guess is nominal.

I have been reading a bit further (up to chapter 6 now) and slowly seeing how decision trees and the like work. Quite alot to it.

Can I ask what sort of problems you apply the learning schemes to usually?

Best regards,
G

Mark
05-27-2008, 04:13 AM
Can I ask what sort of problems you apply the learning schemes to usually?
G

There are many applications (the book discusses some examples). At present, I'm working on an application in healthcare that involves building a model to predict the likelihood of patients needing emergency hospital admission at some time over the coming year, given historical data summarized over a three year period.

Cheers,
Mark.

gvanto
05-28-2008, 01:23 AM
Nice one Mark, sounds like its being applied to a worthy cause - great stuff! Do you do mining research for a liviing then?

My particular mining problem contains a variety of nominal and numeric attributes.

Security Open, High, Low and Close data along with others are numeric, whereas certain att's like Gap {Up, Down} and the class: Action { Buy, Sell } are nominal.

I was thinking a covering classifier algorithm could be suitable for this problem but not sure which one would accommodate both numeric and nominal att's ?

Any advice here?

Cheers
G

Mark
05-28-2008, 05:34 AM
Nice one Mark, sounds like its being applied to a worthy cause - great stuff! Do you do mining research for a liviing then?
G

I'm Pentaho's data mining guy :-) (I was an academic up until late last year) I'm also one of the three original core developers of the Java version of Weka (it existed as a set of C programs with a TCL/TK user interface prior to 1998). Myself and two other (at the time) PhD students (Eibe Frank and Len Trigg) wrote the bulk of the code over a couple of years in the late 90's.



My particular mining problem contains a variety of nominal and numeric attributes.

Security Open, High, Low and Close data along with others are numeric, whereas certain att's like Gap {Up, Down} and the class: Action { Buy, Sell } are nominal.

I was thinking a covering classifier algorithm could be suitable for this problem but not sure which one would accommodate both numeric and nominal att's ?
G

If you are looking for knowledge discovery (and reasonable to good predictive performance) then divide-and-conquer tree learners and covering rule learners are a good choice. Weka's major algorithms in this class can all handle both numeric and nominal attributes. So, I'd suggest looking at J48, RepTree, JRip and PART. The first two are tree learners (J48 is a Java implementation of Ross Quinlan's famous C4.5 decision tree learner, release 8), and the latter two are rule learners. For improved predictive performance, these methods can be bagged and boosted (meaning that an ensemble of trees/rules are learned), but interpretablility of model is sacrificed.

Cheers,
Mark.

gvanto
06-04-2008, 01:30 AM
Hey Mark,

I have read a bit further - making progress on things I think. Among other things, I've discovered that my problem actually uses no nominal attributes, they're all numeric.

I am using a few input attributes to try and determine the class (also numeric).

What I was wondering was this:
a) How to go about selecting the best numeric-only or mixed algorithm for learning rules? [I have now removed an out-of-sample test-set of data to test some rules discovered by training on the training set: is this the right track?]

b) Does how many attributes there are determine which numeric-only will perform best?

Cheerio
Gert

Mark
06-04-2008, 06:57 PM
Hi Gert,

When the class is numeric, I'd suggest trying linear regression, M5P (decision trees with linear regression functions at the leaves) and perhaps SMOreg (support vector regression) to begin with.

The number of attributes do have an impact on the algorithms, particularly in terms of runtime. It depends on how many instances there are as well. A rule of thumb is that you need at least twice as many instances as attributes.

Using an hold-out set for testing is fine. If your data set is small, or if the algorithms you are trying are not taking too long to run, you can also try using cross-validation for evaluation as well.

Cheers,
Mark.

gvanto
06-04-2008, 09:39 PM
Hi Mark,

Awesome stuff on writing the M5Rules algo.

I can't seem to find the (bash command runnable) SMOreg file though ( ? )

Ok, when running my problem I get the following output:
Looking at the cross-validation results, I am assuming this is not a very good result for the rules generated. Does this suggest that there simply isn't a strong relationship in the data present ?




ltsmm@gertlx:~/eclipse/weka-3-4-12$ java -classpath $CLASSPATH:weka.jar weka.classifiers.rules.M5Rules -t gap2.arff

M5 pruned model rules
(using smoothed linear models) :
Number of Rules : 6

Rule: 1
IF
rel_perc_chg <= 1.495
open > 3.355
THEN

p11_chg =
0.0054 * open
- 0.2248 * volat
- 0.2489 * rel_perc_chg
- 0.4965 [3764/75.19%]

Rule: 2
IF
rel_perc_chg <= 2.275
rel_perc_chg > -1.805
THEN

p11_chg =
-0.0001 * open
- 0.002 * volat
- 0.3842 * rel_perc_chg
- 0.6767 [2540/82.357%]

Rule: 3
IF
open <= 4.505
rel_perc_chg <= 2.285
THEN

p11_chg =
0.3698 * open
- 11.1231 * volat
- 0.0504 * rel_perc_chg
+ 0.2266 [748/89.726%]

Rule: 4
IF
open > 4.595
THEN

p11_chg =
-0.4559 * rel_perc_chg
- 0.6658 [463/60.78%]

Rule: 5
IF
volat <= 0.105
THEN

p11_chg =
0.3815 * volat
- 0.0143 * rel_perc_chg
- 2.6761 [373/76.37%]

Rule: 6

p11_chg =
27.0544 * volat
- 0.549 * rel_perc_chg
- 5.4218 [103/93.709%]



Time taken to build model: 22.85 seconds
Time taken to test model on training data: 1.01 seconds

=== Error on training data ===

Correlation coefficient 0.2805
Mean absolute error 2.2457
Root mean squared error 3.3675
Relative absolute error 94.7937 %
Root relative squared error 95.9851 %
Total Number of Instances 7991



=== Cross-validation ===

Correlation coefficient 0.1091
Mean absolute error 2.2812
Root mean squared error 3.801
Relative absolute error 96.2681 %
Root relative squared error 108.319 %
Total Number of Instances 7991





I have included gap2.arff for reference if it helps make sense of the output ...

Regards,
Gert

Mark
06-05-2008, 12:24 AM
Hmm, those results don't look good. The relative measures (root relative squared error and relative absolute error) are close to, or greater than 100 - indicating that the scheme doesn't do any better than just predicting the mean of the target values :-)

Did you try weka.classifiers.trees.M5P as well? Sometimes this works better than the rules version. How about plain old linear regression?

You can find SMOreg in the functions package:

weka.classifiers.functions.SMOreg

There are lots of parameters to play with though.

Cheers,
Mark.

gvanto
06-09-2008, 08:26 PM
Thanks Mark for the explanation on the root error (was wondering what that was, but such a high number didn't seem good).

I am still struggling a bit to these algo's to spit out some nice rules (but thinking I should in fact have prior knowledge of what I am looking for to start with, which kind of defeats the purpose).

Going to keep playing with these schemes though. Looking into neural networks now too, quite intrigued by them.

Cheers
Gert