View Full Version : Mining web applications

08-03-2008, 07:44 AM
Hi there...

I'm a mining neophyte, but am interested in the topic for some research i'm doing. I would like to mine data from interaction with web applications, and wonder if Weka combined with Pentaho's other functionality (esp Kettle) would help me with this.

This is my current concept, I'll apologize in advance for my poor vocabulary in this context:

The central element of my mining exercise will be a 'Session'. A Session is essentially the execution of a use case with the application(s) under study.

The Session will have particular properties:
- context - Ad-hoc information about the application, its owner, etc.
- user - the identity of the authenticated user
- client - properties of the browser/client used
- previous Sessions - some use cases may change the state of the application (e.g. user requests, manager approves, user access)

The user will also have particular properties:
- credentials
- ad-hoc business properties (department, role, profile info)

The Session itself will combine two streams of information. One stream will be user-interaction events recorded through something like the Selenium test tool, and the other will be the associated HTTP traffic. This will be done primarily so i can distinguish user supplied input from that which the application maintains.

The HTTP traffic itself will be fully parsed. All request and response headers will be enumerated (probably just using the J2EE interfaces for familiarity). All request parameters (GET or POST) will be extracted into name/value pairs and i will apply some basic heuristics to attempt to identify the data type of the value.

The application responses (at least the html/xml portion) will be parsed into dom or other structure, again with any discrete elements and their attributes stored (e.g. <div id=login>myusername</div>) so that i can correlate input with application response. I'll probably also index the response data so i can go back through and search for any and all input parameter values in the data.

The requests and responses will be chained to the extent possible, linked by HTTP referer, sequence and any user events.

I'm envisioning the Session data to resemble a fishbone diagram, with the 'spine' reflecting the user's execution of the use case and the bones representing the resulting avalanche of browser activity.

Now, what i'm hoping to do with this bucket of bones is mine information related to the relationships between application parameters, user input and the Session 'metadata'. The ability to differentiate values that are Session-specific, user specific, context specific and static is beneficial. So, for example, being able to distinguish between a navigation parameter (menuItem=44) and a user supplied parameter (city=Oz) is good, as well as information that changes based on input from previous sessions.

For storage, i'm thinking of just dumping it into a database.

The problem is that the above is going to be a mess, which is why i think a mining tool is the right approach. But i might be off, which is why i wanted to float this question before i tear into it full speed.

Sorry for the long post, your thoughts and suggestions are certainly appreciated!!

08-03-2008, 05:37 PM

If you can formulate your problem to fit one of the common data mining activities - i.e. supervised learning (classification/regression), clustering or association learning - and you can map your data to a single de-normalized table, then you can apply Weka's algorithms. Pentaho Data Integration (Kettle) can help with getting the data into the right format.


08-04-2008, 12:01 PM
Hi Mark,

Thanks for your reply. Sounds like i'll have to educate myself a bit... Time to raid Safari.