Data mining:
Knowledge Discovery in Databases

LAPP-Top Computer Science, Pre University College Leiden
Peter van der Putten (putten-at-liacs.nl), February 2005

Lab Session I

Assignment 1: Animal Trees

In this assignment we use a data set of animals and their attributes. Using a decision tree classifier the computer learns to classify animals into different categories (mammals, fish, reptiles etc).

1.1 The data set can be found here. Without using the data mining tool, draw a decision tree of three to five levels deep that classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.

1.2 Now we are going to let the computer discover a decision tree itself. First download this zip file with data sets to your desktop and unzip it. Open the zoo.arff data set in WEKA (choose start menu – weka – weka-3-4 – Weka Explorer – Open file).

1.2.1 How many attributes are known of each animal?

1.2.2 How many animals are there in the data set?

1.3 Let us build some classifiers. Go to the classifier tab. We will use 66% of the animals to build the models, and the remaining 34% to evaluate the quality of the model., so select percentage split – 66%. First we will build a ‘naïve’ model that just predicts the most occurring class in the data set for each animal. This corresponds to a decision tree of depth 0. Click start to build a model.

1.3.1 What % of animals is correctly classified?

1.3.2 Into what category are all these animals classified and why?

1.4 Now build a decision tree of depth 1 (a.k.a. a decision stump - select choose – trees – decision stump). Draw the discovered decision tree.

1.4.1 What % of animals is correctly classified?

1.4.2 Give an example of an animal that would not be classified correctly by this model.

1.5 Now build a decision tree of any depth (a.k.a. a J48 tree). Draw the discovered decision tree.

1.5.1 What % of animals is correctly classified?

1.5.2 Give an example of an animal that would not be classified correctly by this model.

Assignment 2: Animal Rules

In this exercise you will use the association rule algorithm to discover interesting regularities in the zoo data set.

2.1 The association rule algorithm to be used can only cope with non-numerical (‘nominal’) attributes, so you first have to transform the numerical attribute ‘legs’ to discrete bins (so 0, 2, 2, 4, >4 legs etc). This type of data preprocessing can be performed in the preprocess tab by applying the right filter (select Discretize of PKIDistcretize and then Apply). Check the results before and after application of the filter. Now run the association rule algorithm. You can change the numrules option to get more rules Id needed.

2.1.1 List at least three interesting rules

2.1.2 Give at least one example of a rule that is always true according to the algorithm (hint: see the confidence)?

2.1.3 Give an example of counterexample for a specific rule (an example for which the rule is not correct)

Assignment 3: Mine Yourself

At the beginning of this lab session you have answered some questions about yourselves. In this exercise we will mine this survey of all Lapp-toppers to discover interesting, surprising and counterintuitive patterns in the data.

3. 1 Build a decision tree to predict whether someone watches RTL Boulevard or the Journaal. What is the predictive power of the model? What are important distinguishing characteristics?

3.2 Build classifiers for a selection of the other attributes. For each attribute note the classification accuracy and some distinguishing characteristics. Which attribute is easiest to predict and which one is hardest to predict?

3.3 Use the association rules algorithm to derives interesting rules of this data set. Pick three rules that find most interesting (most funny, trivial, counterintuitive)

We will discuss some of the patterns found with the group.

Lab session II

Assignment 4: Recommenders

4. 1 List the top recommendations belonging to your favourite book(s), movie(s) or music using two out of the following list of recommenders (or any other recommender you know):

Amazon, BOL, Proxis, Romandvies@bibliotheek.nl, Centrale Discotheek Rotterdam, GNOD (Music, Books, Movies, People), Internet Movie Database, Reel.com

Assignment 5: Data Mining Case Projects

The zip file from assignment 1 contains a number of data sets from a variety of areas. Most data sets contain a small description in the header – to read this open the file in a text editor like notepad. This exercise should be done in pairs.

Pick a data set that looks interesting and write it on the blackboard so that we don’t get two team working on the same data set.

For your data set / data mining case note:

The practical problem that is being solved here
The goal of the classifier: what needs to be predicted
A high level description of the data: kind of attributes available, number of attributes / instances etc.
Examples of interesting patterns found by just analyzing individual attributes
The classification accuracy for each classifier type – a decision stump, a decision tree and optionally another type of classifier
The patterns discovered by at least one of the classifiers
One or more interesting association rules
A suggestion of how such a prediction can be used in practice

Create a small powerpoint presentation discussing your most interesting results. One of you should act like the domain expert and present the beginning and the end; the other one should act like the data mining expert and present the data mining approach and results. The rest of the group will ask questions after the presentation. The presentations should be short – no more than 5 minutes.

The presentations will be posted to this website.