Data mining:
Knowledge Discovery in Databases
LAPP-Top
Computer Science,
Peter van der Putten (putten-at-liacs.nl), February 2005
Lab Session I
Assignment 1: Animal Trees
In this assignment we use a
data set of animals and their attributes. Using a decision tree classifier the
computer learns to classify animals into different categories (mammals, fish,
reptiles etc).
1.1 The data set can be found here. Without using
the data mining tool, draw a decision tree of three to five levels deep that
classifies animals into a mammal, bird, reptile, fish, amphibian, insect or invertebrate.
1.2 Now we are going to let the computer discover a
decision tree itself. First download this zip file
with data sets to your desktop and unzip it. Open the zoo.arff data set in WEKA
(choose start menu – weka –
weka-3-4 – Weka Explorer – Open file).
1.2.1 How many attributes are known of each animal?
1.2.2 How many animals are there in the data set?
1.3 Let us build some
classifiers. Go to the classifier tab. We will use 66% of the animals to build the models, and the remaining
34% to evaluate the quality of the model., so select percentage split – 66%. First we will build a ‘naïve’ model that just
predicts the most occurring class in the data set for each animal. This
corresponds to a decision tree of depth 0. Click start to build a model.
1.3.1 What % of animals is correctly classified?
1.3.2 Into what category are all these animals classified
and why?
1.4 Now build a decision tree of depth 1 (a.k.a. a decision stump - select choose
– trees – decision stump). Draw the discovered decision tree.
1.4.1 What % of animals is correctly classified?
1.4.2 Give an example of an animal that would not be
classified correctly by this model.
1.5 Now build a decision tree of any depth (a.k.a. a J48 tree). Draw the discovered decision tree.
1.5.1 What % of animals is correctly classified?
1.5.2 Give an example of an animal that would not be
classified correctly by this model.
Assignment 2: Animal Rules
In this exercise you will
use the association rule algorithm to discover interesting regularities in the
zoo data set.
2.1 The association rule algorithm to be used can only
cope with non-numerical (‘nominal’) attributes, so you first have to transform
the numerical attribute ‘legs’ to discrete bins (so 0, 2, 2, 4, >4 legs
etc). This type of data preprocessing can be performed in the preprocess tab by applying the right filter (select Discretize of PKIDistcretize and then Apply). Check the results before and after application of
the filter. Now run the association rule algorithm. You can change the numrules
option to get more rules Id needed.
2.1.1 List at least three interesting rules
2.1.2 Give at least one example of a rule that is always
true according to the algorithm (hint: see the confidence)?
2.1.3 Give an example of counterexample for a specific rule
(an example for which the rule is not correct)
Assignment 3: Mine Yourself
At the beginning of this lab
session you have answered some questions about yourselves. In this exercise we
will mine this survey of all Lapp-toppers to discover interesting, surprising
and counterintuitive patterns in the data.
3. 1 Build a decision tree
to predict whether someone watches
3.2 Build classifiers for a
selection of the other attributes. For each attribute note the classification
accuracy and some distinguishing characteristics. Which attribute is easiest to
predict and which one is hardest to predict?
3.3 Use the association
rules algorithm to derives interesting rules of this data set. Pick three rules
that find most interesting (most funny, trivial, counterintuitive)
We will discuss some of the
patterns found with the group.
Lab session II
Assignment 4: Recommenders
4. 1 List the top
recommendations belonging to your favourite book(s), movie(s) or music using
two out of the following list of recommenders (or any other recommender you
know):
Assignment 5: Data Mining Case Projects
The zip file from assignment
1 contains a number of data sets from a variety of areas. Most data sets
contain a small description in the header – to read this open the file in a
text editor like notepad. This exercise should be done in pairs.
Pick a data set that looks
interesting and write it on the blackboard so that we don’t get two team
working on the same data set.
For your data set / data
mining case note:
Create a small powerpoint
presentation discussing your most interesting results. One of you should act
like the domain expert and present the beginning and the end; the other one
should act like the data mining expert and present the data mining approach and
results. The rest of the group will ask questions after the presentation. The
presentations should be short – no more than 5 minutes.
The presentations will be
posted to this website.