Assignment 1: Introduction
to the WEKA Data Mining Software
Decision Trees
1.1 Study the animals in the
Excel document (zoo.xls). Without using a data mining tool, draw a
decision tree of three to five levels deep that classifies animals into a
mammal, bird, reptile, fish, amphibian, insect or invertebrate.
1.2 Read about the ARFF-format
here. Construct
the header for the animal file.
1.3 Download datasets.zip and unzip it.
Open zoo.arff by going to Weka and then choosing the explorer.
1.4 Find out in WEKA how many
animals this dataset contains.
1.5 Go to the classifier tab
and select the decision tree classifier j48. Click on the line behind the
choose button. This shows you the parameters you can set and a button called
'More'. Which algorithm is implemented by j48?
1.6 Which percentage of
instances is correctly classified by j48? Which families are mistaken for each
other?
1.7 Again go to the parameter
settings by clicking on the box after the 'Choose' button. Now change
binarySplit to true and build a new decision tree. What is the difference?
1.8 Experiment with some of
the other classifiers and until you get a better classification performance.
Write down the classifier and its performance.
1.9 Compile the following
source code in Java:
import
java.io.*;
import weka.classifiers.trees.J48;
import weka.core.*;
public class MyDecisionTree {
MyDecisionTree(){
try{
FileReader reader = new FileReader("zoo.arff");
Instances instances = new Instances(reader);
// Make the last attribute be the class
instances.setClassIndex(instances.numAttributes() - 1);
J48 tree = new J48();
tree.buildClassifier(instances);
System.out.println("The third animal is classified as: " +
tree.classifyInstance(instances.instance(2)));
reader.close();
} catch(Exception ex){
ex.printStackTrace();
}
}
public static void main(String args[]){
new MyDecisionTree(); }
}
To
compile: javac -classpath weka.jar MyDecisionTree.java, note: first copy
weka.jar and zoo.arff to the same directory as MyDecisionTree.java.
To
execute: java -classpath .;weka.jar MyDecisionTree
(in some versions of java ; must be a : or it must be
-classpath=.:weka.jar, please e-mail me if it doesn't work)
Use the
Weka API documentation. How can you
make a decision tree with a binary split?
Association
rules
Next we
will search for association rules.
2.1 The
association algorithm requires nominal variables. In order to make all
variables nominal we need to distcretize the data.
This pre-processing
can be done with filtering, find the filter button on the pre-processing tab
and select the right unsupervised method to convert the attributes to nominal
attributes. After selecting a filter you can set its properties by clicking on
it. Press the apply button and watch how the attributes change.
2.2 Now
run the association rule algorithm. Which rules are always true? Write them
down.
2.3
Write down a couple of interesting rules and a couple of trivial rules.
Pima
indians, mushrooms and politicians
3.1 The
dataset.zip file contains different data sets ranging from predicting diabetes
in an indian population, distinguishing eatable mushrooms from poisonous till
separating republicans from democrats. Most datasets contain a short
description in the 'header'. Choose at least one data set, and answer the
following questions:
Other
datasets
On the
internet you can find many more data sets. Not all these data sets are in the
ARFF format. Choose one of the data sets from http://kdd.ics.uci.edu/. Convert this
dataset to the ARFF format and try different data mining techniques.