CMPT459: Data Mining Assignment 1

CMPT459: Data Mining, Spring 2021

Assignment 1 [total marks: 100]

The goal of this assignment is to implement a Decision Tree and to test it on a dataset to classify

people into those who earn less than 50k and more than 50k based on their attributes. The adults

dataset consists of 14 features (6 continuous and 8 categorical) and one class label. Provided

data.zip file includes three files:

• data.summary.txt: information about the features

• adult.data.csv: training data

• adult.test.csv: testing data

a) [25 marks] Present a pseudo-code for a simple Decision Tree with error reduction pruning:

The information gain is used as a split criterion. [5 marks]

ii)

The tree is grown deep, i.e. it is grown until all training examples corresponding to

a leaf node belong to the same class. [5 marks]

iii)

Works on categorical data. [5 marks]

iv)

Works on numerical data. [5 marks]

Error reduction pruning using validation data. [5 marks]

b) [75 marks] Implement your pseudo-code using Python. Your implementation should

include the following functions (method signatures might be different based on the specific

needs of your implementation):

● grow(dataset) -> tree: grows a deep tree on the given dataset and returns the tree

object. [20 marks]

● prune(dataset, tree) -> tree: accepts a tree object and prunes it using the validation

dataset. Returns the pruned tree. [15 marks]

● test(dataset, tree) -> accuracy: returns the accuracy of the given tree on the given

dataset. [10 marks]

You should use the above methods to complete the following tasks:

1. Train and evaluate on the dataset adult.data.csv using 5-fold-cross-validation. Each

time you grow a tree, you need to prune it before evaluation. You can leave 10% of

the training data as validation data. Report the average accuracy. [15 marks]

2. Train one final tree on the dataset adult.data.csv and use it to predict samples in

adult.test.csv and save outputs in a csv file. [10 marks]

3. You need to properly handle missing values. Explain briefly how you did that. [5

marks]

[IMPORTANT] Submit a file [student-id].zip which includes the following:

● A file report.pdf with the pseudocode and your answers for tasks 1. and 3.

● One and only one .py file which includes all your implemented functions and classes.

● A file predictions.csv with the predictions for the test data.

● A requirements.txt file, including all the required packages to run your code.

● data/ directory with the datasets.

Running the python command on your .py file should reproduce all your reported results. You

need to use relative paths for accessing your data through your code to make it runnable on all

machines.

Deadline: The deadline is 23:59 pm on Feb 11th. We accept late submissions up to 24 hours late

but deduct 10% of the marks. You will lose all the marks for submissions after that.

Libraries: You can use libraries including math, numpy, scipy, random, etc. You MUST provide

YOUR OWN implementation for the information gain calculation, decision tree growing, pruning

and the methods to perform 5-fold-cross validation and predictions. These MUST be implemented

from scratch i.e. not using scikit-learn libraries. You will be marked on the correctness of your

implementation.

QQ：99515681
WeChat：codinghelp
Email：99515681@qq.com
Work Time：8:00-23:00

Hots

Ghostwriter Cs1b Spring 2024 Tth Hw08h... 2024-04-19
Help With Managing Financial Risk Prob... 2024-04-19
Ghostwriter Cs 0449 – Project 5: /Dev/ 2024-04-19
Ghostwriter Elec 2141 Digital Circuit ... 2024-04-19
Help With Csc171 — Videogame Projecthe 2024-04-19
Help With Comp3411 Artificial Intellig 2024-04-19
Help With Stat3061: Random Processes &... 2024-04-19
Ghostwriter Accounting 452, Spring 202... 2024-04-19
Ghostwriter Finc5001 Foundations In Fi... 2024-04-19
Ghostwriter 7Ssmm712 – Topics In Appli 2024-04-19
Help With Com 337 - Film Studies For T... 2024-04-19
Ghostwriter Mes202tc - Digital Vlsi Sy... 2024-04-19
Ghostwriter Geography 2041B Distance S... 2024-04-19
Ghostwriter Ecos3006 International Tra... 2024-04-19
Help With Fit5225 2024 Sm1 Creating An... 2024-04-19
Help With Cit 593: Introduction To Com... 2024-04-19
Help With Math 4931: Take Home Examgho... 2024-04-19
Ghostwriter Csci 547|Info 533: Systems... 2024-04-19
Ghostwriter Cs536-S24 Intro To Pls And... 2024-04-19
Help With Fit5212 - Assignment 1Ghostw... 2024-04-19

Programming Assignment Help！