CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism:


CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism: Plagiarism is the submission of somebody else’s work in a manner that gives the impression that the work is your own. The Department of Computer Science and Computer Engineering at La Trobe University treats plagiarism very seriously. When it is detected, penalties are strictly imposed. INDIVIDUAL assignment. Part I (10 marks) In this part, we are going to build a decision tree classifier to predict the ages of abalones from the measurement results. The dataset can be found in the CSV file abalone.csv. 1. Before creating the classifier, the abalone dataset has to be pre-processed according to the following criteria: 1) The sex attribute has three possible values (M, F and I), encode them into integers by M = 1, F = 2 and I = 3. (1.5 marks) 2) The rings attribute is the number of rings an abalone has, the age can be estimated by number of rings + 1.5. Also, from this estimated age, each sample will be assigned to different age groups according to the table below: Age Class label <= 5.5 <= 5.5 > 5.5 and <= 8.5 (5.5, 8.5] > 8.5 and <= 11.5 (8.5, 11.5]> 11.5 and <= 14.5 (11.5, 14.5]> 14.5 and <= 17.5 (14.5, 17.5]> 17.5 and <= 20.5 (17.5, 20.5]> 20.5 > 20.5 Replace the rings column with the age group column. The age group attribute is the class label. (Hint: You can use any method to set the age group for each sample, including formula in Excel or MATLAB script.) (1.5 marks) After pre-processing, the dataset is divided into the training dataset and the testing dataset. Download the program “DataSplit.exe” and execute it. Enter your student ID and specify the locations of the dataset file and the destination folder. The dataset will be split for you by clicking the “OK” button. Note that your training and testing datasets are unique to others. Make sure you enter the student ID correctly. Show only your pre-processed training and testing datasets. Only the first 20 rows of each dataset are required in your answer. Also, please submit your MATLAB source codes (in MATLAB script file) with the assignment answer (2 marks). No marks will be given to your answer unless the relevant source codes are submitted. 2. Load both the training and testing datasets in Q1 into the MATLAB workspace. It is recommended to separate the class label (i.e. the attribute age group) from other attributes such that all the class labels of a dataset are stored in a matrix. As a result, there are four matrices after the import process, two for the attribute values from the two datasets, and the other two for the class labels from these datasets. a. Build a decision tree classifier (using the age group attribute as the class label). Show the decision tree. (1 mark) b. Use the built classifier to predict the age groups for the samples in the testing dataset. Show the predicted class labels for the first 20 rows of the testing dataset. (1 mark) c. Using the testing dataset, evaluate the error rate, sensitivity, specificity, and the confusion matrix. (1 mark) Please submit your MATLAB source codes with the assignment answer. (2 marks) No marks will be given to your answer unless the relevant source codes are submitted. Part II 3. The table below shows the statistics of interviewees about their current status of continuing education: ID Education level Annual Income Continuing Education? 1 Tertiary 35000 Y 2 Secondary 28000 N 3 Secondary 40000 Y 4 Tertiary 52000 N 5 Postgrad 31000 Y 6 Secondary 47000 N 7 Secondary 22000 Y 8 Secondary 19000 N 9 Postgrad 22000 Y 10 Tertiary 44000 Y 11 Tertiary 20000 Y 12 Postgrad 32000 N 13 Secondary 62000 Y 14 Postgrad 30000 Y 15 Tertiary 55000 Y a. Calculate the Gini index for the education level attribute, with multi-way split. Show your steps. (1 mark) b. Calculate the Gini index for the annual income attribute, for each of the following split points: i. ≤ 25000 and > 25000 ii. ≤ 35000 and > 35000 iii. ≤ 45000 and > 45000 iv. ≤ 55000 and > 55000 Show your steps. (1 mark) c. Calculate the entropy for the education level attribute, with multi-way split. Show your steps. (1 mark) d. From the results in (c). Calculate the information gain for the education level attribute, with multi-way split. (1 mark) e. Explain why the attribute with the maximum information gain is selected as the splitting attribute, in terms of the physical meaning of information gain. (1 mark) 4. a. Plot the receiver operating characteristic (ROC) curves for classifiers 1 and 2 using the following information: Instance Classifier 1 Classifier 2 P1(1|A) P2(1|A) True class 1 0.28 0.67 0 2 0.63 0.81 1 3 0.44 0.25 0 4 0.26 0.6 1 5 0.36 0.45 0 6 0.62 0.39 0 7 0.71 0.78 1 8 0.66 0.17 0 9 0.94 0.88 1 10 0.49 0.73 1 Px(1|A) denotes the probability for the instance belonging to class 1, based on its attribute A. It is computed by the classifier x. (2 marks) b. What does it mean when a segment of the ROC curve is below the diagonal? (1 mark) c. Calculate the area under curve (AUC) for both curves. Which classifier is better? Why? (2 marks) Dataset Reference: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

Don't use plagiarized sources. Get Your Custom Essay on
CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism:
Just from $13/Page
Order Essay

Show more


Source link

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
The price is based on these factors:
Academic level
Number of pages
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more

Order your essay today and save 15% with the discount code BANANA