CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism:

[ad_1]

CSE4DMI Data Mining Sem 2 2014, Assignment One 20 Marks (Due Thursday 4 September 2014, 9:30am) Copying, Plagiarism: Plagiarism is the acquiescence of triton else’s effort in a carriage that gives the collision that the effort is your own. The Department of Computer Science and Computer Engineering at La Trobe University treats plagiarism very seriously. When it is detected, penalties are strictly imposed. INDIVIDUAL assignment. Segregate I (10 tokens) In this segregate, we are going to construct a determination tree adjustifier to forebode the ages of abalones from the configuration conclusions. The dataset can be plant in the CSV perfect abalone.csv. 1. Before creating the adjustifier, the abalone dataset has to be pre-processed according to the forthcoming criteria: 1) The sex manifestation has three feasible values (M, F and I), encode them into integers by M = 1, F = 2 and I = 3. (1.5 tokens) 2) The rings manifestation is the enumereprimand of rings an abalone has, the age can be estimated by enumereprimand of rings + 1.5. Also, from this estimated age, each specimen form be assigned to contrariant age collections according to the board underneathneath: Age Adjust address <= 5.5 <= 5.5 > 5.5 and <= 8.5 (5.5, 8.5] > 8.5 and <= 11.5 (8.5, 11.5]> 11.5 and <= 14.5 (11.5, 14.5]> 14.5 and <= 17.5 (14.5, 17.5]> 17.5 and <= 20.5 (17.5, 20.5]> 20.5 > 20.5 Replace the rings support delay the age collection support. The age collection manifestation is the adjust address. (Hint: You can use any mode to set the age collection for each specimen, including formula in Excel or MATLAB script.) (1.5 tokens) Succeeding pre-processing, the dataset is disconnected into the grafting dataset and the testing dataset. Download the program “DataSplit.exe” and consummate it. Invade your ward ID and mention the locations of the dataset perfect and the doom folder. The dataset form be rive for you by clicking the “OK” nonentity. Note that your grafting and testing datasets are singular to others. Make firm you invade the ward ID right. Show barely your pre-processed grafting and testing datasets. Barely the primeval 20 rows of each dataset are required in your repartee. Also, content succumb your MATLAB commencement codes (in MATLAB script perfect) delay the assignment repartee (2 tokens). No tokens form be consecrated to your repartee consistent the apt commencement codes are succumbted. 2. Load twain the grafting and testing datasets in Q1 into the MATLAB effortspace. It is recommended to detached the adjust address (i.e. the manifestation age collection) from other manifestations such that all the adjust addresss of a dataset are stored in a matrix. As a conclusion, there are lewd matrices succeeding the significance manner, two for the manifestation values from the two datasets, and the other two for the adjust addresss from these datasets. a. Construct a determination tree adjustifier (using the age collection manifestation as the adjust address). Show the determination tree. (1 token) b. Use the built adjustifier to forebode the age collections for the specimens in the testing dataset. Show the forebodeed adjust addresss for the primeval 20 rows of the testing dataset. (1 token) c. Using the testing dataset, evaluate the deception reprimand, sensitivity, specificity, and the laziness matrix. (1 token) Content succumb your MATLAB commencement codes delay the assignment repartee. (2 tokens) No tokens form be consecrated to your repartee consistent the apt commencement codes are succumbted. Segregate II 3. The board underneathneath shows the statistics of interviewees encircling their prevalent status of stable order: ID Order flatten Annual Allowance Stable Education? 1 Tertiary 35000 Y 2 Secondary 28000 N 3 Secondary 40000 Y 4 Tertiary 52000 N 5 Postgrad 31000 Y 6 Secondary 47000 N 7 Secondary 22000 Y 8 Secondary 19000 N 9 Postgrad 22000 Y 10 Tertiary 44000 Y 11 Tertiary 20000 Y 12 Postgrad 32000 N 13 Secondary 62000 Y 14 Postgrad 30000 Y 15 Tertiary 55000 Y a. Calculate the Gini abjuration for the order flatten manifestation, delay multi-way rive. Show your steps. (1 token) b. Calculate the Gini abjuration for the annual allowance manifestation, for each of the forthcoming rive points: i. ≤ 25000 and > 25000 ii. ≤ 35000 and > 35000 iii. ≤ 45000 and > 45000 iv. ≤ 55000 and > 55000 Show your steps. (1 token) c. Calculate the entropy for the order flatten manifestation, delay multi-way rive. Show your steps. (1 token) d. From the conclusions in (c). Calculate the instruction form for the order flatten manifestation, delay multi-way rive. (1 token) e. Explain why the manifestation delay the acme instruction form is chosen as the riveting manifestation, in stipulations of the tangible signification of instruction form. (1 token) 4. a. Plot the receiver frank distinction (ROC) deflexions for adjustifiers 1 and 2 using the forthcoming instruction: Example Classifier 1 Classifier 2 P1(1|A) P2(1|A) True adjust 1 0.28 0.67 0 2 0.63 0.81 1 3 0.44 0.25 0 4 0.26 0.6 1 5 0.36 0.45 0 6 0.62 0.39 0 7 0.71 0.78 1 8 0.66 0.17 0 9 0.94 0.88 1 10 0.49 0.73 1 Px(1|A) denotes the likelihood for the example obligatory to adjust 1, inveterate on its manifestation A. It is computed by the adjustifier x. (2 tokens) b. What does it moderation when a section of the ROC deflexion is underneathneath the angular? (1 token) c. Calculate the area underneathneath deflexion (AUC) for twain deflexions. Which adjustifier is improve? Why? (2 tokens) Dataset Reference: Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Instruction and Computer Science.

Show further

[ad_2]
Source attach