MajorBoulderNarwhal11
Assignment 3 Due Date: Sunday, February 19. 2023 The total number…

Assignment 3

Due Date: Sunday, February 19. 2023

The total number of points for this assignment is 60 points. Please submit your assignment in a Word file. Use this assignment file as a template to enter and copy-paste your answers for your assignment submission. Keep the problem descriptions and insert your answers after each question. Please name your assignment with this format: Lastname.Firstname.Assignment3.

 

1.   (15 points) Download the BostonHousing2.xls file (which has been used in Assignment 2). The target attribute in this dataset is CATMEDV (which is a binary attribute converted from MEDV in the BostonHousing.xls file).

a.   Within Excel, save the FullData sheet (with 506 records) as a CSV file, as you did for Assignment 2. Run Weka’s support vector machines algorithm (SMO) on this data file, with 10-fold cross-validation. First, use the default parameter C = 1. Then, change C value to 10 and 100 in sequence. Show the output screens that display the 10-fold cross-validation error rates in these three cases. How does the error rate change as the C value increases?

b.   Based on the results with C = 100, what two attributes are the most important predictors? Explain the impact of these two predictors on classification in terms of how classification result will change when the value of a predictor increases or decreases.

 

 

2.   (25 points) Apply (i) decision trees (J48), (ii) Naïve Bayes, (iii) k-NN (k = 1), and (iv) SVM (SMO) in Weka for classifying the BostonHousing2 data used in Problem 1. Evaluate the performances of these four classification models based on (1) the overall classification accuracy, and (2) the ROC curve and AUC value by considering homes with ‘high’ value as the positive class. The specific steps and questions for this problem are:

a.   Run the four classification models in Weka on the data using the default settings (10-fold cross-validation, etc.). For each model, show two output screens: the first displays the 10-fold cross-validation error rates and the confusion matrix; the second displays the ROC curve (for your reference, see the output screens shown in the “Plotting ROC Curve in Weka” section of the lecture notes titled “Model and Performance Evaluation”). In sum, there are eight output screens, two for each classification model.

b.   Based on the overall classification accuracy, rank the four models from the best to the worst.

c.   Suppose you are only interested in accurately predicting/identifying high-value homes (so that the ‘high’ class is the positive class). In this case, how do you rank the four models from the best to the worst? Justify your answers with the relevant results from the Weka output.

 

 

3.   (20 points) Download the BostonHousing.xls file (which has been used in Assignment 1). The target attribute in this dataset is MEDV (numeric). In Excel, delete the CAT.MEDV attribute (which is a binary attribute converted from MEDV) and save the data to a CSV file, as you did for Assignment 1.

a.   Run Weka’s LinearRegression algorithm with the default parameters and 10-fold cross-validation. Show the output screen with Linear Regression Model and the Cross-Validation Summary section with error results.

b.   Run Weka’s SVR algorithm (SMOreg) on this data. Set error margin parameter (epsilonParameter) e = 0.1. Keep the other default parameters unchanged. Show the output screen with SVR model and the Cross-Validation Summary section with error results. Compare and comment the performance of SVR and that of linear regression in part (a), based on the results of ‘Mean absolute error’ and ‘Root mean squared error’.

c.   What two attributes are the most important predictors based on the SVR model? Are they consistent with those identified in Problem 1b based on the SVM model?