Chapter 5 Support Vector Machine
In order to do the next two models, the data needed to be processed differently to account for the categories that appear in the training set and not the test set, or vice versa.
5.1 The Model
## [1] "The accuracy on the training dataset is: 0.931405514458642"
## [1] "The accuracy on the testing dataset is: 0.926075268817204"
The support vector machine is a major improvement on the accuracy of the model. This model has an accuracy of .94 on the training set and .92 on the test set. This shows that the model is probably not over or underfitting very heavily and has a high accuracy.
5.2 Visualization
While SVM is the most accurate model by far, it is very difficult to conduct interpretable ML methods on this model. Even the basic visualizations that are suggested in a lot of tutorials are not easy to do with this dataset, as can be seen in the graph above (you cannot see any information with two binary variables graphs in this manner)! SVM uses hyperplanes and thus visualizations are inherently difficult because it is unclear how one would visualize a multi-dimensional space greater than 3 dimensions. The above graph shows the “decision boundary” that the SVM uses in purple, yet it is difficult to understand what that decision boundary means in two dimensions, because the graph does not appear to truly be split by the line the purple dots create. In simpler terms, the red and blue dots are on both sides.
Through using the SVM, I aim to demonstrate that accuracy, as a metric can often be misleading. It is unclear how good this model is if we are unable to figure out why it is making the decisions it is making. For the purposes of this project, which aims to determine the decision-making process that goes into the selection of the IsHighlight variable, this model is useless. It does not tell us which features contributed to the predictions it made or the the importance of those features.
Note:
What are grp1, grp2, and grp 3?
The pre-processing package I used collapses perfectly correlated features. These groups are sets of perfectly correlated features. grp1 is the index column and the object number. grp2 is the title of the artist and their name. grp3 is whether the public domain is true and whether it is false. All of these columns are removed and placed into these groups. For all intents and purposes, I deemed these columns un-necessary and removed them.
Reference:
- https://rpubs.com/cliex159/865583
- https://web.mit.edu/6.034/wwwbob/svm.pdf
- https://github.com/SchlossLab/mikRopML/issues/156