Random forest classifier achieves 96% accuracy on diagnosis from Breast Cancer (Wisconsin) Data Set. Built for Python 2.7, pandas 0.22.0, matplotlib 2.1.1, scikit-learn 0.19.1, and seaborn 0.8.1.
Data is taken from the Breast Cancer (Wisconsin) Data Set.
$ python tumor-classification.py

Percentage of training data that is benign: 0.63
Percentage of training data that is malignant: 0.37
Random Forest Classifier Accuracy: 0.96
Random Forest Classifier Confusion Matrix:
benign classified as benign: 106
malignant classified as benign: 2
benign classified as malignant: 4
malignant classified as malignant: 59
precision recall f1-score support
benign 0.96 0.98 0.97 108
malignant 0.97 0.94 0.95 63
avg / total 0.96 0.96 0.96 171
This classifier assumes the true prevalence of malignant tumors is approximately equal to the sample prevalence of malignant tumors (37%). 70% of the Breast Cancer (Wisconsin) Data Set were used for training and 30% was used for testing.
Six features (texture_mean, perimeter_mean, smoothness_mean, compactness_mean, symmetry_mean, and fractal_dimension_mean) were used to predict tumor diagnosis. These features were chosen as they have a low correlation in the correlation heatmap.
A random forest classifier with 100 estimators correctly classifies the test data 96% of the time. The precision and recall for both benign and malignant tumors are above 94%. In tumor classification, it is important to minimize the number of malignant tumors classified as benign tumors. Our classifier has a false negative rate of 3%.