br Four methods NB SVM
Four methods (NB, SVM, LR, RF) were initially tested with 63 fea-tures, and later with 49 features after removing the “Basic clinical test” features, in order to find the most suitable classifier for the task. They were implemented using Weka  with a five fold cross validation for skewed and unskewed dataset. For kernelized SVM we used Libsvm  from WEKA. However, we could not find an implementation kernel logistic regression in either Weka or Libsvm/Liblinear. Hence, we generated all degree k monomial features from the original features, followed by a linear LR. For example, if xi, i = 1, …, d are the original features, and k = 3 the derived features are xixjxk, ∀ (i, j, k) ∈ 1, …, d 3suchthati ≠ j ≠ k. Code for generation of features and searching for appropriate threshold was written in Python. Linear logistic regression (LR) was implemented using liblinear  for scalability.
For testing the prediction model with kernel LR, we set aside a test set, which is random 20% sample of the entire training set, and report metrics (accuracy and sensitivity) on the test set. For validating the threshold selection algorithm, we further split the training set into training (70%) and validation sets (10%) and set the threshold using accuracy on validation set. After threshold selection, we re-train using union of training and validation sets (80% data) and test on the re-maining 20% data using the threshold selected earlier. An operational flow chart is given below in Fig. 2.
4.1. Evaluation of classifiers and threshold selection algorithm
In this Latrunculin A section, we report our comparison of different classifiers, and empirically evaluate our threshold selection algorithm on the dataset described in Section 2.
Data-preprocessing and comparison of classifiers
First, we report performances for the four basic methods (LR, SVM, NB, RF) and the kernel methods using SVM and LR, on the entire da-taset with 63 features (Table 2). As mentioned above, we use 5-fold cross-validation, and report accuracy and sensitivity for the same. We notice that due to extreme skew (∼2.5 positive for every 100 negative), accuracy is relatively unaffected by the classifier performance. This can be seen from the accuracy in the “skewed” column, where all classifiers Artificial Intelligence In Medicine 95 (2019) 16–26
trained on the unskewed data show comparable performance. However, sensitivity in the skewed column is low for the classifiers barring SVM, indicating that we they are not able to classify the abnormal class in-stances, which are few in number, correctly. We attribute this problem due to skew in the training dataset.
We solve this problem by unskewing the training dataset. We sample datapoints belonging to the abnormal class 10 times using SMOTE ; thus resulting in total instances of 4609, with a class ratio (Abnormal to normal) of 0.28. While performing 5-fold cross-valida-tion, for each fold we unskew the training set (formed by leaving out the test fragment), test on the skewed test fold and report the average metrics accross all folds. These results are reported in “unskewed” columns of Table 2. We can see for all classifiers the accuracy remains same but sensitivity is not high for NB and RF in comparison with LR and SVM which leads them for not being considered for further ex-periment. We also see a sharp increase in sensitivity from the results obtained, when using skewed dataset. Further, the accuracy and sen-sitivity are of the same order for LR and SVM. Hence, we conclude that the classifier now performs equally well on both classes.
In continuation towards the objective of diagnosing a patient with as few number of tests as possible we experiment with the extreme case i.e. with no clinical tests (equivalent to test with 49 features only), an report cross-validation results in Table 3. We do not include NB and RF in this since they do not show encouraging results with the full-feature dataset. We report the results for training with the skewed datasets for completeness. However, contractile vacuole is clear that training on skewed data does not produce a balanced classifier. We observe that polynomial kernels with both SVM and LR (feature maps in case of LR) perform better than the linear kernel, when training with unskewed datasets, in terms of both accuracy and sensitivity. Best performance in terms of accuracy is achieved when a polynomial kernel of order 3 is used along with either SVM or LR, while achieving very high accuracy ∼95%. We also note the sensitivity of both kernel LR and kernel SVM are very low compared to the case when features corresponding to the basic clinical tests were included. Hence, we conclude that detecting patients from non-patients become difficult as we remove the basic clinical test. Next, we address this problem of low sensitivity using threshold selection.