Background
Methods
Data description
Site | [x,y,z] voxel dimensions (mm) | Magnetic field strength (T) | TR/TE (ms) | Number of datasets |
---|---|---|---|---|
D
1
| [0.27,0.27,2.20] | 3 | 4216 - 8266/ 155 - 165 | 16 |
D
2
| [0.41,0.41,3.00] | 3 | 2840 - 7500/ 107 - 135 | 13 |
D
3
| [0.27,0.27,2.96] | 3 | 4754/115 | 56 |
Expert annotation of central gland, PZ, and PCa extent on T2w MRI
Post-processing of T2w MRIs to account for intensity-based artifacts
c
| Samples in set C |
n
| Number of samples in C |
F(c) | N-dimensional (texture) feature vector |
\(\mathcal {F}\)
| Set of all feature vectors |
l(c) | Class label of sample c |
ω+1,ω−1 | Classes associated with l(c)=1,l(c)=0 |
h
β
| Classifier, β∈{QDA,SVM,Bay,DT} |
\(h^{\beta }_{t}\)
| Component classifier within hBag,β,hBoost,β |
h
Bag,
β
| Bagged classifier |
h
Boost,
β
| Boosted classifier |
Extracting PZ tumor specific radiomic texture features from T2w MRI
Classifier construction
QDA [23] |
h
QDA
| - | MATLAB |
hBag, QDA, hBoost, QDA | T=50 | MATLAB | |
h
SVM
| Ω,λ | LIBSVM | |
hBag, SVM, hBoost, SVM | Ω,λ,T=50 | LIBSVM [49], MATLAB | |
Naïve Bayes [50] |
h
Bay
| - | MATLAB |
hBag, Bay, hBoost, Bay | T=50 | MATLAB | |
Decision Trees [26] |
h
DT
| - | C4.5 |
hBag, DT, hBoost, DT | T=50 | MATLAB TreeBagger, PBTs [51] |
Feature normalization
Class balancing
Classifier training
Evaluation of voxel-wise PCa classifiers
Classifier accuracy
Statistical testing
Computation time
Results
Classification accuracy
Comparing single classifier strategies
Comparing bagged classifier strategies
Comparing boosted classifier strategies
Classifier execution time
Discussion
-
We identified the most consistently performing method across all 3 sites as the boosted QDA classifier. It had a relatively high AUC (= 0.735) in the training cohort as well as in both validation cohorts (average AUCs of 0.683 and 0.768, respectively). Coupled with its relatively quick execution time (2nd lowest among all methods), we believe this would make it the best classifier overall. Our second choice would be the single QDA classifier, which did not perform significantly worse (average AUCs of 0.730, 0.686, 0.713 for each of the sites) than the boosted QDA classifier.
-
The performance of all variants of the decision tree classifier (single, bagged, boosted) were overestimated by ≈10%, when compared between the training and validation cohorts. In fact, the top-performing classifier identified in the training cohort was the boosted decision tree classifier (AUC =0.744), but this classifier performed more variably when evaluated on multi-site data. This clearly indicates the need for independent validation when building CAD models, as otherwise these perhaps less generalizable models would have been identified as the top performer.
-
The popular SVM classifier achieved reasonable classification performance in the training cohort alone (similar to previous SVM-based PCa detection schemes for prostate T2w MRI [22]). However, they took the longest to train and test, and did not achieve convergence in multi-site validation. This may be a consideration to take into account as prostate CAD schemes start to undergo larger scale multi-site validation.
-
We could not reach a clear conclusion regarding which of boosting and bagging yielded better performance across the classifier strategies. There were no significant differences in their performance in multi-site validation.
-
Satisfying the conditions of bias and variance were extremely crucial when constructing classifier ensembles. While SVMs and DTs show significant improvements within both bagging and boosting frameworks, Bayesian and QDA classifiers provided a more mixed performance as they suffered from low variance and/or high bias. However, not all of these trends generalized in multi-site validation.
-
For all the classifiers considered, performance in the 2 validation cohorts D2 and D3 fell within the confidence bounds of their performance in the discovery cohort D1. Thus, despite heterogeneous acquisition and imaging characteristics across the 3 sites, our post-processing steps (correcting for bias field and non-standardness) appear to have enabled some degree of harmonization in terms of radiomic features and associated classifier models. Appropriate post-processing of multi-site imaging data may therefore be critical when evaluating radiomic classifiers in this fashion.
-
In terms of the site-specific performance trends, it is interesting to note that all classifiers performed worse in D2 than in D3. While all 3 sites used a 3 T magnet, D2 had a lower voxel resolution than D1 and D3 (which were similar to each other). This seems to indicate that voxel resolution may have a marked effect on classifier performance. This result has also been observed in previous phantom studies of texture analysis in medical imaging [46].