Background
Breast cancer is the most common cancer and the highest incidence of all cancers in women with an incidence rate of over 1.6 million cases per year [
1,
2]. The mortality rate is high over 90% when cancer cells spread systemically and colonize at distant organs from their tumors of origin [
3]. Identification of both intra- and inter-tumor heterogeneity in breast cancer poses a significant challenge due to its genomic evolution that occurs during breast cancer progression. In depth characterization of the molecular heterogeneity is important to improve diagnosis, define prognostic biomarkers and for designing therapeutic strategies [
4‐
6]. Understanding the cellular and molecular heterogeneity of tissue samples continues to be a challenge for high-throughput genomic technologies [
7,
8]. Phenotypic markers are widely used to study cell heterogeneity with methods such as flow cytometry or immunohistochemistry [
9,
10]. Bulk transcriptome analysis is used to study cell populations providing average expression levels for genes across large cell populations. However, these methods provide limited information about heterogeneous systems, including complex tissues consisting of various cell types or for cell types that are rare in a tissue. The molecular identities of diverse cells are lost during transcriptome analysis of bulk samples. Single-cell sequencing methods identify such subpopulations, which is important to study the intratumor heterogeneity that fosters tumor evolution. These methods need segregation of cells that disrupt the spatial context of cells from the tissue [
11‐
16]. The physical location or coordinates of cells are important to understand tissue functionality and corresponding pathological changes. In the past, several methods have been developed to generate high-quality transcriptome maintaining the spatial information of cell localization [
17‐
19].
Spatial transcriptomics (ST) is an in situ capturing technique, which uses a glass slide containing oligonucleotides to capture mRNAs, while maintaing the spatial information of histological tissue sections. Following cDNA synthesis, the resulting barcoded cDNA libraries are sequenced using standard RNA-seq technology [
20]. Specific sequence barcodes allow to assign expression data to the positions on the slide. The efficiency of the method to capture transcriptomes from tissues with maintained positional information has been shown in several studies [
21,
22]. In one of these studies, spatial gene expression profiles from breast cancer tissues were analyzed [
20]. The role of the microenvironment in promoting tumor growth has proven important. The tumor microenvironment encompasses inflammatory cells, extracellular matrix, blood vessels, and stromal cells interacting with tumor cells for cancer growth and progression [
23].
ST allows for a wide range of applications. The transcriptome is measured for the whole tissue section by sequencing that allows to compare different areas within that tissue section [
20,
21]. For example, intratumor heterogeneity can be addressed by contrasting data from ST spots within one tumor or between different tumors [
24]. Another application can be the identification of cancer subtypes and the simultaneous identification of the corresponding RNA biomarkers. Furthermore, known reference data such as genomic variation, for example SNPs from GWAS studies, can be cross-referenced against RNA biomarkers with the potential to assign possible functions for variants in poorly annotated genomic regions. In this concept study, we have applied computational machine learning algorithms to four ST datasets to characterize cancer regions in histological breast cancer tissue sections.
Methods
Data processing
For the study, we used four publicly available ST breast cancer datasets [
20].
Read alignment, annotation, and quantification
The ST sequencing results in paired-end reads. Read one (R1) contains the spatial barcode and the unique molecular identifier (UMI). Read two (R2) contains the transcript sequence information. All the datasets were processed using the ST Pipeline version 1.3.5 with default settings [
20,
25]. The human genome hg38 and its corresponding annotation file were used for mapping and for assigning sequence reads to genes (annotation) [
25]. The general statistics for the breast cancer datasets are shown in Additional file
1: Table S1.
The ST Pipeline generates a BED file containing the sequence reads mapped to genomic positions together with the spatial locations (the ST spots) of all the reads. All the transcripts annotated to the gene Malat1 were removed due to its overexpression and internal priming. We developed an open-source computational pipeline that uses the BED file to compute ST tag clusters (ST-TCs) for all the ST spots. Based on the transcription termination site (TTS) profiling, we computed TTS regions (ST-TCs) by peak calling of the transcripts genomic positions using the parametric clustering method paraclu [
26], which was widely used for similar data in the FANTOM projects [
26‐
28]. The ST-TC peaks and their corresponding expressions were visualized using the ZENBU interactive visualization tool [
29]. The pipeline then computes the expression count matrices for each dataset, where the rows represent the ST spots (including X and Y coordinates on the ST slide) and the columns represent the ST-TCs with their expression values [
20,
25].
We developed a Python open-source tool for unsupervised classification of ST spots based on the ST-TC expression profiles using the sklearn framework [
30]. Datasets were filtered removing spots and ST-TCs with low expression (expressed in very few spots or very few ST-TCs, respectively). Expression count matrices were normalized with the size factors computed using DESeq2 [
31]. A pseudocount of 1 was added prior to log2 transformation of the counts.
Unsupervised classification of ST spots based on ST-TC expression patterns
Dimensionality reduction was performed using principal component analysis (PCA). PCA is a dimensionality reduction technique that simplifies the complexity in high-dimensional data (such as gene expression data) by transforming the data into fewer dimensions while retaining important trends and patterns [
32]. Further, ST spots with close proximity in the principal component space were grouped into three groups (clustered with Ward2 method) based on euclidean distance. The clustered ST spots were plotted onto the hematoxylin and eosin (H&E)-stained images with specific colors representing the respective clusters. This grouping of the ST spots by the PCA does not take advantage of any expert knowledge of the histology samples and is therefore referred to as unsupervised classification.
Manual identification of cancer subtype-specific ST spots
We classified (annotated) ST spots based on the morphology observed in H&E-stained histological tissue sections. The ST spots covering a minimum of 20 cells in the tissue section were selected. We identified subtype-specific ST-TCs (gene expression signatures) for non-malignant, ductal carcinoma in situ (DCIS), or invasive ductal carcinoma (IDC) regions (the three classes in machine learning terminology) using differential expression analysis (DESeq2, log2 fold-change > 2, false discovery rate < 0.01).
Supervised classification of ST spots with machine learning
We used machine learning technique to learn expression signatures (train a model) for non-malignant, DCIS, or IDC regions. The regions were identified in two ways, by supervised expert annotation of the H&E images and independently in an unsupervised way by PCA. The identified expression signatures were used to characterize ST spots in ST experiments (testing of the model) which were withheld during model training.
For this, we developed a Python open-source tool (see references to tools) for supervised classification of ST spots using a multi-class support vector machine with the sklearn framework [
30]. The tool requires one or multiple datasets for training and one dataset for testing. To assess the accuracy of the testing, the annotation of the test dataset needs to be known (so called
ground truth) and provided as separate files. The training and test datasets are pre-processed and normalized as described in the unsupervised classification. The training datasets together with the ST spot labels were used to train a model using multi-class support vector machine (MC-SVM) method [
33]. We chose default values for the hyper-parameters and used a linear kernel due to its simplicity with three classes. The trained model was then used to predict ST spot classes in the ST data (data not used in the training model) withheld during model training. The ST spots in the test dataset were plotted back onto the H&E-stained tissue image. The result of one test consists of associations of ST spots to classes. This association is given in terms of probabilities for each ST spot to each class. The probabilities might be interpreted as how much a cell type was represented in an ST spot in a mixed population of cells. The performance of the test result can be assessed by comparing to the known results using the F1-score, which is calculated based on true positive (TP) and true negative (TN) results. The F1-score ranges from 1.0 for perfect prediction to zero and can also be expressed in percent.
Discussion
In this concept study, we have derived cancer expression signatures from spatial transcriptomics (ST) data, trained with one machine learning algorithm (MC-SVM) and evaluated the performance of the model to identify cancer regions. Four independent ST datasets were available and used in this study. Expert pathologist annotation of the H&E-stained tissue images provided classification of 194 ST spots of ductal carcinoma in situ (DCIS), invasive ductal carcinoma (IDC), and non-malignant tissue regions for all four experiments. We derived ST expression signatures for each of the three classes consisting of distinct sets of transcripts (798 ST-TCs) (Fig.
2a). In addition to protein coding gene-associated ST-TCs, ST-TCs for non-coding genes (23 out of 798, 2.9%), and non-annotated transcripts (79 out of 798, 9.9%) contributed to the expression signature. The extent of the contribution of non-coding transcripts was not evaluated in this study. The most differentially expressed non-coding or non-annotated TCs were predominantly expressed in DCIS and IDC regions emphasizing their potential role in cell proliferation and differentiation [
36,
45‐
47]. The ST expression signature was able to classify ST spots with very high accuracy (190 out of 194, 97.9%) to non-malignant, DCIS, and IDC tissue regions.
In the past, various supervised machine learning algorithms were used in clinical studies to predict clinical outcomes based on expression signatures from bulk cancer samples, distinguishing DCIS from IDC or for extracting cell type-specific information from gene expression profiling from heterogeneous samples using deconvolution techniques [
40,
41,
48‐
58]. These methods used a minimal set of differentially expressed and cell type-specific genes requiring specific analyses to obtain this subset. In this context, we continued with 194 ST spots from manual expert annotation and assessed the classification ability of a support vector machine (MC-SVM) to learn the three annotated classes based on the expression signature (small set of 798 transcripts/ ST-TCs) as well as on the complete set of all transcripts (ST-TCs). Classification results from both approaches gave comparable results of 97.7% and 98.3%, respectively (Table
2A, B). These results overall demonstrated the power of ST-TCs to classify cancer regions in breast cancer tissue sections. The smaller ST-TCs signature set gave comparable results to using the full set of ST-TCs. Classification on a smaller number of selected features might appear advantageous in terms of reduced model complexity. On the other hand, inclusion of all features would simplify the overall workflow by removing the selection process and might make it applicable in more complex workflows with fewer steps. The model which used all ST-TCs can also take advantage of features of the expression profiles that are not among the signature features identified by the differential expression analysis of the ST-TCs.
Our further analysis employed unsupervised classification of ST spots for identifying cancer regions. The resulting three distinct classes corresponded to the three expert annotated regions. The classification performance of the machine learning model (93.7%) was less accurate compared to the expert selected ST spots (97.7% and 98.3%, Table
2). The larger number of ST spots used by the unsupervised classification (979 unsupervised) might add variances in the data and complicate the classification task. Interestingly, for DCIS, the overall prediction accuracy is higher compared to IDC (Table
2). IDC regions might be more heterogeneous and might possibly contain other cell types such as fibroblasts.
We envision that in the coming years we will see simplification, further standardization, and reduced pricing for the ST protocol leading to extensive ST sequencing of samples of various cancer types. Here, the automated classification might become a powerful tool to support clinical pathologists in identifying cancer signatures. Moreover, the routine expert annotation of tissue sections might be used by an expert system to improve cancer signatures with increasing amounts of available data as well as in parallel to identify cancer subtypes with improved resolution. While the dataset employed is comprised of 979 sequencing libraries from four breast cancer ST experiments and constitutes the largest available ST breast cancer dataset so far, inclusion of additional individuals and breast cancer samples might be required to arrive at a cancer classifier for clinical usability.
The ST datasets are composed of sequencing data and corresponding detailed morphology of the stained tissue slides. In this work, we focused on the application of machine learning methods to the sequencing data. Machine learning has very successfully been applied to the classification of image data. We see great potential for a strategy applying machine learning to ST image data for detecting cancer regions while at the same time using machine learning to maximize the power of a corresponding expression signature.
Breast cancer manifests with subtypes that have different treatment responses and clinical outcomes. Identifying tumor heterogeneity in breast cancer regions is crucial for determining specific disease states and for starting suitable treatments early. Our application of a standard machine learning method to ST data clearly distinguished healthy and diseased areas in the tissue and most importantly identified regions containing both DCIS and IDC regions. We believe that detailed characterization of these regions might give us an insight into gene expression changes during the progression of breast cancer. More ST datasets containing such transitions might allow us to obtain detailed expression signatures and possibly a more detailed understanding of breast cancer progression. A more fine-grained resolution of the ST spots will enable higher resolution and allow detection of the transition between DCIS and IDC. This might pave the way towards the identification of new biomarkers specific to disease subtypes and hence cancer therapies for more personalized medicine. Histology is an efficient, effective, and relatively inexpensive diagnosis for breast cancer. We see the potential that ST technology might become a clinically usable complement to histology as the clinical gold standard.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.