Introduction
Juvenile idiopathic arthritis (JIA) is a general term for arthritis of unknown etiology that lasts more than 6 weeks and develops before the age of 16 years. It is one of the most common chronic inflammatory rheumatic diseases of childhood [
1‐
3]. Systemic juvenile idiopathic arthritis (SJIA) is currently classified as a subtype of JIA [
4]. SJIA is a form of childhood arthritis with clinical features [
5,
6] such as fever, lymphadenopathy, arthritis, rash and serosal disease, and even in the absence of ongoing disease activity, the consequences are not limited to joint destruction, but can also affect vision, impact overall growth, osteoporosis and underlying cardiovascular disease, leading to high rates of disability and mortality in children [
1,
5,
7‐
9]. Although SJIA has characteristic clinical signs such as fever, a rash associated with fever and joint involvement, the early clinical manifestations are varied and some of the symptoms can be mild or atypical, making recognition and diagnosis difficult. Our treatment choices are scarce for this pediatric condition. Effective identification and intervention can reduce and prevent treatment difficulties after the disease has become severe.
Due to the complex pathogenesis of SJIA, it has been reported that SJIA can be better diagnosed through the study of biomarkers [
10,
11]. In order to effectively identify high-risk patients and prevent or lessen the development of SJIA, biomarkers can be used to diagnose and treat SJIA rather than relying solely on its clinical aspects [
12]. Significant progress has been achieved in the pathophysiology and treatment of SJIA in recent years, and big data research that combine genomic data and immunophenotypes may reveal new information about SJIA [
13]. Just in the last few years, biologics have emerged as a promising new approach to treating SJIA [
14‐
17]. Therefore, there is an urgent need for more accurate risk identification and treatment targeting for SJIA.
The rapid development of transcriptome sequencing technology has provided a reliable method to decipher the genetic information of diseases. Besides, it also provides a strong basis for the diagnosis and targeted treatment of various diseases [
18‐
20]. Although some genes have been identified as genetic risk markers for SJIA [
21,
22], their individual identification is limited due to the complexity and association of the genetic architecture. Importantly, the combination of multiple biomarkers to identify disease has been shown to significantly improve identification. Mathematical models suggest that genetic risk genes play important roles in identifying high-risk individuals, as well as improving the diagnostic accuracy and targeted drug development [
23‐
25].
Peripheral blood mononuclear cell (PBMC) plays a critical role in the pathogenesis of SJIA by producing pro-inflammatory cytokines, recruiting other immune cells to the sites of inflammation, and activating the complement system. So our study is based on PBMC from SJIA children and the blood samples are easy to obtain and manipulate.
Based on 4 key genes (ALDH1A1, CEACAM1, YBX3 and SLC6A8) screened in the Gene Expression Omnibus (GEO) database, we developed a novel SJIA diagnostic model at the transcriptome level. Three methods, including least absolute shrinkage and selection operator (Lasso), random forest (RF)-recursive feature elimination (RFE) and RF classifier scores, were used together to determine the most important genes for SJIA classification. The SJIA diagnostic model based on 4 key genes was then built by selecting the RF of the optimal mtry on account of the 10-fold cross-validation with 5 repetitions and grid search. We evaluated the performance of the diagnostic model using 5-fold cross-validation as well as using an external independent validation dataset to confirm its accuracy and discriminatory power. We also investigated the expression of 4 key genes in three other isoforms of JIA (enthesitis-related arthritis, persistent oligoarthritis and rheumatoid factor negative polyarthritis) to find similarities and differences. Therefore, our study provides new insights into potential biomarkers for follow-up studies of SJIA, as well as a new predictive aid in terms of predicting SJIA at the genetic level in clinical practice.
Materials and methods
Data selection and processing
All datasets for this study were obtained from the GEO database, which stores information on the expression of genes using high-throughput methods. It was created by the National Center for Biotechnology Information (NCBI). The study was extensively searched through the NCBI database platform using the keyword “systemic juvenile idiopathic arthritis”. The type of datasets we chose was expression profiling by array, the type of organisms was homo sapiens, and the tissue selection of peripheral blood mononuclear cells (PBMCs). We used the MAS5.0 signal intensity as the expression level of the gene in order to unify the probe signal intensity. Datasets of MAS5.0 signal strength (GSE11907-GPL96 and GSE8650-GPL96) were downloaded in the Series Matrix File(s) of the respective GEO data. For the remaining datasets (GSE7753, GSE13501, GSE20307 and GSE21521), we downloaded the original CEL files generated by Affymetrix from GEO. The raw CEL files were processed separately using the MAS 5.0 algorithm implemented in the R package affy (version 1.72.0). Then, we converted the probe IDs to gene symbols based on the corresponding platform annotation files for the respective datasets. If multiple probes correspond to the same gene symbol, the signal average of multiple probes corresponding to the same gene symbol is calculated as the expression level of the corresponding gene. Batch correction was performed by applying the ComBat function in the R package sva (version 3.42.0) as a way to eliminate potential multicenter batch effects between different experiments and to output the correction results.
We merged and batch corrected GSE11907-GPL96, GSE8650-GPL96 and GSE13501 as the training dataset. GSE7753 was used as the external validation dataset. Details about six datasets are shown in Supplementary Table
S1.
Screening for differentially expressed genes (DEGs)
To screen out DEGs, we performed differential expression analysis in the training dataset using the traditional Bayesian approach of the R package limma (version 3.50.1). False discovery rate (FDR) < 0.05 and log2 fold change (log2FC) with absolute value > 1 were used as significance criteria for DEGs. We plotted the volcano map and the heat map to represent the expression levels of DEGs, which were created using the R packages pheatmap (version 1.0.12) and ggplot2 (version 3.3.6), respectively.
Analysis of gene ontology and pathway enrichment
DEGs were interpreted using Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG). In this study, we used the enrichGO and enrichKEGG functions of the R package clusterProfiler (version 4.4.4) to perform GO and KEGG analyses, respectively, where GO analyses include biological processes (BP), cellular components (CC), and molecular functions (MF). We used bar graphs to show them. In addition, we again performed a gene set enrichment analysis (GSEA) using the gseKEGG function of the R package clusterProfiler to identify pathway differences between the SJIA and healthy groups. Pathways with p < 0.05 were considered significantly enriched.
Assessing the role of immunomodulation
Since SJIA has historically been associated with the immune system, we aimed to explore the potential immune relationship between SJIA and healthy samples. We used the CIBERSORT algorithm of the R package cibersort (version 0.1.0) and its self-contained LM22 gene signature to quantify the fraction of each immune cell in each sample. CIBERSORT, a deconvolution algorithm, allows a reliable quantification of sensitivity and specificity distinctions and abundance of 22 human immune cell phenotypes based on transcriptomic data. The abundance of immune cells was used to explore whether there were immune differences between SJIA and healthy groups.
Feature selection
To screen for key biomarkers, we applied machine learning in biology to perform reliable gene-based feature selection. First, we used the R package glmnet (version 4.1-3) for Lasso regression. Lasso regression provides a new feature selection algorithm that can solve the covariance problem and filter out representative variable features. A refined model is constructed using the Lambda value at the minimum binomial deviation as a criterion. Second, the RFE method combined with the RF classifier was used for feature selection using the R package caret (version 6.0–91), where the control parameters were set to a 10-fold cross-validation with 5 repetitions as well as accuracy was used as a screening criterion. The recursive feature elimination method is to train the model on the original features, and each feature gets a weight. After that, those features that have the smallest absolute value weights are kicked out of the feature set. This is recursively done until the number of remaining features reaches the desired number of features. Finally, we constructed RF models with optimal mtry parameters based on candidate genes using the R package caret and the R package randomForest (version 4.7–1.1) with control parameters set to 10-fold cross-validation with 5 repetitions. The parameter mtry denotes the number of variables randomly sampled in constructing decision tree branches in random forest modeling. Using a RF classifier, we assigned importance scores to feature genes, and we designated genes with importance scores > 80 as the key genes we need.
Establishing an optimal SJIA classification model using random forest
Because the simple construction of random forest with default parameters could not determine whether the model was the best model, we needed to adjust the parameters of the random forest model to find the best model. The feature genes tested during feature selection were initially included to the SJIA random forest prediction model. The best parameter mtry for fitting the dataset was then found using a grid search strategy with the caret and randomForest packages. Immediately afterwards, the model was optimized on the basis of accuracy and overfitting was reduced through multiple rounds of training with 10-fold cross-validation with 5 repetitions. Finally, the SJIA random forest prediction model with optimal parameter mtry was constructed. Finally, the SJIA random forest prediction model was built with the optimal parameter mtry.
Validating the robustness of the model using cross validation
We applied the best model to the training dataset for 5-fold cross-validation to determine the robustness of the model. Using the R package pROC (version 1.18.0), we calculated the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. The accuracy, Kappa, sensitivity and specificity of the results were calculated by the confusionMatrix function of R package caret. Accuracy is the ratio of the sample size that was correctly predicted over all the sample sizes that participated in the prediction. Kappa is a metric used for consistency testing and can also be used to measure the effectiveness of classification. In the range of -1 to 1, higher values indicate better model performance. Sensitivity, also known as true-positive rate, is the percentage of actual disease correctly diagnosed by that diagnostic criterion. Specificity, also known as true-negative rate, is the percentage of actual absence of disease correctly diagnosed by that diagnostic criterion.
Verification using validation datasets
The validity of the SJIA random forest diagnostic model was verified in an external independent validation dataset (GSE7753). Since there are multiple types of JIA, we also explored whether there are dissimilarities among the three subtypes of enthesitis-related arthritis, persistent oligoarthritis and rheumatoid factor negative polyarthritis for differential analysis of these 4 key genes. The above disease datasets were obtained from GSE13501. Details of the above three subtypes are shown in Supplementary Table
S2.
To further explore the breadth of the model. We examined the predictive power of the model for other diseases, such as Systemic Lupus Erythematosus (SLE), S. aureus, S. pneumoniae, E. coli and Influenza A. We collected the GSE8650-GPL96 dataset for model performance checking of SLE and the GSE6269-GPL96 dataset for model performance checking of S. aureus, S. pneumoniae, E. coli and Influenza A. Details of the above diseases are shown in Supplementary Table
S3.
Single-cell RNA-Seq data processing
Single-cell RNA-seq data were obtained from the GSE207633 database, from which we obtained patient clinical information (from the file “GSE207633_Patient_Clinical-MetaData.xlsx”), publisher Single-cell clustering information containing cell name, patient ID, cell type, UMAP clustering information, etc. (from the file “GSE207633_ScRNASeq-MetaData.xlsx”) and single-cell RNA-seq sequencing data (from the file “GSE207633_RAW.tar”). First, we used R package hdf5r to input the single-cell RNA-seq sequencing data extracted from “GSE207633_RAW.tar” of 26 patients, and created them into Seurat objects using the CreateSeuratObject function of R package Seurat, and then merged them. CreateSeuratObject function parameters were selected by default. Next, we used the NormalizeData function of the R package Seurat to divide the read counts for each cell by the total counts for that cell, multiply by a scaling factor (10,000), and then perform the natural logarithm transformation. We then extracted the expression matrix of YBX3 in individual cells using the FetchData function of the R package Seurat and filtered out the cells that possessed the single-cell clustering information. Finally, we used single-cell clustering information for single-cell map visualization as well as mapping YBX3 expression levels in individual cells to single-cell map.
Statistical analysis
All statistical analyses were performed with R software (version 4.1.3). The two continuous variables were compared using the Wilcoxon two-sample test. In order to assess the correlation between two continuous variables, the Spearman correlation coefficient was used. The AUC is used to test the predictive ability of the model, 0.5<AUC<0.7 for poor model performance, 0.7 ≤ AUC<0.8 for fair model performance, 0.8 ≤ AUC<0.9 for good model performance, and AUC ≥ 0.9 for excellent model performance. All tests were two-sided, and p < 0.05 was considered significant.
Discussion
Arthritis in SJIA is accompanied by clinical features such as fever, lymphadenopathy, rash, and serosal disease, and it develops rapidly and is prone to disability [
1]. In order to enhance and protect the quality of survival for children with SJIA, good prediction and identification are essential [
9]. However, the exact mechanism by which SJIA occurs is still unknown [
26], and it is not a classic autoimmune disease, it has systemic inflammatory properties [
6]. Since there are currently no clear diagnostic criteria for SJIA, the diagnosis is typically made using a combination of clinical symptoms, serological tests, and imaging studies. As a rare disease [
27,
28], timely judgments are difficult to made in clinical diagnosis. On the genetic side, exploratory studies on biomarkers of SJIA are still of great value to go deeper. Therefore, it is essential to identify biomarkers that have a strong correlation with SJIA.
In our study, we used transcriptomic data from PBMC and built a SJIA diagnostic model based on the random forest algorithm as a way to distinguish between children with SJIA and healthy children. With the rapid development of bioinformatics, the evidence for the classification of diseases such as SJIA can be supported by strong evidence. To identify differential genes in SJIA, we first combined three GEO datasets (GSE11907-GPL96, GSE8650-GPL96 and GSE13501) as a training dataset, and then performed differential gene expression analysis. GO, KEGG and GSEA enrichment analyses were then performed. Based on GO and KEGG analyses, DEGs were associated with a large number of biological processes and pathways, reflecting the dynamic and complex nature of the pathogenesis. Many previous studies have supported our findings. Although children with SJIA and other JIA have long-term joint damage, the main difference lies in the fact that it can show autoinflammatory characteristics, which is closely related to autoimmunity. Brown et al. [
29] found that neutrophils in children with SJIA exhibited sustained pro-inflammatory activation regardless of disease activity.IL-17, a pro-inflammatory cytokine secreted by Th17 cells, is also critical to study in SJIA. Feng et al. [
30] study found that IL-17 levels were increased in both the active and inactive phases of SJIA compared to controls and were significantly higher in the active phase compared to the inactive phase. Omoyinmi et al. [
31] reported a significant increase in the proportion of IL-17-producing CD4 + T cells in the peripheral blood of SJIA patients. Our study also found that differential genes associated with SJIA are linked to several metabolism-related pathways, including “Porphyrin metabolism”, “Arginine and proline metabolism” and “Glycine, serine and threonine metabolism”. GSEA analysis is superior to KEGG analysis and provides a better understanding of the biological changes between groups. The aggregation of many immune-related pathways in SJIA patients suggests that the pathogenesis of SJIA patients is associated with immune abnormalities, which has become a consensus among SJIA researchers. Regarding metabolism-related pathways, these pathways were positively correlated with SJIA, again suggesting that the pathogenesis of SJIA is likely to be closely related to metabolic dysregulation. We also found two cell death modalities, mitophagy and ferroptosis, to be significantly enriched in SJIA. Mitophagy is an important form of autophagy used to selectively remove dysfunctional or redundant mitochondria, and there is growing evidence of a close relationship between mitophagy and inflammation and autoimmunity [
32‐
34]. Because most of the relevant studies are from autophagy-mitophagy-inhibition models, it should be noted that excessive activation of mitophagy may be detrimental to immune cell homeostasis, as Jin et al. [
35] found that excessive activation of mitophagy performed in a mouse model impairs progenitor-phase red lineage differentiation and leads to depletion of red lineage cells. Therefore, the relationship between regulation of mitophagy and inflammation and immunity cannot be conclusively established, and multiple methods to assess activity of mitophagy are still needed to gain a more thorough understanding of pathways of mitophagy in autoimmune diseases [
32]. Our study found that mitophagy was enriched in children with SJIA, which may provide new clues to future researchers about the relationship between SJIA and mitophagy. Ferroptosis is a regulated mode of cell death triggered by a combination of iron toxicity, lipid peroxidation, and plasma membrane damage [
36]. Recent studies have identified an important role for this mode of cell death in metabolism and immunity [
37,
38], while it has been less studied in SJIA.
We finally screened 4 key genes, namely ALDH1A1, CEACAM1, YBX3 and SLC6A8, by three feature screening methods (Lasso, RF-RFE and RF classifier scoring). Acetaldehyde dehydrogenase 1A1 (ALDH1A1) has a key role in tumor immunity. Liu et al. [
39] found that increased ALDH1A1 activity in breast cancer led to expansion of myeloid-derived suppressor cells and immunosuppression, and also found that the ALDH1A1 inhibitor disulfiram and the chemotherapeutic agent gemcitabine synergistically inhibited breast tumor growth and tumorigenesis by clearing ALDH + tumor-initiating cells and activating T-cell immunity. Cui et al. [
40] reported that ALDH1A1 expression in thyroid cancer was negatively correlated with immunostimulatory genes, major histocompatibility complexes, chemokines and receptors. The above two studies indicate that ALDH1A1 has a close relationship with immunity, but ALDH1A1 has been less studied in autoimmune diseases. In our study, we found that ALDH1A1 expression was low in children with SJIA compared to healthy children and that ALDH1A1 was positively correlated with T cell CD4 memory resting status and negatively correlated with T cell CD4 memory activation, consistent with previous reports that ALDH1A1 has an immunosuppressive function. A recent study reported that ALDH1A1 is associated with systemic sclerosis and affects the pentose phosphate pathway, oxidative stress and lipolysis [
41], suggesting a relationship between ALDH1A1 and autoimmune diseases. Carcinoembryonic antigen cell adhesion molecule 1 (CEACAM1) plays an important role in regulating immune responses associated with infection, inflammation, and cancer. Previous studies have shown that CEACAM1 is a heterophile ligand of TIM-3, mediating T cell inhibition. The interaction between CEACAM1 and TIM-3 plays a crucial role in regulating autoimmunity and anti-tumor immunity [
42,
43]. Our study found that CEACAM1 was highly expressed in children with SJIA and that patients with high CEACAM1 expression were associated with many immune-related pathways. CEACAM1 expression was positively correlated with plasma cells, T cells CD4 memory activation and neutrophils, suggesting that CEACAM1 may play a role in immune activation in SJIA. Y-box binding protein 3 (YBX3) is an RNA-binding protein that regulates different sets of mRNAs, including mRNA abundance, through different mechanisms, thus becoming a key regulator at the amino acid level [
44]. It is considered a key molecule that links the genetic risk factors for Behcet syndrome to the pathogenesis of the disease [
45]. Sun et al. [
46] identified lncRNA-HEIH/YBX3 as a pan-cancer immunogenic system that can be used as a diagnostic and prognostic biomarker and therapeutic target. YBX3 is less studied at present, especially in autoimmune diseases. We found that YBX3 was significantly expressed in the vast majority of immune cells by single-cell level analysis of PMBC from 21 SJIA patients and 5 healthy individuals (Supplementary Figure
S3). Alteration of macrophage-mediated immune response in vivo by depletion of intracellular creatine through ablation of creatine transporter protein (SLC6A8) [
47]. SLC6A8 regulates the energy balance of intestinal epithelial cells, thereby regulating intestinal epithelial integrity and barrier function, which leads to intestinal barrier dysfunction in patients with inflammatory bowel disease [
48]. We found that YBX3 and SLC6A8 were highly expressed in SJIA and that there was a strong correlation between YBX3 and SLC6A8. The results of GSEA analysis, clinical phenotype analysis and correlation analysis with immune cells for these two genes showed strong similarity. In the GSEA study, high expression was found to be closely associated with Complement and coagulation cascades. In a correlation analysis, they were found to be negatively correlated with T cells CD4 memory resting state and positively correlated with T cells CD4 memory activation state, and these two genes may play a key role in the regulation of the immune system in SJIA. Currently, these 4 genes have not been studied in SJIA. In this study, we performed a key gene screen using machine learning techniques with public genetic databases, and we suggest that these 4 genes may be key biomarkers linking genetic risk factors for SJIA to disease pathogenesis.
In the study, we analysed three other subtypes of JIA (enthesitis-related arthritis, persistent oligoarthritis and rheumatoid factor negative polyarthritis) in relation to the expression of 4 key genes. ALDH1A1 showed a significant trend of low expression in all four subtypes of patients compared to healthy individuals, suggesting that ALDH1A1 plays a key role in JIA. In contrast, CEACAM1 showed high expression only in SJIA and persistent oligoarthritis patients, suggesting a potentially identical pathway between SJIA and persistent oligoarthritis. In contrast, YBX3 and SLC6A8 were only highly expressed in SJIA, with no difference in any of the other three subtypes, indicating the specificity of these two genes in SJIA compared to the other three subtypes.The highlights of our study are the innovative combination and application of Lasso, RF-RFE and Random Forest classifier scoring methods to SJIA and the excellent results generated in terms of predictive power in SJIA. Three feature selection methods, such as Lasso, RF-RFE and random forest classifier scoring, are each capable of being used as methods to identify key biomarkers and have been widely used in biology [
49,
50]. Previously, no studies have used machine learning to construct SJIA diagnostic models based on gene sequencing.
We performed a 5-fold cross-validation to prevent the model from overfitting, and obtained an AUC greater than 0.95 and an accuracy greater than 0.9 in each case, indicating the robustness of the model. Following immediately, we performed another validation of the robustness and generalization ability of the model using an external independent dataset (GSE7753), and the results showed an AUC of 0.990, again demonstrating the strong robustness of our model. In particular, the sample in GSE7753 included untreated children with new-onset SJIA as well as healthy children. The above results indicate that the model is well suited for the identification and prediction of SJIA.
In addition, we explored whether there were similarities and differences among the 4 subtypes (SJIA, enthesitis-related arthritis, persistent oligoarthritis and rheumatoid factor negative polyarthritis) of these 4 key genes. The results revealed that ALDH1A1 showed downregulation in all 4 subtypes, and ALDH1A1 may have a key role as a biomarker of disease onset in the same 4 subtypes. YBX3 and SLC6A8 were only upregulated in SJIA, with no significant up and down expression trends in the remaining three subtypes, which may indicate that these two genes are the key genes that cause SJIA to have different clinical characteristics from the three subtypes. And at the same time YBX3 and SLC6A8 have a very strong correlation in SJIA.
Using the transcriptome level, we investigated the validity and reliability of machine learning in SJIA diagnostic prediction. In light of this, we have successfully developed a new SJIA disease risk prediction model that holds promise for use as a new SJIA diagnostic prediction tool to aid in the diagnosis of SJIA. Furthermore, the 4 key genes we finally obtained provided some new clues for the subsequent SJIA study.
Nevertheless, our study has some limitations. (1) Although we merged multiple datasets into a larger dataset to build the model, it still did not satisfy the number of data samples required for machine learning. If conditions allow, we can include more research data in future training datasets. (2) Overfitting in model construction is objective and difficult to eliminate, but we minimized the overfitting problem by using a 5-fold cross-validation approach in the modeling process. Checking for overfitting is not a complete solution, but it is still very helpful. However, this means that even if we get good model results on the validation dataset, there is no shortage of different data with noise in reality, and the actual generalization ability may not be good. (3) The model has not been tested in practical applications in predicting SJIA patients. Therefore, more research data will be needed in the future to test the robustness and generalization ability of the model.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.