Background
Autism spectrum disorders (ASD) [OMIM 209850] are defined as a group of neurobehavioral syndromes characterized by deficits in social interaction, impaired communication skills and restricted, stereotypical and ritualized patterns of interests and behavior, typically appearing before the age of 3. Throughout the last decades, the prevalence of ASD has risen from the historically estimated proportion of 4/10,000 to approximately 1/110 children (2008) [
1-
3] with a ratio four times higher in males than females [
4]. It is still debated how much of this increase is related to diagnostic improvements, raised awareness towards ASD, or emerging environmental factors [
5,
6]. ASD are among the most heritable neuropsychiatric disorders, given that concordance rates in monozygotic twins are 90% and siblings have an approximately 50-fold increased risk of ASD. ASD are found in association with comorbid genetic conditions in 10% of cases and are considered complex multifactorial disorders involving multiple genes [
7,
8]. Currently, the etiology can be established in only 30% of the cases and remains unknown for most patients.
The technological improvements of the last decade have lead to tremendous advances in understanding the genetic basis of ASD, revealing a high degree of genetic heterogeneity. Clinical application of molecular karyotyping has shown that 5% to 10% of patients carry chromosomal rearrangements and that the burden of rare and
de novo smaller copy number variants (CNVs) is higher among ASD patients than controls. However, since many of these variants show incomplete penetrance and variable phenotypic expression, the best model to explain most ASD cases would be oligogenic with a probable environmental contribution. Until now, most genomic studies on ASD using next-generation sequencing (NGS) have focused on coding regions and have analyzed trios in an effort to identify
de novo mutations [
9-
14]. Only a few studies investigated rare inherited variation [
15-
18]. The reported data suggest a contribution of
de novo disruptive mutations in the genetic etiology of ASD, with hundreds of genes implicated and only a few of them recurrently mutated in unrelated cases (
CHD8 [
MIM 610528],
DYRK1A [
MIM 600855],
GRIN2B [
MIM 138252],
KATNAL2 [
MIM 614697],
POGZ [MIM 614787], and
SCN2A [
MIM 182390])
. Besides
de novo disruptive mutations, comparison of rates of rare variation across the whole genome in cases
versus controls have yielded no significant associations. These findings support previous hypothesis which suggested that a large number of genes confer risk to ASD and reinforce the idea that much larger cohorts will be necessary to carry out this type of analyses [
19]. The identification of new genes involved in ASD will eventually lead to the definition of common effects of genetic variants and possibly ASD biomarkers and biological signatures. Biology system tools such as interaction networks are important to detect common deregulated pathways and expression networks implicated in the disease.
An additional approach to identify genetic variants associated with a phenotype and to understand the biological effects resulting from rare genetic variation could be derived from observing the transcriptomic consequences of genetic variation [
20]. To this end, we have analyzed 36 Spanish male patients with idiopathic ASD by whole-exome sequencing (WES) to define causative or susceptibility variants for ASD and their transcriptomic consequences by RNAseq. In addition to the identification of likely monogenic cases, we also studied the accumulation of rare genetic variation which could result in putatively common functional consequences.
Methods
Sample selection
We studied 36 unrelated males with a diagnosis of idiopathic ASD selected from a Spanish cohort of 324 patients. All cases except two were sporadic. All patients had a confirmed diagnosis of one of the categories of ASD listed in the Diagnosis and Statistical Manual of Mental Diseases IV (DSM-IV), categorized according to the Spanish version of ADI-R (Autism Diagnostic Interview-Revised), and the Wechsler Intelligence Scale for Children or Wechsler Adult Intelligence Scale. All patients had an extensive clinical and molecular evaluation including fragile X testing and molecular karyotype (either BAC, oligo, or SNP array) with normal results. The study was approved by the Clinical Research Ethics Committee of the centers involved (CEIC-Parc Salut Mar), and informed consent for participation was obtained from the parents or legal caregivers. Blood samples were obtained, and genomic DNA was extracted by the salting out method using the Puregene® DNA Purification Kit (Gentra Systems, Big Lake, MN, USA). Parental and familial samples were obtained from the available relatives who gave informed consent.
Whole-exome capture and sequencing
The exome portion of the genome was enriched using NimbleGen EZ Exome V2.0 capture kit (Roche Applied Science, Madison, WI, USA). Gene and exon annotations for SeqCap EZ Human Exome Library came from RefSeq (Jan 2010), CCDS (Sept 2009), and miRBase (v.14, September 2009). A total of approximately 30,000 coding genes (approximately 300,000 exons, total size 36.5 Mb) were targeted by the design, and a total of 44.1 Mb were covered by the probes. Final libraries were then sequenced on an ABI Solid 4 platform (Life Technologies, Carlsbad, CA, USA). Single-end sequences were obtained with a read length of 50 bp.
Variant calling, annotation, and prioritization
A pipeline for data alignment using BFAST [
21] and GATK [
22] algorithms was applied to the sequencing data following standard parameters. Briefly, sequences were aligned to the latest version of the human genome (hg19), PCR duplicates were marked and removed, and quality scores of alignments were recalibrated. Single nucleotide variants (SNV) and indel calls were only considered if positions had a depth of coverage of at least 10×, and heterozygous positions were only called when a minimum of 20% of the reads showed the variant (AB between 0.2 and 0.8). In order to minimize technical artifacts, we removed variants that appeared in more than two samples, even if they were present in a single read or had an AB ratio lower than 0.2. Annotation of variants was performed using ANNOVAR (
http://www.openbioinformatics.org/annovar/), taking into account the variant frequency in control databases: dbSNP135 (
http://www.ncbi.nlm.nih.gov/SNP/), Exome Variant Server (EVS) (
http://evs.gs.washington.edu/EVS/), and an in-house database of 90 Spanish controls. The nature of the changes was assessed by PolyPhen and Condel (
http://bg.upf.edu/fannsdb/) protein effect prediction algorithms [
23]. To distinguish the putative disease-causing variants, we established the following criteria: (1) we selected only non-synonymous variants; (2) under a dominant model, we excluded variants previously described in the general population (dbSNP135, EVS, 1000 Genomes (
http://browser.1000genomes.org) and Spanish controls); (3) under a recessive model, we removed variants with a minor allele frequency (MAF) >0.002 and only considered genes with homozygous or compound heterozygous mutations; (4) we discarded variants present in loss of function tolerant genes as previously described [
24]; and (5) we manually inspected recurrent variants and indel calls to exclude false positives using Integrative Genomics Viewer (IGV) [
25].
We used the XHMM algorithm to call CNVs, based on measurement of the read depth per target region (GATK). We followed the standard steps as described in the online tutorial. We applied the same filters previously described [
26]: XHMM quality score (SQ) ≥65, exons spanned ≥3, and estimated CNV length ≥1kB. We focused our analysis on rare CNVs, so we excluded CNVs overlapping with polymorphic variants reported in Database of Genomic Variants (DGV) (
http://dgv.tcag.ca/dgv/app/home).
Validation
We used Sequenom genotyping (iPLEX Gold platform, San Diego, CA, USA) and Sanger sequencing by capillary electrophoresis (ABI PRISM 7900HT, Applied Biosystems, Foster City, CA, USA) to perform validation and segregation studies. To genotype the selected variants, we designed primers (PRIMER 3 application) (
http://www.bioinformatics.nl/cgi-bin/primer3plus/primer3plus.cgi/) and used standard PCR conditions. For CNV validation, we used multiple ligation probe amplification (MLPA) with custom probes in the target region. MLPA reactions were carried out under standard conditions. The relative peak height method was used to determine the copy number status. We analyzed samples from the proband and both parents as well as from other relatives when available.
Paternity testing
We performed microsatellite genotyping of trios to corroborate paternity on patients with de novo mutations. We selected highly heterozygous microsatellites markers randomly distributed in different autosomal chromosomes. PCR products were amplified under standard conditions, and fragments were separated and analyzed by high-resolution electrophoresis using GeneMapper software (ABI 3100, Applied Biosystems, Foster City, CA, USA).
X-chromosome inactivation analysis
To determine the X-chromosome inactivation pattern (XCI), we examined the differential methylation state of nearby
HpaII sites of the polymorphic CAG repeat in exon 1 of the human androgen receptor gene (
AR [MIM 313700]) located at Xq13. Following digestion with methylation sensitive restriction endonuclease
HpaII, the region was amplified by PCR with a FAM labeled forward primer [
27,
28]. The digested and not digested PCR products were analyzed in an ABI PRISM 3100 Genetic Analyzer. For quantitative analysis, trace data were retrieved using the accompanying software (GeneMapper, Applied Biosystems, Foster City, CA, USA). The degree of XCI skewing was calculated as the fractional peak height ratio (expressed as %) for the more strongly amplified allele. XCI was considered significantly skewed if the ratio exceeded 90:10.
Transcriptome sequencing
Peripheral mononuclear cells (PBMCs) from whole blood of 36 studied ASD patients were isolated using a ficoll density gradient (Lymphoprep™, STEMCELL Technologies, Vancouver, British Columbia, Canada). Total RNA was extracted using Trizol (Life Technologies, Carlsbad, CA, USA) following a standard protocol. The quality and yield of the isolated RNA was assessed using a NanoDrop8000 Spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). Transcriptome sequencing was performed on a HiSeq 2000. Paired-end sequences were obtained at a read length of 100 bp with 57,792,576 mean read pairs per sample. Sequences were aligned to the NCBI build 37 human genome reference using TopHat [
29] and Bowtie [
30] to map the inter-exon splice junctions. Cufflinks [
31] and htseqcount [
32] were used to estimate the expression of the transcripts (FPKM - fragment per kilobase of transcript per million fragments mapped- and read counts). We used the ComBat algorithm (
http://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html) (package sva R) to remove batch effect and to obtain the z-score of expressed genes.
Allele-specific expression analysis
In order to study extreme imbalances of allelic expression, either allele-specific or preferential expression, we selected heterozygous SNPs (dbSNP135) identified by WES in each patient with a minimum depth of coverage of 15 and an AB ratio between 0.3 and 0.7. SNPs in known segmental duplications or pseudogenes according to UCSC hg19 (
http://genome.ucsc.edu/cgi-bin/hgGateway) ‘Segmental Dups’ and ‘Retroposed Genes’ tracks were excluded from the analyses. We then extracted the number of RNAseq reads mapped to each position and selected only highly covered positions (at least 20×). We classified each SNP expression according to its AB ratio, being biallelic when the AB ratio was between 0.1 and 0.9 and monoallelic when predominantly the reference or alternative allele were expressed (AB > 0.9 or <0.1). When all SNPs of a gene were monoallelic, we classified the gene as having monoallelic expression, whereas genes with biallelic SNPs were considered to have biallelic expression.
The aligned reads were processed by Cufflinks, using a supplied reference annotation (Homo_sapiens.GRCh37.68.gtf) to guide RABT assembly. Assembled transcripts were then analyzed by Cuffcompare to compare isoforms across all samples. We then selected novel isoforms (defined by Cufflinks by class code j) with an expression >2 FPKM and matched them to rare variants found by exome sequencing in the same patient.
RNA editing analysis
We first applied stringent filtering criteria to remove RNAseq false positive calls (DP > 10, SB ≤ 0.1, HRun < 8, ReadposRankSum ≥ 2.0, BaseQRankSum ≥ 2) and then annotated depth of coverage of the same positions according to exome sequencing. We selected only positions with a depth of coverage of at least 15× and that were not called by exome sequencing, excluding all variants described in control databases (dbSNP135, Exome Variant Server) and those present in another sample. We manually revised the remaining variants to discard false positive calls.
Pathway enrichment analysis
To identify common deregulated mechanisms affected by rare genetic variants, we performed pathway enrichment analyses using the publicly available ConsensusPathDB database (CPDB) (
http://cpdb.molgen.mpg.de/). CPDB incorporates interaction data from different categories including metabolic and signaling reactions, physical protein and genetic interactions, or gene regulatory interactions. Statistical analyses were performed using the CPDB overrepresentation analysis option, with four categories of predefined genes (network neighborhood-based, pathways-based, Gene Ontology-based and protein complex-based gene sets). For each of the predefined sets, a
P-value was calculated according to the hypergeometric test based on the number of physical entities present in both the predefined set and user-specified list of physical entities. For pathway-based sets, we used the default
P-value threshold of 0.01. We used the default gene background defined by CPDB as the number of entities that are annotated within the category of the provided gene. We then compared overrepresented pathways among rare WES variants in ASD samples with respect to 55 Spanish non-ASD samples.
Discussion
Autism spectrum disorders are a group of heterogeneous disorders with a strong genetic component but a complex genetic architecture. This complexity makes genetic diagnosis challenging, with a current diagnostic yield ranging from 15% to 30% [
60,
61]. Unbiased genome-wide molecular tools such as NGS, with a steadily lowering cost, have a proven efficacy, although they produce genetic and genomic information that cannot be properly interpreted yet. Here, we used WES and blood transcriptome by RNAseq in a selected group of males with idiopathic ASD to detect putative causal genetic variants of this complex disease. Segregation and recurrence analyses along with expression studies were used to better discriminate putatively pathogenic variants from innocuous rare variation.
WES identified several cases with likely monogenic forms of ASD, including four patients with
de novo variants in strong candidate autosomal genes (11%) and two patients with inherited X-linked mutations (5.6%). Since our study did not include parental exome sequencing, our detection rate of monogenic cases may be underestimated. Autosomal LoF mutations were identified in
SCN2A,
MED13L, and
KCNV1. SCN2A is one of the few genes found recurrently mutated in unrelated patients with ASD and intellectual disability, which is unlikely to occur by chance [
10,
36,
37,
62].
MED13L was previously associated with intellectual disability and heart defects, and a
de novo splicing mutation was described in an autistic patient [
13,
63]. Recently,
de novo deletions affecting coding regions of
MED13L were found in two girls presenting a phenotype very similar to the patient we report here, including facial dysmorphism, hypotonia, and development delay, along with ASD [
64]. Our findings are consistent with a role of
MED13L in neurodevelopmental disorders. The third
de novo and likely pathogenic variant affected
KCNV1, coding for a potassium channel subunit mainly expressed in the brain and involved in the regulation of two other potassium channels (
KCNB1 and
KCNB2). Defects in voltage-gated potassium channels have been associated with a variety of neuropsychiatric disorders, including bipolar disorder, schizophrenia, and ASD [
12,
65-
67]. Therefore, our data suggest a role for potassium voltage-gated channels in the etiology of ASD. Finally, we detected a
de novo missense mutation in the
CUL3, which is other of the few recurrently
de novo mutated genes in ASD patients [
11,
39].
Regarding X-linked mutations, we identified alterations in two genes (
MAOA and
CDKL5) previously associated with ASD and intellectual disability. We found a splicing mutation in
MAOA in a multiplex family with and X-linked pattern of inheritance, with two affected male siblings and a maternal history of psychiatric disease.
MAOA encodes the protein monoamine oxidase A, which degrades amine neurotransmitters such as dopamine, norepinephrine, and serotonin [
68]. Both affected siblings had consistent biochemical alterations of the catecholamine pathway, very mild in their carrier mother, then supporting the pathogenicity of the mutation. A
MAOA truncating mutation was first described in a Dutch family in 1993, and recently, a second loss of function mutation was found in a family segregating ASD and behavioral problems [
69,
70]. Our work further strengthens the relation between
MAOA and ASD. The other X-linked mutation affected a highly conserved amino acid in
CDKL5 and is predicted to be deleterious by different algorithms [
71,
72]. It was inherited from the healthy mother and also detected in the unaffected sister, who preferentially inactivated the mutated allele. Mutations in
CKDL5 were reported in X-linked infantile spasms syndrome (ISSX) [MIM 308350], atypical Rett syndrome (RTT) [MIM 312750], and Angelman syndrome-like [MIM 105830] [
40-
42,
73,
74]. Since the phenotype of our patient is less severe than the one described in males with
CDKL5 mutations, it is possible that the missense variant found in this patient is a hypomorphic allele with a milder effect. The biased X inactivation documented in the sister’s proband could act as a protective factor in females explaining their unaffected status.
While the definition of the potential pathogenicity for ASD mutations could be relatively straightforward in
de novo and Mendelian cases assuming full penetrance, a greater challenge is the classification of putatively pathogenic heterozygous mutations and rearrangements inherited from unaffected parents. These variants, presumably with incomplete penetrance and variable expression, are thought to contribute to disease risk in an oligogenic model with probable environmental contribution. One of the most common criteria to define potential pathogenicity is the recurrence of mutations in the same gene in unrelated patients. In our small cohort, we detected two additional patients with rare inherited missense mutations in genes with
de novo LoF mutations in other cases (
SCN2A and
MED13L), suggesting that the inherited mutations could also be contributing to the disease. While amorphic alleles might be highly penetrant, hypomorphic missense mutations might have a milder effect just increasing the global burden of risk for ASD. Moreover, the study of common functional consequences of rare variation pointed towards a set of relevant pathways only overrepresented in ASD patients. Among these, there were the PI3K/Akt signaling and the axon guidance which were previously associated to ASD by linkage studies, rare CNVs, genomic mutations, and comorbid ASD conditions [
45-
48].
Using CNV detection tools on WES data, we identified small rearrangements that were missed by previous molecular karyotype in eight cases, all of them inherited from unaffected progenitors. Some of these rare CNVs could also contribute to the disease in a multiple-hit model, such as the duplications found in
ASMT as previously proposed [
50-
52]. Although we did not observe a significant effect on expression in the individual analysis of these rare CNVs, a global analysis showed a tendency for higher expression of genes in duplication type CNVs compared to those in deletions, as expected and described for other ASD-related rearrangements [
20,
75].
The incorporation of the peripheral blood transcriptomic data provided an important additive value to the identification of molecular biomarkers of ASD, leading to identification of additional mutations and transcriptional consequences. Despite that blood is not the ideal target tissue to study ASD, it is commonly used in neurodevelopmental disorders since it can be easily available for diagnostic testing [
76]. In our study, approximately 30% of the rare variants were sufficiently expressed in blood, and 88% of these had a concordant calling in both techniques. By isoform analysis, we found an aberrant transcript in
PTEN, due to a
de novo intronic mutation that activates a cryptic splice site. The patient (ASD_36) presents macrocephaly, a feature that is consistently found in ASD patients with mutations in this gene (MIM 605309). Thus, transcriptome sequencing of blood cells was essential to achieve a diagnosis in an additional patient, reaching a final diagnosis yield of 19%. Moreover, the integrative approach also enabled the identification of rare inherited variants with functional consequences that could contribute to the phenotype. As previously suggested, the joint study of genomic and transcriptomic data can be crucial to unravel the mechanism of complex diseases [
77,
78]. We detected alteration in expression levels in 1.7% of expressed rare variants, inherited in all cases. Overexpression of
MECP2 was found in a patient who had a rare SNP variant in the same gene. Duplication of
MECP2 causes a known duplication syndrome almost exclusively in males with moderate to severe intellectual disability. Overexpression of the gene in peripheral leucocytes was previously described in ASD patients [
79] and was also related to aggressive social behavior in schizophrenia [
80]. We also identified three rare mutations with concurring overexpression of candidate genes
ANK3 [
81,
82],
CREBBP [
83,
84], and
SEMA6B [
85] in a single patient. They were inherited from both progenitors and they could contribute to ASD in an additive manner. Additionally, we also detected monoallelic expression of the wild-type allele associated with ten rare inherited truncating mutations suggesting non-sense mediated decay (NMD) in which the functional haploinsufficiency could contribute to the phenotype such as the alteration in
ALG9 [MIM 606941] and
RIT1 [MIM 609591] previously involved in neuropshychiatric conditions [
86-
88]. Finally, allele-specific expression analysis revealed the alteration of 68 genes (or specific transcript variants) with monoallelic expression but no
cis-element responsible for it. This phenomenon was found to be more common for autosomal and X-linked genes in ASD patients than in controls in the brain and other tissues [
89,
90]. Allele-specific expression can be caused by unidentified
cis-acting elements, including genetic or genomic mutations in the promoter or regulatory regions and epigenetic marks. Thus, some of identified genes with allele-specific expression might contribute to ASD. In fact, two of them have been previously associated (
MTOR) [
91] and/or are great candidates (
FUS and
TAF1C) [
92,
93]. Ongoing efforts to define the extent of expression variation in large numbers of healthy controls such as Geuvadis and GTEx will help to better clarify the deregulated genes found in individual patients that are related to disease.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Competing interests
Benjamín Rodríguez-Santiago and Luis A. Pérez-Jurado are currently employee and scientific advisor, respectively, of qGenomics SL. The authors declare no competing financial and non-financial conflicts of interests.
Authors’ contributions
MCS participated in the performance of molecular genetic assays, carried out data interpretation, and drafted the manuscript. BRS performed the bioinformatic analyses and helped to draft the manuscript. AH participated in genetic studies and revised the manuscript. JS participated in the bioinformatic analyses. MR participated in the molecular studies. GAL, MdC, BG, EG, and MPB carried out the clinical evaluation of the patients. AG participated in the bioinformatic analyses. GA participated in the whole-exome sequencing process. LPJ conceived the study and participated in the design and data interpretation, and helped in drafting the manuscript. IC conceived the study and participated in the design, coordination, and data interpretation and drafted the manuscript. All authors read and approved the manuscript.