Recurrent somatic mutations reveal new insights into consequences of mutagenic processes in cancer

Abstract

The sheer size of the human genome makes it improbable that identical somatic mutations at the exact same position are observed in multiple tumours solely by chance. The scarcity of cancer driver mutations also precludes positive selection as the sole explanation. Therefore, recurrent mutations may be highly informative of characteristics of mutational processes. To explore the potential, we use recurrence as a starting point to cluster >2,500 whole genomes of a pan-cancer cohort. We describe each genome with 13 recurrence-based and 29 general mutational features. Using principal component analysis we reduce the dimensionality and create independent features. We apply hierarchical clustering to the first 18 principal components followed by k-means clustering. We show that the resulting 16 clusters capture clinically relevant cancer phenotypes. High levels of recurrent substitutions separate the clusters that we link to UV-light exposure and deregulated activity of POLE from the one representing defective mismatch repair, which shows high levels of recurrent insertions/deletions. Recurrence of both mutation types characterizes cancer genomes with somatic hypermutation of immunoglobulin genes and the cluster of genomes exposed to gastric acid. Low levels of recurrence are observed for the cluster where tobacco-smoke exposure induces mutagenesis and the one linked to increased activity of cytidine deaminases. Notably, the majority of substitutions are recurrent in a single tumour type, while recurrent insertions/deletions point to shared processes between tumour types. Recurrence also reveals susceptible sequence motifs, including TT[C>A]TTT and AAC[T>G]T for the POLE and ‘gastric acid-exposure’ clusters, respectively. Moreover, we refine knowledge of mutagenesis, including increased C/G deletion levels in general for lung tumours and specifically in midsize homopolymer sequence contexts for microsatellite instable tumours. Our findings are an important step towards the development of a generic cancer diagnostic test for clinical practice based on whole-genome sequencing that could replace multiple diagnostics currently in use.

Main Figures and Table

Fig 1. Recurrence within each tumour type in absolute numbers and percentages

DATA: TXT table

Fig 2. Spearman’s rank correlation between the 42 mutational features

DATA: The statistics used to compute the correlations in the manuscript, including those in S2 Text. (RData | TXT file)

CODE: Correct p-values for multiple testing and plot correlations for the 42 features (RScript)

Fig 3. Workflow of the recurrence-based approach to group cancer genomes

Note: An interactive 3D version of the PCA plot shown in Step 3 of the workflow is available here.

DATA:

Output of Step 1: 42 features (RData | TXT file)
Output of Step 3 and 4: PCA object (RData)
Output of Step 5 and 6: HCPC object (RData) and Samples linked to cluster and tumour type (RData | TXT file)
Output of Step 7: Samples annotated with metadata (RData | TXT file)

CODE:

Main workflow, step 1 through 7 (RScript)
- Annotation of mutations (RScript)
- Annotation of samples (RScript)
Generic functions (RScript)

Fig 4: Key characteristics of the 16 clusters

DATA:

Piediagram (RData | TXT file)
Number of SSMs and SIMs (RData | TXT file)
Key characteristics and overall association to recurrence: HCPC object (RData)

Fig 5. Overview of the 42 features and their association with each cluster

DATA:

PCA object (RData)
All significant associations between features and clusters (RData)

CODE:
Correct the p-values for the results of the v tests (features versus clusters) for multiple testing and plot associations (RScript)

Fig 6. Enriched sequence motifs

METHOD: Description

DATA:

Cluster A (C>A mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

Cluster E (C>G mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

Cluster G (C>T mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

Cluster H (C>A mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

Cluster L (T>G mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

CODE:

Add sequence context to SSMs and collect mutation of a specific subtype of a specific group of sample (RScript)
Compute and plot sequence logos (RScript)
Compute statistics for sequence motifs (RScript)

Fig 7. Factors impacting on recurrence in the context of the clusters

Supporting Information

S1 Figure. Clustering tree showing tumour type distribution for 2 to 20 clusters.

DATA:
Table with the samples and the clusters they are assigned to at each cluster resolution (from 2 to 20 clusters) (TXT file) HCPC objects containing the association of the features to the clusters at a specific cluster resolution. (Folder)

S2 Figure. PCA and clustering with and without the recurrence-related features.

DATA:
Table with the samples and the clusters they are assigned to after clustering without and with the recurrence-related features (C and D) (TXT file)

S3 Figure. Enriched sequence motifs for C>G SSMs in cluster M.

METHOD: Description

DATA:

Cluster M (C>G mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

CODE:

Add sequence context to SSMs and collect mutation of a specific subtype of a specific group of sample (RScript)
Compute and plot sequence logos (RScript)
Compute statistics for sequence motifs (RScript)

S4 Figure. Enriched sequence motifs for T>G SSMs in cluster H.

METHOD: Description

DATA:

Cluster H (T>G mutations):

Nucleotide counts per position in the sequence:
- Recurrent mutations (RData | TXT file)
- Unique mutations (RData | TXT file)
Relative Entropies per nucleotide and Total Entropy per position in the sequence:
- Recurrent mutations (TXT file)
- Unique mutations (TXT file)

CODE:

Add sequence context to SSMs and collect mutation of a specific subtype of a specific group of sample (RScript)
Compute and plot sequence logos (RScript)
Compute statistics for sequence motifs (RScript)

S1 Table. Tumour type abbreviation, full name and number of samples.

S2 Table. Recurrence in pan-cancer context and within tumour type(s).

CODE:
Determine the type of recurrence for each recurrent mutation ('pan-cancer only', 'single tumour type', 'multiple tumour types') (RScript)

S1 Text. Estimation of the levels of recurrence when purely driven by chance.

METHOD: Description

DATA:

Summary tables of 5000 simulations:
- Recurrence pan-cancer per tumour type (RData | TXT file)
- Recurrence within tumour type (RData | TXT file)
- Recurrence per substitution type (RData | TXT file)
- Overall recurrence (RData | TXT file)

CODE:

S2 Text. Recurrence versus general mutational characteristics.

DATA: Features in absolute terms and percentages (RData | TXT file)

CODE: Estimate for frequency of homopolymers in the genome (RScript)

S3 Text. Detailed cluster-specific descriptions.

DATA: Features in absolute terms and percentages plus metadata (RData | TXT file)

CODE:

Density plots of the replication time scores (RScript)
Plot the number of recurrent mutations along the genome for Lymph-BNHL and Lymph-CLL in cluster M. (RScript)

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
Clustering		Clustering
Figures		Figures
Recurrence		Recurrence
SequenceContext		SequenceContext
Simulations		Simulations
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recurrent somatic mutations reveal new insights into consequences of mutagenic processes in cancer

Abstract

Main Figures and Table

Fig 1. Recurrence within each tumour type in absolute numbers and percentages

Fig 2. Spearman’s rank correlation between the 42 mutational features

Fig 3. Workflow of the recurrence-based approach to group cancer genomes

Fig 4: Key characteristics of the 16 clusters

Fig 5. Overview of the 42 features and their association with each cluster

Fig 6. Enriched sequence motifs

Cluster A (C>A mutations):

Cluster E (C>G mutations):

Cluster G (C>T mutations):

Cluster H (C>A mutations):

Cluster L (T>G mutations):

Fig 7. Factors impacting on recurrence in the context of the clusters

Supporting Information

S1 Figure. Clustering tree showing tumour type distribution for 2 to 20 clusters.

S2 Figure. PCA and clustering with and without the recurrence-related features.

S3 Figure. Enriched sequence motifs for C>G SSMs in cluster M.

Cluster M (C>G mutations):

S4 Figure. Enriched sequence motifs for T>G SSMs in cluster H.

Cluster H (T>G mutations):

S1 Table. Tumour type abbreviation, full name and number of samples.

S2 Table. Recurrence in pan-cancer context and within tumour type(s).

S1 Text. Estimation of the levels of recurrence when purely driven by chance.

S2 Text. Recurrence versus general mutational characteristics.

S3 Text. Detailed cluster-specific descriptions.

S4 Text. Smoking history and related mutational subtypes.

S1 File. Characteristic plots summarising each of the 42 features.

S2 File. Sample distribution per tumour type across the 16 clusters.

About

Releases

Packages

Languages

biomedicalGenomicsCNAG/RecurrentMutations

Folders and files

Latest commit

History

Repository files navigation

Recurrent somatic mutations reveal new insights into consequences of mutagenic processes in cancer

Abstract

Main Figures and Table

Fig 1. Recurrence within each tumour type in absolute numbers and percentages

Fig 2. Spearman’s rank correlation between the 42 mutational features

Fig 3. Workflow of the recurrence-based approach to group cancer genomes

Fig 4: Key characteristics of the 16 clusters

Fig 5. Overview of the 42 features and their association with each cluster

Fig 6. Enriched sequence motifs

Cluster A (C>A mutations):

Cluster E (C>G mutations):

Cluster G (C>T mutations):

Cluster H (C>A mutations):

Cluster L (T>G mutations):

Fig 7. Factors impacting on recurrence in the context of the clusters

Supporting Information

S1 Figure. Clustering tree showing tumour type distribution for 2 to 20 clusters.

S2 Figure. PCA and clustering with and without the recurrence-related features.

S3 Figure. Enriched sequence motifs for C>G SSMs in cluster M.

Cluster M (C>G mutations):

S4 Figure. Enriched sequence motifs for T>G SSMs in cluster H.

Cluster H (T>G mutations):

S1 Table. Tumour type abbreviation, full name and number of samples.

S2 Table. Recurrence in pan-cancer context and within tumour type(s).

S1 Text. Estimation of the levels of recurrence when purely driven by chance.

S2 Text. Recurrence versus general mutational characteristics.

S3 Text. Detailed cluster-specific descriptions.

S4 Text. Smoking history and related mutational subtypes.

S1 File. Characteristic plots summarising each of the 42 features.

S2 File. Sample distribution per tumour type across the 16 clusters.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages