Harmonized single-cell perturbation data 📊 New protein folds in the virome 🦠 Reproducible genome assembly in Galaxy 🌌 AI's impact on genomics 🤖

Harmonized single-cell perturbation data 📊 New protein folds in the virome 🦠 Reproducible genome assembly in Galaxy 🌌 AI's impact on genomics 🤖

Bioinformer Weekly Roundup

Stay Updated with the Latest in Bioinformatics!

Issue: 22 | Date: 2 February 2024

👋 Welcome to the Bioinformer Weekly Roundup!

In this newsletter, we curate and bring you the most captivating stories, developments, and breakthroughs from the world of bioinformatics. Whether you're a seasoned researcher, a student, or simply curious about the intersection of biology and data science, we've got you covered. Subscribe now to stay ahead in the exciting realm of bioinformatics!

🔬 Featured Research

Accurate quantification of single-cell and single-nucleus RNA-seq transcripts using distinguishing flanking k-mers | bioRxiv

This article introduces a novel method to enhance read mapping in single-cell and single-nucleus RNA sequencing by expanding the 'region of interest' to include both nascent and mature mRNA, coupled with the use of distinguishing flanking k-mers (DFKs) for improved mRNA quantification accuracy, addressing the challenges posed by the coexistence of different mRNA processing stages.

🧵 Lior Pachter has posted a Twitter thread describing the motivation and results from the article.

scPerturb: harmonized single-cell perturbation data | Nature Methods

This article discusses the creation of scPerturb, an information resource containing 44 publicly available single-cell perturbation-response datasets with various molecular readouts. These datasets undergo uniform quality control and feature annotation harmonization to enhance data interoperability. The paper introduces energy statistics (E-statistics) for quantifying perturbation effects and significance testing, along with E-distance as a general distance measure for single-cell expression profiles.

🧵 Tessa Green has posted a Twitter thread introducing the resource and describing its development.

Randomizing the human genome by engineering recombination between repeat elements | bioRxiv

This study addresses the challenge of understanding the non-coding portion of the human genome by developing a toolbox for creating structural variants at scale using CRISPR prime editing. The toolbox enables the generation of deletions, inversions, translocations, and extrachromosomal circular DNA in human cell lines, resulting in thousands of clonal insertions and rearrangements. Analysis of these structural changes reveals selection pressures favoring shorter variants that often delete growth-inhibiting genes, while translocations are depleted.

An atlas of cells in the human tonsil | Cell Immunity

This article describes the creation of a comprehensive human tonsil atlas from over 556,000 cells using five data modalities, identifying 121 cell types and states, and providing insights into tonsil development, functional units, and cell maturation. It also highlights the atlas's application in annotating cells from B cell-derived mantle cell lymphomas, connecting their transcriptional diversity to normal tonsil B cell differentiation states.

Aggregation of recount3 RNA-seq data improves inference of consensus and tissue-specific gene co-expression networks | bioRxiv

This study outlines best practices for inferring and evaluating network Graph Convolutional Networks (GCNs) through data aggregation. The recommendations include estimating and regressing confounders in each dataset before aggregation and prioritizing larger sample size studies for GCN reconstruction. Increased statistical power in inferring context-specific networks allows for deriving variant annotations enriched for concordant trait heritability, independent of context-agnostic functional genomic annotations.

SHIFTR enables the unbiased identification of proteins bound to specific RNA regions in live cells | Nucleic Acids Research

This research presents SHIFTR (Selective RNase H-mediated interactome framing for target RNA regions), a novel method for identifying proteins that interact with specific regions within endogenous RNAs in live cells using mass spectrometry. SHIFTR is demonstrated to be highly accurate, with minimal background interactions, and it requires significantly lower input material compared to existing techniques.

🧵 Mathias Munschauer has posted a Twitter thread describing the method and its applications.

Birth of new protein folds and functions in the virome | bioRxiv

This study leverages a database of newly predicted protein structures from various eukaryotic viral species to uncover insights into the evolution and function of viral proteins. It reveals that a significant portion (62%) of viral proteins are evolutionarily young and lack homologs in the Alphafold database, while some ancient viral proteins have structural homologs in non-viral proteins, indicating similarities between human pathogens and their eukaryotic hosts.

A Biophysical Model for ATAC-seq Data Analysis | bioRxiv

This study proposes a biophysical model for chromatin dynamics and transcription to assess the advantages of multiome data over unregistered single-cell RNA-seq and single-cell ATAC-seq data. Additionally, the model offers a biophysically grounded approach to integrate open chromatin data with other modalities in a comprehensive manner.

Gene regulatory patterning codes in early cell fate specification of the C. elegans embryo | eLife Developmental Biology

This study explores the mechanisms of pattern formation in C. elegans embryogenesis, diverging from the syncytium-based model seen in Drosophila. By analyzing single-cell RNA-Seq data from 1- to 102-cell stage embryos, researchers identified 119 embryonic cell-states and observed modular gene expression programs along sub-cell lineages. The study highlights the pivotal role of homeodomain genes in establishing lineage-specific positioning from the 28-cell stage and finds Drosophila segmentation gene orthologs in C. elegans with sub-lineage specific expression, suggesting a deep homology in cell fate specification programs across different species.

Isoform-level profiling of m6A epitranscriptomic signatures in human brain | bioRxiv

This study utilizes Oxford Nanopore direct RNA sequencing to map N6-methyladenosine (m6A) modifications at the isoform level across three human brain regions, revealing over 57,000 m6A sites within 15,000 isoforms and demonstrating significant variations in modification patterns and polyA tail lengths that correlate with specific brain regions and neuron cell types, offering new insights into the epitranscriptomic landscape of the human brain and its implications for neurology.

The Scalable Variant Call Representation: Enabling Genetic Analysis Beyond One Million Genomes | bioRxiv

This article introduces the Scalable Variant Call Representation (SVCR), a solution to the scalability issues of the Variant Call Format (VCF) in genome sequencing. By incorporating reference blocks and local allele indices, SVCR reduces file sizes and ensures linear scalability, offering a lossless, mergeable format suitable for large-scale genetic datasets, with implementations in SVCR-VCF and Hail's VDS format demonstrating efficiency and the potential for rapid data analysis.

🛠️ Latest Tools

Semi-supervised integration of single-cell transcriptomics data | Nature Communications

This article introduces STACAS, a semi-supervised batch correction method for single-cell RNA-seq data designed to preserve biological variability during data integration by incorporating prior knowledge of cell types. Through benchmarking, STACAS demonstrates superior performance over leading unsupervised and supervised methods, offering scalability for large datasets and robustness against inaccuracies in cell type labels. The study advocates for the routine inclusion of cell type information in single-cell data integration and offers a flexible framework for achieving effective batch effect correction.

🧵 Santiago Carmona has posted a Twitter thread providing an overview of the article and results.

CHOIR improves significance-based detection of cell types and states from single-cell data | bioRxiv

This research paper introduces CHOIR (clustering hierarchy optimization by iterative random forests), a clustering tool designed to address the issue of statistical inference testing in single-cell data analysis. CHOIR utilizes random forest classifiers and permutation tests within a hierarchical clustering framework to determine distinct cell populations accurately. Extensive benchmarking against 14 existing clustering methods across various datasets demonstrates CHOIR's enhanced performance and its applicability to diverse single-cell data types, offering a flexible and robust solution for identifying biologically relevant cell groupings.

scDisInFact: disentangled learning for integration and prediction of multi-batch multi-condition single-cell RNA-sequencing data | Nature Communications

This article introduces scDisInFact, a deep learning framework designed to distinguish between technical batch effects and biological condition effects in single-cell RNA-sequencing (scRNA-seq) data, facilitating batch effect removal, identification of condition-associated key genes, and perturbation prediction. Tested on simulated and real datasets, scDisInFact demonstrates superior performance compared to existing methods, offering a holistic and precise tool for analyzing scRNA-seq data across various batches and conditions.

🧵 Ziqi Zhang has posted a Twitter thread summarising the deep learning framework and its applications.

scCensus: Off-target scRNA-seq reads reveal meaningful biology | bioRxiv

This article introduces scCensus, a workflow for evaluating off-target reads in single-cell RNA-sequencing data, revealing that these reads provide valuable insights into chromatin structure, antisense transcription, and cell clustering, thereby suggesting the potential to enhance gene detection and the understanding of transcriptional activities by integrating data from spliced and unspliced reads.

From Planning Stage To FAIR Data: A Practical Metadatasheet For Biomedical Scientists

This article introduces Metadatasheet, a new metadata standard developed from interviews with biomedical consortia members and data repository screenings, designed to integrate with the data-lifecycle and facilitate metadata recording in Microsoft Excel. It offers features like automation, dynamic adaption, integrity checks, and export options to various metadata standards, aiming to streamline metadata management for biomedical researchers, enhance collaboration, and prevent data loss, thus potentially accelerating scientific progress.

Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy | Nature Biotechnology

This article discusses the Earth BioGenome Project's objective to create reference genomes for approximately 1.8 million eukaryotic species and the need to significantly increase genome production speed. The paper highlights three key areas of focus: optimizing genome assembly and best practices, providing computational infrastructure, and disseminating knowledge and training. The authors present a pipeline within the Galaxy ecosystem that combines PacBio HiFi reads and long-distance information from Hi-C maps to generate comprehensive assemblies, emphasizing the importance of quality control and recommended coverage levels for accuracy.

Benchmarking of computational methods for m6A profiling with Nanopore direct RNA sequencing | Briefings in Bioinformatics

This research presents NanOlympicsMod, a Nextflow pipeline that utilizes containerized technology to compare the performance of 14 different tools for detecting N6-methyladenosine (m6A) modifications in direct RNA sequencing (dRNA-seq) data. The pipeline was tested on synthetic oligos with known m6A positions as well as dRNA-seq datasets from yeast, mouse, and human samples, comparing the tool's predictions to reference m6A sets generated by orthogonal methods.

Scalable and efficient DNA sequencing analysis on different compute infrastructures aiding variant discovery | bioRxiv

This article presents nf-core/sarek version 3, an updated variant calling and annotation pipeline for analyzing both germline and somatic samples. It highlights the pipeline's adaptability to any genome with a known reference, significant storage and runtime efficiency improvements through CRAM format use and increased intra-sample parallelization, resulting in a 70% cost reduction on commercial clouds. This enables large-scale, cross-platform data analysis with lower costs and CO2 emissions. The updated code is accessible online.

📰 Community News

Can Artificial Intelligence accelerate the impact of genomics? | Genomics England

This podcast focuses on the significant impact of artificial intelligence (AI) on genomics. It explores AI's potential to transform patient care, addresses public opinions on the integration of AI in genomics, and delves into the ethical issues emerging from advancements in this field. Experts share their insights on the future interplay between genomics and AI, offering predictions about developments to anticipate.

Is the Future of Medicine Hidden in Ancient DNA? | The Daily

This podcast highlights a significant breakthrough where DNA analysis of Bronze Age skeletons is offering insights into contemporary medical conditions, with Carl Zimmer detailing how this emerging research area is revolutionizing our approach to developing treatments for severe diseases.

If I leave academia, can I come back? | EMBL Career Webinar

This webinar is designed for early-career researchers contemplating or who have transitioned out of academia, focusing on the potential for returning. It features insights from speakers Jennifer Rohn and Algirdas Toleikis, who discuss their career paths, opportunities, and challenges associated with moving between academic and non-academic roles.

The pharma industry from Paul Janssen to today: why drugs got harder to develop and what we can do about it | Alex Telford

This blog explores Paul Janssen's successful career in pharmaceuticals, highlighting how he founded his company in 1953 and developed over 70 new medicines from the 1950s to the 1990s. It contrasts his prolific output with the challenges faced by modern drug discovery scientists, who struggle with high costs and low success rates. The article argues that the pharmaceutical industry's productivity crisis is not due to simple factors but stems from the increasing complexity and cost of drug discovery and development over time. It raises questions about how Janssen's start-up achieved such success in a different pharmaceutical landscape.

Making bioinformatics training FAIR: the EMBL-EBI training portal | Frontiers in Bioinformatics

This article describes the redesign and restructuring of the EMBL-EBI Training website to improve the findability and usability of its data-driven life sciences training materials, using FAIR principles and Agile Scrum methodology, resulting in increased user engagement and the ability to track learning progress.

📅 Upcoming Events

Single Cell Genomics Day: A (Virtual) Practical Workshop | Satija Lab

Explore the latest advances in molecular biology, multiplexed imaging, and computational biology. Learn about cutting-edge technologies and computational methods to empower your single cell genomics research. This free workshop welcomes beginners and experts alike, featuring keynote presentations from renowned experts and live streaming of all talks.

📚 Educational Corner

Organizing scRNA-seq data with LaminDB | Valentine Svensson

This blog discusses LaminDB, a tool designed to handle the organization and management of large-scale scRNA-seq experiments and other biological data. It offers plugins for managing ontologies specific to scRNA-seq data and supports the AnnData format. LaminDB tracks metadata and data locations, making it useful for both local and cloud-based storage of biological data.

SNK (Snek) | Wytamma Wirth

Snk, pronounced "snek," is a workflow management system based on Snakemake. It enables the installation of Snakemake workflows as dynamically generated Command Line Interfaces (CLIs). This approach enhances interoperability and allows for the integration of complex workflows as modular components within a larger system.

Six not-so-basic base R functions | Isabella Velásquez

This blog post highlights the versatility of base R functions, emphasizing the possibility of accomplishing impressive tasks without the need for additional packages. It discusses several lesser-known base R functions and provides examples of their use.

How Difficult Is It To Start Your Single-cell Analysis As A Beginner | Xi Chen

This blog delves into the challenges associated with initiating single-cell analysis, primarily stemming from the intricacies involved in setting up the necessary software. It provides a comprehensive guide to different installation approaches, serving as a valuable resource for individuals grappling with this aspect of their analysis journey.

🔗 Connect with Us

Stay connected and engage with us on social media for daily updates, discussions, and more!

📬 Subscribe

Don't miss an issue! Subscribe to the Bioinformer Weekly Roundup and receive the latest insights directly in your inbox.

Subscribe Now

We hope you enjoyed this week's edition of the Bioinformer Weekly Roundup. Feel free to share it with your colleagues and friends who share your passion for bioinformatics!


Disclaimer: The information provided in this newsletter is for educational and informational purposes only and does not constitute professional advice.

Editor: James Ashmore | Contact: bioinformatics@zifornd.com

Copyright © 2024, Bioinformer Weekly Roundup. All rights reserved.

Absolutely agree, this week has been full of ground-breaking progress! Remember, as Einstein said, "Innovation is not the product of logical thought, although the result is tied to logical structure." Stay committed to discovery! 👩🔬💡 #Bioinformatics #Bioinformer

Like
Reply
Christopher Southan

Honorary Professor at the University of Edinburgh and owner of TW2Informatics Consulting

10mo

Good stuff, hope it stays free :)

Like
Reply

To view or add a comment, sign in

Insights from the community

Explore topics