Important considerations for using UCE and potentially other #foundationmodels like Geneformer and scGPT 💡
In a standard scRNA-seq analysis pipeline, you select the top ~2000 variable genes for downstream analysis (eg. clustering). However, my recent experiment suggests that you should not do this for foundation models. Here is what I did... The Universal Cell Embeddings (UCE) foundation model, part of a bigger "virtual cell" initiative, takes a raw cells x counts matrix as input and outputs a 1280 dimensional vector that contains biological meaning as output. This is then used for downstream analysis. The power here is that you get the same vectors every time. There is no fine-tuning of the model. So you can make comparisons with any datasets that have never been run through the model, and therefore do things like annotate, given metadata cells from other datasets. As I said in a previous post, this can take a long time if you're running it locally. One hypothesis, inspired by one of the comments, was that I could put in an abbreviated dataset of only variable genes, and get a faster result without sacrificing accuracy - a good thing when computational resources are limited. Experimental design: I ran the following 3 datasets through UCE. 1. The full dataset (positive control). 2. The dataset containing the most variable genes (experimental). 3. The dataset containing a random selection of genes (negative control). My results: I found that the dataset containing the most variable genes did not have the same level of cell type separation compared to the full dataset, with the negative control performing worse than both of them. This can be seen by assessing PCA space of the concatenated data (image below). Further quantification via Shannon entropy (to measure diversity) confirms this (see my jupyter notebook in the comments). What this means for you: This suggests that for UCE, and perhaps for other foundation models (geneformer, scGPT), you should run the full dataset through it to get the best results, and the typical practice of only selecting variable genes may not apply to the use of foundation models. Zooming out: There has been an uptick in people asking me questions around AI as it relates to single-cell in the past few weeks (perhaps because I'm posting about it). Even if you're a natural skeptic (like me), you should at least be familiar with them, because like the black boxes before it (eg. t-SNE/UMAP), these tools don't appear to be going anywhere. And they do indeed have potential to accelerate our workflows. If you are doing work in this space, or interested in doing work in this space, please let me know. A jupyter notebook showing my work is linked in the comments. I hope you all have a great day.