Gene Set Foundation Model

Daniel J. B. Clarke1, Giacomo Marino1, Avi Ma’ayan1

1Department of Pharmacological Sciences, Department of Artificial Intelligence and Human Health, Mount Sinai Center for Bioinformatics, Icahn School of Medicine at Mount Sinai, One Gustave L. Levy Place, Box 1603, New York, NY USA

Abstract

Foundation models have proven highly effective in several scientific and commercial domains. Trained on massive datasets, foundation models can effectively capture complex patterns within these datasets and convert these patterns into “embeddings” that can be used for a variety of useful applications. Here we created a gene-centric foundation model that was trained using millions of unlabeled gene sets automatically extracted from tables within supplemental materials of published research articles listed on PubMed Central (PMC). Several model architectures and training methodologies were compared to other state-of-the-art gene function prediction methods and models. The models were evaluated based on their ability to predict gene function by holding-out labeled genes from gene sets created from the Gene Ontology Biological Processes (GO-BP), Kyoto Encyclopedia of Genes and Genomes (KEGG), Genome-wide Association Study (GWAS) Catalog, and ChIP-X Enrichment Analysis (ChEA). One of the foundation models that we have developed achieves superior performance in predicting gene function for genes that are members of sets created from both literature-based and data-driven sources. Next, to apply the model for a useful task, we systematically augmented gene sets from a variety of gene set libraries. The gene function predictions and the model is made accessible for download, search, and visualization through an easy-to-use website.

Results

GSFM Benchmark Results Figure 1. Median area under the receiver operating characteristic curve (AUROC) across all terms in each benchmarking library. GSFM Rummagene is the model used for predictions on this site.

GSFM Predictions Figure 2. Median area under the receiver operating characteristic curve (AUROC) distributions across all terms in all libraries used for prediction on this site.

Data Availability

The foundation model source code and weights, and instructions on accessing it locally are available on HuggingFace at: https://huggingface.co/maayanlab/gsfm
The source code for the website is available at: https://github.com/maayanlab/gsfm-predictions
All benchmark results are available at: https://gsfm.maayanlab.cloud/downloads