scRAPID-web

documentation

scRAPID-web Documentation

scRAPID-web implements the scRAPID pipeline to predict protein-RNA interactions from single-cell RNA-sequencing (scRNA-seq) data (Fiorentino et al. [Nucleic Acids Res. 2024]). scRAPID integrates gene regulatory network (GRN) inference algorithms with predictions from catRAPID (Bellucci et al. [Nat Methods. 2011]), a tool that predicts the interaction propensity of protein-RNA pairs based on their secondary structure and physicochemical properties. Beyond the prediction of binary protein-RNA interactions, scRAPID identifies hub RBPs and RNAs (Stock, Popp et al. [Bioinformatics 2024]), and RBP-RBP interactions.

How it works

The main input of scRAPID is a gene expression matrix obtained from single-cell or single-nucleus RNA-seq experiments, containing Unique Molecular Identifiers (UMIs) or read counts. Gene filtering, expression matrix normalization, mapping to Ensembl gene IDs, and user-setting-based selection of highly variable RBPs and genes are the first steps in the pipeline. GRN inference is performed on the expression data relative to the highly variable genes. The supported GRN inference algorithms are GRNBoost2, TENET, and DeepSEM. GRNBoost2 employs gradient-boosting regression models to uncover relationships between target genes and transcription factors, optimized for handling large datasets efficiently (Moerman et al. [Bioinformatics 2019]). TENET reconstructs GRNs from scRNA-seq data by ordering cells along pseudotime and using transfer entropy to identify directed causal relationships (Kim et al. [Nucleic Acids Res. 2021]). DeepSEM employs a neural network version of the Structural Equation Model (SEM) to reconstruct a GRN, embed and simulate scRNA-seq data (Shu et al. [ Nat Comput Sci 2021]).

After the selected GRN inference method is applied, the resulting network is filtered to include only directed RBP-target interactions. Indirect interactions are removed using catRAPID interaction propensities previously computed on precompiled RBP and RNA libraries. An important parameter of the scRAPID pipeline is the cutoff on the catRAPID interaction propensity, used after GRN inference to eliminate indirect protein-RNA interactions. We use a cutoff of 30, originally calibrated for mRNAs, lncRNAs and pseudogenes based on multiple scRNA-seq datasets and GRN inference algorithms (Fiorentino et al. [Nucleic Acids Res. 2024]), while we recalculated it for short non-coding RNAs, obtaining an optimal cutoff value of 8.

Hub RBPs, RNAs, and lncRNAs (if present) are identified as described in (Fiorentino et al. [Nucleic Acids Res. 2024], Stock, Popp et al. [Bioinformatics 2024]). RBP-RBP interactions are computed based on the target overlap of RBP pairs, quantified via the Jaccard coefficient.

Precompiled libraries

Human and mouse RBP libraries were obtained by expanding the RBP libraries from the catRAPID omics v2.0 web server (Armaos, Colantoni et al. [Nucleic Acids Res. 2021]) using orthology information and integrating them with proteins scoring ≥ 10 in the RBP2GO database (Armaos, Colantoni et al. [Nucleic Acids Res. 2021]). For Arabidopsis Thaliana, Caenorhabditis Elegans, Danio Rerio, Drosophila Melanogaster the RBPs were taken directly from the catRAPID omics v2.0 libraries and supplemented with proteins with RBP2GO score >= 10. For Rattus norvegicus and Xenopus tropicalis, which are not covered by the RBP2GO database, protein sequences were retrieved from Uniprot (release 2024_03) (UniProt Consortium [Nucleic Acids Res. 2023]).

RNA libraries were obtained from Ensembl 107 (Cunningham et al. [Nucleic Acids Res. 2022]), except for those relative to Arabidopsis thaliana, gathered from Ensembl Plants 56 (Yates et al. [Nucleic Acids Res. 2022] et al. [Nat Methods. 2011]).

Transcripts were divided into 4 categories, based on the gene biotype:

protein-coding: protein_coding
long non-coding: lincRNA, lncRNA, antisense, antisense_RNA, sense_intronic, sense_overlapping, processed_transcript, TEC
small non-coding: ribozyme, snRNA, sRNA, snoRNA, miRNA, pre-miRNA, scaRNA, misc_RNA, vault_RNA, scRNA, Y_RNA, piRNA
pseudogene: pseudogene, polymorphic_pseudogene, processed_pseudogene, unprocessed_pseudogene, unitary_pseudogene, transcribed_processed_pseudogene, transcribed_unprocessed_pseudogene, transcribed_unitary_pseudogene, translated_processed_pseudogene, translated_unprocessed_pseudogene, translated_unitary_pseudogene, nontranslating_CDS

For Drosophila melanogaster, transcripts at least 200 nucleotides long, with the ncRNA gene biotype, and containing the term "lncRNA", "asRNA", "sisRNA", "Su(Ste)", "Uhg", "hpRNA", or "RNaseMRP" in the gene name were classified as long non-coding RNAs. All other ncRNAs were included in the small non-coding RNA library.

The Ensembl Gene ID is used internally to indicate the transcripts. Only the Ensembl-annotated canonical isoform of a gene with several transcript isoforms was kept; if the gene was unannotated, the longest isoform was selected. We made sure the chosen transcript for genes having a protein-coding biotype had a protein-coding transcript biotype. The same database and criteria used in the catRAPID omics v2.0 web server were used to search transcript sequences for instances of RNA-binding motifs. If motifs from the most similar RBPs with at least 70% sequence identity were available, we used them for RNA-binding proteins without known motifs; MMseqs2 was used to find similar sequences (Steinegger et al. [Nat Biotechnol. 2017]).

To compute catRAPID interaction propensities of protein-RNA pairs, we employed the fragmentation-based method used by the RNact database (Lang et al. [Nucleic Acids Res. 2019]), the catRAPID omics v2.0 web server (Armaos, Colantoni et al. [Nucleic Acids Res. 2021]), and the "catRAPID fragments" module (Cirillo et al. [RNA. 2013] ). The interaction propensity for a protein-RNA pair is determined by taking the largest value from the distribution of interaction propensities of their fragments, as in RNAct and the scRAPID internal database. For each organism, we created a SQL database of protein-RNA interaction propensities using the sqlite3 Python package. This database is fastly queried in the catRAPID-based filtering step of the scRAPID pipeline's interactions.

Output interpretation

At the top of the output page, an interactive RBP-RBP interaction network is displayed, where nodes represent RBPs and edges indicate interactions with Jaccard coefficient exceeding a user-defined threshold. Users can adjust the threshold dynamically using a slider, which automatically updates the network layout in real time. The network can be downloaded as a Cytoscape-compatible JSON file or png file.

scRAPID protein-RNA interactions: list of inferred protein-RNA interactions that passed the catRAPID-based filter, including target RNA biotype, edge weights returned by the selected GRN inference method and RBP-specific motifs identified within the target sequences.
Hub RBPs: list of RBPs and their out-centrality values (fraction of outgoing edges in the GRN).
Hub RNAs: list of target RNAs and their in-centrality values.
Hub lncRNAs: list of target lncRNAs, if present in the dataset, including in-centrality values (fraction of ingoing edges in the GRN).
RBP Co-interactions: list of RBP-RBP pairs including the number of target RNAs of both RBPs from the inferred GRN and the Jaccard coefficient quantifying target overlap. For GRN inference methods that return more than 5% of the possible edges, the ranking is trimmed to retain only the top 5% edges, otherwise all edges are kept. Then, for each pair of RBPs, the Jaccard coefficient between their sets of targets is computed. The RBP-RBP interaction analysis is not performed if TENET is selected due to its poor performance in this task.
Highly Variable Genes: list of the highly variable genes which are used to infer the GRN.

Performances

The performance of scRAPID in predicting protein-RNA interactions from single-cell transcriptomic data has been systematically evaluated across various datasets and methods (Fiorentino et al. [Nucleic Acids Res. 2024]). The prediction of RBP-target interactions consistently achieves higher Early Precision Ratio (EPR) compared to the inference of transcription factor-target interactions, the standard task in GRN inference from scRNA-seq data, with DeepSEM and TENET showing the best performance.

The integration with catRAPID predictions (Bellucci et al. [Nat Methods. 2011]) significantly improves the ability to predict interactions, particularly for long non-coding RNAs (lncRNAs). In particular, the application of catRAPID filtering led to an average relative improvement in EPR of 17.6% across datasets and methods, indicating its effectiveness in refining the inferred networks by eliminating indirect interactions.

Additionally, scRAPID identifies hub RBPs and hub RNAs, with the catRAPID filter preserving or increasing the Jaccard coefficient ratio (JCR) in 95% of cases for datasets with 500 highly variable genes (HVGs).

Finally, scRAPID predicts direct RBP–RBP interactions by leveraging the overlap of RNA targets inferred from the scRNA-seq data.

The scRAPID pipeline can be used with any GRN inference algorithm, making it a valuable tool for exploring complex protein-RNA regulatory networks in single-cell transcriptomic studies.

Execution times

The execution times vary based on the algorithm, the organism, and the number of HVGs. The table below summarizes the time (in minutes) required to complete a run starting from a matrix with 2448 cells from a retinoic acid-driven differentiation time course of mouse embryonic stem cells (Semrau et al. [Nat Commun. 2017]), using various algorithms and selecting different amounts of HVGs. “# HVGs Network” refers to the genes used for network inference, which is the union of the selected HVGs with all the highly variable RBPs (reported in the Highly Variable Genes output table). For the TENET algorithm, the table reports the time required for the generation of the diffusion map + the time required for the subsequent analysis.