RIME Web Service Documentation

RIME (RNA Interactions Model with Embeddings) is a deep learning model that predicts RRIs by leveraging embeddings from the Nucleotide Transformer language model. The model was trained to infer whether two input RNA sequences, modeled using a contact matrix representation, interact (positive class) or not (negative class). To ensure the identification of direct RNA-RNA interactions, independent of protein mediation, the positive set consisted of contact matrix windows encompassing duplexes (PP, Paired regions in interacting transcript Pairs) detected by Psoralen-based high-throughput RRI detection methodologies (PARIS, PARIS2 and SPLASH). The negative set was constructed from three distinct classes of non-interacting. These included randomly selected Non-paired regions in interacting transcript Pairs (NP), randomly selected Non-paired regions in Non-interacting transcript pairs (NN), and Paired regions already seen in the positive set combined in Non-interacting transcript pairs (PN). This setup allowed us to approach RRI prediction from two angles:

  • Discriminating RNA-RNA Patches (DRP) focuses on distinguishing interacting regions (PP) from non-interacting ones (NN, NP).
  • Discriminating RNA-RNA Interactors (DRI) involves identifying interacting regions within transcript pairs that are truly interacting (PP) versus those in randomly combined transcript pairs that are not interacting with each other (PN)
Example Image

Figure 1: Preparation of datasets for training and evaluation of RIME. a, Schematic representation of the negative data generation process. Non-interacting pairs were generated by randomly combining interacting transcript pairs. The interacting sites are shown as white rectangles along transcript sequences, represented with lines. b, Illustration of Interacting Transcript Pairs (left panel) and Non-interacting Transcript Pairs (right panel), represented with contact matrices. Training and evaluation data were generated by sampling windows of varying sizes within these matrices, classified into four categories (PP, NP, NN, or PN) based on their positions. Interacting sites, depicted as white rectangles along the transcript sequence axes, are included in PP and PN windows but excluded from NN and NP windows. c, Visual representation of the DRP and DRI model evaluation tasks. In both cases, PPs serve as the positive set, while non-interacting regions (NN and NP) are used as the negative set for DRP, and randomly permuted interacting regions (PN) for DRI.

The extensive RRI set we compiled allowed us to perform a thorough evaluation of RIME performances, and to benchmark it against the most well-known thermodynamics-based prediction tools: IntaRNA 2, ASSA, RNAplex, RNAhybrid, RIsearch2, pRIblast, RNAup, and RNAcofold. To evaluate performance across different tools and account for errors in both positive and negative predictions, we used the Area Under the ROC Curve (AUC) as a global evaluation metric. The tools were tested on 200 nucleotide-long RNA regions pairs.

Example Image

Figure 2: RIME overcomes the limitations of thermodynamics-based tools. a, Heatmap of Pearson correlation coefficients between RRI prediction tools, calculated using scores assigned to positive interactions in the Psoralen-based test set. For comparability with RIME scores, ∆G values from thermodynamics-based tools were converted by inverting their sign. Tools with similar scoring patterns were grouped using average linkage hierarchical clustering based on Euclidean distances. b and c, ROC curves describing the performances of the RRI prediction tools in the DRP (b) and DRI (c) tasks, shown for the Psoralen HQ Human (left) and the RIC-seq HQ (right) datasets. To address class imbalance, negatives were randomly undersampled 100 times to match the number of positives, and ROC curves were averaged across iterations.

In the DRP task, RIME consistently outperformed thermodynamics-based tools on human Psoralen-based RNA-RNA interactions, particularly when assessed against higher quality sets. In particular, for the subset with at least 3 supporting reads or interacting regions length higher than 35 nucleotides (hereby named Psoralen HQ Human), RIME demonstrated an AUC of 78%, while the top-performing thermodynamics-based tool (RIsearch2) reached 63%. Noteworthy, we observed a similar trend when evaluating the methods against interactions detected via RIC-Seq, with RIME achieving an AUC of 77% when assessed against interactions supported by at least 4 reads (RIC-seq HQ), whereas on the same set thermodynamics-based tools reached a peak AUC value of 66% with RIsearch2. Although the overall performance metrics for the DRI task were comparatively lower, RIME still stood out as the top-performing tool, with its AUC improving alongside dataset quality. On the Psoralen HQ Human dataset RIME demonstrated an AUC of 72%, while the leading thermodynamics-based tool (RIsearch2) achieved an AUC of 58%. Consistently, on the RIC-Seq HQ dataset, RIME’s AUC (70%) was significantly better than RIsearch2’s one (61%)

The web-service leverages the RIMEfull model, which was trained on an expanded dataset by incorporating the validation set into the training phase. This expanded training dataset enhanced the model's performance when evaluated on the high-quality RIC-seq data.

Example Image

Figure 3: Evaluation of the RIMEfull model. Line plots illustrating the performance (ROC-AUC) of RISE and RISEfull models on RIC-seq test sets with progressively higher quality, measured for the DRP (left panel) and DRI (right panel) tasks. The bottom x-axis represents the minimum interacting region length thresholds. At each threshold value, the positive set was subsampled to match the quality criteria and, to address class imbalance, negative samples were randomly undersampled 100 times to match the number of positives. ROC-AUC scores were averaged across iterations. The number of interactions retained at different threshold values is reported in the top x-axis of each line plot.