RNA plays a fundamental role inside the cell. The experimental determination of ribonucleoprotein (RNP) complexes is a slow and difficult process and the number of experimentally determined structures of RNP complexes is still rather scarce. In this view, computational predictions of RNP complex structures would greatly help studying protein–RNA interactions and investigate their molecular function. To this purpose, we developed catRAPID, an algorithm to facilitate the identification of protein-RNA interactions. Through the calculation of secondary structure, hydrogen bonding and van der Waals contributions, catRAPID is able predict protein-RNA interaction propensities with great accuracy (up to 89% on the ncRNA-protein interaction database, NPinter).
In catRAPID, the contributions of secondary structure, hydrogen bonding and van der Waals are combined together into the interaction profile:
where
indicates the physico-chemical profile of a property
calculated for each amino acid (nucleotide) starting from the N-terminus (5’). The hydrogen bonding profile, denoted by
, is the hydrogen bonding ability of each amino acid (nucleotide) in the sequence:
Similarly,
represents the secondary structure occupancy profile and
the van der Waals profile. The variable x indicates RNA (x = r) or protein (x = p) profiles. Secondary structure, hydrogen bonding and van der Waals contributions are calculated as described in the catRAPID manuscript [Bellucci et al., 2011]. In particular, the RNA secondary structure is predicted from sequence using the Vienna package [ref?] including the algorithms RNAfold, RNAsubopt and RNAplot. Model structures, ranked by energy, are used as input for catRAPID. For each model structure, the RNAplot algorithm is employed to generate secondary structure coordinates. Using the coordinates we define the secondary structure occupancy by counting the number of contacts within the nucleotide chain. High values of secondary structure occupancy indicate that base pairing occurs in regions with high propensity to form stems, while low values are associated with junctions or multi-loops.
We employ discrete Fourier transform to compare interaction profiles of different length:
Where the number of coefficients is
.
The interaction propensity
is defined as the inner product between the protein propensity profile
and the RNA propensity profile
weighted by the interaction matrix
:
The interaction matrix
as well as the parameters
,
and
are derived under the condition that interaction propensities
take maximal values for associations present in the positive training set (and minimal values for those in the negative training set):
The discriminative power measures the interaction propensity of a protein-RNA pair with respect to the training sets.
For each protein-RNA pair under investigation, we use a reference set of 102 protein and 102 RNA molecules (the number of sequences is chosen to guarantee sufficient statistical sampling). To assess the strength of a particular association, we compute the interaction propensity
and compare it with the interaction propensities
of the reference set (total of 104 protein-RNA pairs). Using the interaction propensity distribution of the reference set, we generate the interaction score:
The number of interactions is
. From the distribution of interaction propensities we compute the interaction strength:
In our calculations we employ random associations between polypeptide and nucleotide sequences. Since little interaction propensities are expected from these random associations, the reference set represents a negative control. Reference sequences have the same lengths as the pair of interest to guarantee that the interaction strength is independent of protein and RNA lengths. The interaction strength ranges from 0% (non-interacting) to 100% (interacting). Interaction strengths above 50% indicate propensity to bind. The RNA interaction strength and the protein interaction strength are special cases of the interaction strength in which the reference set is generated using only RNA or protein sequences.
When protein or RNA sequences exceed the size compatible with our computational requirements and catRAPID could not be used to calculate the interaction propensity. To overcome this limitation, we developed a procedure called fragmentation, which involves division of polypeptide and nucleotide sequences into fragments followed by prediction of the interaction propensities.
The "long RNA" option: The use of RNA fragments is introduced to identify RNA regions involved in protein binding. The RNALfold algorithm from the Vienna package
is employed to select RNA fragments in the range 100-200 nt with predicted stable secondary structure. Secondary structure stabilities are estimated by calculating the RNA free energy predicted by RNALfold. As long RNA segments have lower free energy for the higher number of bases that can be paired, the choice of segments in the range of 100-200 nt is optimal because it allows simultaneously: a) selection of secondary structures with comparable free energy b) high sequence coverage (>90%) for long transcripts such as Xist. Once the RNA fragments are selected, catRAPID is employed to predict their ability to bind to polypeptide chains. Conceptually, the interaction fragments algorithm is a variant of the RNA interaction strength algorithm that allows identification of putative binding areas in long sequences. If the exact protein and/or RNA domains are known, we recommend the use of the interaction strength method to predict the binding specificity.
The “protein-RNA” option: The analysis of fragments is particularly useful to identify regions involved in the binding. The fragmentation approach is based on the division of protein and RNA sequences into overlapping segments:
Where
,
, and
indicate the position of fragments, f is their length and l is the overall sequence length. The number of total fragments is
(limited by catRAPID sequence restrictions). The maximum number of protein-RNA interactions is 104, which implies that the ability to identify an experimentally validated interaction by chance is 10-4. The list of all the protein-RNA fragment associations is called interaction map. Protein and RNA interaction profiles are bi-dimensional projections of the interaction map onto the protein or RNA positions, respectively.