Some frequently asked questions:

1.Can I make lncRNA/DNA binding prediction for lncRNA and DNA sequences that are not in the LongMan database?

Yes. If one lncRNA and one genomic region (one DNA sequence) are involved, the prediction can be made in the LongTarget web page, by pasting two sequences into the two input boxes or uploading two sequence files. If the sequences are multiple lncRNAs and one DNA sequence, or one lncRNA and multiple DNA sequences, the prediction can be made in the newly developed LongTarget-BE web page (BE means “Batch prediction for External data”), in which two sequence files should be uploaded. Due to time-consumption, it is currently not allowed to do prediction for N lncRNAs and M DNA sequences.


2. Why is the usefulness of cross-species lncRNA/DNA binding analysis?

Increasing evidence indicates that epigenetic regulation of gene expression is very species-specific. By checking whether a human lncRNA has DNA binding sites in a macaque genomic region or a macaque lncRNA has DNA binding sites in a human genomic region, one can examine, for example, what (TFO or TTS) cause species-specific gains and losses of sites of epigenetic regulation.


3.Should I perform a permutation test if I suspect the reliability of a prediction?

The current permutation test is too time-consuming, and a revised version is available soon. It is more preferable to re-run the job with a larger Nt parameter (with a slightly larger offset parameter) to see if the same binding sites still occur at the same positions. A long TTS (about 100 bp, with offset=15 - 20) strongly indicates a reliable binding site.


4. Can TFOs/TTSs generated by different parameters be compared?

As long as TTS distribution patterns generated by different parameters are the same, the results are basically equal. Example 4 in the Examples web page is a case in point. So, instead of comparing the number, height, and area of TTS, one should compare TTS distribution patterns.


5. Why both hg19 and hg38 are included in the database for lncRNA/DNA binding prediction?

To examine whether predicted DNA binding sites are at reasonable genomic positions, it is advisable to display TTS distributions with the ENCODE DNA Methylation and ENCODE Histone Modification tracks in the UCSC Genome Browser. These tracks work only with hg19. If these tracks are not cared, it may be more advisable to use hg38.


6. Upon what to judge whether predicted DNA binding motifs and binding sites are reliable?

DNA binding motifs in an lncRNA and binding sites in a genomic region are mutually determined. Normally, more are known about lncRNAs’ DNA binding sites (for example, the enrichment of epigenomic modifying enzymes and signals of epigenomic modification at considerable genomic sites in some cell lines indicate lncRNAs’ DNA binding sites). Also, many lncRNAs’ DNA binding sites are in promoter regions of genes. So, DNA binding sites at these positions are rather reasonable and reliable. In contrast, DNA binding motifs in few lncRNAs are identified experimentally, and an important and impressive example is PTENP1-asRNA, which uses a DNA binding motif to bind to PTEN promoter (Example 5 in the Examples web page). Therefore, any TFO1s that generates reasonable TTSs are reasonable.


7. What are differences between TFO1 and other TFOs, should TFO2 and TFO3 be seriously considered?

LongTarget, currently upon the 24 Hoogsteen and reverse Hoogsteen rulesets, usually predicts multiple TFOs and TTSs, with TFO1 being the best TFO and TFO2 the second best one. In many situations only TFO1 may be the true DNA binding motif. But our analyses of Kcnq1ot1 and some other very long lncRNAs indicate that some lncRNAs may have multiple DNA binding motifs and they may use different DNA binding motifs to bind to DNA sequences in different genomic regions. So, if TFO2 and TFO3 have TTSs at reasonable genomic positions, they may also be DNA binding motifs. In addition using different DNA binding motifs to bind to different genomic regions, lncRNAs may also use different DNA binding motifs to bind to DNA sequences in different cells.


8. Will the database LongMan contain more sequences and offer more services?

More sequences and more services will be available soon, and we are doing our best to make the server upgraded.


9. How is the database updated?

The database is updated in several ways – to include more lncRNAs, to include identified TFOs and TTSs, to provide links to more external data sources, and to improve lncRNA/DNA binding prediction protocols. Currently, LongMan is dedicated to orthologous mammalian lncRNAs we identified based on the GENCODE-annotated human and mouse lncRNAs. We are searching orthologues of the 10481 GENCODE-annotated mouse lncRNAs in the 16 mammals and expect to include the identified orthologues of the 10481 mouse lncRNAs in other mammals soon. In addition, the database will include more GENCODE-annotated human and mouse lncRNAs.


10. What are the main steps to perform a first-form genome-scale prediction?

Steps to perform a chromosome-wide prediction are as follows. First, choose “Species=human” and “Chr=Y” in Search orthologues upon multi-conditionsf, this makes database research reports 83 lncRNAs on human Y chromosome. Second, press the left-top button “Batch TFO/TTS prediction”, this makes LongMan go to the batch prediction web page. Third, use “Or define a genomic region” to define the CDKN2A/2B region hg19|chr9|21992500-22012500. Fourth, press the “Retrieve” button, this makes LongMan to retrieve the sequence of hg19|chr9|21992500-22012500 into the DNA input box. Fifth, choose the default LongTarget parameter setting. Sixth, do not choose any filter conditions, this let all results be reported. Seventh, input an email address, this makes a pdf file displaying all TTS distributions to the user automatically whenever the prediction is finished, and the email also contains the links for downloading other files. Eighth, press the “submit” button to submit the prediction. Ninth, check emails, the first email confirming the submission of the prediction and the second email returning the pdf file and the links. Tenth, to read the TTS distributions in the background of ENCODE DNA Methylation and ENCODE Histone Modification tracks, the user should download the class1 file from our website and upload the class1 file onto the UCSC Genome Browser (hg19) as a custom track. A genome-wide prediction follows the same steps, beginning with choosing “Species=human” and “Chr=All” in Search orthologues upon multi-conditions.


11. What are the main steps to perform a second-form genome-wide prediction?

Steps to perform a second-form genome-wide prediction (to predict an lncRNAs’ DNA binding sites in the promoter regions of all transcripts in a species) are simpler. First, in the LongTarget web page, in 1. Input DNA Sequence, choose “Or choose a species for genome-scale prediction for the inputted lncRNA”. Second, input the lncRNA. Third, choose the LongTarget parameters (the default setting suits most situations). Fourth, since this genome-wide prediction takes about 10 days to two weeks for the human genome hg38, to avoid checking if a prediction is finished from time to time, it may be more advisable to leave an email address to let the results be sent automatically by email whenever it is finished.


12. What cause many TTSs?

A small Nt and/or low filter conditions cause many TTSs, but such TTSs may not be strong. If the genomic region has many transposons (especially Simple Repeats) many strong TTSs may be reported.


13. What make very few TTSs be predicted?

There is no or few binding site, or parameters and filter conditions are too large.


14. What files are returned by a batch prediction (genome-wide or chromosome-wide)?

If an email address is provided, an email is sent to the user, which includes the pdf file (preview-all.pdf) that displays the TTS distribution generated by TFO1 in the genomic region and the link for downloading the compressed file (BatchLncRNA-BindingSites-prediction-result.zip). After uncompressing the zip file, inside the folder are (1) multiple gene folders (such as human_hg19-ENSG0000012345.1-RP11-2345-chr1-1234567-2345678), (2) a defaultFilter folder, and (3) the preview-all.pdf. Inside each gene folder are a class1 file and a sorted file, the class1 file containing the distribution of TTS generated by TFO1 in the genomic region and the sorted file containing details of all triplexes (including TFO and TTS sequences). An example of class1 file is shown below, in which numbers in the fourth column indicate the height of triplexes at genomic positions.

Some contents of a sorted file is shown below, in which numbers in the Nt (bp) field indicate lengths of triplexes, numbers (all 1s) in the Class field indicate that these triplexes are generated by TFO1, sequences in the TFO_sequence field indicate the TFO1 sequences, and sequences in the TTS_sequence field indicate the TTS sequences. Note that triplexes generated by TFO1 at different positions may have different lengths, therefore these TFO sequences are different in length but have a shared region, and these TTS sequences are different both in length and in sequence but share a consensus sequence.

If an lncRNA has one or more TTS with length exceeding the parameter Nt, it is reported using a gene folder, and if it has one or more TTS matching the filter conditions, it also occurs in the defaultFilter folder.


15. How to get and locate TFO1 and TFO2?

(1) Use Microsoft Excel to open the sorted file. If one wants to get the longest possible TFO1sequence, choose and copy the TFO1 sequence of the longest triplex (for example, the one whose length is 113 in FAQ 14), otherwise if the consensus sequence of TFO1 is conserved, choose and copy a short triplex (for example, the one whose length is 54 or 61 in FAQ 14). (2) Open the UCSC Genome Browser (genome.UCSC.edu) and go to Tools/BLAT. (3) Paste the TFO1 sequence into the BLAT input box and submit a genome search. (4) Click the “browser” of the first search result (normally the first is the one at the target genomic region) and get into the genome browser web page. (5) Use the left button of the mouse to mark this region. (6) Zoom out the page 100 times to show the whole PTENP1/PTENP1-AS promoter region. Now, it is clear that the TFO1 is in exon1 of PTENP1-AS and the TTS is in the promoter of PTENP1. Moreover, the TFO1 exists only in humans and chimps. The following figures show the above steps and results. The same steps can be used to get and locate TFO2.





Some frequently met trouble-shootings:

1. Why does a prediction not generate any result?

First, check how long the prediction has run, it is quite often for the first-form genome-wide prediction to take several days and for the second-form genome-wide prediction to take more than 10 days. Second, check the format of both inputs by pressing the “Sample-A”, “Sample-B”, and “format” buttons, in most cases this is caused by wrong input formats. Third, check if the "Nt" parameter is set too high (for example, >100 bp), and if the "Identity" parameter is set too high (for example, >80). Fourth, check if filter conditions are set too high (for example, TTS height < 80, Area < 500). After all these are checked and there remains no result, please write to longtarget@smu.edu.cn, instead of resubmitting the prediction, and we normally respond to emails in 1-2 days.


2. Why no email is received after an email address is provided?

If an email address is provided, two emails will be sent, one to notice the submission of the prediction and the other returns the results. If the first email is received, and after 2 days for making a genome-region prediction, 5 days for the first-form genome-wide prediction, 10 days for the second-form genome-wide prediction, please write to longtarget@smu.edu.cn.


3. Why does an lncRNA assumed to bind to the genomic region not be identified in a genome-wide prediction?

First, check if the "Nt" parameter is set too high and if the "Identity" parameter is set too high. Second, check if filter conditions are set too high.


4. How to handle too many TTSs?

To use Microsoft Excel to open the sorted file and manually filter out short TTSs, or to resubmit the prediction with higher Nt and higher filter conditions.


5. How to do if there are too few results?

If the defaultFilter folder has very few lncRNAs, (1) check all reported lncRNAs (all gene folders), if the filter conditions are too stringent, many reported lncRNAs may not be in the defaultFilter folder, (2) if there are very few gene folders, resubmit the prediction with a smaller Nt (together with lower filter conditions), (3) rethink whether the DNA sequence is a reasonable target region of the lncRNAs.


6. How to identify false positive TTSs?

Considerable TTSs at Simple Repeats may be false positives, because highly enriched nucleotides in some Simple Repeats may accidentally match some base-pairing rulesets very well. Also, TTSs generated by multiple rulesets may be less convincing than TTSs generated by one dominant ruleset. Finally, TTSs with a low height and small area may be false positives. All of these TTSs can be filtered out using LongTarget parameters and/or filter conditions.


7. Now that the Nt parameter is important, how to choose the proper Nt?

Normally Nt = 30 - 50, try 50 first. If too many TTSs are reported or there are many long triplexes in the sorted file, try a larger Nt, otherwise, try a smaller one. When Nt is changed, so may offset need. Nt = 30/50/80 work normally with offset = 10/15/20. Be careful: many TTSs may be false positives if Nt is shorter than 30.