In order to retrieve predictions, please prepare your file as a FASTA file (a list of sequences) or BED file (a list of genomic coordinates). If you are interested in predictions for protein-coding genes, which additionally account for the effects of predicted mRNA half-life, please retrieve the pre-computed predictions from the Downloads section. Otherwise, the script we provide will predict transcriptional activity without considering half-life.
FASTA files obey the following convention:
In our case, the sequences should be up to 10,500 nucleotides long (the presumed TSS should be oriented appropriately, and is located in a way such that there are 7,000nt upstream of the TSS and 3,500nt downstream). If your sequences are longer, please trim them down to 10,500 nucleotides. If they are shorter, please pad the sequences with Ns (understanding that your TSS should still be positioned 7,000nt into the sequence).
Read more information about the BED file specification. The first 6 columns of the BED file are required:
Using your BED file, you must regenerate a new BED file that extracts the 7,000nt upstream and 3,500nt downstream of each TSS for which you're predicting information. The following simple script assumes that the center of each entry is the location of your TSS and prepares your new BED file accordingly. Please ensure the TSS positions you specify are not close to the very edge of a chromosome, otherwise the FASTA extraction will fail.
The simplest way to prepare your FASTA file in the expected format is to install bedtools. You can then use the bedtools getfasta function to extract the sequences as shown below. The "-s" parameter reorients sequences on the negative strand. You must download your genome (for example, the human hg38 or mouse mm10 genome assemblies). Please ensure that your first column matches the chromosome names of your genome assembly (for example, you may have to specify something like '1' instead of 'chr1' in your BED file to allow the entries to match.
You can now use your FASTA file to generate predictions.
Please install the required dependencies to run the script. The dependencies file are provided in dependencies.txt, and can be installed using pip install -r dependencies.txt
.
Python modules:
tensorflow
numpy
pandas
keras
biopython
The prediction script is bundled with the test files in Xpresso-predict.zip. For large FASTA files, the predictions can sped up significantly using GPUs, but CPUs are generally sufficient during prediction for a small number of sequences.
Note that this script relies on being in the same directory as the other test files provided in the folder. Documentation for the script arguments is provided below for reference:
Argument | Description |
---|---|
<trained_model.h5> |
Pre-trained model file. Please choose this according to the species and cell type of interest. |
<input_file.fa> |
Input FASTA file. This file can also be gzipped to save space. |
<output_file.txt> |
A tab-separated file with columns ID and SCORE , which hold information for each FASTA ID and its predicted transcriptional activity score. |
Pre-trained models for the following species and cell types are provided in the bundle:
Species | Cell type | File |
---|---|---|
Human | Median among many cell types | humanMedian_trainepoch.11-0.426.h5 |
Human | K562 erythroleukemia cells | K562_trainepoch.11-0.4917.h5 |
Human | GM12878 lymphoblastoid cells | GM12878_trainepoch.06-0.5062.h5 |
Mouse | Median among many cell types | mouseMedian_trainepoch.05-0.278.h5 |
Mouse | Embryonic stem cells | mESC_trainepoch.10-0.3867.h5 |
You can run the predictions on the provided test files in the zip folder with the pre-trained human model using this sample command:
Also provided are prepared human (human_promoters.fa.gz) and mouse (mouse_promoters.fa.gz) gene promoters.