Xpresso

Expected input formats

In order to retrieve predictions, please prepare your file as a FASTA file (a list of sequences) or BED file (a list of genomic coordinates). If you are interested in predictions for protein-coding genes, which additionally account for the effects of predicted mRNA half-life, please retrieve the pre-computed predictions from the Downloads section. Otherwise, the script we provide will predict transcriptional activity without considering half-life.

FASTA format

FASTA files obey the following convention:

    >name1
    ATAGACAGACGAGACAGCAGACGACGCAAGATGCACGATTTA.....GATAGACAGATATGGAGAGGCCGATATGACCGAGATAGA
    >name2
    GACACGACAGGACCCACGGACGAGTATTTTGCACGTTTTGAC.....GATAGACAGACCTTTAGAGAAAACCCGAGAGACGTAGCA

In our case, the sequences should be up to 10,500 nucleotides long (the presumed TSS should be oriented appropriately, and is located in a way such that there are 7,000nt upstream of the TSS and 3,500nt downstream). If your sequences are longer, please trim them down to 10,500 nucleotides. If they are shorter, please pad the sequences with Ns (understanding that your TSS should still be positioned 7,000nt into the sequence).

BED format

Read more information about the BED file specification. The first 6 columns of the BED file are required:

chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671)
chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0
chromEnd - The ending position of the feature in the chromosome or scaffold
name - Defines the name of the BED line, such as gene name or genomic position
score - A score between 0 and 1000. Optional and can be filled with a "."
strand - Defines the strand. Either "+" or "-"

Preparing FASTA from BED

Using your BED file, you must regenerate a new BED file that extracts the 7,000nt upstream and 3,500nt downstream of each TSS for which you're predicting information. The following simple script assumes that the center of each entry is the location of your TSS and prepares your new BED file accordingly. Please ensure the TSS positions you specify are not close to the very edge of a chromosome, otherwise the FASTA extraction will fail.

perl -ne '@a=split /\t/; $mid=int(($a[1]+$a[2])/2); \
        if($a[5] eq "+"){ $a[1]=$mid-7000; $a[2] = $mid+3500; } \
        else { $a[1]=$mid-3500; $a[2] = $mid+7000; } \
        print join("\t", @a), "\n";' \
        < testfile.bed > testfile.prepared.bed

The simplest way to prepare your FASTA file in the expected format is to install bedtools. You can then use the bedtools getfasta function to extract the sequences as shown below. The "-s" parameter reorients sequences on the negative strand. You must download your genome (for example, the human hg38 or mouse mm10 genome assemblies). Please ensure that your first column matches the chromosome names of your genome assembly (for example, you may have to specify something like '1' instead of 'chr1' in your BED file to allow the entries to match.

 bedtools getfasta -s -name -fi YOURGENOME.fa -bed testfile.prepared.bed -fo testfile.fa

You can now use your FASTA file to generate predictions.

Required dependencies

Please install the required dependencies to run the script. The dependencies file are provided in dependencies.txt, and can be installed using pip install -r dependencies.txt.

Requirements to run the Python script

Python modules:
tensorflow
numpy
pandas
keras
biopython

Generating predictions

The prediction script is bundled with the test files in Xpresso-predict.zip. For large FASTA files, the predictions can sped up significantly using GPUs, but CPUs are generally sufficient during prediction for a small number of sequences.

python xpresso_predict.py <trained_model.h5> <input_file.fa> <output_file.txt>

Note that this script relies on being in the same directory as the other test files provided in the folder. Documentation for the script arguments is provided below for reference:

Argument	Description
`<trained_model.h5>`	Pre-trained model file. Please choose this according to the species and cell type of interest.
`<input_file.fa>`	Input FASTA file. This file can also be gzipped to save space.
`<output_file.txt>`	A tab-separated file with columns `ID` and `SCORE`, which hold information for each FASTA ID and its predicted transcriptional activity score.

Pre-trained models for the following species and cell types are provided in the bundle:

Species	Cell type	File
Human	Median among many cell types	humanMedian_trainepoch.11-0.426.h5
Human	K562 erythroleukemia cells	K562_trainepoch.11-0.4917.h5
Human	GM12878 lymphoblastoid cells	GM12878_trainepoch.06-0.5062.h5
Mouse	Median among many cell types	mouseMedian_trainepoch.05-0.278.h5
Mouse	Embryonic stem cells	mESC_trainepoch.10-0.3867.h5

You can run the predictions on the provided test files in the zip folder with the pre-trained human model using this sample command:

python xpresso_predict.py pretrained_models/humanMedian_trainepoch.11-0.426.h5 input_fasta/testinput.fa.gz predictions.txt

Also provided are prepared human (human_promoters.fa.gz) and mouse (mouse_promoters.fa.gz) gene promoters.