Welcome! Puffin-D is a deep learning model that predicts basepair-resolution transcription initiation signals from 100kb sequence context. The model was trained to predict transcription initiation signals in humans for five datasets: techniques that measure mostly mature transcripts by precisely capturing their 5’ (two variants of CAGE and RAMPAGE) and two techniques that measure only nascent transcripts (GRO/PRO-cap). The model predictions correspond to basepair-resolution count profiles averaged across samples after applying \(log_{10}(𝑥 + 1)\) transformation where 𝑥 is the read count. The model can take any 100Kbp long genome sequence and outputs basepair-resolution five types of transcription initiation signals for the entire 100kbp region. The model can be used to study mutation effects on transcription initiation and analyze the sequence basis of any transcription start site. To use the Puffin-D from the command line visit our github. Puffin-D is the prediction-focused model of two complementary models that we developed for decoding the sequence-basis of transcription initiation. The interpretation-focused model Puffin, an explainable sequence model that provides basepair-level and motif-level interpretation, is available interactively at the Puffin webserver. Puffin is completely free for any non-commercial and academic use, and please contact us for commercial applications.
CAGE (Cap Analysis of Gene Expression) - CAGE is a high-throughput sequencing technique that involves capturing the starting points (5' ends) of capped RNA molecules in a sample. CAGE provides quantitative data on the abundance of total transcripts which include mostly mature transcripts. Here we predict CAGE data from FANTOM and ENCODE projects which used two different variations of CAGE technique.
RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) - RAMPAGE is a variation of the CAGE technique. RAMPAGE involves capturing the 5' ends of capped RNA molecules and generating paired-end sequencing reads to accurately map transcription start sites (TSSs) across the genome. RAMPAGE also captures and quantifies the abundance of total transcripts.
GRO-cap (Global Run-On coupled with cap analysis of gene expression) - GRO-cap captures nascent transcripts by combining two complementary methods: Global Run-On Sequencing (GRO-seq) and CAGE. GRO-cap begins with the labeling and isolation of nascent RNA transcripts that are actively transcribed in a given cell or tissue. This is achieved by performing a "run-on" reaction, where the cellular transcription machinery is allowed to continue transcription from the active genes. Next, the isolated nascent RNA with cap is ligated with 3’ and 5’ RNA adapters. The ligated RNA is reverse transcribed and amplified for sequencing.
PRO-cap (Precision nuclear run-on with Capture) - PRO-cap is a genomic technique used for capturing and sequencing the 5' ends of nascent RNA molecules. Similarly to GRO-cap, PRO-cap applies four biotin-NTP to provide base pair resolution about the early stages of gene expression.
In the home page, sequence input information is expressed in a simple SeqStr format,and then submitted to our job queue. The model
can make predictions for up to 10 input sequences at once. Every sequence should start with a name
on a new line. To predict any reference genome regions it is enough to specify the
sequence coordinates and the name of genome assembly (hg38 or hg19). Sequences carrying any mutations can also
be specified in SeqStr format. It is important to note that
input sequences longer than 100Kbp the prediction will be made only for the central 100Kbp region. Here we list
the SeqStr patterns that we currently support:
genomic interval - [reference_genome]chromosome:coordinate-coordinate strand (for
example, [hg38]chr7:5480600-5580600 -
)
multiple subsequences - [reference_genome]chromosome:coordinate-coordinate
strand,@chromosome coordinate reference_alleles alternative_alleles;raw sequence; (for example,
[hg38]chr7:5530575-5630625 -,@chr7 5530575 C T,@chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;
)
multiple sequences - <sequence_name>sequence\n<sequence_name>sequence (for
example,
<s1>[hg38]chr7:5480600-5580600 -
)
[hg38]chr7:44746680-44846680 +
<wildtype>[hg38]chr7:5480600-5580600 -
<mutant>[hg38]chr7:5480600-5580600 -, @chr7 5530626 TATA GCGC
<wildtype>
, <mutant>
are sequence names
[hg38]
is specifying the reference genome that should be used to retrieve the sequence
chr7:5480600-5580600 -
are the chromosome, coordinates and strand of the sequence
chr7:5480600-5580600 -, @chr7 5530626 TATA GCGC
is specifying the mutant sequence, where
chr7:5480600-5580600 -
is the wild type sequence that needs to be modified and
@chr7 5530626 TATA GCGC
instructs to replace TATA
sequence at position
chr7:5530626
with GCGC
sequence
the input sequence is constructed from separate sequence chunk are concatenated together:
[hg38]chr7:5530575-5630625 -,@chr7 5530575 C T,@chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;
;
divides sequence that will be concatenated
[hg38] chr7:5530575-5630625 @chr7 5530575 C T @chr7 5530576 GC A
is specifying a mutant
sequence where C
at position chr7:5530575
is replaced with T
and
GC
at position chr7:5530576
is replaced with A
TTAAccggGGNaa
, TTAA
are raw sequences
[hg38]chrX:1000000-1000017 +
is specifying another sequence chromosome, coordinates and
strand
Puffin-D predicts basepair-resolution transcription initiation signals using only genomic sequence as input, and more importantly analyze the sequence basis of any transcription start site (TSS) at motif and basepair levels. The predictions are visualized with the plot of signals for target of FANTOM_CAGE, ENCODE_CAGE, ENCODE_RAMPAGE, GRO_CAP, PRO_CAP
You may also download the numerical predictions in json format.
Thank you for using Puffin-D. If you have any question or feedback, you can let us know at https://github.com/jzhoulab/puffin/discussions.