Puffin-D Help

Puffin-D

Welcome! Puffin-D is a deep learning model that predicts basepair-resolution transcription initiation signals from 100kb sequence context. The model was trained to predict transcription initiation signals in humans for five datasets: techniques that measure mostly mature transcripts by precisely capturing their 5’ (two variants of CAGE and RAMPAGE) and two techniques that measure only nascent transcripts (GRO/PRO-cap). The model predictions correspond to basepair-resolution count profiles averaged across samples after applying \(log_{10}(𝑥 + 1)\) transformation where 𝑥 is the read count. The model can take any 100Kbp long genome sequence and outputs basepair-resolution five types of transcription initiation signals for the entire 100kbp region. The model can be used to study mutation effects on transcription initiation and analyze the sequence basis of any transcription start site. To use the Puffin-D from the command line visit our github. Puffin-D is the prediction-focused model of two complementary models that we developed for decoding the sequence-basis of transcription initiation. The interpretation-focused model Puffin, an explainable sequence model that provides basepair-level and motif-level interpretation, is available interactively at the Puffin webserver. Puffin is completely free for any non-commercial and academic use, and please contact us for commercial applications.

Prediction Target

CAGE (Cap Analysis of Gene Expression) - CAGE is a high-throughput sequencing technique that involves capturing the starting points (5' ends) of capped RNA molecules in a sample. CAGE provides quantitative data on the abundance of total transcripts which include mostly mature transcripts. Here we predict CAGE data from FANTOM and ENCODE projects which used two different variations of CAGE technique.
RAMPAGE (RNA Annotation and Mapping of Promoters for the Analysis of Gene Expression) - RAMPAGE is a variation of the CAGE technique. RAMPAGE involves capturing the 5' ends of capped RNA molecules and generating paired-end sequencing reads to accurately map transcription start sites (TSSs) across the genome. RAMPAGE also captures and quantifies the abundance of total transcripts.
GRO-cap (Global Run-On coupled with cap analysis of gene expression) - GRO-cap captures nascent transcripts by combining two complementary methods: Global Run-On Sequencing (GRO-seq) and CAGE. GRO-cap begins with the labeling and isolation of nascent RNA transcripts that are actively transcribed in a given cell or tissue. This is achieved by performing a "run-on" reaction, where the cellular transcription machinery is allowed to continue transcription from the active genes. Next, the isolated nascent RNA with cap is ligated with 3’ and 5’ RNA adapters. The ligated RNA is reverse transcribed and amplified for sequencing.
PRO-cap (Precision nuclear run-on with Capture) - PRO-cap is a genomic technique used for capturing and sequencing the 5' ends of nascent RNA molecules. Similarly to GRO-cap, PRO-cap applies four biotin-NTP to provide base pair resolution about the early stages of gene expression.

Prediction Input

In the home page, sequence input information is expressed in a simple SeqStr format,and then submitted to our job queue. The model can make predictions for up to 10 input sequences at once. Every sequence should start with a name on a new line. To predict any reference genome regions it is enough to specify the sequence coordinates and the name of genome assembly (hg38 or hg19). Sequences carrying any mutations can also be specified in SeqStr format. It is important to note that input sequences longer than 100Kbp the prediction will be made only for the central 100Kbp region. Here we list the SeqStr patterns that we currently support:

genomic interval - [reference_genome]chromosome:coordinate-coordinate strand (for example, [hg38]chr7:5480600-5580600 -)
multiple subsequences - [reference_genome]chromosome:coordinate-coordinate strand,@chromosome coordinate reference_alleles alternative_alleles;raw sequence; (for example, [hg38]chr7:5530575-5630625 -,@chr7 5530575 C T,@chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;)
multiple sequences - <sequence_name>sequence\n<sequence_name>sequence (for example,
<s1>[hg38]chr7:5480600-5580600 - [hg38]chr7:44746680-44846680 +)

Example 1:

<wildtype>[hg38]chr7:5480600-5580600 - <mutant>[hg38]chr7:5480600-5580600 -, @chr7 5530626 TATA GCGC

<wildtype>, <mutant> are sequence names
[hg38] is specifying the reference genome that should be used to retrieve the sequence
chr7:5480600-5580600 - are the chromosome, coordinates and strand of the sequence
chr7:5480600-5580600 -, @chr7 5530626 TATA GCGC is specifying the mutant sequence, where chr7:5480600-5580600 - is the wild type sequence that needs to be modified and @chr7 5530626 TATA GCGC instructs to replace TATA sequence at position chr7:5530626 with GCGC sequence

Example 2:

the input sequence is constructed from separate sequence chunk are concatenated together:
[hg38]chr7:5530575-5630625 -,@chr7 5530575 C T,@chr7 5530576 GC A;TTAAccggGGNaa;[hg38]chrX:1000000-1000017 +;TTAA;

; divides sequence that will be concatenated
[hg38] chr7:5530575-5630625 @chr7 5530575 C T @chr7 5530576 GC A is specifying a mutant sequence where C at position chr7:5530575 is replaced with T and GC at position chr7:5530576 is replaced with A
TTAAccggGGNaa, TTAA are raw sequences
[hg38]chrX:1000000-1000017 + is specifying another sequence chromosome, coordinates and strand

Prediction Output

Puffin-D predicts basepair-resolution transcription initiation signals using only genomic sequence as input, and more importantly analyze the sequence basis of any transcription start site (TSS) at motif and basepair levels. The predictions are visualized with the plot of signals for target of FANTOM_CAGE, ENCODE_CAGE, ENCODE_RAMPAGE, GRO_CAP, PRO_CAP

You may also download the numerical predictions in json format.

Question and feedback?

Thank you for using Puffin-D. If you have any question or feedback, you can let us know at https://github.com/jzhoulab/puffin/discussions.

Puffin-D Prediction