> Matt v. 1.1.0
=====

A toolkit for analyzing genomic sequences
with focus on down-stream analysis of alternative splicing events
  Versions
       1.2.1
       1.2.0
       1.1.0
       1.0.0

       All & change log
Commands for:

Data analysis
get_efeatures
get_ifeatures
cmpr_exons
cmpr_introns
cmpr_motif
get_kmers_from_pwm
get_pwm_hits
get_pwm_prof
get_regexp_hits
get_regexp_prof
perm_test
plot_motif
rna_maps
rna_maps_cisbp
test_cisbp_enrich
test_pwm_enrich
test_regexp_enrich


Data preparation
add_cols
add_rows
add_val
chg_labs
chk_tab
col_calc
col_uniq
def_cats
extr_geneids
extr_scafids
extr_seqs
get_colnms
get_cols
get_match
get_gcc
get_ovl
get_pwm
get_rows
get_sj
conc
get_seql
get_seqs
get_sss
get_vast
grp_calc
mk_uniq
prnt_tab
rand_rows
retr_rnaseq
rm_cols
rn_cols
row_calc
sort_tab
stratify

Contents

  1. Introduction
  2. Download and installation
  3. Tables: the central data structure
  4. Documentation of all commands
  5. Examples
  6. Extending Matt: adding new commands
  7. FAQ / Trouble shooting
  8. How to cite


Introduction

Matt is a Linux command-line tool-kit for analyzing genomic sequences with focus on the down-stream analysis of alternative splicing events. Being a POSIX-style command-line tool-kit, Matt is run on the terminal of any Linux-like system.

Matt is modular. That means, it comprises many commands, each of which tailored for one specific task. The combination of these distinct commands allows the user to solve complex tasks: The whole is more than the sum of its parts.

Usually, the user of Matt goes through two phases.

  1. Data preparation: fit the output from other programs, e.g., for the estimation of inclusion levels of alternative splicing events from RNA-seq data, to input requirements of Matt analyses. This might include:
  2. Data analysis: run Matt analyses, e.g.:
Though the main focus of working with Matt is the final analyses, Matt offers commands for data preparation within the same framework, too. The time needed for data preparation, though not being the main focus of working with Matt, should not be under-estimated. Some analysis commands produce reports as PDF documents, PDF graphics, or web pages summarizing results of analysis. To make the work with Matt more efficient, many Matt commands are designed to be combined by piping. Exploiting this capability improves the work experience with Matt.

Matt provides comprehensive help messages on the terminal. To get an overview of all available commands, call

> matt
For getting the help message for COMMAND, call
> matt COMMAND
Each help message describes all arguments, e.g.,
matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...
where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted. The order of arguments always matters; changing the order might cause errors or false results.

This web page describes all commands as well and gives example calls. Both, this online documentation and the terminal help messages complement each other.

Download and installation

  1. Create folder for installation
    > mkdir matt
    > cd matt
  2. Download archive
    > wget matt.crg.eu/v1_1_0/matt.tar.gz
    or
    > curl -OL matt.crg.eu/v1_1_0/matt.tar.gz
  3. Decompress
    > tar -xvzf matt.tar.gz
  4. Run INSTALL script
    > ./INSTALL
    or
    > source ./INSTALL
  5. Add link to any PATH folder
    > cd /path/to/PATH/folder
    > ln -s /path/to/folder/matt/matt
  6. Call Matt to get an overview of commands
    > matt

Tables

Tab-separated plain text tables are the main data structure Matt works on. The motivations for making tables the basic data-structure are:

  1. Compatibility: with the output of many programs for the estimation of inclusion levels of alternative splicing events that often output their results in form of tables, e.g., ASpli, IPSA MISO, rMATS, SANJUAN, SUPPA, VAST-TOOLS,
  2. User-friendliness: most users are familiar with tables in form of spread sheets (Libre Office, Excel) and therefore can inspect tables easily with any spread sheet program
  3. Clarity: combining all important pieces of information in a table with precise column names documents well the data
  4. Exploiting piping: many Matt commands can be combined efficiently using Piping as their main input and output are tables
  5. Synergy: other powerful command-line tools like Sed or Awk can be applied to tables and, therefore, be used in combination with Matt

Tables must have headers with column names and column names must be unique, e.g., a table describing micro exons could look like this:

         EXON_ID    START   END     SCAFF  STRAND  SEQ      ENSEMBL_GID      GROUP
         ex1        123127  123132  chr1   +       GCTAGT   ENSG00000223972  up
         ex2        903861  903868  chr3   +       GCTCGATT ENSG00038737632  up
         ex3        4291861 4291863 chr5   -       GTC      ENSG00009875223  down
         ex4        9128712 9128717 chr12  -       CGCCCT   ENSG00002652352  down
         ex5        714614  714620  chrM   +       CGCTCGC  ENSG00083273821  up
		
It should be emphasized that Matt does not assume any specific column structure, rather does the user choose specific columns with data necessary for the command the user wants to apply. As a consequence, when calling Matt commands the user needs to provide as arguments the names of these columns. For example, a user might want to do a differential motif enrichment analysis with the micro-exons table above and would list columns SEQ and GROUP, or, when the user wants to retrieve splice-site sequences of these micro exons, the user would use the same table but list columns START, END, SCAFF, STRAND. The order of columns in tables is never important, as the user always need to refer Matt to specific columns by their names.

Not assuming specific columns in a specific order makes Matt flexible and applicable to almost any tab-separated table. For adding column names to tables yet without column names, e.g., like BED files, Matt offers the command add_header.

Users are free to choose any column names they prefer. Though, it is recommended to use capital letters for all COLUMN NAMES. Further, column names should contain only alpha-numeric characters, A-Z, 0-9, and '_'.

Generally, tables might contain any type of data representable as text, including long sequences. For improving the work experience with Matt, it is recommended that each column, whenever possible, should contain only one piece of information / value. Combining different values into one cell by concatenations, e.g., chr1:876123-877155, makes the single values, e.g., chromosome, start, end, inaccessible for Matt.

Tables might contain comments; comment lines must start with a the character sharp '#'. It's recommended to have comment lines before the header.

Matt can process tables with Linux new-lines and Windows new-lines. Tables with old-style Mac new-lines or Dos new-lines need to be adapted first; See command chk_nls. When working with MS Excel, please save tables in format Windows Text.


Documentation of Matt commands

This online documentation of Matt commands introduces all commands and provides examples and explanations. For more details on commands you might want to checkout the terminal help-messages which Matt offers for all commands.

For displaying an overview of all available commands, call
> matt
For getting a help message for COMMAND, call
> matt COMMAND
Each help message defines all arguments, e.g.,
matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...
where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted.

Generally, one might categorize all Matt commands into commands for

  1. altering tables / preparing the data for later analysis (top of right panel)
  2. performing analyses (top of right panel)
If you are primarily interested in performing analyses, you might want to make yourself familiar with the commands for analyses first, and later learn about the commands for altering tables.

Table categorizing all Matt commands into eight groups

1. Import data / check table
     chk_nlsvisualize and check newlines in table
     chk_tabcheck if table is Matt conform, i.e., all rows have same number of columns
     get_pwmread-in position weight matrix model from text file
     get_sjread-in output table with skipped exons from SANJUAN program
     get_vastread-in output table with alternative splicing events from VAST-TOOLS program

2. Retrieve data from table
     col_uniqlist all distinct values of a column
     get_colnmsget column names of table
     get_colsextract entire columns from table
     get_matchmatch entries across two tables via IDs
     get_ovlget overlap, i.e., events (rows) which are present in several tables
     concconstruct artificial row ids by concatenating specific values across rows
     get_rowsextract entire rows from table, which fulfill constraints
     rand_rowsrandomly down-sample rows from table

3. Manipulate table
     add_colsadd columns to table
     add_headeradd header to table yet without header
     add_rowsadd rows to table
     add_valadd one new column with constant value to table
     chg_labsin a column with categorical values (group labels), change these labels
     def_catsdefine groups of item (rows) and add a new column with group labels to a table
     rm_colsremove columns from a table
     rn_colsrename column names in header of table

4. Retrieve sequences
     extr_seqsoutput in FASTA format sequences from a column of a table
     get_seqsretrieve biological sequences from FASTA file for coordinates given by a table

5. Analyze sequences
     cmpr_motifgenerate a comparative motif plot (PDF) for comparing two motifs of the same length
     get_efeaturesretrieve 75 features for exon investigations described by a table
     get_gccdetermine the GC content for sequences in a column of a table
     get_ifeaturesretrieve 50 features for intron investigations for introns described by a table
     get_pwm_hitsin sequences search for hits with a motif position weight matrix model
     get_pwm_profvisualize positional distributions of PWM hits across sequences
     get_regexp_hitsin sequences search for hits with Perl regular expressions
     get_regexp_profvisualize positional distributions of REGEXP hits across sequences
     get_seqlget length of sequences given in a column of a table
     get_sssget splice-site scores (MaxEnt models) for splice sites given in a column of a table
     plot_motifgenerate a motif plot as PDF (similar to sequence logos)
     test_pwm_enrichtest motif enrichment in two sets of sequences with motif position weight matrix
test_regexp_enrichtest motif enrichment in two sets of sequences with motif regular expression

6. Maths and statistics
     row_calcapply mathematical computations with values along rows
     col_calcapply mathematical computations with values along columns
     perm_testgeneral-purpose statistical permutation test

7. General purpose
     extr_geneidsextract automatically gene IDs from GTF for genomic events (e.g. genes, exon, introns)
     extr_scafidsget overview of chromosome/scaffold IDs from FASTA, FASTQ, GTF files
get_kmers_from_pwmget top k-mers most likely under a given position weight matrix model
     mk_uniqmake unique a table, i.e., neglect rows with replicated values
     prnt_tabprint a table nicely to screen
     retr_rnaseqretrieve RNA-seq data seqs from GEO with GEO accession numbers
     sort_tabsort a table according to values along column
     stratifygeneral-purpose stratification of two data sets according to any numerical feature

8. High-level analyses
     cmpr_exonscomparing exon groups: extract 75 exon features and determine discriminating features
     cmpr_intronscomparing intron groups: extract 50 intron features and determine discriminating features
     rna_mapsfor groups of exons / introns: generate PDF plots with motif RNA maps
test_cisbp_enrichfor groups of sequences: check for motif enrichment with all 297 CISBP-RNA binding motifs



> matt chk_nls
Description: Use this command to print a table to screen making visible new-lines. This helps to investigate which new-line characters are present in the table and how the table would need to be adapted to be processable by Matt. This might be necessary for tables with old-style Mac or Dos new-lines.

Example: With t1.tab being the example table from above
> matt chk_nls t1.tab
prints this table making visible new-lines.



> matt chk_tab
Description: Use this command to check if a given table is processable by Matt. To be so, all rows of a table must have the same number of fields. This command lists row-number of rows with a wrong number of fields or reports that the table is correctly formatted.

Example: With t1.tab being the example table from above
> matt chk_tab t1.tab
reports that this table is correctly formatted and, hence, processable by Matt.



> matt get_pwm
Description: Use this command to
  1. read-in a position weight matrix (PWM) given by a table
  2. learn a PWM from aligned sequences of identical length.
and turn it into the standard Matt format for PWMs. The first case is helpful when you have down-loaded a PWM or when you have extracted it from some document and want to use it with Matt. Matt tries to understand the structure of the table, e.g., if rows or columns correspond to positions. The second case is helpful when you have short sequences of identical length, e.g., splice sites or branch points, from which you want to learn a PWM (get the relative frequencies of letters at each position) and use it with Matt.

Example: With M037_0.6.txt being the human MBNL1 PWM from CISBP RNA
> matt get_pwm M037_0.6.txt 0.001 A,C,G,T
prints this MBNL1 PWM with minimal probability 0.001 instead of 0 and alphabet A,C,G,T instead of A,C,G,U. For re-defining the alphabet, you could also directly edit the text file M037_0.6.txt. To save this PWM in Matt format for later use with Matt, re-direct the output into a file.
> matt get_pwm M037_0.6.txt 0.001 A,C,G,T > hsa_mbnl1_pwm.tab



> matt get_sj
Description: This command is specifically tailored to the result table of SANJUAN (a program for estimating inclusion levels of alternative splicing events from RNA-seq data). These tables describe exons and already contain their start and end-coordinate, chromosome, and strand. With this command, you'll add Ensembl gene ids to the table, which are necessary for subsequent Matt analyses.

Example: SANJUAN outputs differentially spliced exons in a table called CEs_clean.txt.
> matt get_sj CEs_clean.txt hg19 > exons_from_sj.tab
will output a table like CEs_clean.txt with an additional column GENEID_ENSEMBL. It will duplicate rows (exons) for which SANJUAN did output several GENE_SYMBOLs; one row for each of the reported genes with its gene name and Ensembl gene id.



> matt get_vast
Description: This command is specifically tailored to the result tables of VAST-TOOLS (a program for estimating inclusion levels of alternative splicing events from RNA-seq data). Use this command to transform the output table from VAST-TOOLS into Matt format and extract sub-sets of reported alternative splicing events. VAST-TOOLs reports exons, introns, alt3, alt5 events in one final output table together with estimated PSI values across several data sets. Matt allows you to retrieve form this table only exons, or only intron, or only alt3 or only alt5 events. In addition, you might retrieve only events which have a minimum/maximum dPSI between data sets/conditions. This means, you can, e.g., define a set of differentially spliced exons and a reference set of non-differentially spliced exons which you later can use as input to high-level functions of Matt, e.g., to compare these sets wrt. their features. You might need to familiarize yourself with the format of VAST-TOOLS result tables as described in VAST-TOOLS documentation.

Example: With vts_out.tab being a final results table from VAST-TOOLS combine command
> matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C 
     -a kd1,kd2 -b ctr1,ctr2 -dplim 25,100,DPLIM > ir_events.tab
extracts all intron retention events from vts_out.tab whose PIR values could be estimated with minimal quality flag LOW across all samples (kd1, kd2, ctr1, ctr2) and with dPSI≥25 (more retained in kd than in ctr) between the two groups of samples kd1,kd2 and ctr1,ctr2.

The user can agument the result table (e.g. ir_events.tab) with gene IDs extracted from any given gene annotation (e.g. GTF file Hsa19.gtf with gene ids in field gene_id) by

> matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C -gtf Hsa19.gtf -f gene_id > ir_events.tab
The output table ir_events.tab will have one additional column GENEID with extracted gene IDs.

The command get_vast is one of the most complex commands, as it allows the user to perform complex event selection from VAST-TOOLS output tables. Generally, VAST-TOOLS allows to gather inclusion levels of many samples in one table. These samples might be event unrelated to each other. Though, by default, get_vast outputs only events whose inclusion level could be estimated with minimum quality VLOW across all samples. The more samples are gathered in one table, the less like it is that events have good estimates across all samples. Use -minqglob N and -ignoreNA to deactivate this default behavior.

> matt get_vast vts_out.tab -minqglob N -ignoreNA > all_events.tab
will output all events without any restriction.



> matt col_uniq
Description: Use this command to get all distinct values of a column together with their number of occurrences. This command should be applied only to columns with categorical data.

Example: With ir_events.tab containing the previously extracted intron-retention events (with command get_vast) and a column COMPLEX which contains the sub-types (IR, IR-S, IR-C) of intron-retention events as defined by VAST-TOOLS
> matt col_uniq ir_events.tab COMPLEX
prints the three sub-types and their numbers of occurrences in table ir_events.tab.



> matt get_colnms
Description: Use this command to get all column names and their indices from a table. This command is useful to investigate the structure of tables and the kind of data in their columns.

Example: With t1.tab being the table from above
> matt get_colnms t1.tab
prints the column headers EXON_ID, START, END, SCAFF, STRAND, SEQ, ENSEMBL_GID, GROUP and their indices.



> matt get_cols
Description: Use this command to retrieve one or several columns from a table.

Example: With t1.tab being the table from above
> matt get_cols t1.tab START END
prints the entire columns START and END from table t1.tab to screen.

You might want to redirect the output into a file to construct a new table consisting of the extracted columns

> matt get_cols t1.tab START END > new_table.tab




> matt get_match
Description: For a given table with entities, find matching values from a second table. E.g., assume you have a table 1 with a column with gene names and another table 2 with gene names and gene ids, then you can extract gene ids from table 2 which match to the genes of table 1.

Example: With t1.tab being the table from above and t2.tab describing a mapping EXON_ID→GENE_ID with columns
  • EXON_ID
  • GENE_ID
> matt get_match t1.tab EXON_ID t2.tab EXON_ID GENE_ID
prints a table with one column GENE_ID. Each row of this table corresponds to a row of t1.tab and it contains the GENE_ID for the corresponding EXON_ID of t1.tab.



> matt get_ovl
Description: Use this command to retrieve IDs of rows which occur in several tables. This command might be helpful if you want to determine the overlap of different data sets.

Example: With ir_events.tab containing the previously extracted intron-retention events and in column EVENT event IDs, and ir_events_reference.tab containing reference intron-retention events with column EVENT containing their IDs
> matt get_ovl ir_events.tab EVENT ir_events_reference.tab EVENT > common_ids.tab
writes a table with one column OVERLAP with event ids from table ir_events.tab which occur as well in table ir_events_reference.tab into file common_ids.tab.
To extract all rows from table ir_events.tab for these events, you can apply get_rows as follows
> matt get_rows ir_events.tab EVENT]common_ids.tab,OVERLAP[ > common_ir_events.tab
If tables do not have columns with a kind of event ID, you might use conc to construct artificial IDs.



> matt conc
Description: Use this command for concatenating entries of several columns. This command might be useful for constructing artificial row IDs, i.e., event IDs which might be necessary, e.g., for get_ovl if the table not yet contains event IDs, or for concatenating sequences from several columns.

Example: With t1.tab being the table from above
> matt conc t1.tab EVENT_ID '_' START END SCAFF STRAND
prints a table with one column EVENT_ID which contains in each row a event ID by concatenating START_END_SCAFF_STRAND. With command add_cols you might add this column with event IDs to the table t1.tab.

Other separation symbols like the empty string '' or white space ' ' can be specified.



> matt get_rows
Description: Use this command to retrieve rows from a table. Filters allow to specifically select rows, i.e., rows which have specific values (numeric or categorical) in some of the columns, or whose vales are empty (empty string).

Example: With t1.tab being the table from above
> matt get_rows t1.tab START[1,5000000] '!STRAND]+['
prints the rows from t1.tab with START in 1 to 5 Mio and STRAND is NOT +.

Constraints can be:

  • LENGTH[0,5] : numeric values in column LENGTH must be from 0 to 5 (inclusive)
  • LENGTH[0,5,10,15] : numeric values in column LENGTH must be from 0 to 5 or from 10 to 15
  • GROUP]g1,g2[ : categorical values in column GROUP must be either g1 or g2
  • GROUP]table.tab,column_name[ : categorical values in column GROUP must be in column column_name of table table.tab (useful if you want to specify many admissible categorical values)
  • GROUP][ : categorical value in column GROUP must be empty string
  • COLUMNA==COLUMNB: entries in both columns must be equal
  • '!LENGTH[0,5]' : negation; numerical value in column LENGTH not from 0 to 5
  • '!COLUMNA==COLUMNB' : negation; entries in both columns must not be equal
  • constraints can be combined




> matt rand_rows
Description: Use this command to randomly draw rows from a table. This command might be useful when you want to sub-sample data sets.

Example: With ir_events.tab containing the previously extracted intron-retention events (with command get_vast)
> matt rand_rows ir_events.tab 500 3628929065
prints 500 randomly drawn (seed 3628929065) rows (without replacement) from ir_events.tab. You might want to save these events into a file for later use.
> matt rand_rows ir_events.tab 500 3628929065 > 500_ir_events.tab



> matt add_cols
Description: Use this command to add columns to existing tables. You might want to retrieve one column from a table and add it to another table.

Example: With t1.tab being the table from above and having constructed row IDs with conc like
> matt conc t1.tab EVENT_ID _ START END SCAFF STRAND > rowids.tab
doing
> matt add_cols t1.tab rowids.tab
will add the column ROW_ID from rowids.tab to the right end of table t1.tab.
Combining both commands using piping is more efficient.
> matt conc t1.tab EVENT_ID _ START END SCAFF STRAND | matt add_cols t1.tab -



> matt add_header
Description: Use this command to add column names (entire table header) to a table without header to make it processable with Matt.

Example: With exons.bed being the BED file with the first three mandatory fields only
> matt add_header exons.bed chrom chromStart chromEnd
adds the three necessary column headers directly to the file exons.bed without printing it to screen.



> matt add_rows
Description: Use this command to add rows to existing tables. You might want to add rows of one table to another table, e.g., to join data sets into a single table.

Example: With t1.tab being the table from above
> matt add_rows t1.tab t1.tab
will add all rows from t1.tab to the bottom of t1.tab (which is not very meaningful).



> matt add_val
Description: Use this command to add one column with a constant value across all rows to existing tables. You might use this command to add a column containing some kind of group IDs (e.g., up_regulated) to tables which you later join into one table.

Example: With t1.tab being the table from above
> matt add_val t1.tab DATE 22.02.2017
will print table t1.tab to screen with a additional column DATE containing the value 22.02.2017 in all rows.



> matt chg_labs
Description: Use this command to change labels which are already in a column of a table. You might use this command if you have a column with group IDs and want to change these IDs / labels.

Example: With t1.tab being the table from above
> matt chg_labs t1.tab EXON_ID ex1:ex11 ex2:ex22
will print table t1.tab to screen where in column EXON_ID ex1 is changed for ex11, and ex2 for ex22.



> matt def_cats
Description: Use this command (define categories) for creating a column with group IDs for a table. Rows of this table corresponds to events and columns to event features. You can define groups wrt. to ranges of their features using the same syntax as you use with command get_rows.

This Matt command is very important as for many analysis you need to specific groups of items (exons, introns, or sequences, or more) to be compared.

Example: With t1.tab being the table from above

> matt def_cats t1.tab GROUP2 'g1=START[0,5000000] STRAND]+[' 
     'g2=START[0,5000000] !STRAND]+[' 'g3=START[5000001,10000000]'
will output a table with column GROUP2 with group IDs g1, g1, g2, g3, g1 categorizing each micro exon in table t1.tab.



> matt rm_cols
Description: Use this command to remove columns from a table changing the original table.

Example: With t1.tab being the table from above where we add a column with
> matt add_val t1.tab DATE 22.02.2017 > t2.tab
doing
> matt rm_cols t2.tab DATE STRAND
will print a table with columns DATE and STRAND to screen and remove these columns from the original t2.tab.



> matt rn_cols
Description: Use this command to rename columns of tables. You might use this command before you want to join by-column two tables which have some column names in common. Tables must not contain column names several times. Each column needs to have a unique name.

Example: With t1.tab being the table from above
> matt rn_cols t1.tab SCAFF:SCAFFOLD STRAND:STRAND2
will print t1.tab to screen where column SCAFF got re-named into SCAFFOLD and column STRAND got STRAND2.



> matt extr_seqs
Description: Use this command to retrieve sequences from a table in FASTA format. You might use this command if you want to run other software on sequences which you have obtained with Matt. Often, software for analyzing sequences expect FASTA input format.

Example: With t1.tab being the table from above
> matt extr_seqs t1.tab SEQ EXON_ID
will print the sequences from column SEQ into FASTA format where values in column EXON_ID will be used as sequence IDs.

If the table not yet contains a column with kind of sequence IDs, you might want to use conc to construct them.



> matt get_seqs
Description: Use this command to retrieve genomic sub-sequences from FASTA files. Usually, you'll have a table specifying genomic coordinates (e.g. START, END, SCAFFOLD/CHROMOSOME, STRAND) and you want to retrieve genomic sequences with these coordinates (e.g., sequences of exons, introns, genes, splice sites or any other genomic entity) from a FASTA file (which contains all sequences of the entire genome). The command get_seqs has three modes making it easy to retrieve different genomic sequences wrt. to the coordinates.


Schematic view of positions of START and END coordinates and the four length arguments for mode 3.


The three modes make it easy to retrieve:
  1. sequences from START to END, including START and END position
  2. two sequences per START-END coordinate pair (START and END exclusive)
    1. up-stream sequence of length L1 of region START-END, i.e., from START-L1 to START-1 (on + strand)
    2. down-stream sequence of length L2 of region START-END, i.e., from END+1 to END+L2 (on + strand)
  3. two sequences per START-END coordinate pair (START and END inclusive)
    1. up-stream sequence around START, i.e., from START-L1 to START+L2-1 (on + strand)
    2. down-stream sequence around END, i.e., from END-L3+1 to END+L4 (on + strand)
Sequences from - strand get reverse-complemented automatically and up/down-stream sequences come from the up/down-stream regions wrt. START end END coordinates.

Example: With t1.tab being the table from above
> matt get_seqs t1.tab START END 20 10 SCAFF STRAND Hsa_genome.fa
will extract for all micro exons in t1.tab the up-stream intronic sequences of length 20 and the down-stream intronic sequences of length 10 and print these sequences as a table with columns SEQ_UP and SEQ_DOWN. The scaffold/chromosome IDs in column SCAFF of table t1.tab must be the same as used in Hsa_genome.fa.

Using piping, you can add the retrieved sequences directly to the table t1.tab as get_seqs outputs the sequences as a two-column table.
> matt get_seqs t1.tab START END 20 10 SCAFF STRAND ...
       ... Hsa_genome.fa | matt add_cols t1.tab -




> matt cmpr_motif
Description: Use this command to plot a PDF for visually comparing two motifs (position weight matrices) of the same length. Motifs can be given either by tables describing position weight matrices (get_pwm) or by tables containing sequences (aligned, all of same length) from which previously to plotting Matt will estimate weight matrices.

Example: With hsa_mbnl1_pwm.tab and hsa_a2bp1_pwm.tab being two PWMs retrieved from CISBP-RNA, both being of length 7,
> matt cmpr_motif hsa_mbnl1_pwm.tab hsa_a2bp1_pwm.tab mbnl1_vs_a2bp1.pdf
generates the PDF mbnl1_vs_a2bp1.pdf which shows the motif plots (plot_motif) for both PWMs and in the center a diagram showing at each position the log2-ratio of probabilities of each nucleotide.





> matt get_efeatures
Description: Use this command to retrieve 75 features of interest when investigating and comparing exon sets If you have a table with basic information on exons like their genomic coordinates and the Ensembl gene ID of genes the exons belong to, you will get a table with extracted exon features including but not limited to:
  • lengths of: exon and flanking introns
  • strength of its splice sites using maximum-entropy models (publication, DOI: 10.1089/1066527041410418)
  • GC content of exon and flanking introns
  • strength, score, position, sequence of predicted branch points in up-stream intron using the SVN-BPfinder (publication, DOI: 10.1371/journal.pcbi.1001016)
  • and many more

Overview of features

For detecting features discriminating sets of exons, the command cmpr_exons is more suited.

Examples: To get an overview and a short description of all features
> matt get_efeatures explain

With t1.tab and assuming it describes human micro exons
> matt get_efeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ...
      ... Hsa19.gtf Hsa19.fa Hsap
prints a table with columns containing exon feature and extracted sequences of the exons, up/down-stream introns, and splice sites.

It is recommended to use GTF files with genome annotation from Ensembl, as Matt is tailored to their structure.



> matt get_gcc
Description: Use this command to determine the GC content of genomic sequences.

Example: With t1.tab being the table from above
> matt get_gcc t1.tab SEQ
will print a table with one column SEQ_GCC containing the determined values of the GC content.

Any letter not A, C, G, T, a, c, g, t will be completely neglected from the computation of the GC content.



> matt get_ifeatures
Description: Use this command to retrieve 50 features of interest when investigating and comparing intron sets. If you have a table with basic information on introns like their genomic coordinates and the Ensembl gene ID of genes the introns belong to, you will get a table with extracted intron features including but not limited to:
  • lengths of: intron and flanking exons
  • strength of its splice sites and next splice sites of flanking introns using maximum-entropy models (publication, DOI: 10.1089/1066527041410418)
  • GC content introns and flanking exons
  • strength, score, position, sequence of predicted branch points using the SVN-BPfinder (publication, DOI: 10.1371/journal.pcbi.1001016)
  • and many more

Overview of features

For detecting features discriminating sets of introns, the command cmpr_introns is more suited.

Examples: To get an overview and a short description of all features
> matt get_ifeatures explain

With table introns.tab describing human introns with columns START, END, SCAFF, STRAND, ENSEMBL_GID
> matt get_ifeatures introns.tab START END SCAFF STRAND ENSEMBL_GID ...
      ... Hsa19.gtf Hsa19.fa Hsap
prints a table with columns containing the feature values and extracted sequence of intron, up/down-stream exon, splice sites.



> matt get_pwm_hits
Description: Use this command to predict hits in sequences with a motif position weight matrix model (PWM).

Example: With seqs.tab being a table containing sequences (containing A,C,G,T) in column SEQ and some kind of sequence IDs in column ID and hsa_mbnl1_pwm.tab the MBNL1 PWM from CISBP-RNA whose alphabet we had redefined from A,C,G,U to A,C,G,T, the command
> matt get_pwm_hits seqs.tab SEQ ID hsa_mbml1_pwm.tab 10 single INFER_BGPWM
will estimate a simple background PWM from all sequences (being identical to a homogeneous Markov model of order 0), and predict hits with the MBLN1 motif PWM at all position where the likelihood ratio MBLN1 PWM vs. background PWM is at least 10 or higher, i.e., where it is at least 10 times more likely to actually have a MBNL1 binding site than having no MBNL1 binding site (background). The command prints a table with columns SEQ_ID, START, END, STRAND, HIT, LOGSCORE describing the hits.



> matt get_pwm_prof
Description: Use this command to plot positional distributions of hits of motif position weight matrices (PWMs) across sequences. This command involves predicting hits of PWMs as well. Hence, you need to specify the input sequences, the PWMs, and information on how hits should be predicted.

Example: Assume table exons.tab describes exons and contains beside others these columns:
  1. SEQ_UP: up-stream intronic sequences (150 nt)
  2. SEQ_DOWN: down-stream intronic sequences (150 nt)
  3. DATASET: all exons are classified into non-differentially spliced (ndiff) and differentially spliced (down) and this column contains the group IDs: ndiff and down
Assume table pwms.tab contains information about two motif PWMs of RNA binding proteins from CISBP-RNA and looks like this:
            NAME    FILE        BG            THRESH
            A2BP1   a2bp1.tab   INFER_BGPWM   5
            MBNL1   mbnl1.tab   INFER_BGPWM   5
		
Both files as1bp1.tab and mbnl1.tab were obtained by applying get_pwm to the original PWM files from CISBP-RNA to re-define the alphabet to A,C,G,T, fit it to the alphabet of the sequences. The command:
> matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
         ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]
will go through the following steps for each foreground PWM (MBNL1 and A2BP1 also known as RBFOX1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
  1. estimate a background PWM by taking the frequencies of the four nucleotides A,C,G,T across all sequences in column and group
  2. predict hits at all sequence positions with P_fgPWM(seq)/P_bgPWM(seq)≥5
  3. estimate a positional density of predicted hits with a Normal kernel considering all hits
  4. estimate a positional cumulative density considering all hits
  5. estimate a positional density with Normal kernel considering only the first hit per sequence
  6. estimate a positional cumulative density considering only the first hit per sequence
The parameter LR and RL defines how the cumulative distributions will be plotted (LR: increasing from left to right, RL: increasing from right to left) and which are the first positions per sequence (LR: first position = left most position, RL: = right-most position).


The command (from above)
> matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
         ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]
creates a folder pwm_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the PWMs for which the densities are most different.

Example website with results:

Matt: Positional distributions of hits of motif PWMs

Data file: exons.tab
Categories (number of sequences): down ( 868 ), ndiff ( 661 )
Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
File with information on motif PWMs and background models: pwms.tab
Date: Feb 27, 2017


Ordered Results

Ordering from top to bottom in each column: corresponds to most negative
to most positive differences of densities / cumulative densities to
densities / cumulative densities of reference group
Reference group: ndiff
If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
MBNL1
MBNL1
MBNL1
MBNL1
A2BP1
A2BP1
A2BP1
A2BP1


Plots

A2BP1 - a2bp1.tab

Back to Ordered Results



MBNL1 - mbnl1.tab

Back to Ordered Results







> matt get_regexp_hits
Description: Use this command to predict hits in sequences with Perl regular expressions.

Example: With seqs.tab being a table containing sequences (containing A,C,G,T) in column SEQ and some kind of sequence IDs in column ID the command
> matt get_regexp_hits seqs.tab SEQ ID 'ACCG|G[GC]C|GCCT' single
will search for possibly overlapping hits of ACCG or GGC or GCC or GCCT (case insensitive) and outputs them as a table with columns SEQ_ID, START, END, STRAND, HIT. The argument single states that Matt should not search in the reverse-complements. If several hits start at the same position, only the first gets reported according to their order in the call. If motifs are of variable length, the longest match will be reported only.



> matt get_regexp_prof
Description: Use this command to plot positional distributions of hits of Perl regular expressions across sequences. You need to specify the input sequences, and the regular expressions.

Example: Assume table exons.tab describes exons and contains beside others these columns:
  1. SEQ_UP: up-stream intronic sequences (150 nt)
  2. SEQ_DOWN: down-stream intronic sequences (150 nt)
  3. DATASET: all exons are classified into non-differentially spliced (ndiff) and differentially spliced (down) and this column contains the group IDs: ndiff and down
Assume table regexps.tab contains information about two Perl regular expressions and looks like this:
                      NAME    REGEXP
                      MBNL1   [GC][CT]TT[GA]C  
                      A2BP1   [TA]GCATG[CA]|TACATA
		
The command:
> matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
         ... regexps.tab DATASET[down,ndiff]
will go through the following steps for each regular expression (MBNL1 and A2BP1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
  1. predict hits (exact matches) for regular expression
  2. estimate a positional density of hits with Normal kernel considering all hits
  3. estimate a positional cumulative density of hits considering all hits
  4. estimate a positional density with Normal kernel considering only the first hit per sequence
  5. estimate a positional cumulative density considering only the first hit per sequence
The parameter LR and RL defines how the cumulative distributions will be plotted (LR: increasing from left to right, RL: increasing from right to left). This parameter also defines the first position per sequence (LR: first position = left most position, RL: = right-most position).


The command (from above)
> matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
         ... regexps.tab DATASET[down,ndiff]
creates the folder regexp_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the regular expressions (patterns) for which the densities are most different.

Example website with results:

Matt: Positional distributions of hits of REGEXPs

Data file: exons.tab
Categories (number of sequences): down ( 868 ), ndiff ( 661 )
Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
K or file with information on REGEXPS: regexps.tab
Date: Feb 27, 2017


Ordered Results

Ordering from top to bottom in each column: corresponds to most negative
to most positive differences of densities / cumulative densities to
densities / cumulative densities of reference group
Reference group: ndiff
If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
A2BP1
A2BP1
A2BP1
A2BP1
MBNL1
MBNL1
MBNL1
MBNL1


Plots

MBNL1 - [GC][CT]TT[GA]C

Back to Ordered Results



A2BP1 - [TA]GCATG[CA]|TACATA

Back to Ordered Results







> matt get_seql
Description: Use this command to determine lengths of sequences.

Example: With t1.tab being the table from above
> matt get_seql t1.tab SEQ
will print a table with one column SEQ_LENGTH with the lengths of the sequences in table t1.tab in column SEQ.



> matt get_sss
Description: Use this command to estimate the splice site strength of 3' and 5' splice sites using maximum-entropy models (DOI: 10.1089/1066527041410418). Often you'll start with a table of exons or introns. First, you need to retrieve the sequences of the 3' and 5' splice sites. For this, you might use the command get_seqs (mode 3).

Example: Assuming table introns.tab contains sequences (xxxGTxxxx, 3 exon and 6 intron positions) of the 5' splice sites in column 5SS_SEQ and sequences (xxxxxxxxxxxxxxxxxxAGxxx, 20 intron and 3 exon positions) of the 3' splice sites in column 3SS_SEQ, then
> matt get_sss introns.tab 3SS_SEQ 5SS_SEQ
will estimate the strength of the splice sites and output them as a table with two columns STRENGTH_3SS and STRENGTH_5SS.

Using piping allows the user to extract the sequences and estimate the splice site strength at once. E.g., assume table introns.tab has columns START, END, SCAFF, STRAND describing human introns:

> matt get_seqs introns.tab START END 3 6 20 3 SCAFF STRAND Hsa19.fa | matt ... 
   ... get_sss - SEQ_UP SEQ_DOWN
		




> matt plot_motif
Description: Use this command to create a motif plot as PDF similar to sequence logos for a given position weight matrix model (PWM) or from a set of aligned sequences of identical length. For obtaining PWMs, re-defining their alphabet or setting a minimum probability, you might use the command get_pwm.

Example: With human_nova2_pwm.tab being the human NOVA2 PWM from CISBP-RNA with re-defined alphabet (A,C,G,T instead of A,C,G,T) and minimal probability 0.001
> matt plot_motif human_nova2_pwm.tab
will create a PDF human_nova2_pwm_motifplot.pdf visualizing the probabilities of the nucleotides at each position, the consensus sequence (letters at the top of the bars), and the information content at each position.





> matt test_pwm_enrich
Description: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by position weight matrices (PWM). The sequences and the information on the PWMs need to be given as tables. This command first predicts PWM hits in the two sets of sequences and then applies a permutation test for estimating p-values.

Example:
Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
  • SEQ: containing sequences
  • GROUP_ID: containing group ids (cntr, up, down)

PWM specification: Assume you have a 3 PWMs (pwm1, pwm2, pwm3) which you want to use for this enrichment analysis. Table pwms.tab contains information about these PWMs in the following columns
  • NAME: containing short names for the motifs
  • FILE: containing the path to file with table defining the PWM (in Matt format: see get_pwm)
  • THRESH: containing the threshold to be applied for predicting hits with this PWM, e.g., 10 (more details get_pwm_hits)
                NAME   FILE                THRESH
                pwm1   /path/to/pwm1.tab   10
                pwm2   /path/to/pwm2.tab   10
                pwm3   /path/to/pwm3.tab   10
Running
> matt test_pwm_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ...
      ...  pwms.tab NAME FILE THRESH INFER_BGMODEL quant 10000 > results.tab
will perform 6 comparisons: (pwm1, pwm2, pwm3) x (up vs. cntr, down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab as we re-direct the output.

Each comparison (say, for pwm1 and up vs. cntr) consists of these steps:
  1. estimate background model bm1 (homogeneous Markov model of order 0) from sequences of the foreground data set, here group up
  2. estimate background model bm2 from sequences of the background data set, here group cntr
  3. predict hits with pwm1 in sequences of group up (ratio pmw1/bm1≥10)
  4. predict hits with pwm1 in sequences of group cntr (ratio pmw1/bm2≥10)
  5. apply a permutation test with 10000 iterations (repeats) and determine p-value
Instead of inferring the background model (argument INFER_BGMODEL), you can provide a background PWM or predict hits without background model only using the foreground PWM. In the latter case, the thresholds defined in table pwms.tab must be defined wrt. the log-score of the PWM, e.g, -10 or -8.

Test for positional differences of hits: The above described procedure considers the number of PWM hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view.
For detecting positional differences, Matt predicts the PWM hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
  1. the entire sequences as given
  2. only the first 1/3 part of the sequences
  3. only the internal 1/3 part of the sequences
  4. only the last 1/3 part of the sequences
and reports p-values for all four tests.



Output: a table containing detailed information (number of data points, test statistics, p values). The most important columns might be SIGNIFICANT_RESULT_ALL, SIGNIFICANT_RESULT_FIRST, SIGNIFICANT_RESULT_MIDDEL, SIGNIFICANT_RESULT_END which highlight found significant enrichment differences between groups.

Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found PWM hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

Avoid false positives: Normalizing the number of PWM hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


Testing these sequences which are of very different length but contain the same number of PWM hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
For testing enrichment, one might either consider
  1. the number of hits across sequences per group of sequences
  2. the number of positive sequences containing hits per group
All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument <MODE>.



> matt test_regexp_enrich
Description: Description: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by Perl regular expressions. The sequences and the information on the regular expressions need to be given as tables. This command first predicts exact matches of the regular expressions in the two sets of sequences and then applies a permutation test for estimating p-values. This command is very similar to test_pwm_enrich but predicts motif hits with Perl regular expressions get_regexp_hits.

Example:
Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
  • SEQ: containing sequences
  • GROUP_ID: containing group ids (cntr, up, down)

Specification of Perl regular expressions / patterns: Assume you want to do the enrichment analysis with two regular expressions: ACCGT|ACCGA|ACCAA and AC[CGT]..C{2,5}GT. Table regexps.tab contains information about these regular expressions in the following columns
  • NAME: containing short names for the motifs/patterns, e.g, the corresponding binding protein
  • REGEXP: containing the Perl regular expressions
                NAME   REGEXP
                protA  ACCGT|ACCGA|ACCAA
                protB  AC[CGT]..C{2,5}GT
Running
> matt test_regexp_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ...
      ...  regexps.tab NAME REGEXP quant 10000 > results.tab
will perform 4 comparisons: (protA and protB) x (up vs. cntr and down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab as we re-direct the output.

Each comparison (say, for protA and up vs. cntr) consists of these steps:
  1. search hits (exact matches) with regular expression protA in sequences of group up
  2. search hits (exact matches) with regular expression protA in sequences of group cntr
  3. apply a permutation test with 10000 iterations (repeats) and determine p-value

Test for positional differences of hits: The above described procedure considers the number of hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view.
For detecting positional differences, Matt searches for hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
  1. the entire sequences as given
  2. only the first 1/3 part of the sequences
  3. only the internal 1/3 part of the sequences
  4. only the last 1/3 part of the sequences
and reports p-values for all four tests.



Output: a table containing detailed information (number of data points, test statistics, p values). The most important columns might be SIGNIFICANT_RESULT_ALL, SIGNIFICANT_RESULT_FIRST, SIGNIFICANT_RESULT_MIDDEL, SIGNIFICANT_RESULT_END which highlight found significant enrichment differences between groups.

Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
For testing enrichment, one might either consider
  1. the number of hits across sequences per group of sequences
  2. the number of positive sequences containing hits per group
All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument <MODE>.



> matt row_calc
Description: Use this command to apply calculations along rows.

Example: With t1.tab being the table from above
> matt row_calc t1.tab 'LENGTH=END-START+1'
		
will determine the length of the exons and output it as a table with column LENGTH. Use piping to add this new column directly to t1.tab
> matt row_calc t1.tab 'LENGTH=END-START+1' | matt add_cols t1.tab -
Computing the log of the average of START, END, LENGTH (no very meaningful) with
> matt row_calc t.tab 'MEAN=log((START+END+LENGTH)/3)'
which creates a table with column MEAN.

You can use any mathematical expression which is available within Perl. Column names in expressions are used like variables. Mathematical expressions must always be put inside ' '.



> matt col_calc
Description: Use this command to apply some pre-defined calculations, like summation, along columns. If the table contains columns with group labels, this commands allows to apply calculations per group.

Example: With t2.tab being this table
         A    B    V1    V2
         a    1    1     2
         a    2    1     3
         b    2    1     4
         a    1    1     5
		
> matt col_calc t2.tab sum V1,V2
sums over the entire columns V1 and V2 and outputs
         V1_SUM    V2_SUM
         4         14
		
> matt col_calc t2.tab sum V1,V2 A
sums over columns V1 and V2, considering entries in column A being group labels and outputs sums per group.
         A    V1_SUM    V2_SUM
         a    3         10
         b    1         4
		
> matt col_calc t2.tab sum V1,V2 A,B
sums over columns V1 and V2, considering entries in columns A and B being group labels and outputs sums per group.
         A    B     V1_SUM    V2_SUM
         a    1     2         7
         a    2     1         3
         b    2     1         4
		


> matt perm_test
Description: This command provides the user with the possibility to statistically testing for differences of features between groups. Because it is a permutation test and does not make assumption about the feature distributions, it is a flexible statistical test. This test can be applied to these kinds of features:
  1. numerical
  2. categorical
  3. co-occurrences

Example: With exons.tab describing exons (see also get_efeatures) with these (and more) columns
  1. LENGTH: length of exon
  2. GCC: GC content of exon
  3. GENE_TYPE: coding, non-coding
  4. IS_IN_TRS: colon-separated list of transcript ID each exon occurs in
  5. DATASET with group IDs down, ndiff
The command
> matt perm_test exons.tab DATASET[down,ndiff] 100000 ...
           ... -num LENGTH,GCC -card GENE_TYPE -cooc TR_IDS
will apply a permutation test (100000 repeats) and test for significant differences between the two exon groups down vs. ndiff wrt. to the features exon length, exon GC content, and gene type. It also will test whether the exons of the two groups (down vs. ndiff) co-occur differently, e.g., whether they co-occur heavily in one group in a few transcripts, and much less (spread among many transcripts) in the other group. The command will print a text report (NOT a table) with all details and results.

Co-occurrences: each row of the table corresponds to one element (in our case to an exon). Exons might be clustered, e.g., all exons occurring in the same transcript belong to one cluster. Exons which occur in the same cluster co-occur with each other. You might identify the clusters by giving them unique cluster IDs. In the example, cluster IDs are transcript IDs. Then, there need to be a column in the table which contains information about the clusters each element (exon) occurs in, e.g., IS_IN_TRS. This information must be given by a colon-separated list of cluster IDs, e.g., tr1:tr10,tr155 for exon5 encodes that exon5 actually occurs in these three clusters (transcripts). If a exon does not occur in any cluster, this list is empty.



> matt extr_geneids
Description: This is a utility command for automatically extracting gene IDs from gene annotations (GTF files) for genomic events like genes, exons, introns, etc.

Example: With exons.tab being a table describing human exons (columns START, END, SCAFFOLD, STRAN) and Hsa19.gtf being the Ensembl gene annotation for the human genome (assembly hsa19)
> matt extr_geneids exons.tab START END SCAFFOLD STRAND Hsa19.gtf -f gene_id
extracts for all exons in exons.tab gene ids from Hsa19.gtf, where extracted gene ids are taken from field gene_id of Hsa19.gtf.



> matt extr_scafids
Description: This is a utility command for extracting/checking scaffold/sequence/chromosome IDs in FASTA, FASTQ, GFF, GTF files.

Example: With Hsa19_gDNA.fasta containing all human chromosomes
> matt extr_scafids Hsa19_gDNA.fasta FASTA
will print all IDs of human chromosomes.



> matt get_kmers_from_pwm
Description: Use this command to determine the most likely kmers under a position weight matrix model (PWM). Either obtain the top N (integer ≥1) kmers or the most likely k-mer which together account for P (0<P<1) percent of the total binding specificity (probability) as encoded by the PWM.

This command might help if you want to search for PWM hits but don't want to specify a threshold on the PWM log-score. In this case, you could determine the most likely kmers from a PWM and use these as a Perl regular expression and search for exact hits.

Example: With mbnl1.tab being the PWM for human MBNL1 from CISBP-RNA (with re-defined alphabet A,C,G,T and minimum probability 0.001; see get_pwm)
> matt get_kmers_from_pwm mbnl1.tab 5 5
prints the 5 most likely kmers under the MBNL1 PWM.
          ID  KMER     PWM_LOGSCORE       CUMULATIVE_PERCENT_BINDING  PWM_NAME
          1   AGCTTGC  -1.98607349885094  0.137233215315936           mbnl1
          2   TGCTTGC  -2.09775518556902  0.259964844297944           mbnl1
          3   GGCTTGC  -2.38376620889964  0.352167512560293           mbnl1
          4   CGCTTGC  -2.49208691693447  0.434904633342322           mbnl1
          5   AGCTTGG  -3.67966981232818  0.460135937867991           mbnl1
		
> matt get_kmers_from_pwm mbnl1.tab 0.5 5
prints the most likely kmers which together account for 50% of all binding specificity.
          ID  KMER     PWM_LOGSCORE       CUMULATIVE_PERCENT_BINDING  PWM_NAME
          1   AGCTTGC  -1.98607349885094  0.137233215315936           mbnl1
          2   TGCTTGC  -2.09775518556902  0.259964844297944           mbnl1
          3   GGCTTGC  -2.38376620889964  0.352167512560293           mbnl1
          4   CGCTTGC  -2.49208691693447  0.434904633342322           mbnl1
          5   AGCTTGG  -3.67966981232818  0.460135937867991           mbnl1
          6   TGCTTGG  -3.79135149904627  0.482701022411463           mbnl1
          7   AGCTTAC  -3.9437269446445   0.502076889638399           mbnl1
 		



> matt mk_uniq
Description: This command allows the user to make-unique a table, i.e., omitting rows with replicated entries in specific columns.

Example: With t2.tab being this table
         A    B    V1    V2
         a    1    1     2
         a    2    1     3
         b    2    1     4
         a    1    1     5
		
> matt mk_uniq t2.tab A B
outputs
         A    B    V1    V2
         a    1    1     2
         a    2    1     3
         b    2    1     4
		



> matt prnt_tab
Description: Use this command to print a tab-separated table nicely aligned to screen. This command is especially useful for getting an overview of the contents of tables which contain long strings with different lengths.

Example: With t1.tab being the table from above
> matt prnt_tab t1.tab -w 5 -s 3
prints t1.tab where all cells get extended or truncated to 5 characters and with 3 blanks between columns. Use piping and less to see large table.
> matt prnt_tab t1.tab -w 5 -s 3 | less -S 



> matt retr_rnaseq
Description: Allows the user to retrieve RNA-seq data from the Gene Expression Omnibus (GEO), a public repository where many researchers store their RNA-seq data related to their publications. Having at hand the GEO accession numbers of RNA-seq data sets, this command downloads the SRA archives, extracts RNA-seq reads as FASTQ or FASTA files, and, if wanted, renames the extracted FASTA/FASTQ files as specified by the user.

Example: Specifying accession numbers and data set names in file accession_numbers.txt:
SRR3532927 nova2kd3
SRR3532926 nova2kd2
SRR3532925 nova2kd1
SRR3532924 nova2wt3
SRR3532923 nova2wt2
SRR3532922 nova2wt1
		
> matt retr_rnaseq accession_numbers.txt -keepsra -fasta -o rnaseq_data -p 6
downloads these six data archives, extracts FASTA files, splits them as RNA-seq data are paired-end reads, and renames final files. This is done in parallel with six processes. The output folder rnaseq_data looks then like:
dataset_info.tab
nova2kd1_1.fasta.gz
nova2kd1_2.fasta.gz
nova2kd2_1.fasta.gz
nova2kd2_2.fasta.gz
nova2kd3_1.fasta.gz
nova2kd3_2.fasta.gz
nova2wt1_1.fasta.gz
nova2wt1_2.fasta.gz
nova2wt2_1.fasta.gz
nova2wt2_2.fasta.gz
nova2wt3_1.fasta.gz
nova2wt3_2.fasta.gz
SRR3532922.sra
SRR3532923.sra
SRR3532924.sra
SRR3532925.sra
SRR3532926.sra
SRR3532927.sra



> matt sort_tab
Description: Use this command to sort rows of a table according to values (numeric or alpha-numeric) in a distinct column.

Example: With t1.tab being the table from above
> matt sort_tab t1.tab START
will print t1.tab to screen where lines are sorted wrt. START.
> matt sort_tab t1.tab START -d
will do the same but sort decreasingly.



> matt stratify
Description: Use this command to stratify data sets, e.g., if you want to compare two sets of exons and want to make sure their length distributions are similar, you might first stratify these two sets of exons and then apply perm_test. Stratification can be applied to any numerical feature.

Example: With exons.tab being a table describing exons with these columns (and maybe more)
  1. DATASET: containing ndiff and down
  2. GCC: containing the GC content of the exons
the command
> matt stratify exons.tab GCC DATASET ndiff 0.05
will stratify both groups such that their GC content distributions are similar. The reference group is ndiff. The algorithm searches for each ndiff exon one exon from the group down which a similar value in column GCC (± 0.05). Each pair of matched exons will get into the final result table. All exons from both groups without matching pair will be neglected. The command outputs a table like exons.tab reduced to the stratified exons.



> matt cmpr_exons
Description: Use this command if you want to compare groups of exons. This command goes through these steps:
  1. run get_efeatures on the input table to determine exon-related features
  2. apply Mann-Whitney U test to the comparison of feature distributions across groups
  3. output:
    1. PDF report summarizing results with box plots
    2. table with all events and extracted features
    3. table with details on performed statistical tests including p values
    4. all box plots as PDF graphics
Extracted features include but are not limited to (see get_efeatures for details)
  • GC content
  • length: exon and up-/down-stream introns
  • splice site strength
  • prediction of branch points and their strength
  • strength of potential human SF1 binding sites in the 3'-ends of the up-stream introns
  • many more
The input table describing exons needs to contain these pieces of information about the exons in separate columns
  1. start coordinate
  2. end coordinate
  3. chromosome ID; these chromosome IDs must match with the chromosome IDs in the GTF and FASTA which you use
  4. strand
  5. gene ID of gene where exon occurs in; these gene IDs must match with the gene IDs in the GTF which you use
Recommendation: use the GTF files from Ensembl as they also contain information on the gene types (coding, non-coding, etc.) as these information will get used as well.

Example: With exons.tab being a table describing human exons with these columns (and maybe more)
  1. START
  2. END
  3. SCAFFOLD
  4. STRAND
  5. GENEID_ENSEMBL
  6. DATASET: containing group IDs of exons (down, up, ndiff)
and with Hsa19.gtf being a gene description file from Ensembl and Hsa19.fa the corresponding FASTA file with all genome sequences
> matt cmpr_exons exons.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ...
    ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiff
will compare exons down vs. ndiff and leave untouched exons from group up. The argument 150 specifies the length of the up-stream and down-stream intronic regions which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all exons (down and ndiff) and the extracted exon features. It will also place therein a PDF document summary.pdf containing all details of the comparison.

Hint 1: You might first stratify the sets of exons with stratify and then apply cmpr_exons.
Hint 2: You might chose colors for the box plots of the PDF report. See help message of cmpr_exons for getting help on this.
Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced).
Hint 4: DATASET[down,ndiff] compares set down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff. Be careful not to specify too many groups as this might lead to a large number of comparisons.




> matt cmpr_introns
Description: Use this command if you want to compare groups of introns. This command goes through these steps:
  1. run get_efeatures on the input table to determine exon-related features
  2. apply Mann-Whitney U test to the comparison of feature distributions across groups
  3. output:
    1. PDF report summarizing results with box plots
    2. table with all events and extracted features
    3. table with details on performed statistical tests including p values
    4. all box plots as PDF graphics
Extracted features include but are not limited to (see get_ifeatures for details)
  • GC content
  • length: intron and up-/down-stream exons
  • splice site strength
  • prediction of branch points and their strength
  • strength of potential human SF1 binding sites in the 3'-ends of introns
  • many more
The input table describing introns needs to contain these pieces of information about the introns in separate columns
  1. start coordinate
  2. end coordinate
  3. chromosome ID; these chromosome IDs must match with the chromosome IDs in the GTF and FASTA which you use
  4. strand
  5. gene ID of gene where intron occurs in; these gene IDs must match with the gene IDs in the GTF which you use
Recommendation: use the GTF files from Ensembl as they also contain information on the gene types (coding, non-coding, etc.) as these information will get used as well.

Example: With introns.tab being a table describing human introns with these columns (and maybe more)
  1. START
  2. END
  3. SCAFFOLD
  4. STRAND
  5. GENEID_ENSEMBL
  6. DATASET: containing group IDs of introns (down, up, ndiff)
and with Hsa19.gtf being a gene description file from Ensembl and Hsa19.fa the corresponding FASTA file with all genome sequences
> matt cmpr_introns introns.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ...
    ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiff
will compare introns down vs. ndiff and leave untouched introns from group up. The argument 150 specifies the length of the 3'-end of the introns which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all introns (down and ndiff) and the extracted intron features. It will also place therein a PDF document summary.pdf containing all details of the comparison.

Hint 1: You might first stratify the sets of introns with stratify and then apply cmpr_introns.
Hint 2: You might chose colors for the box plots of the PDF report. See help message of cmpr_introns for getting help on this.
Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced).
Hint 4: DATASET[down,ndiff] compares group down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff. Be careful not to specify too many groups as this might lead to a large number of comparisons.




> matt rna_maps
Description: Allows the user to produce motif RNA-maps for comparing enrichment of binding motifs of RNA/DNA binding proteins for different groups of exons or introns. Motifs can be defined as position weight matrix models (PWMs) or Perl regular expressions. This command will extract all necessary sequences from the genomic proximity of given exons/introns, predict hits of the specified motifs therein, and compute motif enrichment scores. If chosen by the user, a permutation test is applied for estimating p values and regions with significant differences in motif enrichment will be highlighted in the motif RNA-map.
Output:
  1. PDF report containing all motif RNA-maps
  2. For each map, there are two versions, one with a data coverage section, and one without. The data coverage section shows the data overage along the map, which might be less then 100% if the data set contains exons/introns shorter than the specified exon/intron sections to be visualized.
  3. The motif RNA-maps are ordered according to largest differences of motif enrichment scores; Maps with largest positive differences come first, maps with largest negative differences last. The reference group for computing differences is always the last group specified in the program call.
  4. all motif RNA-maps as PDF graphics

Example: assume file nova2_regexpr.tab describes a Perl regular expression of the NOVA2 binding motif in a table with columns
  1. TYPE: the type of motif model, either REGEXP or PWM
  2. NAME: the name of the motif
  3. REGEXP: the Perl regular expression of a link to a file (table) containing the PWM
  4. THRESH: only for PWM (set to NA for REGEXPs): threshold for hit prediction (see get_pwm-hits)
  5. BGMODEL: only for PWM (set to NA for REGEXPs): if and which background model should be used
Further, 3 groups of human exons are described in table exons.tab with columns
  1. START: start of exon
  2. END: end of exon
  3. SCAFFOLD: chromosome
  4. STRAND
  5. UPSTRM_EX_BORDER: border of next up-stream exon
  6. DOSTRM_EX_BORDER: border of next down-stream exon
  7. GROUP: defining exon groups (down, up, ndiff); last group is reference group
The following call produces a NOVA2 motif RNA map with a sliding window of length 31 which slides up to position 35 into exons and up to position 135 into introns. Regions with significant enrichment as compared to the reference group ndiff are highlighted.
matt rna_maps exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD
    ... STRAND GROUP[up,down,ndiff] 31 35 135 Mmu09.fasta nova2_regexp.tab TYPE
    ... NAME REGEXP THRESH BGMODEL -d nova2_rnamap -p 0.0001 10000
The following graphic as PDF will be put into output folder nova2_rnamap





> matt rna_maps_cisbp
Description: Allows users to produce motif RNA maps for
  1. all ACGT k-mers (length 2-5)
  2. all approx. 340 CISBP-RNA BPS with IUPAC binding motif
  3. motif RNA-maps can be generated for exons or introns
at once without needing to specify these k-mer motifs or download the CISBP-RNA binding motifs.

Output:
  1. PDF report containing all motif RNA-maps
  2. For each map, there are two versions, one with a data coverage section, and one without. The data coverage section shows the data overage along the map, which might be less then 100% if the data set contains exons/introns shorter than the specified exon/intron sections to be visualized.
  3. The motif RNA-maps are ordered according to largest differences of motif enrichment scores; Maps with largest positive differences come first, maps with largest negative differences last. The reference group for computing differences is always the last group specified in the program call.
  4. all motif RNA-maps as PDF graphics

Assume 3 groups of human exons are described in table exons.tab with columns
  1. START: start of exon
  2. END: end of exon
  3. SCAFFOLD: chromosome
  4. STRAND
  5. UPSTRM_EX_BORDER: border of next up-stream exon
  6. DOSTRM_EX_BORDER: border of next down-stream exon
  7. GROUP: defining exon groups (down, up, ndiff); last is reference group
The following call produces motif RNA maps for all CISBP-RNA IUPAC binding motifs with a sliding window of length 31 which slides up to position 35 into exons and up to position 135 into introns.
matt rna_maps_cisbp exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD
    ...  STRAND GROUP[up,down,ndiff]  31 35 135 Hsa19.fasta cisbprna_regexps
    ...  -d cisbp_rnamaps
See command rna_maps for an example motif RNA map.



> matt test_cisbp_enrich
Description: Use this command to test two groups of sequences for statistical differences wrt. to the number of predicted binding sites of RNA-binding proteins from the database CISBP-RNA. The CISBP-RNA database contains approx. 300 position weight matrices (PWM) of binding motifs of RNA-binding proteins. All these are available through Matt and will be used for testing.

Example: With table t.tab having these columns (and maybe many more)
  1. SEQ_3P: containing 3'-ends (150 nt) of introns
  2. DATASET: containing group IDs of introns (e.g., down, up, ndiff)
the command
> matt test_cisbp_enrich t.tab SEQ_3P DATASET[down,ndiff,up,ndiff]
will compare down vs. ndiff and up vs. ndiff. During each comparison and for each CISBP-RNA binding PWM, it will predict binding sites in all sequences and apply a permutation test to determine p-values for the differences wrt. the number of predicted hits in both data sets. It will generate a table containing all results (including number of predicted hits per group, test statistics, p-values, information about the binding motifs (PWMs) like associated RNA-binding protein). For PWMs for which significant enrichment/depletion of hits was found, test_cisbp_enrich also generates PWM profiles invoking get_pwm_prof.

Internally, test_cisbp_enrich applies the command test_pwm_enrich. Hence per binding motif (PWM), test_cisbp_enrich tests for differences in the number of predicted hits between two groups of sequences as well as for positional differences of the predicted hist. In detail, test_cisbp_enrich searches for hits in both sets of sequences once, but then tests for significant enrichment/depletion 4 times considering:
  1. the entire sequences as given
  2. only the first 1/3 part of the sequences
  3. only the internal 1/3 part of the sequences
  4. only the last 1/3 part of the sequences
and reports p values for all four tests.




Output: a table containing detailed information (number of data points, test statistics, p values). The most important columns might be SIGNIFICANT_RESULT_ALL, SIGNIFICANT_RESULT_FIRST, SIGNIFICANT_RESULT_MIDDEL, SIGNIFICANT_RESULT_END which highlight found significant enrichment differences between groups.

Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

Hint 1: You might run test_cisbp_enrich with a sub-set of CISBP-RNA PWMs. See help message for details on this.
Hint 2: Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
For testing enrichment, one might either consider
  1. the number of hits across sequences per group of sequences
  2. the number of positive sequences containing hits per group
All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument -m.






Examples

This section shows applications of Matt to these tasks:
  1. Comparing sets of exons wrt. their exon features
  2. Testing enrichment of TGC in up-stream regions of neural micro exons
  3. Analyzing exons regulated by splicing factor Nova2

The examples contain many explanations and therefore appear lengthy. The more familiar the user is with Matt, the easier and faster the work experience will become.



Comparing sets of exons wrt. their exon features

The data table data.tab contains alternative splicing events published in Irimia et. al, Cell, 2014. It was assembled from output of different VAST-TOOLS runs. Each event belongs to one group of
  1. AS_noNeural
  2. NEURAL-DOWN
  3. NEURAL-UP
and we want to extract exon features for all exons of this data set and compare the three sets of exons wrt. these features.

1. Making yourself familiar with data

by checking, e.g., column names of the given table.
> matt get_colnms data.tab

   ID    COL_NAMES
   1     GENE
   2     EVENT
   3     COORD
   4     LENGTH
   5     FullCO
   6     TYPE
   7     GROUP
The table data.tab has 7 columns containing information on alternative splicing events as output from Vast-Tools.
Print the first rows nicely aligned:
> matt prnt_tab data.tab -W 30 | less -S -

   GENE EVENT         COORD                     LENGTH FullCO TYPE GROUP      
        HsaEX0000026  chr12:120648733-120648909 177    ...    S    AS_noNeural
        HsaEX0000032  chr12:10749491-10749580   90     ...    S    AS_noNeural
        HsaEX0000033  chr12:10748290-10748526   237    ...    C2   AS_noNeural
        HsaEX0000044  chr14:58760484-58760644   161    ...    C1   AS_noNeural
        HsaEX0000048  chr14:58752309-58752474   166    ...    C2   AS_noNeural
        HsaEX0000053  chr12:70636846-70636984   139    ...    S    AS_noNeural
        HsaEX0000062  chr14:101376570-101376641 72     ...    S    AS_noNeural
        HsaEX0000087  chr15:84850734-84850835   102    ...    S    AS_noNeural
        HsaEX0000091  chr15:57209263-57209410   148    ...    S    AS_noNeural
   A1BG HsaINT0000005 chr19:58858396-58858718   323    ...    IR-S AS_noNeural
   A2M  HsaINT0000025 chr12:9256997-9258831     1835   ...    IR-S AS_noNeural
   A2M  HsaINT0000026 cr12:9254271-9256834      2564   ...    IR-S AS_noNeural
   A2M  HsaINT0000027 chr12:9253804-9254042     239    ...    IR-S AS_noNeural
Column FullCO was not put on this website to save space. Column TYPE encodes the type of alternative splicing events.
Check all different types of splicing events:
> matt col_uniq data.tab TYPE | matt prnt_tab - -w 9 | less -S -

   TYPE_UNIQ   FREQ     
   Alt3        3444     
   Alt5        2449     
   C1          1604     
   C1*         312      
   C2          1287     
   C2*         259      
   C3          1134     
   C3*         237      
   IR-C        3483     
   IR-S        15494
   ...
2. Adapting labels in column TYPE

We need to use command get_vast later which works with Vast-Tools standard types of alternative splicing events which are: S, C1, C2, C3, MIC, IR-S, IR-C, Alt3, Alt5. Table data.tab contains these standard types but also other types like S*, C1*. Hence, we adapt all labels to the standard types with command chg_labs.
> matt chg_labs data.tab TYPE S*:S C1*:C1 C2*:C2 C3*:C3 MIC_M:MIC MIC_S:MIC > tmp.tab
> mv tmp.tab data.tab
3. Extracting exons and their start/end-coordinate, chromosome, strand

The commands get_efeatures and cmpr_exons expect information on exons like their start and end coordinate but table data.tab does not contain this information, yet. The command get_vast is specifically designed for extracting this information from Vast-Tools output tables. To select from the table only exon events, we use command get_rows to select rows with have in column TYPE S, C1, C2, C3, or MIC. The output table gets re-directed into exons.tab.
> matt get_vast data.tab COORD FullCO TYPE LENGTH | ...
   ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ > exons.tab
Checking how table exons.tab looks like by listing its column names.
> matt get_colnms exons.tab

   ID      COL_NAMES
   1       GENE
   2       EVENT
   3       COORD
   4       LENGTH
   5       FullCO
   6       TYPE
   7       GROUP
   8       SCAFFOLD
   9       START
   10      END
   11      STRAND
   12      UPSTRM_EX_BORDER
   13      DOSTRM_EX_BORDER
To verify that exons.tab contains only exons, you could also apply col_uniq to column TYPE again.

4.1 Adding Ensembl gene IDs

which are necessary for commands get_efeatures and cmpr_exons. Here, we use command get_match which takes as input a second table containing a mapping EVENT_ID to ENSEMBL_GENEID (mapping_EVENT2GENEID.tab) and outputs for each EVENT_ID in column EVENT of table exons.tab the corresponding GENEID. The resulting table (one column) gets added immediately to table exons.tab exploiting piping.
> matt get_match exons.tab EVENT mapping_EVENT2GENEID.tab EVENT ENSEMBL_GENEID ...
       ... | matt add_cols exons.tab -

4.2 Adding Ensembl gene IDs

Alternatively, we could use the command extr_geneids to extract gene IDs from a given GTF file for splicing events. The following commands will append to the table exons.tab a column ENSEMBL_GENEID.
> matt extr_geneids exons.tab START END SCAFFOLD STRAND Hsa19_ensembl.gtf -f gene_id -n ENSEMBL_GENEID ...
       ... | matt add_cols exons.tab -

4.3 Adding Ensembl gene IDs

Yet another option is to use the command get_vast with arguments -gtf and -f. In this case, it would automatically extract gene IDs from the given GTF and append a column GENEID.
> matt get_vast data.tab COORD FullCO TYPE LENGTH -gtf Hsa19_ensembl.gtf -f gene_id | ...
   ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ > exons.tab

5. Checking number of exons in final table

to get an idea about the size of the exon sets.
> matt col_uniq exons.tab GROUP

   GROUP_UNIQ     FREQ
   AS_noNeural    10465
   NEURAL-DOWN    631
   NEURAL-UP      1046
6. Extracting exon features for all exons

You might want to download the USCS HG19 assembly sequences and the Ensembl gene annotation for HG19. For decompressing the file hg19.2bit you need to download as well twoBitToFa and run
> ./twoBitToFa -noMask hg19.2bit Hsa19_gDNA.fasta
Then extracting exon features and adding these immediately to the table exons.tab
> fasta=/..path..to../Hsa19_gDNA.fasta
> gtf=/..path..to../ensembl/Hsa19.gtf
> matt get_efeatures exons.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ...
    ... $gtf $fasta Hsap 150 | matt add_cols exons.tab -
Table exons.tab now contains for all exons information about their genomic coordinate, gene, and their exon features including sequences like exon sequence, splice site sequences, sequences of up/down-stream introns. All these information can be used as input for further analyses, e.g., with R. The extracted sequences might be of interest for further analyses like de-novo motif search etc, which often expect their input as FASTA files. Using the command extr_seqs, you might extract sequences from specific columns of exons.tab in FASTA format.

7. Down-sample exon set AS_noNeural

The sizes of the exon sets are: AS_noNeural=10465, NEURAL-DOWN=631, NEURAL-UP=1046. Hugh groups like AS_noNeural are unnecessary for comparing exon sets and would slow down analysis. Hence, we sub-sample 2000 AS_noNeural exons with rand_rows and create a new table exons_testsets.tab.
> matt get_rows exons.tab GROUP]AS_noNeural[  ... 
       ... | matt rand_rows - 2000 132927352 > exons_testsets.tab
> matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ ...
  ... | matt add_rows exons_testsets.tab -
The table exons_testsets.tab should be used as input for command cmpr_exons. It is important to know that cmpr_exons will extract features for all exons of its input table and append them to the input table. As a consequence, the input table must not contain already columns with identical column names because column names in a table must be unique. Hence, from exons_testsets.tab we select only the important columns and neglect the already added columns with exon features.
> matt get_cols exons_testsets.tab START END SCAFFOLD STRAND ...
   ... ENSEMBL_GENEID GROUP > tmp.tab
> mv tmp.tab exons_testsets.tab
Checking the number of exons in the final table exons_testsets.tab confirms that the sampling worked as expected.
> matt col_uniq exons_testsets.tab GROUP

   GROUP_UNIQ     FREQ
   AS_noNeural    2000
   NEURAL-DOWN    631
   NEURAL-UP      1046
8. Statistically comparing exon sets and generating PDF report
> fasta=/..path..to../Hsa19_gDNA.fasta
> gtf=/..path..to../ensembl/Hsa19.gtf
> matt cmpr_exons exons_testsets.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ...
  ... $gtf $fasta Hsap 150 GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] ...
  ... 100 -colors[brown2,darkgoldenrod2,azure4] -addseqplots cmpr_1
With GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] we select the groups which should be compared, i.e., all pair-wise comparisons are done: NEURAL-UP vs. NEURAL-DOWN, NEURAL-UP vs. AS_noNeural, and NEURAL-DOWN vs. AS_noNeural. Putting AS_noNeural in the last place will give us box plots which contain the group AS_noNeural as reference group. The permutation test for detecting features statistically different in each comparison will perform only 100 iterations for speed. To get good estimates of p-values it is recommended to choose 50000 or 100000. All output gets written into folder cmpr_1, where you find a summary in form of a PDF document with all details on the comparisons and all graphics in sub-folder summary_graphics for later use.

The PDF report contains all details and plots and is organized in four main parts:
  1. Overview of results of permutation tests across all comparisons and features
  2. Box-plots for all numerical features with all exon groups tested
  3. Comparative sequence plots for all groups of splice sites, branch points for all comparisons
  4. Details of results of permutation tests




Testing enrichment of TGC in up-stream regions of neural micro exons

We want to statistically test if the 3-mer TGC occurs more frequent in the 200 nt up-stream regions of neural micro exons. We start again with the table data.tab, which we already have used for comparing different sets of neural and non-neural AS events. Events are grouped according to their origin
  1. AS_noNeural
  2. NEURAL-DOWN
  3. NEURAL-UP
For doing the enrichment analysis, we actually need the up-stream sequence of the exons. For extracting these sequences, we need the start/end coordinate, chromosome, and strand information which are not yet in table data.tab. Hence, we go through the same first three steps as above:
  1. Making yourself familiar with data
  2. Adapting labels in column TYPE
  3. Extracting exons and their start/end-coordinate, chromosome, strand

4. Down-sample exon set AS_noNeural

The set of AS_noNeural is much larger than the other sets.
> matt col_uniq exons.tab GROUP

   GROUP_UNIQ   FREQ
   AS_noNeural  10465
   NEURAL-DOWN  631
   NEURAL-UP    1046
Hence, we randomly sample 2000 AS_noNeural exons and construct a new table with them and all NEURAL-DOWN and NEURAL-UP exons.
> matt get_rows exons.tab GROUP]AS_noNeural[ | ...
   ... matt rand_rows - 2000 923681819 > exons_testsets.tab
> matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ | ...
   ... matt add_rows exons_testsets.tab -
Verifying the sizes of the exon sets in exons_testsets.tab:
> matt col_uniq exons_testsets.tab GROUP

   GROUP_UNIQ   FREQ
   AS_noNeural  2000
   NEURAL-DOWN  631
   NEURAL-UP    1046

5. Retrieving up-stream sequences of exons

The command get_seqs is the utility command for retrieving different types of sequences. For a given genomic region, e.g. an exon in this case, it allows to retrieve the up-stream and down-stream sequences (*Mode 2 in its help message) and will output these as a table with two columns SEQ_UP and SEQ_DOWN. Here we define as length for the up- and down-stream region 200 nt and add the columns with sequences directly to table exons_testsets.tab.
> fasta=/..path..to../Hsa19_gDNA.fasta
> matt get_seqs exons_testsets.tab START END 200 200 SCAFFOLD STRAND $fasta | ...
   ... matt add_cols exons_testsets.tab -
Verifying the result:
> matt get_colnms exons_testsets.tab

   ID  COL_NAMES
   1   START
   2   END
   3   SCAFFOLD
   4   STRAND
   5   TYPE
   6   LENGTH
   7   GROUP
   8   SEQ_UP
   9   SEQ_DOWN
Looking at the retrieved sequences nicely shows/confirms that (almost) all up-stream sequences end in AG, as expected,
> matt get_cols exons_testsets.tab SEQ_UP | less -S -

   SEQ_UP
   ...CAAGGCTGTCTTTTCTCCATAG
   ...TTAAATTGTCTTCTCTCCACAG
   ...ACTTGAATTTCTACTCCCACAG
   ...CAACTTTCTAATCTATCCCTAG
   ...CCCCCTTGTGTTCCTCCTGCAG
   ...ATTTTGTTATTTTTTTTTTTAG
   ...GCTGTTCGTTTTTCTTTCTTAG
   ...CTTCGCCTCCTTGCTGTTTCAG
   ...
and (almost) all down-stream sequences start with GT:
> matt get_cols exons_testsets.tab SEQ_DOWN | less -S -

   SEQ_DOWN
   GTGAGCTGGCAGAAATGAAGGGACT
   GTAAGACCAGTTTACCTTTTGAAGA
   GTAATACTTTGGGGAAGGAAGTTTT
   GTAAGAGTATTTTCCTTAATCAGTT
   GTATGTGGCTGTGGTGAACGGGCAC
   GCATGTAGCACCACACCTAACTGGA
   GTAAGCCTGCCAGGAGCCACTCAGC
   GTGAGCTCTGAGCCTCTGGCATTTC
   ...

7. Defining sets of micro exons

Lastly, before we apply the enrichment analysis, we need to define the groups of exons we want to compare:
  1. nmic_s: neural, short micro exons with length 3-15
  2. nmic: neural, micro exons with length 16-27
  3. nex: neural exons with length > 27
  4. ex: non-neural exons
In order to define these groups, we add a new column GROUP2 to the data. We add this column with command add_val which we apply to sub-selected rows from exons_testsets.tab fulfilling the constraints shown. By this, we construct step-by-step a new table tmp.tab which we finally rename to exons_testsets.tab.
> matt get_rows exons_testsets.tab LENGTH[3,15] GROUP]NEURAL-UP,NEURAL-DOWN[ |
   ... matt add_val - GROUP2 nmic_s > tmp.tab

> matt get_rows exons_testsets.tab LENGTH[16,27] GROUP]NEURAL-UP,NEURAL-DOWN[ |
   ... matt add_val - GROUP2 nmic | matt add_rows tmp.tab -

> matt get_rows exons_testsets.tab LENGTH[28,100000000] GROUP]NEURAL-UP,NEURAL-DOWN[
   ... | matt add_val - GROUP2 nex | matt add_rows tmp.tab -

> matt get_rows exons_testsets.tab GROUP]AS_noNeural[ | matt add_val - GROUP2 ex |
   ... matt add_rows tmp.tab -

> mv tmp.tab exons_testsets.tab
Verifying results
> matt col_uniq exons_testsets.tab GROUP2

   GROUP2_UNIQ	FREQ
   ex           2000
   nex          1371
   nmic         155
   nmic_s       151

7. Statistically testing enrichment of TGC in up-stream sequences

The data are ready prepared for applying the command test_regexp_enrich. We use this command because the 3-mer TGC can be described by a simple regular expression TGC. We create a simple tab-separated table describing this regular expression and save it in tgc.tab which looks like
REGEXP   NAME
TGC      TGC
Then we run the enrichment analysis comparing nmic_s vs. ex, nmic vs. ex, and nex vs. ex using the sequences as given (single) and for computing p-values running a permutation test 100000 times:
> matt test_regexp_enrich exons_testsets.tab SEQ_UP ...
   ... GROUP2[nmic_s,ex,nmic,ex,nex,ex] single tgc.tab NAME ...
   ... REGEXP quant 1000 > res.tab
The permutation test performs 1000 iterations for speed. To obtain reliable p-values, 50000 or 100000 iterations are recommended. A comprehensive summary of the enrichment analyses done is written into res.tab (via re-direction of the output). This file is best studied as a spread sheet (using, e.g., Libre Office Calc). The most interesting columns are (please see explanation of test_regexp_enrich for more details):
  1. SIGNIFICANT_RESULT_ALL: enrichment considering entire sequences with results: nmic_s>ex ***, nmic>ex ***, nex>ex ***
  2. SIGNIFICANT_RESULT_FIRST: enrichment considering only the first 1/3 of the sequences with results: none
  3. SIGNIFICANT_RESULT_MIDDEL: enrichment considering only the internal/middle 1/3 of the sequences with results: nmic>ex ***, nex>ex ***
  4. SIGNIFICANT_RESULT_END: enrichment considering only the last/end 1/3 of the sequences with results: nmic_s>ex ***, nmic>ex ***, nex>ex ***
The stars encode the level of significance, i.e., * p-value ≤0.05, ** p-value ≤ 0.01; *** p-value ≤ 0.001. From the results we find that neural exons (regardless of their size are highly enriched in TGC.

8. Visualizing positional distributions of TGC in up-stream regions of exons

The command get_regexp_prof allows to generate positional distribution of hits of regular expressions across sequences. As we have the 200 nt up-stream but also the 200 nt down-stream regions, we visualize both together running:
> matt get_regexp_prof exons_testsets.tab SEQ_UP RL SEQ_DOWN LR tgc.tab ...
   ... GROUP2[nmic_s,nmic,nex,ex]
This will create a web page in folder regexp_profiles which you can study using any web browser. This web page looks as follows.

Matt: Positional distributions of hits of REGEXPs

Data file: exons_testsets.tab
Categories (number of sequences): ex ( 2000 ), nex ( 1371 ), nmic ( 155 ), nmic_s ( 151 )
Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
K or file with information on REGEXPS: tgc.tab
Date: Mar 09, 2017


Ordered Results

Ordering from top to bottom in each column: corresponds to most negative
to most positive differences of densities / cumulative densities to
densities / cumulative densities of reference group
Reference group: ex
If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
TGC
TGC
TGC
TGC


Plots

TGC - TGC

Back to Ordered Results



It should be mentioned that these plots show densities and cumulative densities. Hence, they give information about the position of TGC hits across sequences; but they do not give information about the absolute number of hits. The plots show that TGC hits in the up-stream intronic regions of neural micro exons (red and blue lines) are clearly concentrated closer to the 3' splice site as in case of neural exons and non-neural exons.



Downstream analysis of Nova2 regulated exons

A Shell script with all commands and explanations is available. Commands therein can be copy'n'pasted to a Shell where Matt is available.

After having downloaded the RNA-seq data from GEO with help of Matt, and after having estimated inclusion levels of alternative-splicing events genome-wide with help of the program vast-tools, this downstream analysis consists of the following steps

  1. Extraction of skipped exons from vast-tools output table, which contains skipped exons, retained introns, and alternative 3' and 5' splice sites.
  2. Definition of three groups of skipped exons according to their delta inclusion levels (∆PSI) for Nova2 KD vs. WT: silenced (∆PSI≥25), enhanced (∆PSI≤-25), unregulated (-1≤∆PSI≤1).
  3. Extraction of exon-related features, and statistical comparison of feature distributions with the Mann-Whitney U test for the detection of discriminative features.
    The output encompasses:
    1. a summary report (PDF) with box plots and p values
    2. all box plots as PDF graphics
    3. a table with extracted features for the exons
    4. a table with details (test statistics, group sizes, p values) on performed statistical tests
  4. Generation of motif RNA-maps for all approx. 340 RNA binding proteins with IUPAC binding motif from CISBP-RNA.
    The output encompasses:
    1. a summary report (PDF) with all motif RNA-maps. Each map is generated in two version, one containing a data coverage section and one which does not contain such a section.
    2. all motif RNA-maps as PDF graphics
  5. Generation of a specific motif RNA-map with a Nova2 binding motif derived from known Nova2 binding motifs, with highlighted regions of statistically significant (FDR≤0.01) different Nova2-motif enrichment.
    The output encompasses:
    1. a summary report (PDF) with all (in this case only one) motif RNA-maps. Each map is generated in two version, one containing a data coverage section and one which does not contain such a section.
    2. all (in this case only one) motif RNA-maps as PDF graphics




  6. Extraction of 50 nt long sequences upstream of exons
  7. Statistical enrichment analysis in these extracted sequences with all CISBP-RNA RNA binding proteins with PWM binding motifs.
    The output encompasses:
    1. a table with detailed information on performed statistical tests (test statistics, p values, group sizes)
    2. for all motifs their positional distribution along the 50 nt long sequences in all three groups as PDF and PNG graphics






Extending Matt: add new commands/functionality

Matt can be extended by new commands/functions. Extensions can be implemented in any scripting/programming language. Though, the easiest way might be to use Perl as Matt is mainly implemented in Perl and already includes a Perl module with utility functions for working with tables.

A template script new_command.pl showing how Matt commands are usually implemented in Perl is available in sub-directory abilities of the Matt installation directory. This Perl script implements already the usual structure and can be extended quickly to a full implementation of a new Matt command.

This template script new_command.pl contains detailed comments/help explaining how it needs to be extended. It also contains instructions for implementing Matt commands in other scripting/programming languages.





FAQ / Trouble shooting

1. Which letters do genomic sequences contain?
Usually, they contain A, C, G, T. All CISBP-RNA motifs are defined wrt. these letters. Hence, when you run test_cisbp_enrich, your sequences must contain A, C, G, T, and maybe N but all potential binding positions containing at least one N will be neglected.
Whenever you extract sequences with, e.g., get_seqs and your FASTA file contains A, C, G, U, then the extracted sequences will contain these letter as well.
Whenever you use your own position weight matrices (PWMs), these might be defined wrt. A, C, G, U. In this case, when you want to use these PWMs with command test_pwm_enrich then your sequences should be defined wrt. A, C, G, U as well.



2. Which column names do my tables need to have?
You are free to choose any column names you prefer, but they should be alpha-numeric and, if possible, upper-case: A-Z, 0-9, and '_', and they should be meaningful. When applying Matt commands to your tables, often you need to specify the column names of your tables to which you want to apply Matt. E.g., get_seqs expects your table to contain a column with start coordinates and another with end coordinates. You can call these columns START and END, but any other name is possible.



3. I don't get a PDF report from cmpr_exons or cmpr_introns.
These PDF reports are generated with pdflatex using the following packages. Please make sure all of them are available on your system.
  1. KOMA-Script
  2. babel
  3. graphicx
  4. tocbibind




How to cite

Matt builds on the great work of others. Hence, when publishing results which you have obtained with Matt, please follow these rules for citation.

  1. Splice site strengths

    As Matt internally uses Yeo & Burge's maximum-entropy models for computing splice site strengths, please cite

    Yeo et al., Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals, 2003, DOI: 10.1089/1066527041410418

    whenever you publish results related to splice site strength. This includes the following splice site features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

    • MAXENTSCORE_BURGEHUMANMODEL_UPSTREAM_5SS : maximum entropy score of 5ss of up-stream exon using a model trained with human splice sites
    • MAXENTSCORE_BURGEHUMANMODEL_3SS : maximum entropy score of 3ss using a model trained with human splice sites
    • MAXENTSCORE_BURGEHUMANMODEL_5SS : maximum entropy score of 5ss using a model trained with human splice sites
    • MAXENTSCORE_BURGEHUMANMODEL_DOWNSTREAM_3SS : maximum entropy score of 3ss of down-stream exon using a model trained with human splice sites


  2. Binding motifs of RNA binding proteins from CISBP-RNA

    As Matt uses binding motifs of RNA binding proteins as provided by the database CISBP-RNA, please cite

    Ray et al., A compendium of RNA-binding motifs for decoding gene regulation, 2013, DOI: 10.1371/journal.pcbi.1001016

    whenever you publish results obtained with Matt's command test_cisbp_enrich or when you have used any of the CISBP-RNA binding motifs in sub-folder data_external/cisbp_rna of Matt installation.


  3. Predicted branch points

    As Matt internally uses SVM_BPfinder for the prediction of branch points, please cite

    Corvelo et al., Genome-Wide Association between Branch Point Properties and Alternative Splicing, 2010, DOI: 10.1038/nature12311

    whenever you publish results related to predicted branch points. This includes the following features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

    • DIST_FROM_MAXBP_TO_3SS_UPINTRON : distance to 3ss of best precited BP
    • SEQ_MAXBP_UPINTRON : sequence of best predicted BP
    • SCORE_FOR_MAXBP_SEQ_UPINTRON : BP sequence score of best predicted BP
    • PYRIMIDINECONT_MAXBP_UPINTRON : Pyrimidine content between the BP adenine and the 3 prime splice site for best BP
    • POLYPYRITRAC_OFFSET_MAXBP_UPINTRON : Polypyrimidine track offset relative to the BP adenine for best BP
    • POLYPYRITRAC_LEN_MAXBP_UPINTRON : Polypyrimidine track length for best BP
    • POLYPYRITRAC_SCORE_MAXBP_UPINTRON : Polypyrimidine track score for best BP
    • BPSCORE_MAXBP_UPINTRON : SVM classification score of best BP
    • NUM_PREDICTED_BPS_UPINTRON : number of all predicted BPs which have a positive BP score
    • MEDIAN_DIST_FROM_BP_TO_3SS_UPINTRON : like DIST_FROM_MAXBP_TO_3SS but median over top-3 predicted BPs
    • SEQ_BPS_UPINTRON : comma:separated list of sequences of top-3 predicted BPs
    • MEDIAN_SCORE_FOR_BPSEQ_UPINTRON : like SCORE_FOR_MAXBP_SEQ but median over top-3 predicted BPs
    • MEDIAN_PYRIMIDINECONT_UPINTRON : like PYRIMIDINECONT_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_OFFSET_UPINTRON : like POLYPYRITRAC_OFFSET_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_LEN_UPINTRON : like POLYPYRITRAC_LEN_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_SCORE_UPINTRON : like POLYPYRITRAC_SCORE_MAXBP but median over top-3 predicted BPs
    • MEDIAN_BPSCORE_UPINTRON : like BPSCORE_MAXBP but median over top-3 predicted BPs
    • DIST_FROM_MAXBP_TO_3SS_DOINTRON : distance to 3ss of best precited BP
    • SEQ_MAXBP_DOINTRON : sequence of best predicted BP
    • SCORE_FOR_MAXBP_SEQ_DOINTRON : BP sequence score of best predicted BP
    • PYRIMIDINECONT_MAXBP_DOINTRON : Pyrimidine content between the BP adenine and the 3 prime splice site for best BP
    • POLYPYRITRAC_OFFSET_MAXBP_DOINTRON : Polypyrimidine track offset relative to the BP adenine for best BP
    • POLYPYRITRAC_LEN_MAXBP_DOINTRON : Polypyrimidine track length for best BP
    • POLYPYRITRAC_SCORE_MAXBP_DOINTRON : Polypyrimidine track score for best BP
    • BPSCORE_MAXBP_DOINTRON : SVM classification score of best BP
    • NUM_PREDICTED_BPS_DOINTRON : number of all predicted BPs which have a positive BP score
    • MEDIAN_DIST_FROM_BP_TO_3SS_DOINTRON : like DIST_FROM_MAXBP_TO_3SS but median over top-3 predicted BPs
    • SEQ_BPS_DOINTRON : comma:separated list of sequences of top-3 predicted BPs
    • MEDIAN_SCORE_FOR_BPSEQ_DOINTRON : like SCORE_FOR_MAXBP_SEQ but median over top-3 predicted BPs
    • MEDIAN_PYRIMIDINECONT_DOINTRON : like PYRIMIDINECONT_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_OFFSET_DOINTRON : like POLYPYRITRAC_OFFSET_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_LEN_DOINTRON : like POLYPYRITRAC_LEN_MAXBP but median over top-3 predicted BPs
    • MEDIAN_POLYPYRITRAC_SCORE_DOINTRON : like POLYPYRITRAC_SCORE_MAXBP but median over top-3 predicted BPs
    • MEDIAN_BPSCORE_DOINTRON : like BPSCORE_MAXBP but median over top-3 predicted BPs


  4. Predicted branch points

    As SVM_BPfinder internally uses SVMlight, please cite

    Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. (link)

    whenever you need to cite SVM_BPfinder (point 3)


  5. Binding score of human SF1

    As Matt uses the human SF1 binding motif (position weight matrix) published by Corioni et al., please cite

    Corioni et al., Analysis of in situ pre-mRNA targets of human splicing factor SF1 reveals a function in alternative splicing, 2011, DOI: 10.1093/nar/gkq1042

    whenever you publish results related to this human SF1 binding motif. This includes the following features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

    • HUMANSF1_HIGHESTSCORE_3SS_UPINTRON : highest score of a SF1 position weight matrix trained with human data in the last 150 nt 3 prime intron positions of up-stream intron
    • HUMANSF1_HIGHESTSCORE_3SS_DOINTRON : highest score of a SF1 position weight matrix trained with human data in the last 150 nt 3 prime intron positions of down-stream intron
System requirements
   Linux/Mac OS
   Perl
   R, Rscript
   pdflatex
   any PDF viewer
   any Web browser


License
 Matt is open-source and available
 under GNU LGPLv3

Additional material
   Intro to rna_maps



Publications / posters


Please cite
 Matt builds on the great work
 of others.