> Matt v. 1.2.1
=====

A Unix toolkit for analyzing genomic sequences
with focus on down-stream analysis of alternative splicing events
  Versions
       1.2.1
       1.2.0
       1.1.0
       1.0.0

       All & change log
Commands for:

Data analysis
get_efeatures
get_ifeatures
cmpr_exons
cmpr_introns
cmpr_motif
get_kmers_from_pwm
get_pwm_hits
get_pwm_prof
get_regexp_hits
get_regexp_prof
perm_test
plot_motif
rna_maps
rna_maps_cisbp
test_cisbp_enrich
test_pwm_enrich
test_regexp_enrich


Data preparation
add_cols
add_rows
add_val
add_range
chg_labs
chk_tab
col_calc
col_uniq
def_cats
retr_geneids
extr_scafids
extr_seqs
get_colnms
get_cols
get_match
get_gcc
get_ovl
get_pwm
get_rows
get_sj
conc
get_seql
get_seqs
get_sss
get_vast
grp_calc
mk_uniq
perm_seqs
prnt_tab
rand_rows
retr_rnaseq
rm_cols
rn_cols
row_calc
sort_tab
stratify

Contents

  1. Introduction
  2. Download and installation
  3. Tables: the central data structure
  4. Documentation of all commands
  5. Examples
  6. Extending Matt: adding new commands
  7. FAQ / Trouble shooting
  8. How to cite


Introduction

Matt is a Unix command-line toolkit for analyzing genomic sequences with focus on the down-stream analysis of alternative splicing events. Being a POSIX-style command-line toolkit, Matt runs on the terminal of any Unix-like system.

Matt is modular: It comprises many commands, each of which tailored to one specific task. Combining commands allows the user to solve complex tasks: The whole is more than the sum of its parts.

The central data structure Matt works on is tables, where rows correspond to objects, e.g., exons, genes, etc., and columns correspond to features of these objects.

For performing a high-level analysis, the user of Matt usually goes through two phases.

  1. Data preparation: Fit the output from other programs, e.g., for the estimation of inclusion levels of alternative splicing events from RNA-seq data, to the input format of Matt analyses. This might include:
  2. Data analysis: run Matt analyses, e.g.:
Though the main focus of working with Matt are the high-level analyses, Matt offers commands for data preparation/table manipulation within the same framework, too. The motivation for this is to prevent users from implementing many highly similar scripts by themselves for standard preprocessing steps like defining groups of exons/introns to be compared with each other, extracting genomic sequences or stratifying data sets. In addition, users can use and combine with Matt any other tool for manipulating tables, e.g., other scripting languages like Awk, Sed, etc. or even LibreOffice Calc or Excel. Some analysis commands produce reports as PDF documents, PDF graphics, or web pages summarizing the results.

To make the work with Matt more efficient, many Matt commands are designed to be combined by piping. Exploiting this capability improves the work experience with Matt.

Matt provides comprehensive help messages on the terminal. To get an overview of all available commands, call

> matt
For getting the help message for COMMAND, call
> matt COMMAND
Each help message describes all arguments, e.g.,
matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...
where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted. The order of arguments always matters; changing the order might cause errors or false results.

This web page describes all commands and gives example calls. Both, this online documentation and the terminal help messages complement each other.

Download and installation

  1. Create folder for installation
    > mkdir matt
    > cd matt
  2. Download archive
    > wget matt.crg.eu/matt.tar.gz
    or
    > curl -OL matt.crg.eu/matt.tar.gz
  3. Decompress
    > tar -xvzf matt.tar.gz
  4. Run INSTALL script
    > ./INSTALL
    or
    > source ./INSTALL
  5. Add link to any PATH folder
    > cd /path/to/PATH/folder
    > ln -s /path/to/folder/matt/matt
  6. Call Matt to get an overview of commands
    > matt

Tables

Tab-separated plain text tables are the main data structure Matt works on. The motivations for making tables the basic data-structure are:

  1. Compatibility: with the output of many programs for the estimation of inclusion levels of alternative splicing events that often output their results in form of tables, e.g., ASpli, IPSA MISO, rMATS, SANJUAN, SUPPA, VAST-TOOLS,
  2. User-friendliness: most users are familiar with tables like spread sheets (LibreOffice Calc, Excel) and therefore can inspect tables easily with any spread sheet program
  3. Clarity: combining all important pieces of information in a table with precise column names documents well the data
  4. Flexibility: the order of columns in tables is never important; Having specified the names of columns to be used for a specific analysis, Matt will search for these columns in the table automatically. This allows the user to apply Matt to very different tables without the need of re-ordering their columns.
  5. Exploiting piping: many Matt commands can be combined efficiently using piping as their main input and output are tables
  6. Synergy: other powerful command-line tools like Sed or Awk can be applied to tables and, therefore, be used in combination with Matt

Tables must have headers with column names and column names must be unique, e.g., a table describing micro exons could look like this table t1.tab:

         EXON_ID    START   END     SCAFF  STRAND  SEQ      ENSEMBL_GID      GROUP
         ex1        123127  123132  chr1   +       GCTAGT   ENSG00000223972  up
         ex2        903861  903868  chr3   +       GCTCGATT ENSG00038737632  up
         ex3        4291861 4291863 chr5   -       GTC      ENSG00009875223  down
         ex4        9128712 9128717 chr12  -       CGCCCT   ENSG00002652352  down
         ex5        714614  714620  chrM   +       CGCTCGC  ENSG00083273821  up
		
It should be emphasized that Matt does not assume any specific column structure, rather does the user choose specific columns with data necessary for the command the user wants to apply. As a consequence, when calling Matt commands the user needs to provide as arguments the names of these columns. For example, a user might want to do a differential motif enrichment analysis with the micro-exons table above and would list columns SEQ and GROUP, or, when the user wants to retrieve splice-site sequences of these micro exons, the user would use the same table but list columns START, END, SCAFF, STRAND. The order of columns in tables is never important, as the user always need to refer Matt to specific columns by their names.

Not assuming specific columns in a specific order makes Matt flexible and applicable to almost any tab-separated table. For adding column names to tables yet without column names, e.g., like BED files, Matt offers the command add_header.

Users are free to choose any column names they prefer. Though, it is recommended to use capital letters for all COLUMN NAMES. Further, column names should contain only alpha-numeric characters, A-Z, 0-9, and '_'.

Generally, tables might contain any type of data representable as text, including long sequences. For improving the work experience with Matt, it is recommended that each column, whenever possible, should contain only one piece of information / value. Combining different values into one cell by concatenations, e.g., chr1:876123-877155, makes the single values, e.g., chromosome, start, end, inaccessible for Matt.

Matt can process tables with Unix new-lines and Windows new-lines. Tables with old-style Mac new-lines or Dos new-lines need to be adapted first; See command chk_nls. When working with Excel, please save tables in format Windows Text.


Documentation of Matt commands

This online documentation of Matt commands introduces all commands and provides examples and explanations. For more details on commands you might want to checkout the terminal help-messages which Matt offers for all commands.

For displaying an overview of all available commands, call
> matt
For getting a help message for COMMAND, call
> matt COMMAND
Each help message defines all arguments, e.g.,
matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...
where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted.

Generally, one might categorize all Matt commands into commands for

  1. Data preparation/Table manipulation for analysis (top of left panel)
  2. Data analysis (top of left panel)
If you are primarily interested in performing analyses, you might want to make yourself familiar with the commands for analyses first, and later learn about the commands for data preparation.

Table categorizing all Matt commands into eight groups

1. Import data / check table
     chk_nlsvisualize and check newlines in table
     chk_tabcheck if table is Matt conform, i.e., all rows have same number of columns
     get_pwmread-in position weight matrix model from text file
     get_sjread-in output table with skipped exons from SANJUAN program
     get_vastread-in output table with alternative splicing events from VAST-TOOLS program

2. Retrieve data from table
     col_uniqlist all distinct values of a column
     get_colnmsget column names of table
     get_colsextract entire columns from table
     get_matchmatch entries across two tables via IDs
     get_ovlget overlap, i.e., events (rows) which are present in several tables
     concconstruct artificial row ids by concatenating specific values across rows
     get_rowsextract entire rows from table, which fulfill constraints
     rand_rowsrandomly down-sample rows from table

3. Manipulate table
     add_colsadd columns to table
     add_headeradd header to table yet without header
     add_rangeadd column with range of values increasing for each row
     add_rowsadd rows to table
     add_valadd one new column with constant value to table
     chg_labsin a column with categorical values (group labels), change these labels
     def_catsdefine group labels for items (rows) wrt. their features
     rm_colsremove columns from a table
     rn_colsrename column names in header of table

4. Retrieve sequences
     extr_seqsoutput in FASTA format sequences from a column of a table
     get_seqsretrieve sub-sequences (e.g., exons, genes) from a FASTA file

5. Analyze sequences
     cmpr_motifgenerate a comparative motif plot (PDF) for comparing two motifs of the same length
     get_efeaturesretrieve 75 features for exons
     get_gccdetermine the GC content for sequences in a column of a table
     get_ifeaturesretrieve 50 features for introns
     get_pwm_hitssearch in sequences for hits with a motif position weight matrix model
     get_pwm_profvisualize positional distributions of PWM hits across sequences
     get_regexp_hitssearch in sequences for hits with Perl regular expressions
     get_regexp_profvisualize positional distributions of REGEXP hits across sequences
     get_seqlget length of sequences given in a column of a table
     get_sssget splice-site scores (MaxEnt models) for splice sites
     plot_motifgenerate a motif plot as PDF (similar to sequence logos)
     test_pwm_enrichtest motif enrichment in two sets of sequences with motif position weight matrix
test_regexp_enrichtest motif enrichment in two sets of sequences with motif regular expression

6. Maths and statistics
     row_calcapply mathematical computations along rows
     col_calcapply mathematical computations along columns
     perm_testpermutation test for numeric, categorical, co-occurrence features

7. General purpose
     retr_geneidsretrieve gene IDs from GTF for, e.g., genes, exon, introns
     extr_scafidsget overview of chromosome/scaffold IDs from FASTA, FASTQ, GTF files
get_kmers_from_pwmget top k-mers most likely under a given position weight matrix model
     mk_uniqmake unique a table, i.e., neglect rows with replicated values
     perm_seqsrandomly permute nucleotides within sequences
     prnt_tabprint a table nicely to screen
     retr_rnaseqretrieve RNA-seq data seqs from GEO with GEO accession numbers
     sort_tabsort a table according to values along column
     stratifystratification of two data sets wrt. a numerical feature

8. High-level analyses
     cmpr_exonscomparing exon groups: extract 75 exon features and determine discriminating features
     cmpr_intronscomparing intron groups: extract 50 intron features and determine discriminating features
     rna_mapsfor groups of exons / introns: generate PDF plots with motif RNA maps
     rna_maps_cisbpmotif RNA maps for all CISBP-RNA motifs, or 2 to 5-mers
test_cisbp_enrichfor groups of sequences: check for motif enrichment with all 297 CISBP-RNA binding motifs



> matt chk_nls
Description: Use this command to print a table to screen making visible new-lines. This helps to investigate which new-line characters are present in the table and how the table would need to be adapted to be processable by Matt. This might be necessary for tables with old-style Mac or Dos new-lines.

Example: With t1.tab being the example table from above
> matt chk_nls t1.tab
prints this table making visible new-lines.



> matt chk_tab
Description: Use this command to check if a given table is processable by Matt. To be so, all rows of a table must have the same number of fields. This command lists row-number of rows with a wrong number of fields or reports that the table is correctly formatted.

Example: With t1.tab being the example table from above
> matt chk_tab t1.tab
reports that this table is correctly formatted and, hence, processable by Matt.



> matt get_pwm
Description: Use this command to
  1. read-in a position weight matrix (PWM) given by a table
  2. learn a PWM from aligned sequences of identical length.
and turn it into the standard Matt format for PWMs. The first case is helpful when you have down-loaded a PWM or when you have extracted it from some document and want to use it with Matt. Matt tries to understand the structure of the table, e.g., if rows or columns correspond to positions. The second case is helpful when you have short sequences of identical length, e.g., splice sites or branch points, from which you want to learn a PWM (get the relative frequencies of letters at each position) and use it with Matt.

Example: With M037_0.6.txt being the human MBNL1 PWM from CISBP RNA
> matt get_pwm M037_0.6.txt 0.001 A,C,G,T
prints this MBNL1 PWM with minimal probability 0.001 instead of 0 and alphabet A,C,G,T instead of A,C,G,U. For re-defining the alphabet, you could also directly edit the text file M037_0.6.txt. To save this PWM in Matt format for later use with Matt, re-direct the output into a file.
> matt get_pwm M037_0.6.txt 0.001 A,C,G,T > hsa_mbnl1_pwm.tab



> matt get_sj
Description: This command is tailored to the result table of SANJUAN (a program for estimating inclusion levels of alternative splicing events from RNA-seq data). These tables describe exons and already contain their start and end-coordinate, chromosome, and strand. With this command, you'll add Ensembl gene ids to the table, which are necessary for subsequent Matt analyses.

Example: SANJUAN outputs differentially spliced exons in a table called CEs_clean.txt.
> matt get_sj CEs_clean.txt hg19 > exons_from_sj.tab
will output a table like CEs_clean.txt with an additional column GENEID_ENSEMBL. It will duplicate rows (exons) for which SANJUAN did output several GENE_SYMBOLs; one row for each of the reported genes with its gene name and Ensembl gene id.



> matt get_vast
Description: This command is tailored to the result tables of VAST-TOOLS. Use this command to transform the output table from VAST-TOOLS into Matt format and extract sub-sets of reported alternative splicing events. VAST-TOOLs reports skipped exons, retained introns, alt3, alt5 events in one output table together with estimated PSI values across several data sets. Matt allows you to retrieve form this table events of specific types (skipped exons, retained introns, Alt3, Alt5) or events that have a PSI value within a user specified range across all samples. Before you run a high-level analysis you might need to defined groups of events (e.g. up-regulated vs down-regulated skipped exons) with Matt function def_cats applied to the result table of get_vast. More information on the format of the VAST-TOOLs output table can be found here.

Example: With vts_out.tab being a final results table from VAST-TOOLS combine command
> matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C 
     -a kd1,kd2 -b ctr1,ctr2 > ir_events.tab
extracts all intron retention events from vts_out.tab together with detailed information on PSI values (min, max, mean) across all samples, across samples -a kd1,kd2, across samples -b ctr1,ctr2, and dPSIs for the comparison of samples -a vs samples -b. In order to be considered in these calculations, PSI values need to have a minimum quality flag of LOW in samples -a and -b and N across all other samples. If for certain events no PSI values are left, NA is output.

The user can augment the result table (e.g. ir_events.tab) with gene IDs extracted from any given gene annotation (e.g. GTF file Hsa19.gtf with gene ids in field gene_id) by

> matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C 
     -a kd1,kd2 -b ctr1,ctr2 -gtf Hsa19.gtf -f gene_id > ir_events.tab
The output table ir_events.tab will have one additional column GENEID with the extracted gene IDs. Of course do the chromosome annotations in the VAST-TOOLs output need to match to those in the GTF file.

Generally, the workflow is as follows: pre-process the output table of VAST-TOOLS for extracting PSI values with get_vast, define groups of events to be analyzed with def_cats, and eventually run a high-level analysis.



> matt col_uniq
Description: Use this command to get all distinct values of a column together with their number of occurrences. This command should be applied only to columns with categorical data.

Example: The ir_events.tab containing intron-retention events from some VAST-TOOLs output obtained with command get_vast. This table has a column COMPLEX which contains for each event the sub-types (IR, IR-S, IR-C) as defined by VAST-TOOLS:
> matt col_uniq ir_events.tab COMPLEX
prints the three sub-types and their numbers of occurrences in table ir_events.tab.



> matt get_colnms
Description: Use this command to get all column names and their indices from a table. This command is useful to investigate the structure of tables and the kind of data in their columns.

Example: With t1.tab being the table from above
> matt get_colnms t1.tab
prints the column headers EXON_ID, START, END, SCAFF, STRAND, SEQ, ENSEMBL_GID, GROUP and their indices.



> matt get_cols
Description: Use this command to retrieve one or several columns from a table.

Example: With t1.tab being the table from above
> matt get_cols t1.tab START END
prints the entire columns START and END from table t1.tab.

You might want to redirect the output into a file to construct a new table consisting of the extracted columns

> matt get_cols t1.tab START END > new_table.tab




> matt get_match
Description: For a given table with entities, find matching values from a second table. E.g., assume you have a table 1 with a column with gene ids and another table 2 with gene ids and gene names, then you can extract gene names from table 2 which match to the gene ids in table 1.

Example: With t1.tab being the table from above and t2.tab describing a mapping GENE_ID→GENE_NAME with columns
  • GENE_ID
  • GENE_NAME
> matt get_match t1.tab GENE_ID t2.tab GENE_ID GENE_NAME
prints a table with one column GENE_NAME. Each row of this table corresponds to a row of t1.tab and it contains the GENE_NAME for the corresponding GENE_ID of t1.tab. Piping allows the user to concatenate both with command add_cols and augment table t1.tab.
> matt get_match t1.tab GENE_ID t2.tab GENE_ID GENE_NAME | matt add_cols t1.tab -



> matt get_ovl
Description: Use this command to retrieve IDs of rows which occur in several tables. This command might be helpful if you want to determine the overlap of different data sets.

Example: Table ir_events.tab contains intron-retention events and has a column EVENT with event IDs. Table ir_events_reference.tab contains reference intron-retention events with column EVENT containing their event IDs. Then
> matt get_ovl ir_events.tab EVENT ir_events_reference.tab EVENT > common_ids.tab
writes a table with column OVERLAP with event ids that occur in table ir_events.tab and ir_events_reference.tab into file common_ids.tab. To extract all rows from table ir_events.tab for these events, you can apply get_rows as follows
> matt get_rows ir_events.tab EVENT]common_ids.tab,OVERLAP[ > common_ir_events.tab



> matt conc
Description: Use this command for concatenating entries of several columns. This command might be useful for constructing artificial row IDs, i.e., event IDs which might be necessary, e.g., for get_ovl if the table not yet contains event IDs.

Example: With t1.tab being the table from above
> matt conc t1.tab EVENT_ID '_' START END SCAFF STRAND
prints a table with one column EVENT_ID which contains in each row an event ID by concatenating START_END_SCAFF_STRAND. With command add_cols you might add this column with event IDs to the table t1.tab.
> matt conc t1.tab EVENT_ID '_' START END SCAFF STRAND | matt add_cols t1.tab -
Other separation symbols like the empty string '' or white space ' ' can be specified.



> matt get_rows
Description: Use this command to retrieve rows (items) from a table. Filters allow the user to select specific rows, i.e., rows which have specific values (numeric or categorical) in some of the columns, or whose vales are empty (empty string).

Example: With t1.tab being the table from above
> matt get_rows t1.tab START[1,5000000] '!STRAND]+['
prints the rows from t1.tab with START in 1 to 5 Mio and STRAND is NOT +.

Constraints can be:

  • LENGTH[0,5] : numeric values in column LENGTH must be in 0 to 5 (inclusive)
  • LENGTH[0,5,10,15] : numeric values in column LENGTH must be in 0 to 5 or in 10 to 15
  • GROUP]g1,g2[ : categorical values in column GROUP must be either g1 or g2
  • GROUP]table.tab,column_name[ : categorical values in column GROUP must be in column column_name of table table.tab (useful if you want to specify many admissible categorical values stored in a table)
  • GROUP][ : categorical value in column GROUP must be empty string
  • COLUMNA==COLUMNB: entries in both columns must be equal
  • '!LENGTH[0,5]' : negation; numerical value in column LENGTH not in 0 to 5
  • '!COLUMNA==COLUMNB' : negation; entries in both columns must not be equal
  • constraints can be combined



> matt rand_rows
Description: Use this command to randomly draw rows from a table. This command might be useful when you want to sub-sample data sets.

Example: The table ir_events.tab contains intron-retention events:
> matt rand_rows ir_events.tab 500 3628929065
prints 500 randomly drawn (seed 3628929065) rows (without replacement) from ir_events.tab. You might want to save these events into a file for later use.
> matt rand_rows ir_events.tab 500 3628929065 > 500_ir_events.tab



> matt add_cols
Description: Use this command to add columns to existing tables. Columns to be added might be newly constructed or retrieved from another table with get_cols.

Example: With t1.tab being the table from above and having constructed row IDs with conc like
> matt conc t1.tab EVENT_ID _ START END SCAFF STRAND > event_ids.tab
doing
> matt add_cols t1.tab event_ids.tab
will add the column EVENT_ID from event_ids.tab to the right end of table t1.tab.
Combining both commands using piping is more efficient.
> matt conc t1.tab EVENT_ID _ START END SCAFF STRAND | matt add_cols t1.tab -
Important: Column names must be unique, i.e., you cannot join a column to a table if this table has already a column with the same column name. To rename columns, you might use rn_cols. It can be efficient to use piping here, e.g.:
> matt rn_cols a.tab COLA:RENAMED_COLA | matt add_cols b.tab -



> matt add_header
Description: Use this command to add column names (entire table header) to a table yet without header to make it processable with Matt.

Example: exons.bed being a BED file with the first three mandatory fields only
> matt add_header exons.bed chrom chromStart chromEnd
adds three column names as header to the file exons.bed directly without printing it to screen.



> matt add_range
Description: Use this command to add a column with a range index increasing for each row.

Example: With t1.tab from above
> matt add_range t1.tab ID
prints t1.tab with extra column ID with index starting at 0 and increasing in each row.



> matt add_rows
Description: Use this command to add rows to existing tables. You might want to add rows of one table to another table, e.g., to join data sets into a single table.

Example: With t1.tab being the table from above
> matt add_rows t1.tab t1.tab
will add all rows from t1.tab to the bottom of t1.tab, which does not look very meaningful.
Important: Both tables must have the same number of columns with the same column names, but their order can vary as Matt will re-order columns automatically if necessary.



> matt add_val
Description: Use this command to add one column with a constant value across all rows to a existing table. You might use this command to add a column containing some kind of group IDs (e.g., up_regulated) to tables which you later join into one table.

Example: With t1.tab being the table from above
> matt add_val t1.tab DATE 22.02.2017
will print table t1.tab to screen with a additional column DATE containing the value 22.02.2017 in all rows.



> matt chg_labs
Description: Use this command to change labels (categorical entries) in a column of a table. You might use this command if you have a column with group IDs and want to change these IDs / labels.

Example: With t1.tab being the table from above
> matt chg_labs t1.tab EXON_ID ex1:ex11 ex2:ex22
will print table t1.tab to screen where in column EXON_ID ex1 is changed for ex11, and ex2 for ex22.



> matt def_cats
Description: Use this command (define categories) for creating a new column with some kind of group IDs. Generally, rows of table corresponds to events, while columns correspond to their features. Hence, you define group IDs (one for each row) wrt. entries in the feature columns using the same constraint syntax as used with command get_rows.

This command is important as many analyses essentially compare some groups of items (e.g. groups of exons, introns, sequences, or more), hence, the user needs to defined the groups to be compared.

Example: With t1.tab being the table from above

> matt def_cats t1.tab GROUP2 'g1=START[0,5000000] STRAND]+[' 
     'g2=START[0,5000000] !STRAND]+[' 'g3=START[5000001,10000000]'
will output a table with column GROUP2 with group IDs g1, g1, g2, g3, g1 categorizing each micro exon in table t1.tab wrt. the defined constraints.



> matt rm_cols
Description: Use this command to remove columns from a table changing the original table.

Example: With t1.tab being the table from above where we add a column with
> matt add_val t1.tab DATE 22.02.2017 > t2.tab
doing
> matt rm_cols t2.tab DATE STRAND
will print a table with columns DATE and STRAND to screen and remove these columns from the original t2.tab.
Be careful: The original table will be changed by removing columns. This might lead to undesired data loss.



> matt rn_cols
Description: Use this command to rename columns of tables. You might use this command before you want to join by-column two tables which have some column names in common. Tables must not contain column names several times. Each column needs to have a unique name.

Example: With t1.tab being the table from above
> matt rn_cols t1.tab SCAFF:SCAFFOLD STRAND:STRAND2
will print t1.tab to screen with column names SCAFFOLD instead of SCAFF and STRAND2 instead of STRAND.



> matt extr_seqs
Description: Use this command to output in FASTA format sequences from a table. You might use this command if you want to run other software on sequences which you have obtained with Matt. Often, software for analyzing sequences expect FASTA input format.

Example: With t1.tab being the table from above
> matt extr_seqs t1.tab SEQ EXON_ID
will print the sequences from column SEQ into FASTA format where values in column EXON_ID will be used as sequence IDs.

If the table not yet contains a column with kind of sequence IDs, you might want to use conc to construct them.



> matt get_seqs
Description: Use this command to retrieve genomic sub-sequences from FASTA files. Usually, you'll have a table specifying genomic coordinates (e.g. START, END, SCAFFOLD/CHROMOSOME, STRAND) and you want to retrieve genomic sequences with these coordinates (e.g., sequences of exons, introns, genes, splice sites or any other genomic entity) from a FASTA file (which contains all sequences of the entire genome). The command get_seqs has three modes making it easy to retrieve different genomic sequences wrt. to the coordinates.


Schematic view of positions of START and END coordinates and the four length arguments for mode 3.


The three modes make it easy to retrieve:
  1. sequences from START to END, including START and END position
  2. two sequences per START-END coordinate pair (START and END exclusive)
    1. up-stream sequence of length L1 of region START-END, i.e., from START-L1 to START-1 (on + strand)
    2. down-stream sequence of length L2 of region START-END, i.e., from END+1 to END+L2 (on + strand)
  3. two sequences per START-END coordinate pair (START and END inclusive)
    1. up-stream sequence around START, i.e., from START-L1 to START+L2-1 (on + strand)
    2. down-stream sequence around END, i.e., from END-L3+1 to END+L4 (on + strand)
Sequences from - strand get reverse-complemented automatically and up/down-stream sequences come from the up/down-stream regions wrt. START end END coordinates.

Example: With t1.tab being the table from above
> matt get_seqs t1.tab START END 20 10 SCAFF STRAND Hsa_genome.fa
will extract for all micro exons in t1.tab the up-stream intronic sequences of length 20 and the down-stream intronic sequences of length 10 and print these sequences as a table with columns SEQ_UP and SEQ_DOWN. The scaffold/chromosome IDs in column SCAFF of table t1.tab must be the same as used in Hsa_genome.fa.

Using piping, you can add the retrieved sequences directly to the table t1.tab as get_seqs outputs the sequences as a two-column table.
> matt get_seqs t1.tab START END 20 10 SCAFF STRAND ...
       ... Hsa_genome.fa | matt add_cols t1.tab -



> matt cmpr_motif
Description: Use this command to plot a PDF for visually comparing two motifs (position weight matrices) of the same length. Motifs can be given either by tables describing position weight matrices (get_pwm) or by tables containing sequences (aligned, all of same length) from which Matt will estimate position weight matrices.

Example: With hsa_mbnl1_pwm.tab and hsa_a2bp1_pwm.tab being two PWMs retrieved from CISBP-RNA, both being of length 7,
> matt cmpr_motif hsa_mbnl1_pwm.tab hsa_a2bp1_pwm.tab mbnl1_vs_a2bp1.pdf
generates the PDF mbnl1_vs_a2bp1.pdf which shows the motif plots (plot_motif) for both PWMs and in the center a diagram showing at each position the log2-ratio of probabilities of each nucleotide.





> matt get_efeatures
Description: Use this command for retrieving 75 features of interest for exons. Exons need to be described by a table with basic information like their genomic coordinates and a gene ID of genes the exons belong to. If the table does not yet contain gene IDs, you might use the command retr_geneids for extracting gene IDs from any GTF file for given genomic events, like exons, introns, genes.

Overview of features


Motivation for including features
  • GENE BIOTYPE: Splicing of exons and introns in coding vs non-coding genes might underlie different mechanisms.
  • LENGTH: Exon and intron length, as well as the relation of exon length and the lengths of its neighbor introns, and intron length to the lengths of its neighbor exons might indicate different splicing mechanisms (e.g., exon vs intron definition, Amit et al., DOI: 10.1016/j.celrep.2012.03.013), or categories of exons like micro-exons or long introns.
  • GC CONTENT of exons, introns, neighboring exons and introns and their ratios, and GC contents of splice sites: The GC content together with the length and lengths ratios might indicate different splicing mechanisms (e.g., exon vs intron definition, Amit et al., DOI: 10.1016/j.celrep.2012.03.013)
  • The following BP and splice-site features: They are proxies for the binding specificity respect to the U2 snRNPs and other core spliceosomal components. The lower this binding specificity is, the less the intron/exon will be recognized.
  • SF1 BINDING SCORES: Is an approximation of the branch-point strength thought of impacting the efficiency of splicing.
  • MAXENT SCORES: of splice sites are an approximation of the efficiency of splice sites (Yeo et al., DOI: 10.1089/1066527041410418)
  • DISTANCE PREDICTED BRANCH POINTS TO 3SS, NUMBER OF PREDICTED BRANCH POINTS, SEQUENCE SCORE OF BRANCH POINT SEQUENCE, PYRIMIDINE CONTENT, LENGTH OF PYRIMIDINE TRACT, OFFSET OF PYRIMIDINE TRACT FROM 3SS, PYRIMIDINE TRACT SCORE: are used for the classification of branch points by Corvelo et al., DOI: 10.1371/journal.pcbi.1001016; Thus they are potentially interesting features to be considered when studying and comparing introns or neighbor exons.
  • MEDIAN EXON/INTRON NUMBER: of transcripts a specific exon/intron occurs in; Might indicate if exons/introns preferentially occur in transcripts with few or many exons/introns.
  • (RELATIVE) RANK: of exon or intron. Might reveal if exons/introns preferentially occur at the start, interior, or end of transcripts.
  • PROPORTION OF TRANSCRIPTS: where exon/intron occurs as first, internal, or last exon/intron, or within UTRs. Again, might indicate bias of the location of exons/introns in transcripts.
  • CO-OCCURRENCES: of exons/introns in same transcripts. Might indicate of studied exon/introns tend to co-occur in transcripts.

  • For detecting features discriminating sets of exons, the command cmpr_exons is more suited.

    Examples: To get an overview and a short description of all features
    > matt get_efeatures explain

    With t1.tab and assuming it describes human micro exons
    > matt get_efeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ...
          ... Hsa19.gtf Hsa19.fa Hsap  > efeatures.tab
    writes a table with columns containing exon feature and extracted sequences of the exons, up/down-stream introns, and splice sites.

    If only one or a few features are of interest, users can apply get_cols and extract specific feature columns only.

    > matt get_efeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ...
          ... Hsa19.gtf Hsa19.fa Hsap  | matt get_cols - EXON_LENGTH EXON_GCC > efeatures.tab

    It is recommended to use GTF files with genome annotation from Ensembl, as Matt is tailored to their structure.



    > matt get_gcc
    Description: Use this command to determine the GC content of genomic sequences.

    Example: With t1.tab being the table from above
    > matt get_gcc t1.tab SEQ
    will print a table with one column SEQ_GCC containing the determined values of the GC content.

    Any letter not A, C, G, T, U, a, c, g, t, u will be completely neglected from the computation of the GC content.



    > matt get_ifeatures
    Description: Use this command to retrieve 50 features of interest for introns. Introns need to be described by a table with basic information like their genomic coordinates and a gene ID of genes the introns belong to. If the table does not yet contain gene IDs, you might use the command retr_geneids for extracting gene IDs from any GTF file for given genomic events, like exons, introns, genes.

    Overview of features

    Motivations and references for these features are listed in the description of command get_ifeatures. For detecting features discriminating sets of introns, the command cmpr_introns is more suited.

    Examples: To get an overview and a short description of all features
    > matt get_ifeatures explain

    With table introns.tab describing human introns with columns START, END, SCAFF, STRAND, ENSEMBL_GID
    > matt get_ifeatures introns.tab START END SCAFF STRAND ENSEMBL_GID ...
          ... Hsa19.gtf Hsa19.fa Hsap > ifeatures.tab
    writes a table with columns containing the feature values and extracted sequence of intron, up/down-stream exon, splice sites.

    If only one or a few features are of interest, users can apply get_cols and extract specific feature columns only.

    > matt get_ifeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ...
          ... Hsa19.gtf Hsa19.fa Hsap  | matt get_cols - INTRON_LENGTH INTRON_GCC > ifeatures.tab




    > matt get_pwm_hits
    Description: Use this command to predict hits in sequences with a motif position weight matrix model (PWM).

    Example: With seqs.tab being a table containing sequences (containing A,C,G,T) in column SEQ and some kind of sequence IDs in column ID and hsa_mbnl1_pwm.tab the MBNL1 PWM from CISBP-RNA whose alphabet we had redefined from A,C,G,U to A,C,G,T, the command
    > matt get_pwm_hits seqs.tab SEQ ID hsa_mbml1_pwm.tab 10 single INFER_BGPWM
    will estimate a simple background PWM from all sequences (being identical to a homogeneous Markov model of order 0), and predict MBLN1 hits at positions where the likelihood ratio MBLN1 PWM vs. background PWM is at least 10 or higher, i.e., where it is at least 10 times more likely to actually have a MBNL1 binding site than having no MBNL1 binding site (background). The command prints a table with columns SEQ_ID, START, END, STRAND, HIT, LOGSCORE describing the hits.



    > matt get_pwm_prof
    Description: Use this command to plot positional distributions of hits of motif position weight matrices (PWMs) across sequences. This command involves predicting hits of PWMs as well. Hence, you need to specify the input sequences, the PWMs, and information on how hits should be predicted.

    Example: Assume table exons.tab describes exons and contains beside others these columns:
    1. SEQ_UP: up-stream intronic sequences (150 nt)
    2. SEQ_DOWN: down-stream intronic sequences (150 nt)
    3. DATASET: all exons are classified into non-differentially spliced (ndiff) and differentially spliced (down) and this column contains the group IDs: ndiff and down
    Assume table pwms.tab contains information about two motif PWMs of RNA binding proteins from CISBP-RNA and looks like this:
                NAME    FILE        BG            THRESH
                A2BP1   a2bp1.tab   INFER_BGPWM   5
                MBNL1   mbnl1.tab   INFER_BGPWM   5
    		
    Both files as1bp1.tab and mbnl1.tab were obtained by applying get_pwm to the original PWM files from CISBP-RNA to re-define the alphabet to A,C,G,T, fit it to the alphabet of the sequences. The command:
    > matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
             ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]
    will go through the following steps for each foreground PWM (MBNL1 and A2BP1 also known as RBFOX1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
    1. estimate a background PWM by taking the frequencies of the four nucleotides A,C,G,T across all sequences in column and group
    2. predict hits at all sequence positions with P_fgPWM(seq)/P_bgPWM(seq)≥5
    3. estimate a positional density of predicted hits with a Normal kernel considering all hits
    4. estimate a positional cumulative density considering all hits
    5. estimate a positional density with Normal kernel considering only the first hit per sequence
    6. estimate a positional cumulative density considering only the first hit per sequence
    The parameter LR and RL defines how the cumulative distributions will be plotted (LR: increasing from left to right, RL: increasing from right to left) and which are the first positions per sequence (LR: first position = left most position, RL: = right-most position).


    The command (from above)
    > matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
             ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]
    creates a folder pwm_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the PWMs for which the densities are most different.

    Example website with results:

    Matt: Positional distributions of hits of motif PWMs

    Data file: exons.tab
    Categories (number of sequences): down ( 868 ), ndiff ( 661 )
    Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
    File with information on motif PWMs and background models: pwms.tab
    Date: Feb 27, 2017


    Ordered Results

    Ordering from top to bottom in each column: corresponds to most negative
    to most positive differences of densities / cumulative densities to
    densities / cumulative densities of reference group
    Reference group: ndiff
    If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

    Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
    MBNL1
    MBNL1
    MBNL1
    MBNL1
    A2BP1
    A2BP1
    A2BP1
    A2BP1


    Plots

    A2BP1 - a2bp1.tab

    Back to Ordered Results



    MBNL1 - mbnl1.tab

    Back to Ordered Results







    > matt get_regexp_hits
    Description: Use this command to predict hits in sequences with Perl regular expressions.

    Example: With seqs.tab being a table containing sequences in column SEQ and some kind of sequence IDs in column ID the command
    > matt get_regexp_hits seqs.tab SEQ ID 'ACCG|G[GC]C|GCCT' single
    will search for possibly overlapping hits of ACCG or GGC or GCC or GCCT (case insensitive) and outputs them as a table with columns SEQ_ID, START, END, STRAND, HIT. The argument single states that Matt should not search in the reverse-complements. If several hits start at the same position, only the first gets reported according to their order in the call. If motifs are of variable length, the longest match will be reported only.



    > matt get_regexp_prof
    Description: Use this command to plot positional distributions of hits of Perl regular expressions across sequences. You need to specify the input sequences, and the regular expressions.

    Example: Assume table exons.tab describes exons and contains beside others these columns:
    1. SEQ_UP: up-stream intronic sequences (150 nt)
    2. SEQ_DOWN: down-stream intronic sequences (150 nt)
    3. DATASET: all exons are classified into non-differentially spliced (ndiff) and differentially spliced (down) and this column contains the group IDs: ndiff and down
    Assume table regexps.tab contains information about two Perl regular expressions and looks like this:
                          NAME    REGEXP
                          MBNL1   [GC][CT]TT[GA]C  
                          A2BP1   [TA]GCATG[CA]|TACATA
    		
    The command:
    > matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
             ... regexps.tab DATASET[down,ndiff]
    will go through the following steps for each regular expression (MBNL1 and A2BP1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
    1. predict hits (exact matches) for regular expression
    2. estimate a positional density of hits with Normal kernel considering all hits
    3. estimate a positional cumulative density of hits considering all hits
    4. estimate a positional density with Normal kernel considering only the first hit per sequence
    5. estimate a positional cumulative density considering only the first hit per sequence
    The parameter LR and RL defines how the cumulative distributions will be plotted (LR: increasing from left to right, RL: increasing from right to left). This parameter also defines the first position per sequence (LR: first position = left most position, RL: = right-most position).


    The command (from above)
    > matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ...
             ... regexps.tab DATASET[down,ndiff]
    creates the folder regexp_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the regular expressions (patterns) for which the densities are most different.

    Example website with results:

    Matt: Positional distributions of hits of REGEXPs

    Data file: exons.tab
    Categories (number of sequences): down ( 868 ), ndiff ( 661 )
    Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
    K or file with information on REGEXPS: regexps.tab
    Date: Feb 27, 2017


    Ordered Results

    Ordering from top to bottom in each column: corresponds to most negative
    to most positive differences of densities / cumulative densities to
    densities / cumulative densities of reference group
    Reference group: ndiff
    If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

    Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
    A2BP1
    A2BP1
    A2BP1
    A2BP1
    MBNL1
    MBNL1
    MBNL1
    MBNL1


    Plots

    MBNL1 - [GC][CT]TT[GA]C

    Back to Ordered Results



    A2BP1 - [TA]GCATG[CA]|TACATA

    Back to Ordered Results







    > matt get_seql
    Description: Use this command to determine lengths of sequences.

    Example: With t1.tab being the table from above
    > matt get_seql t1.tab SEQ
    will print a table with one column SEQ_LENGTH with the lengths of the sequences in table t1.tab in column SEQ.



    > matt get_sss
    Description: Use this command to estimate the splice site strength of 3' and 5' splice sites using maximum-entropy models (DOI: 10.1089/1066527041410418). Often you'll start with a table of exons or introns. First, you need to retrieve the sequences of the 3' and 5' splice sites. For this, you might use the command get_seqs (mode 3).

    Example: Assuming table introns.tab contains sequences (xxxGTxxxx, 3 exon and 6 intron positions) of the 5' splice sites in column 5SS_SEQ and sequences (xxxxxxxxxxxxxxxxxxAGxxx, 20 intron and 3 exon positions) of the 3' splice sites in column 3SS_SEQ, then
    > matt get_sss introns.tab 3SS_SEQ 5SS_SEQ
    will estimate the strength of the splice sites and output them as a table with two columns STRENGTH_3SS and STRENGTH_5SS.

    Using piping allows the user to extract the sequences and estimate the splice site strength at once. E.g., assume table introns.tab has columns START, END, SCAFF, STRAND describing human introns:

    > matt get_seqs introns.tab START END 3 6 20 3 SCAFF STRAND Hsa19.fa | matt ... 
       ... get_sss - SEQ_UP SEQ_DOWN
    		




    > matt plot_motif
    Description: Use this command to create a motif plot as PDF similar to sequence logos for a given position weight matrix model (PWM) or from a set of aligned sequences of identical length. For obtaining PWMs, re-defining their alphabet or setting a minimum probability, you might use the command get_pwm.

    Example: With human_nova2_pwm.tab being the human NOVA2 PWM from CISBP-RNA with re-defined alphabet (A,C,G,T instead of A,C,G,U) and minimal probability 0.001
    > matt plot_motif human_nova2_pwm.tab
    will create a PDF human_nova2_pwm_motifplot.pdf visualizing the probabilities of the nucleotides at each position, the consensus sequence (letters at the top of the bars), and the information content at each position.





    > matt test_pwm_enrich
    Description: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by position weight matrices (PWM). The sequences and the information on the PWMs need to be given as tables. This command first predicts PWM hits in the two sets of sequences and then applies a permutation test for estimating p-values for the enrichment/depletion.

    Example:
    Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
    • SEQ: containing sequences
    • GROUP_ID: containing group ids (cntr, up, down)

    PWM specification: Assume you have a 3 PWMs (pwm1, pwm2, pwm3) which you want to use for this enrichment analysis. Table pwms.tab contains information about these PWMs in the following columns
    • NAME: containing short names for the motifs
    • FILE: containing the path to file with table defining the PWM (in Matt format: see get_pwm)
    • THRESH: containing the threshold to be applied for predicting hits with this PWM, e.g., 10 (more details get_pwm_hits)
                    NAME   FILE                THRESH
                    pwm1   /path/to/pwm1.tab   10
                    pwm2   /path/to/pwm2.tab   10
                    pwm3   /path/to/pwm3.tab   10
    
    Running
    > matt test_pwm_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ...
          ...  pwms.tab NAME FILE THRESH INFER_BGMODEL quant 10000 > results.tab
    will perform 6 comparisons: (pwm1, pwm2, pwm3) x (up vs. cntr, down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab by re-directing the output.

    Each comparison (say, for pwm1 and up vs. cntr) consists of these steps:
    1. estimate background model bm1 (homogeneous Markov model of order 0) from sequences of the foreground data set, here group up
    2. estimate background model bm2 from sequences of the background data set, here group cntr
    3. predict hits with pwm1 in sequences of group up (ratio pmw1/bm1≥10)
    4. predict hits with pwm1 in sequences of group cntr (ratio pmw1/bm2≥10)
    5. apply a permutation test with 10000 iterations (repeats) and determine p-value
    Instead of inferring the background model (argument INFER_BGMODEL), you can provide a background PWM or predict hits without background model only using the foreground PWM. In the latter case, the thresholds defined in table pwms.tab must be defined wrt. the log-score of the PWM, e.g, -10 or -8.

    Test for positional differences of hits: The above described procedure considers the number of PWM hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view.
    For detecting positional differences, Matt predicts the PWM hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
    1. the entire sequences as given
    2. only the first 1/3 part of the sequences
    3. only the internal 1/3 part of the sequences
    4. only the last 1/3 part of the sequences
    and reports p-values for all four tests.



    Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
    1. SIGNIFICANT_RESULT_ALL: enrichment considering entire sequences
    2. SIGNIFICANT_RESULT_FIRST: enrichment considering only the first 1/3 of the sequences
    3. SIGNIFICANT_RESULT_INTERNAL: enrichment considering only the internal 1/3 of the sequences
    4. SIGNIFICANT_RESULT_END: enrichment considering only the last 1/3 of the sequences


    Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found PWM hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

    Avoid false positives: Normalizing the number of PWM hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


    Testing these sequences which are of very different length but contain the same number of PWM hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

    Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
    For testing enrichment, one might either consider
    1. the number of hits across sequences per group of sequences
    2. the number of positive sequences containing hits per group
    All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument <MODE>.



    > matt test_regexp_enrich
    Description: Description: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by Perl regular expressions. The sequences and the information on the regular expressions need to be given as tables. This command first predicts exact matches of the regular expressions in the two sets of sequences and then applies a permutation test for estimating p-values. This command is very similar to test_pwm_enrich but predicts motif hits with Perl regular expressions get_regexp_hits.

    Example:
    Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
    • SEQ: containing sequences
    • GROUP_ID: containing group ids (cntr, up, down)

    Specification of Perl regular expressions / patterns: Assume you want to do the enrichment analysis with two regular expressions: ACCGT|ACCGA|ACCAA and AC[CGT]..C{2,5}GT. Table regexps.tab contains information about these regular expressions in the following columns
    • NAME: containing short names for the motifs/patterns, e.g, the corresponding binding protein
    • REGEXP: containing the Perl regular expressions
                    NAME   REGEXP
                    protA  ACCGT|ACCGA|ACCAA
                    protB  AC[CGT]..C{2,5}GT
    
    Running
    > matt test_regexp_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ...
          ...  regexps.tab NAME REGEXP quant 10000 > results.tab
    will perform 4 comparisons: (protA and protB) x (up vs. cntr and down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab as we re-direct the output.

    Each comparison (say, for protA and up vs. cntr) consists of these steps:
    1. search hits (exact matches) with regular expression protA in sequences of group up
    2. search hits (exact matches) with regular expression protA in sequences of group cntr
    3. apply a permutation test with 10000 iterations (repeats) and determine p-value

    Test for positional differences of hits: The above described procedure considers the number of hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view.
    For detecting positional differences, Matt searches for hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
    1. the entire sequences as given
    2. only the first 1/3 part of the sequences
    3. only the internal 1/3 part of the sequences
    4. only the last 1/3 part of the sequences
    and reports p-values for all four tests.





    Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
    1. SIGNIFICANT_RESULT_ALL: enrichment considering entire sequences
    2. SIGNIFICANT_RESULT_FIRST: enrichment considering only the first 1/3 of the sequences
    3. SIGNIFICANT_RESULT_INTERNAL: enrichment considering only the internal 1/3 of the sequences
    4. SIGNIFICANT_RESULT_END: enrichment considering only the last 1/3 of the sequences


    Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

    Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


    Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

    Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
    For testing enrichment, one might either consider
    1. the number of hits across sequences per group of sequences
    2. the number of positive sequences containing hits per group
    All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument <MODE>.



    > matt row_calc
    Description: Use this command to apply calculations along rows.

    Example: With t1.tab being the table from above
    > matt row_calc t1.tab 'LENGTH=END-START+1'
    		
    will determine the length of the exons and output it as a table with column LENGTH. Use piping to add this new column directly to t1.tab
    > matt row_calc t1.tab 'LENGTH=END-START+1' | matt add_cols t1.tab -
    Computing the log of the average of START, END, LENGTH (not very meaningful) with
    > matt row_calc t.tab 'MEAN=log((START+END+LENGTH)/3)'
    which creates a table with column MEAN.

    You can use any mathematical expression which is available within Perl. Column names in expressions are used like variables. Mathematical expressions must always be put inside ' '.



    > matt col_calc
    Description: Use this command to apply some pre-defined calculations, like summation, along columns. If the table contains columns with group labels, this commands allows to apply calculations per group.

    Example: With t2.tab being this table
             A    B    V1    V2
             a    1    1     2
             a    2    1     3
             b    2    1     4
             a    1    1     5
    		
    > matt col_calc t2.tab sum V1,V2
    sums over the entire columns V1 and V2 and outputs
             V1_SUM    V2_SUM
             4         14
    		
    > matt col_calc t2.tab sum V1,V2 A
    sums over columns V1 and V2, considering entries in column A being group labels and outputs sums per group.
             A    V1_SUM    V2_SUM
             a    3         10
             b    1         4
    		
    > matt col_calc t2.tab sum V1,V2 A,B
    sums over columns V1 and V2, considering entries in columns A and B being group labels and outputs sums per group.
             A    B     V1_SUM    V2_SUM
             a    1     2         7
             a    2     1         3
             b    2     1         4
    		


    > matt perm_test
    Description: This command provides a flexible, statistical test for differences of features between groups. Because it is a permutation test and does not make assumption about the feature distributions. This test can be applied to these kinds of features:
    1. numerical
    2. categorical
    3. co-occurrences

    Example: With exons.tab describing exons (see also get_efeatures) with these (and more) columns
    1. LENGTH: length of exon
    2. GCC: GC content of exon
    3. GENE_TYPE: coding, non-coding
    4. IS_IN_TRS: colon-separated list of transcript ID each exon occurs in
    5. DATASET with group IDs down, ndiff
    The command
    > matt perm_test exons.tab DATASET[down,ndiff] 100000 ...
               ... -num LENGTH,GCC -card GENE_TYPE -cooc IS_IN_TRS
    will apply a permutation test (100000 repeats) and test for significant differences between the two exon groups down vs. ndiff wrt. to the features exon length, exon GC content, and gene type. It also will test whether the exons of the two groups (down vs. ndiff) co-occur differently, e.g., whether they co-occur heavily in one group in a few transcripts, and much less (spread among many transcripts) in the other group. The command will print a text report (NOT a table) with all details and results.

    Co-occurrences: each row of the table corresponds to one element (in our case to an exon). Exons might be clustered, e.g., all exons occurring in the same transcript belong to one cluster. Exons which occur in the same cluster co-occur with each other. You might identify the clusters by giving them unique cluster IDs. In the example, cluster IDs are transcript IDs. Then, there need to be a column in the table which contains information about the clusters each element (exon) occurs in, e.g., IS_IN_TRS. This information must be given by a colon-separated list of cluster IDs, e.g., tr1:tr10:tr155 for exon5 encodes that exon5 actually occurs in these three clusters (transcripts). If a exon does not occur in any cluster, this list is empty.



    > matt retr_geneids
    Description: This is a utility command for automatically extracting gene IDs from gene annotations (GTF files) for genomic events like genes, exons, introns, etc.

    Example: With exons.tab being a table describing human exons (columns START, END, SCAFFOLD, STRAN) and Hsa19.gtf being the Ensembl gene annotation for the human genome (assembly hsa19)
    > matt retr_geneids exons.tab START END SCAFFOLD STRAND Hsa19.gtf -f gene_id
    retrieves for all exons in exons.tab gene ids from Hsa19.gtf, where gene ids are taken from field gene_id of Hsa19.gtf.



    > matt extr_scafids
    Description: This is a utility command for extracting/checking scaffold/sequence/chromosome IDs in FASTA, FASTQ, GFF, GTF files.

    Example: With Hsa19_gDNA.fasta containing all human chromosomes
    > matt extr_scafids Hsa19_gDNA.fasta FASTA
    will print all IDs of human chromosomes.



    > matt get_kmers_from_pwm
    Description: Use this command to determine the most likely kmers under a position weight matrix model (PWM). Either obtain the top N (integer ≥1) kmers or the most likely k-mer which together account for P (0<P<1) percent of the total binding specificity (probability) as encoded by the PWM.

    This command might help if you want to search for PWM hits but don't want to specify a threshold on the PWM log-score. In this case, you could determine the most likely kmers from a PWM and use these as a Perl regular expression and search for exact hits.

    Example: With mbnl1.tab being the PWM for human MBNL1 from CISBP-RNA (with re-defined alphabet A,C,G,T and minimum probability 0.001; see get_pwm)
    > matt get_kmers_from_pwm mbnl1.tab 5 5
    prints the 5 most likely kmers under the MBNL1 PWM.
              ID  KMER     PWM_LOGSCORE       CUMULATIVE_PERCENT_BINDING  PWM_NAME
              1   AGCTTGC  -1.98607349885094  0.137233215315936           mbnl1
              2   TGCTTGC  -2.09775518556902  0.259964844297944           mbnl1
              3   GGCTTGC  -2.38376620889964  0.352167512560293           mbnl1
              4   CGCTTGC  -2.49208691693447  0.434904633342322           mbnl1
              5   AGCTTGG  -3.67966981232818  0.460135937867991           mbnl1
    		
    > matt get_kmers_from_pwm mbnl1.tab 0.5 5
    prints the most likely kmers which together account for 50% of all binding specificity.
              ID  KMER     PWM_LOGSCORE       CUMULATIVE_PERCENT_BINDING  PWM_NAME
              1   AGCTTGC  -1.98607349885094  0.137233215315936           mbnl1
              2   TGCTTGC  -2.09775518556902  0.259964844297944           mbnl1
              3   GGCTTGC  -2.38376620889964  0.352167512560293           mbnl1
              4   CGCTTGC  -2.49208691693447  0.434904633342322           mbnl1
              5   AGCTTGG  -3.67966981232818  0.460135937867991           mbnl1
              6   TGCTTGG  -3.79135149904627  0.482701022411463           mbnl1
              7   AGCTTAC  -3.9437269446445   0.502076889638399           mbnl1
     		



    > matt mk_uniq
    Description: This command allows the user to make-unique a table, i.e., omitting rows with replicated entries in specific columns.

    Example: With t2.tab being this table
             A    B    V1    V2
             a    1    1     2
             a    2    1     3
             b    2    1     4
             a    1    1     5
    		
    > matt mk_uniq t2.tab A B
    outputs
             A    B    V1    V2
             a    1    1     2
             a    2    1     3
             b    2    1     4
    		



    > matt perm_seqs
    Description: This command allows the user to permute randomly nucleotides within sequences.
    With table seqs.tab containing sequences in column SEQ:
    > matt perm_seqs seqs.tab SEQ
    outputs a table with column SEQ_PERM with sequences from SEQ with randomly permuted nucleotides.



    > matt prnt_tab
    Description: Use this command to print a tab-separated table nicely aligned to screen. This command is especially useful for getting an overview of the contents of tables which contain long strings with different lengths.

    Example: With t1.tab being the table from above
    > matt prnt_tab t1.tab -w 5 -s 3
    prints t1.tab where all cells get extended or truncated to 5 characters and with 3 blanks between columns. Use piping and less to see large table.
    > matt prnt_tab t1.tab -w 5 -s 3 | less -S 



    > matt retr_rnaseq
    Description: Allows the user to retrieve RNA-seq data from the Gene Expression Omnibus (GEO), a public repository where many researchers store their RNA-seq data related to their publications. Having at hand the GEO accession numbers of RNA-seq data sets, this command downloads the SRA archives, extracts RNA-seq reads as FASTQ or FASTA files, and, if wanted, renames the extracted FASTA/FASTQ files as specified by the user.

    Example: Specifying accession numbers and data set names in file accession_numbers.txt:
    SRR3532927 nova2kd3
    SRR3532926 nova2kd2
    SRR3532925 nova2kd1
    SRR3532924 nova2wt3
    SRR3532923 nova2wt2
    SRR3532922 nova2wt1
    		
    > matt retr_rnaseq accession_numbers.txt -keepsra -fasta -o rnaseq_data -p 6
    downloads these six data archives, extracts FASTA files, splits them as RNA-seq data are paired-end reads, and renames final files. This is done in parallel with six processes. The output folder rnaseq_data looks then like:
    dataset_info.tab
    nova2kd1_1.fasta.gz
    nova2kd1_2.fasta.gz
    nova2kd2_1.fasta.gz
    nova2kd2_2.fasta.gz
    nova2kd3_1.fasta.gz
    nova2kd3_2.fasta.gz
    nova2wt1_1.fasta.gz
    nova2wt1_2.fasta.gz
    nova2wt2_1.fasta.gz
    nova2wt2_2.fasta.gz
    nova2wt3_1.fasta.gz
    nova2wt3_2.fasta.gz
    SRR3532922.sra
    SRR3532923.sra
    SRR3532924.sra
    SRR3532925.sra
    SRR3532926.sra
    SRR3532927.sra
    



    > matt sort_tab
    Description: Use this command to sort rows of a table according to values (numeric or alpha-numeric) in a distinct column.

    Example: With t1.tab being the table from above
    > matt sort_tab t1.tab START
    will print t1.tab to screen where lines are sorted wrt. START.
    > matt sort_tab t1.tab START -d
    will do the same but sort decreasingly.



    > matt stratify
    Description: Use this command to stratify data sets, e.g., if you want to compare two sets of exons and want to make sure their length distributions are similar, you might first stratify these two sets of exons and then apply perm_test. Stratification can be performed wrt. any numerical feature.

    Example: With exons.tab being a table describing exons with these columns (and maybe more)
    1. DATASET: containing ndiff and down
    2. GCC: containing the GC content of the exons
    the command
    > matt stratify exons.tab GCC DATASET ndiff 0.05
    will stratify both groups such that their GC content distributions are similar. The reference group is ndiff. The algorithm searches for each ndiff exon one exon from the group down which a similar value in column GCC (± 0.05). Each pair of matched exons will get into the final result table. All exons from both groups without matching pair will be neglected. The command outputs a table like exons.tab reduced to the stratified exons.
    With col_calc you may check how well the means or medians of the groups agree after stratification.



    > matt cmpr_exons
    Description: Use this command if you want to compare groups of exons. This command goes through these steps:
    1. run get_efeatures on the input table to determine exon-related features
    2. apply Mann-Whitney U test to the comparison of feature distributions across groups
    3. output:
      1. PDF report summarizing results with box plots
      2. table with all events and extracted features
      3. table with details on performed statistical tests including p values
      4. all box plots as PDF graphics
    Information on exon related features are available under get_efeatures. The input table describing exons must contain these pieces of information about the exons in separate columns
    1. start coordinate
    2. end coordinate
    3. chromosome ID (must match with chromosome IDs in GTF and FASTA; use extr_scafids to check IDs in GTFs and FASTAs)
    4. strand
    5. gene ID of gene where exon occurs in; these gene IDs must match with the gene IDs in the GTF which you use
    Recommendation: use the GTF files from Ensembl as they also contain information on the gene types (coding, non-coding, etc.) as these information will get used as well.

    Example: With exons.tab being a table describing human exons with these columns (and maybe more)
    1. START
    2. END
    3. SCAFFOLD
    4. STRAND
    5. GENEID_ENSEMBL
    6. DATASET: containing group IDs of exons (down, up, ndiff)
    and with Hsa19.gtf being a gene description file from Ensembl and Hsa19.fa the corresponding FASTA file with all genome sequences
    > matt cmpr_exons exons.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ...
        ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiff
    will compare exons down vs. ndiff and leave untouched exons from group up. The argument 150 specifies the length of the up-stream and down-stream intronic regions which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all exons (down and ndiff) and the extracted exon features. It will also place therein a PDF document summary.pdf containing all details of the comparison.

    Hint 1: You might first stratify the sets of exons with stratify and then apply cmpr_exons.
    Hint 2: You might chose colors for the box plots of the PDF report. See help message of cmpr_exons for getting help on this.
    Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced).
    Hint 4: DATASET[down,ndiff] compares set down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff.




    > matt cmpr_introns
    Description: Use this command if you want to compare groups of introns. This command goes through these steps:
    1. run get_ifeatures on the input table to determine exon-related features
    2. apply Mann-Whitney U test to the comparison of feature distributions across groups
    3. output:
      1. PDF report summarizing results with box plots
      2. table with all events and extracted features
      3. table with details on performed statistical tests including p values
      4. all box plots as PDF graphics
    Information on features under get_ifeatures. The input table describing introns needs to contain these pieces of information about the introns in separate columns
    1. start coordinate
    2. end coordinate
    3. chromosome ID; these chromosome IDs must match with the chromosome IDs in the GTF and FASTA which you use
    4. strand
    5. gene ID of gene where intron occurs in; these gene IDs must match with the gene IDs in the GTF which you use
    Recommendation: use the GTF files from Ensembl as they also contain information on the gene types (coding, non-coding, etc.) as these information will get used as well.

    Example: With introns.tab being a table describing human introns with these columns (and maybe more)
    1. START
    2. END
    3. SCAFFOLD
    4. STRAND
    5. GENEID_ENSEMBL
    6. DATASET: containing group IDs of introns (down, up, ndiff)
    and with Hsa19.gtf being a gene description file from Ensembl and Hsa19.fa the corresponding FASTA file with all genome sequences
    > matt cmpr_introns introns.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ...
        ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiff
    will compare introns down vs. ndiff and leave untouched introns from group up. The argument 150 specifies the length of the 3'-end of the introns which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all introns (down and ndiff) and the extracted intron features. It will also place therein a PDF document summary.pdf containing all details of the comparison.

    Hint 1: You might first stratify the sets of introns with stratify and then apply cmpr_introns.
    Hint 2: You might chose colors for the box plots of the PDF report. See help message of cmpr_introns for getting help on this.
    Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced).
    Hint 4: DATASET[down,ndiff] compares group down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff.




    > matt rna_maps
    Description: Allows the user to produce motif RNA-maps for comparing enrichment of binding motifs of RNA/DNA binding proteins for different groups of exons or introns. Motifs can be defined as position weight matrix models (PWMs) or Perl regular expressions. This command will extract all necessary sequences from the genomic proximity of given exons/introns, predict hits of the specified motifs therein, and compute motif enrichment scores. If chosen by the user, a permutation test is applied for estimating p values and regions with significant differences in motif enrichment will be highlighted in the motif RNA-map.
    Output:
    1. PDF report containing all motif RNA-maps
    2. PDF report with motif RNA-maps with significant enrichment only (with arguments -p, -fdr)
    3. For each map, there are two versions, one with a data coverage section, and one without. The data coverage section shows the data overage along the map, which might be less then 100% if the data set contains exons/introns shorter than the specified exon/intron sections to be visualized.
    4. The motif RNA-maps are ordered according to largest differences of motif enrichment scores; Maps with largest positive differences come first, maps with largest negative differences last. The reference group for computing differences is always the last group specified in the program call.
    5. all motif RNA-maps as PDF graphics

    Example: Assume file nova2_regexpr.tab describes a Perl regular expression of the NOVA2 binding motif in a table with columns
    1. TYPE: the type of motif model, either REGEXP or PWM
    2. NAME: the name of the motif
    3. REGEXP: the Perl regular expression of a link to a file (table) containing the PWM
    4. THRESH: only for PWM (set to NA for REGEXPs): threshold for hit prediction (see get_pwm_hits)
    5. BGMODEL: only for PWM (set to NA for REGEXPs): if and which background model should be used
    Further, 3 groups of human exons are described in table exons.tab with columns
    1. START: start of exon
    2. END: end of exon
    3. SCAFFOLD: chromosome
    4. STRAND
    5. UPSTRM_EX_BORDER: border of next up-stream exon
    6. DOSTRM_EX_BORDER: border of next down-stream exon
    7. GROUP: defining exon groups (down, up, ndiff); last group is reference group
    The following call produces a NOVA2 motif RNA map with a sliding window of length 31 which slides up to position 35 into exons and up to position 135 into introns. Regions with significant enrichment as compared to the reference group ndiff are highlighted.
    matt rna_maps exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD
        ... STRAND GROUP[up,down,ndiff] 31 35 135 Mmu09.fasta nova2_regexp.tab TYPE
        ... NAME REGEXP THRESH BGMODEL -d nova2_rnamap -p 0.0001 10000
    The following graphic as PDF will be put into output folder nova2_rnamap

    Hint: When you use the arguments -p or -fdr then statistical testing is performed and regions with significant enrichment/depletion are highlighted. The last group in GROUP[up,down,ndiff] is the reference group for testing, e.g., here up vs ndiff and down vs ndiff. Hence, highlighting will be plotted only for groups up and down, never for ndiff.



    > matt rna_maps_cisbp
    Description: Allows users to produce motif RNA maps for
    1. all ACGT k-mers (length 2-5)
    2. all approx. 340 CISBP-RNA BPS with IUPAC binding motif
    3. motif RNA-maps can be generated for exons or introns
    at once without needing to specify these k-mer motifs or download the CISBP-RNA binding motifs.

    Output:
    1. PDF report containing all motif RNA-maps
    2. PDF report with motif RNA-maps with significant enrichment only (with arguments -p, -fdr)
    3. For each map, there are two versions, one with a data coverage section, and one without. The data coverage section shows the data overage along the map, which might be less then 100% if the data set contains exons/introns shorter than the specified exon/intron sections to be visualized.
    4. The motif RNA-maps are ordered according to largest differences of motif enrichment scores; Maps with largest positive differences come first, maps with largest negative differences last. The reference group for computing differences is always the last group specified in the program call.
    5. all motif RNA-maps as PDF graphics

    Assume 3 groups of human exons are described in table exons.tab with columns
    1. START: start of exon
    2. END: end of exon
    3. SCAFFOLD: chromosome
    4. STRAND
    5. UPSTRM_EX_BORDER: border of next up-stream exon
    6. DOSTRM_EX_BORDER: border of next down-stream exon
    7. GROUP: defining exon groups (down, up, ndiff); last is reference group
    The following call produces motif RNA maps for all CISBP-RNA IUPAC binding motifs with a sliding window of length 31 which slides up to position 35 into exons and up to position 135 into introns.
    matt rna_maps_cisbp exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD
        ...  STRAND GROUP[up,down,ndiff]  31 35 135 Hsa19.fasta cisbprna_regexps
        ...  -d cisbp_rnamaps
    Hint: When you use the arguments -p or -fdr then statistical testing is performed and regions with significant enrichment/depletetion are highlighted. The last group in GROUP[up,down,ndiff] is the reference group for testing, e.g., here up vs ndiff and down vs ndiff. Hence, highlighting will be plotted only for groups up and down, never for ndiff.
    See command rna_maps for an example motif RNA map.



    > matt test_cisbp_enrich
    Description: Use this command to test for enrichment/depletion with all approx. 300 PWM motifs from the CISBP-RNA database.

    Example: With table t.tab having these columns (and maybe many more)
    1. SEQ_3P: containing 3'-ends (150 nt) of introns
    2. DATASET: containing group IDs of introns (e.g., down, up, ndiff)
    the command
    > matt test_cisbp_enrich t.tab SEQ_3P DATASET[down,ndiff,up,ndiff]
    will compare down vs. ndiff and up vs. ndiff. During each comparison and for each CISBP-RNA binding PWM, it will predict binding sites in all sequences and apply a permutation test to determine p-values for the differences wrt. the number of predicted hits in both data sets. It will generate a table containing all results (including number of predicted hits per group, test statistics, p-values, information about the binding motifs (PWMs) like associated RNA-binding protein). For PWMs for which significant enrichment/depletion of hits was found, test_cisbp_enrich also generates PWM profiles invoking get_pwm_prof.

    Internally, test_cisbp_enrich applies the command test_pwm_enrich. Hence per binding motif (PWM), test_cisbp_enrich tests for differences in the number of predicted hits between two groups of sequences as well as for positional differences of the predicted hist. In detail, test_cisbp_enrich searches for hits in both sets of sequences once, but then tests for significant enrichment/depletion 4 times considering:
    1. the entire sequences as given
    2. only the first 1/3 part of the sequences
    3. only the internal 1/3 part of the sequences
    4. only the last 1/3 part of the sequences
    and reports p values for all four tests.




    Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
    1. SIGNIFICANT_RESULT_ALL: enrichment considering entire sequences
    2. SIGNIFICANT_RESULT_FIRST: enrichment considering only the first 1/3 of the sequences
    3. SIGNIFICANT_RESULT_INTERNAL: enrichment considering only the internal 1/3 of the sequences
    4. SIGNIFICANT_RESULT_END: enrichment considering only the last 1/3 of the sequences


    Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences).

    Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set.


    Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits.

    Hint 1: You might run test_cisbp_enrich with a sub-set of CISBP-RNA PWMs. See help message for details on this.
    Hint 2: Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits
    For testing enrichment, one might either consider
    1. the number of hits across sequences per group of sequences
    2. the number of positive sequences containing hits per group
    All examples above consider the number of hits. Depending on the research question, it might be more reasonable to consider the fraction of positive sequences in each group of sequences. Positive sequences are those which contain a minimum number of predicted hits. For selecting the applied test statistic, use argument -m.






    Examples

    This section shows applications of Matt to these tasks:
    1. Comparing sets of exons wrt. their exon features
    2. Testing enrichment of TGC in up-stream regions of neural micro exons
    3. Analyzing exons regulated by splicing factor Nova2

    The examples contain many explanations and therefore appear lengthy. The more familiar the user is with Matt, the easier and faster the work experience will become.



    Comparing sets of exons wrt. their exon features

    The data table data.tab contains alternative splicing events published in Irimia et. al, Cell, 2014. It was assembled from output of different VAST-TOOLS runs. Each event belongs to one group of
    1. AS_noNeural
    2. NEURAL-DOWN
    3. NEURAL-UP
    and we want to extract exon features for all exons of this data set and compare the three sets of exons wrt. these features.

    1. Making yourself familiar with data

    by checking, e.g., column names of the given table.
    > matt get_colnms data.tab
    
       ID    COL_NAMES
       1     GENE
       2     EVENT
       3     COORD
       4     LENGTH
       5     FullCO
       6     TYPE
       7     GROUP
    The table data.tab has 7 columns containing information on alternative splicing events as output from Vast-Tools.
    Print the first rows nicely aligned:
    > matt prnt_tab data.tab -W 30 | less -S -
    
       GENE EVENT         COORD                     LENGTH FullCO TYPE GROUP      
            HsaEX0000026  chr12:120648733-120648909 177    ...    S    AS_noNeural
            HsaEX0000032  chr12:10749491-10749580   90     ...    S    AS_noNeural
            HsaEX0000033  chr12:10748290-10748526   237    ...    C2   AS_noNeural
            HsaEX0000044  chr14:58760484-58760644   161    ...    C1   AS_noNeural
            HsaEX0000048  chr14:58752309-58752474   166    ...    C2   AS_noNeural
            HsaEX0000053  chr12:70636846-70636984   139    ...    S    AS_noNeural
            HsaEX0000062  chr14:101376570-101376641 72     ...    S    AS_noNeural
            HsaEX0000087  chr15:84850734-84850835   102    ...    S    AS_noNeural
            HsaEX0000091  chr15:57209263-57209410   148    ...    S    AS_noNeural
       A1BG HsaINT0000005 chr19:58858396-58858718   323    ...    IR-S AS_noNeural
       A2M  HsaINT0000025 chr12:9256997-9258831     1835   ...    IR-S AS_noNeural
       A2M  HsaINT0000026 cr12:9254271-9256834      2564   ...    IR-S AS_noNeural
       A2M  HsaINT0000027 chr12:9253804-9254042     239    ...    IR-S AS_noNeural
    Column FullCO was not put on this website to save space. Column TYPE encodes the type of alternative splicing events.
    Check all different types of splicing events:
    > matt col_uniq data.tab TYPE | matt prnt_tab - -w 9 | less -S -
    
       TYPE_UNIQ   FREQ     
       Alt3        3444     
       Alt5        2449     
       C1          1604     
       C1*         312      
       C2          1287     
       C2*         259      
       C3          1134     
       C3*         237      
       IR-C        3483     
       IR-S        15494
       ...
    
    2. Adapting labels in column TYPE

    We need to use command get_vast later which works with Vast-Tools standard types of alternative splicing events which are: S, C1, C2, C3, MIC, IR-S, IR-C, Alt3, Alt5. Table data.tab contains these standard types but also other types like S*, C1*. Hence, we adapt all labels to the standard types with command chg_labs.
    > matt chg_labs data.tab TYPE S*:S C1*:C1 C2*:C2 C3*:C3 MIC_M:MIC MIC_S:MIC > tmp.tab
    > mv tmp.tab data.tab
    
    3. Extracting exons and their start/end-coordinate, chromosome, strand

    The commands get_efeatures and cmpr_exons expect information on exons like their start and end coordinate but table data.tab does not contain this information, yet. The command get_vast is specifically designed for extracting this information from VAST-TOOLs output tables. It outputs an augmented table, from which we select only exon events by re-directing this table to the command get_rows. The final output table gets re-directed into file exons.tab.
    > matt get_vast data.tab COORD FullCO TYPE LENGTH | ...
       ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ > exons.tab
    
    Checking how table exons.tab looks like by listing its column names.
    > matt get_colnms exons.tab
    
       ID      COL_NAMES
       1       GENE
       2       EVENT
       3       COORD
       4       LENGTH
       5       FullCO
       6       TYPE
       7       GROUP
       8       SCAFFOLD
       9       START
       10      END
       11      STRAND
       12      UPSTRM_EX_BORDER
       13      DOSTRM_EX_BORDER
    
    To verify that exons.tab contains only exons, you could also apply col_uniq to column TYPE again.

    4.1 Adding Ensembl gene IDs

    which are necessary for commands get_efeatures and cmpr_exons. Here, we use command get_match which takes as input a second table containing a mapping EVENT_ID to ENSEMBL_GENEID (mapping_EVENT2GENEID.tab) and outputs for each EVENT_ID in column EVENT of table exons.tab the corresponding GENEID. The resulting table (one column) gets added immediately to table exons.tab exploiting piping.
    > matt get_match exons.tab EVENT mapping_EVENT2GENEID.tab EVENT ENSEMBL_GENEID ...
           ... | matt add_cols exons.tab -

    4.2 Adding Ensembl gene IDs

    Alternatively, we could use the command retr_geneids to extract gene IDs from a given GTF file for splicing events. The following commands will append to the table exons.tab a column ENSEMBL_GENEID.
    > matt retr_geneids exons.tab START END SCAFFOLD STRAND Hsa19_ensembl.gtf -f gene_id -n ENSEMBL_GENEID ...
           ... | matt add_cols exons.tab -

    4.3 Adding Ensembl gene IDs

    Yet another option is to use the command get_vast with arguments -gtf and -f. In this case, it automatically extracts gene IDs from the given GTF and appends a column GENEID, which we rename to ENSEMBL_GENEID.
    > matt get_vast data.tab COORD FullCO TYPE LENGTH -gtf Hsa19_ensembl.gtf -f gene_id | ...
       ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ | matt rn_cols - GENEID:ENSEMBL_GENEID > exons.tab
    

    5. Checking number of exons in final table

    to get an idea about the size of the exon sets.
    > matt col_uniq exons.tab GROUP
    
       GROUP_UNIQ     FREQ
       AS_noNeural    10465
       NEURAL-DOWN    631
       NEURAL-UP      1046
    
    6. Extracting exon features for all exons

    You might want to download the USCS HG19 assembly sequences and the Ensembl gene annotation for HG19. For decompressing the file hg19.2bit you need to download as well twoBitToFa and run
    > ./twoBitToFa -noMask hg19.2bit Hsa19_gDNA.fasta
    
    Then extracting exon features and adding these immediately to the table exons.tab
    > fasta=/..path..to../Hsa19_gDNA.fasta
    > gtf=/..path..to../ensembl/Hsa19.gtf
    > matt get_efeatures exons.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ...
        ... $gtf $fasta Hsap 150 | matt add_cols exons.tab -
    
    Table exons.tab now contains for all exons information about their genomic coordinate, gene, and their exon features including sequences like exon sequence, splice site sequences, sequences of up/down-stream introns. All these information can be used as input for further analyses, e.g., with R. The extracted sequences might be of interest for further analyses like de-novo motif search etc, which often expect their input as FASTA files. Using the command extr_seqs, you might extract sequences from specific columns of exons.tab in FASTA format.

    7. Down-sample exon set AS_noNeural

    The sizes of the exon sets are: AS_noNeural=10465, NEURAL-DOWN=631, NEURAL-UP=1046. Hugh groups like AS_noNeural are unnecessary for comparing exon sets and would slow down analysis. Hence, we sub-sample 2000 AS_noNeural exons with rand_rows and create a new table exons_testsets.tab.
    > matt get_rows exons.tab GROUP]AS_noNeural[  ... 
           ... | matt rand_rows - 2000 132927352 > exons_testsets.tab
    > matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ ...
      ... | matt add_rows exons_testsets.tab -
    
    The table exons_testsets.tab should be used as input for command cmpr_exons. It is important to know that cmpr_exons will extract features for all exons of its input table and append them to the input table. As a consequence, the input table must not contain already columns with identical column names because column names in a table must be unique. Hence, from exons_testsets.tab we select only the important columns and neglect the already added columns with exon features.
    > matt get_cols exons_testsets.tab START END SCAFFOLD STRAND ...
       ... ENSEMBL_GENEID GROUP > tmp.tab
    > mv tmp.tab exons_testsets.tab
    
    Checking the number of exons in the final table exons_testsets.tab confirms that the sampling worked as expected.
    > matt col_uniq exons_testsets.tab GROUP
    
       GROUP_UNIQ     FREQ
       AS_noNeural    2000
       NEURAL-DOWN    631
       NEURAL-UP      1046
    
    8. Statistically comparing exon sets and generating PDF report
    > fasta=/..path..to../Hsa19_gDNA.fasta
    > gtf=/..path..to../ensembl/Hsa19.gtf
    > matt cmpr_exons exons_testsets.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ...
       ... $gtf $fasta Hsap 150 GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] ...
       ... cmpr_1 -colors:brown2,darkgoldenrod2,azure4
    
    With GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] we select the groups which should be compared, i.e., all pair-wise comparisons are done: NEURAL-UP vs. NEURAL-DOWN, NEURAL-UP vs. AS_noNeural, and NEURAL-DOWN vs. AS_noNeural. Putting AS_noNeural in the last place will give us box plots which contain the group AS_noNeural as reference group. All output gets written into folder cmpr_1, where you find a summary in form of a PDF document with all details on the comparisons and all graphics in sub-folder summary_graphics for later use.

    The PDF report contains all details and plots and is organized in four main parts:
    1. Overview of results of permutation tests across all comparisons and features
    2. Box-plots for all numerical features with all exon groups tested
    3. Comparative sequence plots for all groups of splice sites, branch points for all comparisons
    4. Details of results of permutation tests




    Testing enrichment of TGC in up-stream regions of neural micro exons

    We want to statistically test if the 3-mer TGC occurs more frequent in the 200 nt up-stream regions of neural micro exons. We start again with the table data.tab, which we already have used for comparing different sets of neural and non-neural AS events. Events are grouped according to their origin
    1. AS_noNeural
    2. NEURAL-DOWN
    3. NEURAL-UP
    For doing the enrichment analysis, we actually need the up-stream sequence of the exons. For extracting these sequences, we need the start/end coordinate, chromosome, and strand information which are not yet in table data.tab. Hence, we go through the same first three steps as above:
    1. Making yourself familiar with data
    2. Adapting labels in column TYPE
    3. Extracting exons and their start/end-coordinate, chromosome, strand

    4. Down-sample exon set AS_noNeural

    The set of AS_noNeural is much larger than the other sets.
    > matt col_uniq exons.tab GROUP
    
       GROUP_UNIQ   FREQ
       AS_noNeural  10465
       NEURAL-DOWN  631
       NEURAL-UP    1046
    
    Hence, we randomly sample 2000 AS_noNeural exons and construct a new table with them and all NEURAL-DOWN and NEURAL-UP exons.
    > matt get_rows exons.tab GROUP]AS_noNeural[ | ...
       ... matt rand_rows - 2000 923681819 > exons_testsets.tab
    > matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ | ...
       ... matt add_rows exons_testsets.tab -
    
    Verifying the sizes of the exon sets in exons_testsets.tab:
    > matt col_uniq exons_testsets.tab GROUP
    
       GROUP_UNIQ   FREQ
       AS_noNeural  2000
       NEURAL-DOWN  631
       NEURAL-UP    1046
    

    5. Retrieving up-stream sequences of exons

    The command get_seqs is the utility command for retrieving different types of sequences. For a given genomic region, e.g. an exon in this case, it allows to retrieve the up-stream and down-stream sequences (*Mode 2 in its help message) and will output these as a table with two columns SEQ_UP and SEQ_DOWN. Here we define as length for the up- and down-stream region 200 nt and add the columns with sequences directly to table exons_testsets.tab.
    > fasta=/..path..to../Hsa19_gDNA.fasta
    > matt get_seqs exons_testsets.tab START END 200 200 SCAFFOLD STRAND $fasta | ...
       ... matt add_cols exons_testsets.tab -
    
    Verifying the result:
    > matt get_colnms exons_testsets.tab
    
       ID  COL_NAMES
       1   START
       2   END
       3   SCAFFOLD
       4   STRAND
       5   TYPE
       6   LENGTH
       7   GROUP
       8   SEQ_UP
       9   SEQ_DOWN
    
    Looking at the retrieved sequences nicely shows/confirms that (almost) all up-stream sequences end in AG, as expected,
    > matt get_cols exons_testsets.tab SEQ_UP | less -S -
    
       SEQ_UP
       ...CAAGGCTGTCTTTTCTCCATAG
       ...TTAAATTGTCTTCTCTCCACAG
       ...ACTTGAATTTCTACTCCCACAG
       ...CAACTTTCTAATCTATCCCTAG
       ...CCCCCTTGTGTTCCTCCTGCAG
       ...ATTTTGTTATTTTTTTTTTTAG
       ...GCTGTTCGTTTTTCTTTCTTAG
       ...CTTCGCCTCCTTGCTGTTTCAG
       ...
    
    and (almost) all down-stream sequences start with GT:
    > matt get_cols exons_testsets.tab SEQ_DOWN | less -S -
    
       SEQ_DOWN
       GTGAGCTGGCAGAAATGAAGGGACT
       GTAAGACCAGTTTACCTTTTGAAGA
       GTAATACTTTGGGGAAGGAAGTTTT
       GTAAGAGTATTTTCCTTAATCAGTT
       GTATGTGGCTGTGGTGAACGGGCAC
       GCATGTAGCACCACACCTAACTGGA
       GTAAGCCTGCCAGGAGCCACTCAGC
       GTGAGCTCTGAGCCTCTGGCATTTC
       ...
    

    7. Defining sets of micro exons

    Lastly, before we apply the enrichment analysis, we need to define the groups of exons we want to compare:
    1. nmic_s: neural, short micro exons with length 3-15
    2. nmic: neural, micro exons with length 16-27
    3. nex: neural exons with length > 27
    4. ex: non-neural exons
    The command add_val allows the user to define groups and generate a column with group labels.
    > matt def_cats exons_testsets.tab GROUP2 ...
       ... 'nmic_s=LENGTH[3,15] GROUP]NEURAL-UP,NEURAL-DOWN[' ...
       ... 'nmic=LENGTH[16,27] GROUP]NEURAL-UP,NEURAL-DOWN[' ...
       ... 'nex=LENGTH[28,1000000000] GROUP]NEURAL-UP,NEURAL-DOWN[' ...
       ... 'ex=GROUP]AS_noNeural[' | matt add_cols exons_testsets.tab -
    
    Verifying results
    > matt col_uniq exons_testsets.tab GROUP2
    
       GROUP2_UNIQ	FREQ
       ex           2000
       nex          1371
       nmic         155
       nmic_s       151
    

    7. Statistically testing enrichment of TGC in up-stream sequences

    The data are ready prepared for applying the command test_regexp_enrich. We use this command because the 3-mer TGC can be described by a simple regular expression TGC. We create a simple tab-separated table describing this regular expression and save it in tgc.tab which looks like
    REGEXP   NAME
    TGC      TGC
    
    When you copy'n'paste this into tgc.tab, please make sure that columns are tab-separated. Then we run the enrichment analysis comparing nmic_s vs. ex, nmic vs. ex, and nex vs. ex using the sequences as given (single) and for computing p-values running a permutation test 100000 times:
    > matt test_regexp_enrich exons_testsets.tab SEQ_UP ...
       ... GROUP2[nmic_s,ex,nmic,ex,nex,ex] single tgc.tab NAME ...
       ... REGEXP quant 1000 > res.tab
    
    The permutation test performs 1000 iterations for speed. To obtain reliable p-values, 50000 or 100000 iterations are recommended. A comprehensive summary of the enrichment analyses done is written into res.tab (via re-direction of the output). This file is best studied as a spread sheet (using, e.g., Libre Office Calc). The most interesting columns are (please see explanation of test_regexp_enrich for more details):
    1. SIGNIFICANT_RESULT_ALL: enrichment considering entire sequences with results: nmic_s>ex ***, nmic>ex ***, nex>ex ***
    2. SIGNIFICANT_RESULT_FIRST: enrichment considering only the first 1/3 of the sequences with results: none
    3. SIGNIFICANT_RESULT_INTERNAL: enrichment considering only the internal/middle 1/3 of the sequences with results: nmic>ex ***, nex>ex ***
    4. SIGNIFICANT_RESULT_END: enrichment considering only the last/end 1/3 of the sequences with results: nmic_s>ex ***, nmic>ex ***, nex>ex ***
    The stars encode the level of significance, i.e., * p-value ≤0.05, ** p-value ≤ 0.01; *** p-value ≤ 0.001. From the results we find that neural exons (regardless of their size are highly enriched in TGC.

    8. Visualizing positional distributions of TGC in up-stream regions of exons

    The command get_regexp_prof allows to generate positional distribution of hits of regular expressions across sequences. As we have the 200 nt up-stream but also the 200 nt down-stream regions, we visualize both together running:
    > matt get_regexp_prof exons_testsets.tab SEQ_UP RL SEQ_DOWN LR tgc.tab ...
       ... GROUP2[nmic_s,nmic,nex,ex]
    
    This will create a web page in folder regexp_profiles which you can study using any web browser. This web page looks as follows.

    Matt: Positional distributions of hits of REGEXPs

    Data file: exons_testsets.tab
    Categories (number of sequences): ex ( 2000 ), nex ( 1371 ), nmic ( 155 ), nmic_s ( 151 )
    Sequence columns selected (plotting parameter): SEQ_UP ( RL ), SEQ_DOWN ( LR )
    K or file with information on REGEXPS: tgc.tab
    Date: Mar 09, 2017


    Ordered Results

    Ordering from top to bottom in each column: corresponds to most negative
    to most positive differences of densities / cumulative densities to
    densities / cumulative densities of reference group
    Reference group: ex
    If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.

    Densities all positions Cumulative densities all positions Densities first positions Cumulative densities first positions
    TGC
    TGC
    TGC
    TGC


    Plots

    TGC - TGC

    Back to Ordered Results



    It should be mentioned that these plots show densities and cumulative densities. Hence, they give information about the position of TGC hits across sequences; but they do not give information about the absolute number of hits. The plots show that TGC hits in the up-stream intronic regions of neural micro exons (red and blue lines) are clearly concentrated closer to the 3' splice site as in case of neural exons and non-neural exons.



    Downstream analysis of Nova2 regulated exons

    A Shell script with all commands and explanations is available. Commands therein can be copy'n'pasted to a Shell where Matt is available.

    After having downloaded the RNA-seq data from GEO with help of Matt, and after having estimated inclusion levels of alternative-splicing events genome-wide with help of the program VAST-TOOLs, this downstream analysis consists of the following steps

    1. Extraction of skipped exons from vast-tools output table, which contains skipped exons, retained introns, and alternative 3' and 5' splice sites.
    2. Definition of three groups of skipped exons according to their delta inclusion levels (∆PSI) for Nova2 KD vs. WT: silenced (∆PSI≥25), enhanced (∆PSI≤-25), unregulated (-1≤∆PSI≤1).
    3. Extraction of exon-related features, and statistical comparison of feature distributions with the Mann-Whitney U test for the detection of discriminative features.
      The output encompasses:
      1. a summary report (PDF) with box plots and p values
      2. all box plots as PDF graphics
      3. a table with extracted features for the exons
      4. a table with details (test statistics, group sizes, p values) on performed statistical tests
    4. Generation of motif RNA-maps for all approx. 340 RNA binding proteins with IUPAC binding motif from CISBP-RNA.
      The output encompasses:
      1. a summary report (PDF) with all motif RNA-maps. Each map is generated in two version, one containing a data coverage section and one which does not contain such a section.
      2. all motif RNA-maps as PDF graphics
    5. Generation of a specific motif RNA-map with a Nova2 binding motif derived from known Nova2 binding motifs, with highlighted regions of statistically significant (FDR≤0.01) different Nova2-motif enrichment.
      The output encompasses:
      1. a summary report (PDF) with all (in this case only one) motif RNA-maps. Each map is generated in two version, one containing a data coverage section and one which does not contain such a section.
      2. all (in this case only one) motif RNA-maps as PDF graphics




    6. Extraction of 50 nt long sequences upstream of exons
    7. Statistical enrichment analysis in these extracted sequences with all CISBP-RNA RNA binding proteins with PWM binding motifs.
      The output encompasses:
      1. a table with detailed information on performed statistical tests (test statistics, p values, group sizes)
      2. for all motifs their positional distribution along the 50 nt long sequences in all three groups as PDF and PNG graphics






    Extending Matt: add new commands/functionality

    Matt can be extended by new commands/functions. Extensions can be implemented in any scripting/programming language. Though, the easiest way might be to use Perl as Matt is mainly implemented in Perl and already includes a Perl module with utility functions for working with tables.

    A template script new_command.pl showing how Matt commands are usually implemented in Perl is available in sub-directory abilities of the Matt installation directory. This Perl script implements already the usual structure and can be extended quickly to a full implementation of a new Matt command.

    This template script new_command.pl contains detailed comments/help explaining how it needs to be extended. It also contains instructions for implementing Matt commands in other scripting/programming languages.





    FAQ / Trouble shooting

    1. Which letters do genomic sequences contain?
    Usually, they contain A, C, G, T. All CISBP-RNA motifs are defined wrt. these letters. Hence, when you run test_cisbp_enrich, your sequences must contain A, C, G, T, and maybe N but all potential binding positions containing at least one N will be neglected.
    Whenever you extract sequences with, e.g., get_seqs and your FASTA file contains A, C, G, U, then the extracted sequences will contain these letter as well.
    Whenever you use your own position weight matrices (PWMs), these might be defined wrt. A, C, G, U. In this case, when you want to use these PWMs with command test_pwm_enrich then your sequences should be defined wrt. A, C, G, U as well.



    2. Which column names do my tables need to have?
    You are free to choose any column names you prefer, but they should be alpha-numeric and, if possible, upper-case: A-Z, 0-9, and '_', and they should be meaningful. When applying Matt commands to your tables, often you need to specify the column names of your tables to which you want to apply Matt. E.g., get_seqs expects your table to contain a column with start coordinates and another with end coordinates. You can call these columns START and END, but any other name is possible.



    3. I don't get a PDF report from cmpr_exons or cmpr_introns.
    These PDF reports are generated with pdflatex using the following packages. Please make sure all of them are available on your system.
    1. KOMA-Script
    2. babel
    3. graphicx
    4. tocbibind



    4. I get strange errors when extracting sequences from a FASTA.
    ... uninitialized value $chr in substitution (s///) at get_seqs.pl line 118, <$fh> chunk 954.
    ... uninitialized value $seq in substitution (s///) at get_seqs.pl line 119, <$fh> chunk 954.
    ... uninitialized value $seq in transliteration (tr///) at get_seqs.pl line 120, <$fh> chunk 954.
    ... uninitialized value $chr in hash element at get_seqs.pl line 121, <$fh> chunk 954.

    We saw these errors when the FASTA file was stored and read from a pen drive/USB stick. Copying the FASTA to the local hard drive solved this issue.



    How to cite

    Matt builds on the great work of others. Hence, when publishing results which you have obtained with Matt, please follow these rules for citation.

    1. Splice site strengths

      As Matt internally uses Yeo & Burge's maximum-entropy models for computing splice site strengths, please cite

      Yeo et al., Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals, 2003, DOI: 10.1089/1066527041410418

      whenever you publish results related to splice site strength. This includes the following splice site features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

      • MAXENTSCR_HSAMODEL_UPSTRM_5SS - maximum entropy score of 5ss of up-stream exon using a model trained with human splice sites
      • MAXENTSCR_HSAMODEL_3SS - maximum entropy score of 3ss using a model trained with human splice sites
      • MAXENTSCR_HSAMODEL_5SS - maximum entropy score of 5ss using a model trained with human splice sites
      • MAXENTSCR_HSAMODEL_DOWNSTRM_3SS - maximum entropy score of 3ss of down-stream exon using a model trained with human splice sites


    2. Binding motifs of RNA binding proteins from CISBP-RNA

      As Matt uses binding motifs of RNA binding proteins as provided by the database CISBP-RNA, please cite

      Ray et al., A compendium of RNA-binding motifs for decoding gene regulation, 2013, DOI: 10.1371/journal.pcbi.1001016

      whenever you publish results obtained with Matt's command test_cisbp_enrich or when you have used any of the CISBP-RNA binding motifs in sub-folder data_external/cisbp_rna of Matt installation.


    3. Predicted branch points

      As Matt internally uses SVM_BPfinder for the prediction of branch points, please cite

      Corvelo et al., Genome-Wide Association between Branch Point Properties and Alternative Splicing, 2010, DOI: 10.1038/nature12311

      whenever you publish results related to predicted branch points. This includes the following features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

      • DIST_FROM_MAXBP_TO_3SS_UPINTRON - distance to 3ss of best precited BP
      • SEQ_MAXBP_UPINTRON - sequence of best predicted BP
      • SCORE_FOR_MAXBP_SEQ_UPINTRON - BP sequence score of best predicted BP
      • PYRIMIDINECONT_MAXBP_UPINTRON - Pyrimidine content between the BP adenine and the 3 prime splice site for best BP
      • POLYPYRITRAC_OFFSET_MAXBP_UPINTRON - Polypyrimidine track offset relative to the BP adenine for best BP
      • POLYPYRITRAC_LEN_MAXBP_UPINTRON - Polypyrimidine track length for best BP
      • POLYPYRITRAC_SCORE_MAXBP_UPINTRON - Polypyrimidine track score for best BP
      • BPSCORE_MAXBP_UPINTRON - SVM classification score of best BP
      • NUM_PREDICTED_BPS_UPINTRON - number of all predicted BPs which have a positive BP score
      • MEDIAN_DIST_FROM_BP_TO_3SS_UPINTRON - like DIST_FROM_MAXBP_TO_3SS but median over top-3 predicted BPs
      • SEQ_BPS_UPINTRON - comma-separated list of sequences of top-3 predicted BPs
      • MEDIAN_SCORE_FOR_BPSEQ_UPINTRON - like SCORE_FOR_MAXBP_SEQ but median over top-3 predicted BPs
      • MEDIAN_PYRIMIDINECONT_UPINTRON - like PYRIMIDINECONT_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_OFFSET_UPINTRON - like POLYPYRITRAC_OFFSET_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_LEN_UPINTRON - like POLYPYRITRAC_LEN_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_SCORE_UPINTRON - like POLYPYRITRAC_SCORE_MAXBP but median over top-3 predicted BPs
      • MEDIAN_BPSCORE_UPINTRON - like BPSCORE_MAXBP but median over top-3 predicted BPs
      • DIST_FROM_MAXBP_TO_3SS_DOINTRON - distance to 3ss of best precited BP
      • SEQ_MAXBP_DOINTRON - sequence of best predicted BP
      • SCORE_FOR_MAXBP_SEQ_DOINTRON - BP sequence score of best predicted BP
      • PYRIMIDINECONT_MAXBP_DOINTRON - Pyrimidine content between the BP adenine and the 3 prime splice site for best BP
      • POLYPYRITRAC_OFFSET_MAXBP_DOINTRON - Polypyrimidine track offset relative to the BP adenine for best BP
      • POLYPYRITRAC_LEN_MAXBP_DOINTRON - Polypyrimidine track length for best BP
      • POLYPYRITRAC_SCORE_MAXBP_DOINTRON - Polypyrimidine track score for best BP
      • BPSCORE_MAXBP_DOINTRON - SVM classification score of best BP
      • NUM_PREDICTED_BPS_DOINTRON - number of all predicted BPs which have a positive BP score
      • MEDIAN_DIST_FROM_BP_TO_3SS_DOINTRON - like DIST_FROM_MAXBP_TO_3SS but median over top-3 predicted BPs
      • SEQ_BPS_DOINTRON - comma-separated list of sequences of top-3 predicted BPs
      • MEDIAN_SCORE_FOR_BPSEQ_DOINTRON - like SCORE_FOR_MAXBP_SEQ but median over top-3 predicted BPs
      • MEDIAN_PYRIMIDINECONT_DOINTRON - like PYRIMIDINECONT_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_OFFSET_DOINTRON - like POLYPYRITRAC_OFFSET_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_LEN_DOINTRON - like POLYPYRITRAC_LEN_MAXBP but median over top-3 predicted BPs
      • MEDIAN_POLYPYRITRAC_SCORE_DOINTRON - like POLYPYRITRAC_SCORE_MAXBP but median over top-3 predicted BPs
      • MEDIAN_BPSCORE_DOINTRON - like BPSCORE_MAXBP but median over top-3 predicted BPs


    4. Predicted branch points

      As SVM_BPfinder internally uses SVMlight, please cite

      Joachims, Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A. Smola (ed.), MIT-Press, 1999. (link)

      whenever you need to cite SVM_BPfinder (point 3)


    5. Binding score of human SF1

      As Matt uses the human SF1 binding motif (position weight matrix) published by Corioni et al., please cite

      Corioni et al., Analysis of in situ pre-mRNA targets of human splicing factor SF1 reveals a function in alternative splicing, 2011, DOI: 10.1093/nar/gkq1042

      whenever you publish results related to this human SF1 binding motif. This includes the following features which you might have obtained with Matt's commands get_efeatures, get_ifeatures, cmpr_exons, cmpr_introns:

      • SF1_HIGHESTSCORE_3SS_UPINTRON - highest score of a SF1 positon weight matrix trained with human data in the last 150 nt 3 prime intron positons of up-stream intron
      • SF1_HIGHESTSCORE_3SS_DOINTRON - highest score of a SF1 positon weight matrix trained with human data in the last 150 nt 3 prime intron positons of down-stream intron
System requirements
   Linux/Mac OS
   Perl
   R, Rscript
   pdflatex
   any PDF viewer
   any Web browser


License
 Matt is open-source and available
 under GNU LGPLv3

Additional material
   Intro to rna_maps



Matt paper
   A. Gohr, M. Irimia,
   Matt: Unix tools for
   alternative splicing analysis,
   Bioinformatics, 2018, DOI:
10.1093/bioinformatics/bty606

Matt poster
   At NGS 2017


Please cite as well
 Matt uses great tools of others.