> Matt v. 1.3.0 ===== with focus on down-stream analysis of alternative splicing events |
Versions 1.3.0 1.2.1 1.2.0 1.1.0 1.0.0 All & change log |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Commands for: Data analysis • get_efeatures • get_ifeatures • cmpr_exons • cmpr_introns • cmpr_features • cmpr_motif • get_kmers_from_pwm • get_pwm_hits • get_pwm_prof • get_regexp_hits • get_regexp_prof • perm_test • plot_motif • rna_maps • rna_maps_cisbp • test_cisbp_enrich • test_pwm_enrich • test_regexp_enrich Data preparation • add_cols • add_rows • add_val • add_range • chg_labs • chk_tab • col_calc • col_uniq • def_cats • retr_geneids • extr_scafids • extr_seqs • get_colnms • get_cols • get_match • get_gcc • get_ovl • get_pwm • get_rows • get_sj • conc • get_seql • get_seqs • get_sss • get_vast • grp_calc • mk_uniq • perm_seqs • prnt_tab • rand_rows • retr_rnaseq • rm_cols • rn_cols • row_calc • sort_tab • stratify |
Last update: May 27, 2019Contents
IntroductionMatt is a Unix command-line toolkit for analyzing genomic sequences with focus on the down-stream analysis of alternative splicing events. Being a POSIX-style command-line toolkit, Matt runs on the terminal of any Unix-like system.Matt is modular: It comprises many commands, each of which tailored to one specific task. Combining commands allows the user to solve complex tasks: The whole is more than the sum of its parts. The central data structure Matt works on is tables, where rows correspond to objects, e.g., exons, genes, etc., and columns correspond to features of these objects. For performing a high-level analysis, the user of Matt usually goes through two phases.
To make the work with Matt more efficient, many Matt commands are designed to be combined by piping. Exploiting this capability improves the work experience with Matt. Matt provides comprehensive help messages on the terminal. To get an overview of all available commands, call > mattFor getting the help message for COMMAND, call > matt COMMANDEach help message describes all arguments, e.g., matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted. The order of arguments always matters; changing the order might cause errors or false results. This web page describes all commands and gives example calls. Both, this online documentation and the terminal help messages complement each other. Download and installation
TablesTab-separated plain text tables are the main data structure Matt works on. The motivations for making tables the basic data-structure are:
Tables must have headers with column names and column names must be unique, e.g., a table describing micro exons could look like this table t1.tab: EXON_ID START END SCAFF STRAND SEQ ENSEMBL_GID GROUP ex1 123127 123132 chr1 + GCTAGT ENSG00000223972 up ex2 903861 903868 chr3 + GCTCGATT ENSG00038737632 up ex3 4291861 4291863 chr5 - GTC ENSG00009875223 down ex4 9128712 9128717 chr12 - CGCCCT ENSG00002652352 down ex5 714614 714620 chrM + CGCTCGC ENSG00083273821 upIt should be emphasized that Matt does not assume any specific column structure, rather does the user choose specific columns with data necessary for the command the user wants to apply. As a consequence, when calling Matt commands the user needs to provide as arguments the names of these columns. For example, a user might want to do a differential motif enrichment analysis with the micro-exons table above and would list columns SEQ and GROUP, or, when the user wants to retrieve splice-site sequences of these micro exons, the user would use the same table but list columns START, END, SCAFF, STRAND. The order of columns in tables is never important, as the user always need to refer Matt to specific columns by their names. Not assuming specific columns in a specific order makes Matt flexible and applicable to almost any tab-separated table. For adding column names to tables yet without column names, e.g., like BED files, Matt offers the command add_header. Users are free to choose any column names they prefer. Though, it is recommended to use capital letters for all COLUMN NAMES. Further, column names should contain only alpha-numeric characters, A-Z, 0-9, and '_'. Generally, tables might contain any type of data representable as text, including long sequences. For improving the work experience with Matt, it is recommended that each column, whenever possible, should contain only one piece of information / value. Combining different values into one cell by concatenations, e.g., chr1:876123-877155, makes the single values, e.g., chromosome, start, end, inaccessible for Matt.
Matt can process tables with Unix new-lines and Windows new-lines. Tables with old-style Mac new-lines or Dos new-lines need to be adapted first; See command chk_nls.
When working with Excel, please save tables in format Windows Text.
Documentation of Matt commandsThis online documentation of Matt commands introduces all commands and provides examples and explanations. For more details on commands you might want to checkout the terminal help-messages which Matt offers for all commands.For displaying an overview of all available commands, call > mattFor getting a help message for COMMAND, call > matt COMMANDEach help message defines all arguments, e.g., matt COMMAND <ARGUMENT_1> [<ARGUMENT_2>] <ARGUMENT_3> ...where <ARGUMENT> is a necessary argument, but [<ARGUMENT>] optional and can be omitted. Generally, one might categorize all Matt commands into commands for
Table categorizing all Matt commands into eight groups
> matt chk_nlsDescription: Use this command to print a table to screen making visible new-lines. This helps to investigate which new-line characters are present in the table and how the table would need to be adapted to be processable by Matt. This might be necessary for tables with old-style Mac or Dos new-lines. Example: With t1.tab being the example table from above > matt chk_nls t1.tabprints this table making visible new-lines. > matt chk_tabDescription: Use this command to check if a given table is processable by Matt. To be so, all rows of a table must have the same number of fields. This command lists row-number of rows with a wrong number of fields or reports that the table is correctly formatted. Example: With t1.tab being the example table from above > matt chk_tab t1.tabreports that this table is correctly formatted and, hence, processable by Matt. > matt get_pwmDescription: Use this command to
Example: With M037_0.6.txt being the human MBNL1 PWM from CISBP RNA > matt get_pwm M037_0.6.txt 0.001 A,C,G,Tprints this MBNL1 PWM with minimal probability 0.001 instead of 0 and alphabet A,C,G,T instead of A,C,G,U. For re-defining the alphabet, you could also directly edit the text file M037_0.6.txt. To save this PWM in Matt format for later use with Matt, re-direct the output into a file. > matt get_pwm M037_0.6.txt 0.001 A,C,G,T > hsa_mbnl1_pwm.tab > matt get_sjDescription: This command is tailored to the result table of SANJUAN (a program for estimating inclusion levels of alternative splicing events from RNA-seq data). These tables describe exons and already contain their start and end-coordinate, chromosome, and strand. With this command, you'll add Ensembl gene ids to the table, which are necessary for subsequent Matt analyses. Example: SANJUAN outputs differentially spliced exons in a table called CEs_clean.txt. > matt get_sj CEs_clean.txt hg19 > exons_from_sj.tabwill output a table like CEs_clean.txt with an additional column GENEID_ENSEMBL. It will duplicate rows (exons) for which SANJUAN did output several GENE_SYMBOLs; one row for each of the reported genes with its gene name and Ensembl gene id. > matt get_vastDescription: This command is tailored to the result tables of VAST-TOOLS. Use this command to transform the output table from VAST-TOOLS into Matt format and extract sub-sets of reported alternative splicing events. VAST-TOOLs reports skipped exons, retained introns, alt3, alt5 events in one output table together with estimated PSI values across several data sets. Matt allows you to retrieve form this table events of specific types (skipped exons, retained introns, Alt3, Alt5) or events that have a PSI value within a user specified range across all samples. Before you run a high-level analysis you might need to defined groups of events (e.g. up-regulated vs down-regulated skipped exons) with Matt function def_cats applied to the result table of get_vast. More information on the format of the VAST-TOOLs output table can be found here. Example: With vts_out.tab being a final results table from VAST-TOOLS combine command > matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C -a kd1,kd2 -b ctr1,ctr2 > ir_events.tabextracts all intron retention events from vts_out.tab together with detailed information on PSI values (min, max, mean) across all samples, across samples -a kd1,kd2, across samples -b ctr1,ctr2, and dPSIs for the comparison of samples -a vs samples -b. In order to be considered in these calculations, PSI values need to have a minimum quality flag of LOW in samples -a and -b and N across all other samples. If for certain events no PSI values are left, NA is output. The user can augment the result table (e.g. ir_events.tab) with gene IDs extracted from any given gene annotation (e.g. GTF file Hsa19.gtf with gene ids in field gene_id) by > matt get_vast vts_out.tab -minqab LOW -minqglob N -complex IR,IR-S,IR-C -a kd1,kd2 -b ctr1,ctr2 -gtf Hsa19.gtf -f gene_id > ir_events.tabThe output table ir_events.tab will have one additional column GENEID with the extracted gene IDs. Of course do the chromosome annotations in the VAST-TOOLs output need to match to those in the GTF file.
Generally, the workflow is as follows: pre-process the output table of VAST-TOOLS for extracting PSI values with get_vast, define groups of events to be analyzed with def_cats, and eventually run a high-level analysis.
> matt col_uniqDescription: Use this command to get all distinct values of a column together with their number of occurrences. This command should be applied only to columns with categorical data. Example: The ir_events.tab containing intron-retention events from some VAST-TOOLs output obtained with command get_vast. This table has a column COMPLEX which contains for each event the sub-types (IR, IR-S, IR-C) as defined by VAST-TOOLS: > matt col_uniq ir_events.tab COMPLEXprints the three sub-types and their numbers of occurrences in table ir_events.tab. > matt get_colnmsDescription: Use this command to get all column names and their indices from a table. This command is useful to investigate the structure of tables and the kind of data in their columns. Example: With t1.tab being the table from above > matt get_colnms t1.tabprints the column headers EXON_ID, START, END, SCAFF, STRAND, SEQ, ENSEMBL_GID, GROUP and their indices. > matt get_colsDescription: Use this command to retrieve one or several columns from a table. Example: With t1.tab being the table from above > matt get_cols t1.tab START ENDprints the entire columns START and END from table t1.tab. You might want to redirect the output into a file to construct a new table consisting of the extracted columns > matt get_cols t1.tab START END > new_table.tab > matt get_matchDescription: For a given table with entities, find matching values from a second table. E.g., assume you have a table 1 with a column with gene ids and another table 2 with gene ids and gene names, then you can extract gene names from table 2 which match to the gene ids in table 1. Example: With t1.tab being the table from above and t2.tab describing a mapping GENE_ID→GENE_NAME with columns
> matt get_match t1.tab GENE_ID t2.tab GENE_ID GENE_NAMEprints a table with one column GENE_NAME. Each row of this table corresponds to a row of t1.tab and it contains the GENE_NAME for the corresponding GENE_ID of t1.tab. Piping allows the user to concatenate both with command add_cols and augment table t1.tab. > matt get_match t1.tab GENE_ID t2.tab GENE_ID GENE_NAME | matt add_cols t1.tab - > matt get_ovlDescription: Use this command to retrieve IDs of rows which occur in several tables. This command might be helpful if you want to determine the overlap of different data sets. Example: Table ir_events.tab contains intron-retention events and has a column EVENT with event IDs. Table ir_events_reference.tab contains reference intron-retention events with column EVENT containing their event IDs. Then > matt get_ovl ir_events.tab EVENT ir_events_reference.tab EVENT > common_ids.tabwrites a table with column OVERLAP with event ids that occur in table ir_events.tab and ir_events_reference.tab into file common_ids.tab. To extract all rows from table ir_events.tab for these events, you can apply get_rows as follows > matt get_rows ir_events.tab EVENT]common_ids.tab,OVERLAP[ > common_ir_events.tab > matt concDescription: Use this command for concatenating entries of several columns. This command might be useful for constructing artificial row IDs, i.e., event IDs which might be necessary, e.g., for get_ovl if the table not yet contains event IDs. Example: With t1.tab being the table from above > matt conc t1.tab EVENT_ID '_' START END SCAFF STRANDprints a table with one column EVENT_ID which contains in each row an event ID by concatenating START_END_SCAFF_STRAND. With command add_cols you might add this column with event IDs to the table t1.tab. > matt conc t1.tab EVENT_ID '_' START END SCAFF STRAND | matt add_cols t1.tab -Other separation symbols like the empty string '' or white space ' ' can be specified. > matt get_rowsDescription: Use this command to retrieve rows (items) from a table. Filters allow the user to select specific rows, i.e., rows which have specific values (numeric or categorical) in some of the columns, or whose vales are empty (empty string). Example: With t1.tab being the table from above > matt get_rows t1.tab START[1,5000000] '!STRAND]+['prints the rows from t1.tab with START in 1 to 5 Mio and STRAND is NOT +. Constraints can be:
> matt rand_rowsDescription: Use this command to randomly draw rows from a table. This command might be useful when you want to sub-sample data sets. Example: The table ir_events.tab contains intron-retention events: > matt rand_rows ir_events.tab 500 3628929065prints 500 randomly drawn (seed 3628929065) rows (without replacement) from ir_events.tab. You might want to save these events into a file for later use. > matt rand_rows ir_events.tab 500 3628929065 > 500_ir_events.tab > matt add_colsDescription: Use this command to add columns to existing tables. Columns to be added might be newly constructed or retrieved from another table with get_cols. Example: With t1.tab being the table from above and having constructed row IDs with conc like > matt conc t1.tab EVENT_ID _ START END SCAFF STRAND > event_ids.tabdoing > matt add_cols t1.tab event_ids.tabwill add the column EVENT_ID from event_ids.tab to the right end of table t1.tab. Combining both commands using piping is more efficient. > matt conc t1.tab EVENT_ID _ START END SCAFF STRAND | matt add_cols t1.tab -Important: Column names must be unique, i.e., you cannot join a column to a table if this table has already a column with the same column name. To rename columns, you might use rn_cols. It can be efficient to use piping here, e.g.: > matt rn_cols a.tab COLA:RENAMED_COLA | matt add_cols b.tab - > matt add_headerDescription: Use this command to add column names (entire table header) to a table yet without header to make it processable with Matt. Example: exons.bed being a BED file with the first three mandatory fields only > matt add_header exons.bed chrom chromStart chromEndadds three column names as header to the file exons.bed directly without printing it to screen. > matt add_rangeDescription: Use this command to add a column with a range index increasing for each row. Example: With t1.tab from above > matt add_range t1.tab IDprints t1.tab with extra column ID with index starting at 0 and increasing in each row. > matt add_rowsDescription: Use this command to add rows to existing tables. You might want to add rows of one table to another table, e.g., to join data sets into a single table. Example: With t1.tab being the table from above > matt add_rows t1.tab t1.tabwill add all rows from t1.tab to the bottom of t1.tab, which does not look very meaningful. Important: Both tables must have the same number of columns with the same column names, but their order can vary as Matt will re-order columns automatically if necessary. > matt add_valDescription: Use this command to add one column with a constant value across all rows to a existing table. You might use this command to add a column containing some kind of group IDs (e.g., up_regulated) to tables which you later join into one table. Example: With t1.tab being the table from above > matt add_val t1.tab DATE 22.02.2017will print table t1.tab to screen with a additional column DATE containing the value 22.02.2017 in all rows. > matt chg_labsDescription: Use this command to change labels (categorical entries) in a column of a table. You might use this command if you have a column with group IDs and want to change these IDs / labels. Example: With t1.tab being the table from above > matt chg_labs t1.tab EXON_ID ex1:ex11 ex2:ex22will print table t1.tab to screen where in column EXON_ID ex1 is changed for ex11, and ex2 for ex22. > matt def_catsDescription: Use this command (define categories) for creating a new column with some kind of group IDs. Generally, rows of table corresponds to events, while columns correspond to their features. Hence, you define group IDs (one for each row) wrt. entries in the feature columns using the same constraint syntax as used with command get_rows.
This command is important as many analyses essentially compare some groups of items (e.g. groups of exons, introns, sequences, or more), hence, the user needs to defined the groups to be compared.
> matt def_cats t1.tab GROUP2 'g1=START[0,5000000] STRAND]+[' 'g2=START[0,5000000] !STRAND]+[' 'g3=START[5000001,10000000]'will output a table with column GROUP2 with group IDs g1, g1, g2, g3, g1 categorizing each micro exon in table t1.tab wrt. the defined constraints. > matt rm_colsDescription: Use this command to remove columns from a table changing the original table. Example: With t1.tab being the table from above where we add a column with > matt add_val t1.tab DATE 22.02.2017 > t2.tabdoing > matt rm_cols t2.tab DATE STRANDwill print a table with columns DATE and STRAND to screen and remove these columns from the original t2.tab. Be careful: The original table will be changed by removing columns. This might lead to undesired data loss. > matt rn_colsDescription: Use this command to rename columns of tables. You might use this command before you want to join by-column two tables which have some column names in common. Tables must not contain column names several times. Each column needs to have a unique name. Example: With t1.tab being the table from above > matt rn_cols t1.tab SCAFF:SCAFFOLD STRAND:STRAND2will print t1.tab to screen with column names SCAFFOLD instead of SCAFF and STRAND2 instead of STRAND. > matt extr_seqsDescription: Use this command to output in FASTA format sequences from a table. You might use this command if you want to run other software on sequences which you have obtained with Matt. Often, software for analyzing sequences expect FASTA input format. Example: With t1.tab being the table from above > matt extr_seqs t1.tab SEQ EXON_IDwill print the sequences from column SEQ into FASTA format where values in column EXON_ID will be used as sequence IDs.
If the table not yet contains a column with kind of sequence IDs, you might want to use conc to construct them.
> matt get_seqsDescription: Use this command to retrieve genomic sub-sequences from FASTA files. Usually, you'll have a table specifying genomic coordinates (e.g. START, END, SCAFFOLD/CHROMOSOME, STRAND) and you want to retrieve genomic sequences with these coordinates (e.g., sequences of exons, introns, genes, splice sites or any other genomic entity) from a FASTA file (which contains all sequences of the entire genome). The command get_seqs has three modes making it easy to retrieve different genomic sequences wrt. to the coordinates. ![]() Schematic view of positions of START and END coordinates and the four length arguments for mode 3. ![]() The three modes make it easy to retrieve:
Example: With t1.tab being the table from above > matt get_seqs t1.tab START END 20 10 SCAFF STRAND Hsa_genome.fawill extract for all micro exons in t1.tab the up-stream intronic sequences of length 20 and the down-stream intronic sequences of length 10 and print these sequences as a table with columns SEQ_UP and SEQ_DOWN. The scaffold/chromosome IDs in column SCAFF of table t1.tab must be the same as used in Hsa_genome.fa. Using piping, you can add the retrieved sequences directly to the table t1.tab as get_seqs outputs the sequences as a two-column table. > matt get_seqs t1.tab START END 20 10 SCAFF STRAND ... ... Hsa_genome.fa | matt add_cols t1.tab - > matt cmpr_motifDescription: Use this command to plot a PDF for visually comparing two motifs (position weight matrices) of the same length. Motifs can be given either by tables describing position weight matrices (get_pwm) or by tables containing sequences (aligned, all of same length) from which Matt will estimate position weight matrices. Example: With hsa_mbnl1_pwm.tab and hsa_a2bp1_pwm.tab being two PWMs retrieved from CISBP-RNA, both being of length 7, > matt cmpr_motif hsa_mbnl1_pwm.tab hsa_a2bp1_pwm.tab mbnl1_vs_a2bp1.pdfgenerates the PDF mbnl1_vs_a2bp1.pdf which shows the motif plots (plot_motif) for both PWMs and in the center a diagram showing at each position the log2-ratio of probabilities of each nucleotide. ![]() > matt get_efeaturesDescription: Use this command for retrieving 75 features of interest for exons. Exons need to be described by a table with basic information like their genomic coordinates and a gene ID of genes the exons belong to. If the table does not yet contain gene IDs, you might use the command retr_geneids for extracting gene IDs from any GTF file for given genomic events, like exons, introns, genes. Overview of features ![]() Motivation for including features
For detecting features discriminating sets of exons, the command cmpr_exons is more suited. Examples: To get an overview and a short description of all features > matt get_efeatures explain With t1.tab and assuming it describes human micro exons > matt get_efeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ... ... Hsa19.gtf Hsa19.fa Hsap > efeatures.tabwrites a table with columns containing exon feature and extracted sequences of the exons, up/down-stream introns, and splice sites. If only one or a few features are of interest, users can apply get_cols and extract specific feature columns only. > matt get_efeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ... ... Hsa19.gtf Hsa19.fa Hsap | matt get_cols - EXON_LENGTH EXON_GCC > efeatures.tab
It is recommended to use GTF files with genome annotation from Ensembl, as Matt is tailored to their structure.
> matt get_gccDescription: Use this command to determine the GC content of genomic sequences. Example: With t1.tab being the table from above > matt get_gcc t1.tab SEQwill print a table with one column SEQ_GCC containing the determined values of the GC content.
Any letter not A, C, G, T, U, a, c, g, t, u will be completely neglected from the computation of the GC content.
> matt get_ifeaturesDescription: Use this command to retrieve 50 features of interest for introns. Introns need to be described by a table with basic information like their genomic coordinates and a gene ID of genes the introns belong to. If the table does not yet contain gene IDs, you might use the command retr_geneids for extracting gene IDs from any GTF file for given genomic events, like exons, introns, genes. Overview of features ![]() Motivations and references for these features are listed in the description of command get_ifeatures. For detecting features discriminating sets of introns, the command cmpr_introns is more suited. Examples: To get an overview and a short description of all features > matt get_ifeatures explain With table introns.tab describing human introns with columns START, END, SCAFF, STRAND, ENSEMBL_GID > matt get_ifeatures introns.tab START END SCAFF STRAND ENSEMBL_GID ... ... Hsa19.gtf Hsa19.fa Hsap > ifeatures.tabwrites a table with columns containing the feature values and extracted sequence of intron, up/down-stream exon, splice sites. If only one or a few features are of interest, users can apply get_cols and extract specific feature columns only. > matt get_ifeatures t1.tab START END SCAFF STRAND ENSEMBL_GID ... ... Hsa19.gtf Hsa19.fa Hsap | matt get_cols - INTRON_LENGTH INTRON_GCC > ifeatures.tab > matt get_pwm_hitsDescription: Use this command to predict hits in sequences with a motif position weight matrix model (PWM). Example: With seqs.tab being a table containing sequences (containing A,C,G,T) in column SEQ and some kind of sequence IDs in column ID and hsa_mbnl1_pwm.tab the MBNL1 PWM from CISBP-RNA whose alphabet we had redefined from A,C,G,U to A,C,G,T, the command > matt get_pwm_hits seqs.tab SEQ ID hsa_mbml1_pwm.tab 10 single INFER_BGPWMwill estimate a simple background PWM from all sequences (being identical to a homogeneous Markov model of order 0), and predict MBLN1 hits at positions where the likelihood ratio MBLN1 PWM vs. background PWM is at least 10 or higher, i.e., where it is at least 10 times more likely to actually have a MBNL1 binding site than having no MBNL1 binding site (background). The command prints a table with columns SEQ_ID, START, END, STRAND, HIT, LOGSCORE describing the hits. > matt get_pwm_profDescription: Use this command to plot positional distributions of hits of motif position weight matrices (PWMs) across sequences. This command involves predicting hits of PWMs as well. Hence, you need to specify the input sequences, the PWMs, and information on how hits should be predicted. Example: Assume table exons.tab describes exons and contains beside others these columns:
NAME FILE BG THRESH A2BP1 a2bp1.tab INFER_BGPWM 5 MBNL1 mbnl1.tab INFER_BGPWM 5Both files as1bp1.tab and mbnl1.tab were obtained by applying get_pwm to the original PWM files from CISBP-RNA to re-define the alphabet to A,C,G,T, fit it to the alphabet of the sequences. The command: > matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ... ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]will go through the following steps for each foreground PWM (MBNL1 and A2BP1 also known as RBFOX1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
![]() The command (from above) > matt get_pwm_prof exons.tab SEQ_UP RL SEQ_DOWN LR ... ... pwms.tab NAME FILE THRESH BG DATASET[down,ndiff]creates a folder pwm_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the PWMs for which the densities are most different. Example website with results: Matt: Positional distributions of hits of motif PWMs
Data file: exons.tab Ordered ResultsOrdering from top to bottom in each column: corresponds to most negativeto most positive differences of densities / cumulative densities to densities / cumulative densities of reference group Reference group: ndiff If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.
PlotsA2BP1 - a2bp1.tabBack to Ordered Results![]() MBNL1 - mbnl1.tabBack to Ordered Results![]() > matt get_regexp_hitsDescription: Use this command to predict hits in sequences with Perl regular expressions. Example: With seqs.tab being a table containing sequences in column SEQ and some kind of sequence IDs in column ID the command > matt get_regexp_hits seqs.tab SEQ ID 'ACCG|G[GC]C|GCCT' singlewill search for possibly overlapping hits of ACCG or GGC or GCC or GCCT (case insensitive) and outputs them as a table with columns SEQ_ID, START, END, STRAND, HIT. The argument single states that Matt should not search in the reverse-complements. If several hits start at the same position, only the first gets reported according to their order in the call. If motifs are of variable length, the longest match will be reported only. > matt get_regexp_profDescription: Use this command to plot positional distributions of hits of Perl regular expressions across sequences. You need to specify the input sequences, and the regular expressions. Example: Assume table exons.tab describes exons and contains beside others these columns:
NAME REGEXP MBNL1 [GC][CT]TT[GA]C A2BP1 [TA]GCATG[CA]|TACATAThe command: > matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ... ... regexps.tab DATASET[down,ndiff]will go through the following steps for each regular expression (MBNL1 and A2BP1) and each column (SEQ_UP and SEQ_DOWN) and each group (ndiff and down):
![]() The command (from above) > matt get_regexp_prof exons.tab SEQ_UP RL SEQ_DOWN LR ... ... regexps.tab DATASET[down,ndiff]creates the folder regexp_profiles and therein a website 0_all_results.html with all results and a table ydiff_scores_of_dens_and_cdens.tab which contains differences (between the densities down vs. ndiff) for identifying the regular expressions (patterns) for which the densities are most different. Example website with results: Matt: Positional distributions of hits of REGEXPs
Data file: exons.tab Ordered ResultsOrdering from top to bottom in each column: corresponds to most negativeto most positive differences of densities / cumulative densities to densities / cumulative densities of reference group Reference group: ndiff If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.
PlotsMBNL1 - [GC][CT]TT[GA]CBack to Ordered Results![]() A2BP1 - [TA]GCATG[CA]|TACATABack to Ordered Results![]() > matt get_seqlDescription: Use this command to determine lengths of sequences. Example: With t1.tab being the table from above > matt get_seql t1.tab SEQwill print a table with one column SEQ_LENGTH with the lengths of the sequences in table t1.tab in column SEQ. > matt get_sssDescription: Use this command to estimate the splice site strength of 3' and 5' splice sites using maximum-entropy models (DOI: 10.1089/1066527041410418). Often you'll start with a table of exons or introns. First, you need to retrieve the sequences of the 3' and 5' splice sites. For this, you might use the command get_seqs (mode 3). Example: Assuming table introns.tab contains sequences (xxxGTxxxx, 3 exon and 6 intron positions) of the 5' splice sites in column 5SS_SEQ and sequences (xxxxxxxxxxxxxxxxxxAGxxx, 20 intron and 3 exon positions) of the 3' splice sites in column 3SS_SEQ, then > matt get_sss introns.tab 3SS_SEQ 5SS_SEQwill estimate the strength of the splice sites and output them as a table with two columns STRENGTH_3SS and STRENGTH_5SS. Using piping allows the user to extract the sequences and estimate the splice site strength at once. E.g., assume table introns.tab has columns START, END, SCAFF, STRAND describing human introns: > matt get_seqs introns.tab START END 3 6 20 3 SCAFF STRAND Hsa19.fa | matt ... ... get_sss - SEQ_UP SEQ_DOWN > matt plot_motifDescription: Use this command to create a motif plot as PDF similar to sequence logos for a given position weight matrix model (PWM) or from a set of aligned sequences of identical length. For obtaining PWMs, re-defining their alphabet or setting a minimum probability, you might use the command get_pwm. Example: With human_nova2_pwm.tab being the human NOVA2 PWM from CISBP-RNA with re-defined alphabet (A,C,G,T instead of A,C,G,U) and minimal probability 0.001 > matt plot_motif human_nova2_pwm.tabwill create a PDF human_nova2_pwm_motifplot.pdf visualizing the probabilities of the nucleotides at each position, the consensus sequence (letters at the top of the bars), and the information content at each position. ![]() > matt test_pwm_enrichDescription: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by position weight matrices (PWM). The sequences and the information on the PWMs need to be given as tables. This command first predicts PWM hits in the two sets of sequences and then applies a permutation test for estimating p-values for the enrichment/depletion. Example: Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
PWM specification: Assume you have a 3 PWMs (pwm1, pwm2, pwm3) which you want to use for this enrichment analysis. Table pwms.tab contains information about these PWMs in the following columns
NAME FILE THRESH pwm1 /path/to/pwm1.tab 10 pwm2 /path/to/pwm2.tab 10 pwm3 /path/to/pwm3.tab 10Running > matt test_pwm_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ... ... pwms.tab NAME FILE THRESH INFER_BGMODEL quant 10000 > results.tabwill perform 6 comparisons: (pwm1, pwm2, pwm3) x (up vs. cntr, down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab by re-directing the output. Each comparison (say, for pwm1 and up vs. cntr) consists of these steps:
Test for positional differences of hits: The above described procedure considers the number of PWM hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view. For detecting positional differences, Matt predicts the PWM hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
![]() Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found PWM hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences). Avoid false positives: Normalizing the number of PWM hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set. ![]() Testing these sequences which are of very different length but contain the same number of PWM hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits. Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits For testing enrichment, one might either consider
> matt test_regexp_enrichDescription: Description: Use this command to statistically test for enrichment/depletion of motifs comparing two sequence sets. The motifs are defined by Perl regular expressions. The sequences and the information on the regular expressions need to be given as tables. This command first predicts exact matches of the regular expressions in the two sets of sequences and then applies a permutation test for estimating p-values. This command is very similar to test_pwm_enrich but predicts motif hits with Perl regular expressions get_regexp_hits. Example: Sequence data: Assume table exons.tab contains sequences and group IDs specifying to which group (cntr, up, down) each sequence belongs. Table exons.tab has these columns (and potentially many more):
Specification of Perl regular expressions / patterns: Assume you want to do the enrichment analysis with two regular expressions: ACCGT|ACCGA|ACCAA and AC[CGT]..C{2,5}GT. Table regexps.tab contains information about these regular expressions in the following columns
NAME REGEXP protA ACCGT|ACCGA|ACCAA protB AC[CGT]..C{2,5}GTRunning > matt test_regexp_enrich exons.tab SEQ GROUP_ID[up,cntr,down,cntr] single ... ... regexps.tab NAME REGEXP quant 10000 > results.tabwill perform 4 comparisons: (protA and protB) x (up vs. cntr and down vs. cntr), and will output a table with all information on these comparison. This table gets written into file results.tab as we re-direct the output. Each comparison (say, for protA and up vs. cntr) consists of these steps:
Test for positional differences of hits: The above described procedure considers the number of hits across entire sequences. Imagine we predict 14 hits in the sequences of group up close to their left end, and 14 in sequences of group cntr close to their right end. Considering entire sequences, we won't find this positional difference; though this difference might be interesting from a scientific point of view. For detecting positional differences, Matt searches for hits in both sets of sequences once (as seen above), but then tests for significant enrichment/depletion 4 times considering:
![]() Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences). Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set. ![]() Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits. Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits For testing enrichment, one might either consider
> matt row_calcDescription: Use this command to apply calculations along rows. Example: With t1.tab being the table from above > matt row_calc t1.tab 'LENGTH=END-START+1'will determine the length of the exons and output it as a table with column LENGTH. Use piping to add this new column directly to t1.tab > matt row_calc t1.tab 'LENGTH=END-START+1' | matt add_cols t1.tab -Computing the log of the average of START, END, LENGTH (not very meaningful) with > matt row_calc t.tab 'MEAN=log((START+END+LENGTH)/3)'which creates a table with column MEAN. You can use any mathematical expression which is available within Perl. Column names in expressions are used like variables. Mathematical expressions must always be put inside ' '. > matt col_calcDescription: Use this command to apply some pre-defined calculations, like summation, along columns. If the table contains columns with group labels, this commands allows to apply calculations per group. Example: With t2.tab being this table A B V1 V2 a 1 1 2 a 2 1 3 b 2 1 4 a 1 1 5 > matt col_calc t2.tab sum V1,V2sums over the entire columns V1 and V2 and outputs V1_SUM V2_SUM 4 14 > matt col_calc t2.tab sum V1,V2 Asums over columns V1 and V2, considering entries in column A being group labels and outputs sums per group. A V1_SUM V2_SUM a 3 10 b 1 4 > matt col_calc t2.tab sum V1,V2 A,Bsums over columns V1 and V2, considering entries in columns A and B being group labels and outputs sums per group. A B V1_SUM V2_SUM a 1 2 7 a 2 1 3 b 2 1 4 > matt perm_testDescription: This command provides a flexible, statistical test for differences of features between groups. Because it is a permutation test and does not make assumption about the feature distributions. This test can be applied to these kinds of features:
Example: With exons.tab describing exons (see also get_efeatures) with these (and more) columns
> matt perm_test exons.tab DATASET[down,ndiff] 100000 ... ... -num LENGTH,GCC -card GENE_TYPE -cooc IS_IN_TRSwill apply a permutation test (100000 repeats) and test for significant differences between the two exon groups down vs. ndiff wrt. to the features exon length, exon GC content, and gene type. It also will test whether the exons of the two groups (down vs. ndiff) co-occur differently, e.g., whether they co-occur heavily in one group in a few transcripts, and much less (spread among many transcripts) in the other group. The command will print a text report (NOT a table) with all details and results. Co-occurrences: each row of the table corresponds to one element (in our case to an exon). Exons might be clustered, e.g., all exons occurring in the same transcript belong to one cluster. Exons which occur in the same cluster co-occur with each other. You might identify the clusters by giving them unique cluster IDs. In the example, cluster IDs are transcript IDs. Then, there need to be a column in the table which contains information about the clusters each element (exon) occurs in, e.g., IS_IN_TRS. This information must be given by a colon-separated list of cluster IDs, e.g., tr1:tr10:tr155 for exon5 encodes that exon5 actually occurs in these three clusters (transcripts). If a exon does not occur in any cluster, this list is empty. > matt retr_geneidsDescription: This is a utility command for automatically extracting gene IDs from gene annotations (GTF files) for genomic events like genes, exons, introns, etc. Example: With exons.tab being a table describing human exons (columns START, END, SCAFFOLD, STRAN) and Hsa19.gtf being the Ensembl gene annotation for the human genome (assembly hsa19) > matt retr_geneids exons.tab START END SCAFFOLD STRAND Hsa19.gtf -f gene_idretrieves for all exons in exons.tab gene ids from Hsa19.gtf, where gene ids are taken from field gene_id of Hsa19.gtf. > matt extr_scafidsDescription: This is a utility command for extracting/checking scaffold/sequence/chromosome IDs in FASTA, FASTQ, GFF, GTF files. Example: With Hsa19_gDNA.fasta containing all human chromosomes > matt extr_scafids Hsa19_gDNA.fasta FASTAwill print all IDs of human chromosomes. > matt get_kmers_from_pwmDescription: Use this command to determine the most likely kmers under a position weight matrix model (PWM). Either obtain the top N (integer ≥1) kmers or the most likely k-mer which together account for P (0<P<1) percent of the total binding specificity (probability) as encoded by the PWM. This command might help if you want to search for PWM hits but don't want to specify a threshold on the PWM log-score. In this case, you could determine the most likely kmers from a PWM and use these as a Perl regular expression and search for exact hits. Example: With mbnl1.tab being the PWM for human MBNL1 from CISBP-RNA (with re-defined alphabet A,C,G,T and minimum probability 0.001; see get_pwm) > matt get_kmers_from_pwm mbnl1.tab 5 5prints the 5 most likely kmers under the MBNL1 PWM. ID KMER PWM_LOGSCORE CUMULATIVE_PERCENT_BINDING PWM_NAME 1 AGCTTGC -1.98607349885094 0.137233215315936 mbnl1 2 TGCTTGC -2.09775518556902 0.259964844297944 mbnl1 3 GGCTTGC -2.38376620889964 0.352167512560293 mbnl1 4 CGCTTGC -2.49208691693447 0.434904633342322 mbnl1 5 AGCTTGG -3.67966981232818 0.460135937867991 mbnl1 > matt get_kmers_from_pwm mbnl1.tab 0.5 5prints the most likely kmers which together account for 50% of all binding specificity. ID KMER PWM_LOGSCORE CUMULATIVE_PERCENT_BINDING PWM_NAME 1 AGCTTGC -1.98607349885094 0.137233215315936 mbnl1 2 TGCTTGC -2.09775518556902 0.259964844297944 mbnl1 3 GGCTTGC -2.38376620889964 0.352167512560293 mbnl1 4 CGCTTGC -2.49208691693447 0.434904633342322 mbnl1 5 AGCTTGG -3.67966981232818 0.460135937867991 mbnl1 6 TGCTTGG -3.79135149904627 0.482701022411463 mbnl1 7 AGCTTAC -3.9437269446445 0.502076889638399 mbnl1 > matt mk_uniqDescription: This command allows the user to make-unique a table, i.e., omitting rows with replicated entries in specific columns. Example: With t2.tab being this table A B V1 V2 a 1 1 2 a 2 1 3 b 2 1 4 a 1 1 5 > matt mk_uniq t2.tab A Boutputs A B V1 V2 a 1 1 2 a 2 1 3 b 2 1 4 > matt perm_seqsDescription: This command allows the user to permute randomly nucleotides within sequences. With table seqs.tab containing sequences in column SEQ: > matt perm_seqs seqs.tab SEQoutputs a table with column SEQ_PERM with sequences from SEQ with randomly permuted nucleotides. > matt prnt_tabDescription: Use this command to print a tab-separated table nicely aligned to screen. This command is especially useful for getting an overview of the contents of tables which contain long strings with different lengths. Example: With t1.tab being the table from above > matt prnt_tab t1.tab -w 5 -s 3prints t1.tab where all cells get extended or truncated to 5 characters and with 3 blanks between columns. Use piping and less to see large table. > matt prnt_tab t1.tab -w 5 -s 3 | less -S > matt retr_rnaseqDescription: Allows the user to retrieve RNA-seq data from the Gene Expression Omnibus (GEO), a public repository where many researchers store their RNA-seq data related to their publications. Having at hand the GEO accession numbers of RNA-seq data sets, this command downloads the SRA archives, extracts RNA-seq reads as FASTQ or FASTA files, and, if wanted, renames the extracted FASTA/FASTQ files as specified by the user. Example: Specifying accession numbers and data set names in file accession_numbers.txt: SRR3532927 nova2kd3 SRR3532926 nova2kd2 SRR3532925 nova2kd1 SRR3532924 nova2wt3 SRR3532923 nova2wt2 SRR3532922 nova2wt1 > matt retr_rnaseq accession_numbers.txt -keepsra -fasta -o rnaseq_data -p 6downloads these six data archives, extracts FASTA files, splits them as RNA-seq data are paired-end reads, and renames final files. This is done in parallel with six processes. The output folder rnaseq_data looks then like: dataset_info.tab nova2kd1_1.fasta.gz nova2kd1_2.fasta.gz nova2kd2_1.fasta.gz nova2kd2_2.fasta.gz nova2kd3_1.fasta.gz nova2kd3_2.fasta.gz nova2wt1_1.fasta.gz nova2wt1_2.fasta.gz nova2wt2_1.fasta.gz nova2wt2_2.fasta.gz nova2wt3_1.fasta.gz nova2wt3_2.fasta.gz SRR3532922.sra SRR3532923.sra SRR3532924.sra SRR3532925.sra SRR3532926.sra SRR3532927.sra > matt sort_tabDescription: Use this command to sort rows of a table according to values (numeric or alpha-numeric) in a distinct column. Example: With t1.tab being the table from above > matt sort_tab t1.tab STARTwill print t1.tab to screen where lines are sorted wrt. START. > matt sort_tab t1.tab START -dwill do the same but sort decreasingly. > matt stratifyDescription: Use this command to stratify data sets, e.g., if you want to compare two sets of exons and want to make sure their length distributions are similar, you might first stratify these two sets of exons and then apply perm_test. Stratification can be performed wrt. any numerical feature. Example: With exons.tab being a table describing exons with these columns (and maybe more)
> matt stratify exons.tab GCC DATASET ndiff 0.05will stratify both groups such that their GC content distributions are similar. The reference group is ndiff. The algorithm searches for each ndiff exon one exon from the group down which a similar value in column GCC (± 0.05). Each pair of matched exons will get into the final result table. All exons from both groups without matching pair will be neglected. The command outputs a table like exons.tab reduced to the stratified exons. With col_calc you may check how well the means or medians of the groups agree after stratification. > matt cmpr_exonsDescription: Use this command if you want to compare groups of exons. This command goes through these steps:
Example: With exons.tab being a table describing human exons with these columns (and maybe more)
> matt cmpr_exons exons.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ... ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiffwill compare exons down vs. ndiff and leave untouched exons from group up. The argument 150 specifies the length of the up-stream and down-stream intronic regions which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all exons (down and ndiff) and the extracted exon features. It will also place therein a PDF document summary.pdf containing all details of the comparison. Hint 1: You might first stratify the sets of exons with stratify and then apply cmpr_exons. Hint 2: You might choose colors for the box plots of the PDF report. See help message of cmpr_exons for getting help on this. Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced). Hint 4: DATASET[down,ndiff] compares set down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff. > matt cmpr_intronsDescription: Use this command if you want to compare groups of introns. This command goes through these steps:
Example: With introns.tab being a table describing human introns with these columns (and maybe more)
> matt cmpr_introns introns.tab START END SCAFFOLD STRAND GENEID_ENSEMBL ... ... Hsa19.gtf Hsa19.fa Hsap 150 DATASET[down,ndiff] down_vs_ndiffwill compare introns down vs. ndiff and leave untouched introns from group up. The argument 150 specifies the length of the 3'-end of the introns which should be searched for SF1 hits. It will generate the output folder down_vs_ndiff and place therein a table with all introns (down and ndiff) and the extracted intron features. It will also place therein a PDF document summary.pdf containing all details of the comparison. Hint 1: You might first stratify the sets of introns with stratify and then apply cmpr_introns. Hint 2: You might choose colors for the box plots of the PDF report. See help message of cmpr_introns for getting help on this. Hint 3: DATASET[down,ndiff]: the last group specified should be the reference group (like non-differentially spliced). Hint 4: DATASET[down,ndiff] compares group down vs. ndiff, DATASET[down,up,ndiff] compares down vs. up, down vs. ndiff, up vs. ndiff. > matt cmpr_featuresDescription: Use this command if you want to visually inspect and statistically test dependencies between a feature X and other features given by columns in a table. This command goes through these steps:
If the input table describes exons/introns, then it needs to contain these columns
Example: Table introns.tab describes introns by columns
> matt cmpr_features introns.tab -a DPSI -mattintron START END SCAFFOLD STRAND GENEID_ENSEMBL ... ... Hsa19.gtf Hsa19.fa Hsap 150 -points -bins 5 -o output_dirThe argument 150 specifies the length of the 3'-end of the introns which should be searched for SF1 hits and which should be included into the branch point analysis. It will generate the output folder output_dir and place therein:
Very important Hint 1: When comparing exon/intron features to, e.g. some dPSI, then choose thoroughly the set of exons/introns to be included, otherwise weak patterns might be swamped and invisible. Generally, it is a good idea to exclude constitutive exons/introns. Hint 2: You can change the colors for the box plots. Hint 3: You can plot or ommit data points in the box plots. Hint 4: You can plot or ommit outlier data points in the box plots. > matt rna_mapsDescription: Allows the user to produce motif RNA-maps for comparing enrichment of binding motifs of RNA/DNA binding proteins for different groups of exons or introns. Motifs can be defined as position weight matrix models (PWMs) or Perl regular expressions. This command will extract all necessary sequences from the genomic proximity of given exons/introns, predict hits of the specified motifs therein, and compute motif enrichment scores. If chosen by the user, a permutation test is applied for estimating p values and regions with significant differences in motif enrichment will be highlighted in the motif RNA-map. Output:
Example: Assume file nova2_regexpr.tab describes a Perl regular expression of the NOVA2 binding motif in a table with columns
matt rna_maps exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD ... STRAND GROUP[up,down,ndiff] 31 35 135 Mmu09.fasta nova2_regexp.tab TYPE ... NAME REGEXP THRESH BGMODEL -d nova2_rnamap -p 0.0001 10000The following graphic as PDF will be put into output folder nova2_rnamap ![]() Hint1: When you use the arguments -p or -fdr then statistical testing is performed and regions with significant enrichment/depletion are highlighted. The last group in GROUP[up,down,ndiff] is the reference group for testing, e.g., here up vs ndiff and down vs ndiff. Hence, highlighting will be plotted only for groups up and down, never for ndiff. Hint2: For statistical testing a permutation test is invoked. When testing many motifs, then this may increase the runtime considerably. In this case, you may omit the statistical test (omit arguments -p, -fdr), scroll through the PDF report focusing on the first and last motif RNA maps, and re-run this analysis with statistical test for a few motifs which gave you interesting findings. > matt rna_maps_cisbpDescription: Allows users to produce motif RNA maps for
Output:
Assume 3 groups of human exons are described in table exons.tab with columns
matt rna_maps_cisbp exons.tab UPSTRM_EX_BORDER START END DOSTRM_EX_BORDER SCAFFOLD ... STRAND GROUP[up,down,ndiff] 31 35 135 Hsa19.fasta cisbprna_regexps ... -d cisbp_rnamapsSee command rna_maps for an example motif RNA map. Hint1: When you use the arguments -p or -fdr then statistical testing is performed and regions with significant enrichment/depletetion are highlighted. The last group in GROUP[up,down,ndiff] is the reference group for testing, e.g., here up vs ndiff and down vs ndiff. Hence, highlighting will be plotted only for groups up and down, never for ndiff. Hint2: For statistical testing a permutation test is invoked. When testing many motifs, then this may increase the runtime considerably. In this case, you may omit the statistical test (omit arguments -p, -fdr), scroll through the PDF report focusing on the first and last motif RNA maps, and re-run this analysis with statistical test for a few motifs which gave you interesting findings. > matt test_cisbp_enrichDescription: Use this command to test for enrichment/depletion with all approx. 300 PWM motifs from the CISBP-RNA database. Example: With table t.tab having these columns (and maybe many more)
> matt test_cisbp_enrich t.tab SEQ_3P DATASET[down,ndiff,up,ndiff]will compare down vs. ndiff and up vs. ndiff. During each comparison and for each CISBP-RNA binding PWM, it will predict binding sites in all sequences and apply a permutation test to determine p-values for the differences wrt. the number of predicted hits in both data sets. It will generate a table containing all results (including number of predicted hits per group, test statistics, p-values, information about the binding motifs (PWMs) like associated RNA-binding protein). For PWMs for which significant enrichment/depletion of hits was found, test_cisbp_enrich also generates PWM profiles invoking get_pwm_prof. Internally, test_cisbp_enrich applies the command test_pwm_enrich. Hence per binding motif (PWM), test_cisbp_enrich tests for differences in the number of predicted hits between two groups of sequences as well as for positional differences of the predicted hist. In detail, test_cisbp_enrich searches for hits in both sets of sequences once, but then tests for significant enrichment/depletion 4 times considering:
![]() Output: a table containing detailed information (number of data points, test statistics, p values) for all four tests considering different parts of the sequences. The most important columns highlighting found significant enrichment are:
Number and length of sequences: Sequences can be of variable length and the two sequence sets might consist of a different number of sequences. Matt normalizes the number of found hits in each sequence wrt. its length and within each data set wrt. its size (number of sequences). Avoid false positives: Normalizing the number of hits might result in false positive findings. For example, if hits actually occur only in a specific sub-region of the sequences, e.g, only in the 5'-ends of introns, and if the sequence in one set have a very different length than sequences in the other set. ![]() Testing these sequences which are of very different length but contain the same number of hits, after normalization it will look if there were more hits in group cntr (per sequence position) than in group up. In this or similar cases, it is recommended to cut the sequences to the regions (highlighted) where you actually expect hits. Hint 1: You might run test_cisbp_enrich with a sub-set of CISBP-RNA PWMs. See help message for details on this. Hint 2: Change test statistic of the enrichment analysis: number of hits vs. number of sequence with hits For testing enrichment, one might either consider
ExamplesThis section shows applications of Matt to these tasks:
The examples contain many explanations and therefore appear lengthy.
The more familiar the user is with Matt, the easier and faster the work experience will become.
Comparing sets of exons wrt. their exon featuresThe data table data.tab contains alternative splicing events published in Irimia et. al, Cell, 2014. It was assembled from output of different VAST-TOOLS runs. Each event belongs to one group of
1. Making yourself familiar with data by checking, e.g., column names of the given table. > matt get_colnms data.tab ID COL_NAMES 1 GENE 2 EVENT 3 COORD 4 LENGTH 5 FullCO 6 TYPE 7 GROUPThe table data.tab has 7 columns containing information on alternative splicing events as output from Vast-Tools. Print the first rows nicely aligned: > matt prnt_tab data.tab -W 30 | less -S - GENE EVENT COORD LENGTH FullCO TYPE GROUP HsaEX0000026 chr12:120648733-120648909 177 ... S AS_noNeural HsaEX0000032 chr12:10749491-10749580 90 ... S AS_noNeural HsaEX0000033 chr12:10748290-10748526 237 ... C2 AS_noNeural HsaEX0000044 chr14:58760484-58760644 161 ... C1 AS_noNeural HsaEX0000048 chr14:58752309-58752474 166 ... C2 AS_noNeural HsaEX0000053 chr12:70636846-70636984 139 ... S AS_noNeural HsaEX0000062 chr14:101376570-101376641 72 ... S AS_noNeural HsaEX0000087 chr15:84850734-84850835 102 ... S AS_noNeural HsaEX0000091 chr15:57209263-57209410 148 ... S AS_noNeural A1BG HsaINT0000005 chr19:58858396-58858718 323 ... IR-S AS_noNeural A2M HsaINT0000025 chr12:9256997-9258831 1835 ... IR-S AS_noNeural A2M HsaINT0000026 cr12:9254271-9256834 2564 ... IR-S AS_noNeural A2M HsaINT0000027 chr12:9253804-9254042 239 ... IR-S AS_noNeuralColumn FullCO was not put on this website to save space. Column TYPE encodes the type of alternative splicing events. Check all different types of splicing events: > matt col_uniq data.tab TYPE | matt prnt_tab - -w 9 | less -S - TYPE_UNIQ FREQ Alt3 3444 Alt5 2449 C1 1604 C1* 312 C2 1287 C2* 259 C3 1134 C3* 237 IR-C 3483 IR-S 15494 ...2. Adapting labels in column TYPE We need to use command get_vast later which works with Vast-Tools standard types of alternative splicing events which are: S, C1, C2, C3, MIC, IR-S, IR-C, Alt3, Alt5. Table data.tab contains these standard types but also other types like S*, C1*. Hence, we adapt all labels to the standard types with command chg_labs. > matt chg_labs data.tab TYPE S*:S C1*:C1 C2*:C2 C3*:C3 MIC_M:MIC MIC_S:MIC > tmp.tab > mv tmp.tab data.tab3. Extracting exons and their start/end-coordinate, chromosome, strand The commands get_efeatures and cmpr_exons expect information on exons like their start and end coordinate but table data.tab does not contain this information, yet. The command get_vast is specifically designed for extracting this information from VAST-TOOLs output tables. It outputs an augmented table, from which we select only exon events by re-directing this table to the command get_rows. The final output table gets re-directed into file exons.tab. > matt get_vast data.tab COORD FullCO TYPE LENGTH | ... ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ > exons.tabChecking how table exons.tab looks like by listing its column names. > matt get_colnms exons.tab ID COL_NAMES 1 GENE 2 EVENT 3 COORD 4 LENGTH 5 FullCO 6 TYPE 7 GROUP 8 SCAFFOLD 9 START 10 END 11 STRAND 12 UPSTRM_EX_BORDER 13 DOSTRM_EX_BORDERTo verify that exons.tab contains only exons, you could also apply col_uniq to column TYPE again. 4.1 Adding Ensembl gene IDs which are necessary for commands get_efeatures and cmpr_exons. Here, we use command get_match which takes as input a second table containing a mapping EVENT_ID to ENSEMBL_GENEID (mapping_EVENT2GENEID.tab) and outputs for each EVENT_ID in column EVENT of table exons.tab the corresponding GENEID. The resulting table (one column) gets added immediately to table exons.tab exploiting piping. > matt get_match exons.tab EVENT mapping_EVENT2GENEID.tab EVENT ENSEMBL_GENEID ... ... | matt add_cols exons.tab - 4.2 Adding Ensembl gene IDs Alternatively, we could use the command retr_geneids to extract gene IDs from a given GTF file for splicing events. The following commands will append to the table exons.tab a column ENSEMBL_GENEID. > matt retr_geneids exons.tab START END SCAFFOLD STRAND Hsa19_ensembl.gtf -f gene_id -n ENSEMBL_GENEID ... ... | matt add_cols exons.tab - 4.3 Adding Ensembl gene IDs Yet another option is to use the command get_vast with arguments -gtf and -f. In this case, it automatically extracts gene IDs from the given GTF and appends a column GENEID, which we rename to ENSEMBL_GENEID. > matt get_vast data.tab COORD FullCO TYPE LENGTH -gtf Hsa19_ensembl.gtf -f gene_id | ... ... matt get_rows - TYPE]S,C1,C2,C3,MIC[ | matt rn_cols - GENEID:ENSEMBL_GENEID > exons.tab 5. Checking number of exons in final table to get an idea about the size of the exon sets. > matt col_uniq exons.tab GROUP GROUP_UNIQ FREQ AS_noNeural 10465 NEURAL-DOWN 631 NEURAL-UP 10466. Extracting exon features for all exons You might want to download the USCS HG19 assembly sequences and the Ensembl gene annotation for HG19. For decompressing the file hg19.2bit you need to download as well twoBitToFa and run > ./twoBitToFa -noMask hg19.2bit Hsa19_gDNA.fastaThen extracting exon features and adding these immediately to the table exons.tab > fasta=/..path..to../Hsa19_gDNA.fasta > gtf=/..path..to../ensembl/Hsa19.gtf > matt get_efeatures exons.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ... ... $gtf $fasta Hsap 150 | matt add_cols exons.tab -Table exons.tab now contains for all exons information about their genomic coordinate, gene, and their exon features including sequences like exon sequence, splice site sequences, sequences of up/down-stream introns. All these information can be used as input for further analyses, e.g., with R. The extracted sequences might be of interest for further analyses like de-novo motif search etc, which often expect their input as FASTA files. Using the command extr_seqs, you might extract sequences from specific columns of exons.tab in FASTA format. 7. Down-sample exon set AS_noNeural The sizes of the exon sets are: AS_noNeural=10465, NEURAL-DOWN=631, NEURAL-UP=1046. Hugh groups like AS_noNeural are unnecessary for comparing exon sets and would slow down analysis. Hence, we sub-sample 2000 AS_noNeural exons with rand_rows and create a new table exons_testsets.tab. > matt get_rows exons.tab GROUP]AS_noNeural[ ... ... | matt rand_rows - 2000 132927352 > exons_testsets.tab > matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ ... ... | matt add_rows exons_testsets.tab -The table exons_testsets.tab should be used as input for command cmpr_exons. It is important to know that cmpr_exons will extract features for all exons of its input table and append them to the input table. As a consequence, the input table must not contain already columns with identical column names because column names in a table must be unique. Hence, from exons_testsets.tab we select only the important columns and neglect the already added columns with exon features. > matt get_cols exons_testsets.tab START END SCAFFOLD STRAND ... ... ENSEMBL_GENEID GROUP > tmp.tab > mv tmp.tab exons_testsets.tabChecking the number of exons in the final table exons_testsets.tab confirms that the sampling worked as expected. > matt col_uniq exons_testsets.tab GROUP GROUP_UNIQ FREQ AS_noNeural 2000 NEURAL-DOWN 631 NEURAL-UP 10468. Statistically comparing exon sets and generating PDF report > fasta=/..path..to../Hsa19_gDNA.fasta > gtf=/..path..to../ensembl/Hsa19.gtf > matt cmpr_exons exons_testsets.tab START END SCAFFOLD STRAND ENSEMBL_GENEID ... ... $gtf $fasta Hsap 150 GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] ... ... cmpr_1 -colors:brown2,darkgoldenrod2,azure4With GROUP[NEURAL-UP,NEURAL-DOWN,AS_noNeural] we select the groups which should be compared, i.e., all pair-wise comparisons are done: NEURAL-UP vs. NEURAL-DOWN, NEURAL-UP vs. AS_noNeural, and NEURAL-DOWN vs. AS_noNeural. Putting AS_noNeural in the last place will give us box plots which contain the group AS_noNeural as reference group. All output gets written into folder cmpr_1, where you find a summary in form of a PDF document with all details on the comparisons and all graphics in sub-folder summary_graphics for later use. The PDF report contains all details and plots and is organized in four main parts:
Testing enrichment of TGC in up-stream regions of neural micro exonsWe want to statistically test if the 3-mer TGC occurs more frequent in the 200 nt up-stream regions of neural micro exons. We start again with the table data.tab, which we already have used for comparing different sets of neural and non-neural AS events. Events are grouped according to their origin
4. Down-sample exon set AS_noNeural The set of AS_noNeural is much larger than the other sets. > matt col_uniq exons.tab GROUP GROUP_UNIQ FREQ AS_noNeural 10465 NEURAL-DOWN 631 NEURAL-UP 1046Hence, we randomly sample 2000 AS_noNeural exons and construct a new table with them and all NEURAL-DOWN and NEURAL-UP exons. > matt get_rows exons.tab GROUP]AS_noNeural[ | ... ... matt rand_rows - 2000 923681819 > exons_testsets.tab > matt get_rows exons.tab GROUP]NEURAL-DOWN,NEURAL-UP[ | ... ... matt add_rows exons_testsets.tab -Verifying the sizes of the exon sets in exons_testsets.tab: > matt col_uniq exons_testsets.tab GROUP GROUP_UNIQ FREQ AS_noNeural 2000 NEURAL-DOWN 631 NEURAL-UP 1046 5. Retrieving up-stream sequences of exons The command get_seqs is the utility command for retrieving different types of sequences. For a given genomic region, e.g. an exon in this case, it allows to retrieve the up-stream and down-stream sequences (*Mode 2 in its help message) and will output these as a table with two columns SEQ_UP and SEQ_DOWN. Here we define as length for the up- and down-stream region 200 nt and add the columns with sequences directly to table exons_testsets.tab. > fasta=/..path..to../Hsa19_gDNA.fasta > matt get_seqs exons_testsets.tab START END 200 200 SCAFFOLD STRAND $fasta | ... ... matt add_cols exons_testsets.tab -Verifying the result: > matt get_colnms exons_testsets.tab ID COL_NAMES 1 START 2 END 3 SCAFFOLD 4 STRAND 5 TYPE 6 LENGTH 7 GROUP 8 SEQ_UP 9 SEQ_DOWNLooking at the retrieved sequences nicely shows/confirms that (almost) all up-stream sequences end in AG, as expected, > matt get_cols exons_testsets.tab SEQ_UP | less -S - SEQ_UP ...CAAGGCTGTCTTTTCTCCATAG ...TTAAATTGTCTTCTCTCCACAG ...ACTTGAATTTCTACTCCCACAG ...CAACTTTCTAATCTATCCCTAG ...CCCCCTTGTGTTCCTCCTGCAG ...ATTTTGTTATTTTTTTTTTTAG ...GCTGTTCGTTTTTCTTTCTTAG ...CTTCGCCTCCTTGCTGTTTCAG ...and (almost) all down-stream sequences start with GT: > matt get_cols exons_testsets.tab SEQ_DOWN | less -S - SEQ_DOWN GTGAGCTGGCAGAAATGAAGGGACT GTAAGACCAGTTTACCTTTTGAAGA GTAATACTTTGGGGAAGGAAGTTTT GTAAGAGTATTTTCCTTAATCAGTT GTATGTGGCTGTGGTGAACGGGCAC GCATGTAGCACCACACCTAACTGGA GTAAGCCTGCCAGGAGCCACTCAGC GTGAGCTCTGAGCCTCTGGCATTTC ... 7. Defining sets of micro exons Lastly, before we apply the enrichment analysis, we need to define the groups of exons we want to compare:
> matt def_cats exons_testsets.tab GROUP2 ... ... 'nmic_s=LENGTH[3,15] GROUP]NEURAL-UP,NEURAL-DOWN[' ... ... 'nmic=LENGTH[16,27] GROUP]NEURAL-UP,NEURAL-DOWN[' ... ... 'nex=LENGTH[28,1000000000] GROUP]NEURAL-UP,NEURAL-DOWN[' ... ... 'ex=GROUP]AS_noNeural[' | matt add_cols exons_testsets.tab -Verifying results > matt col_uniq exons_testsets.tab GROUP2 GROUP2_UNIQ FREQ ex 2000 nex 1371 nmic 155 nmic_s 151 7. Statistically testing enrichment of TGC in up-stream sequences The data are ready prepared for applying the command test_regexp_enrich. We use this command because the 3-mer TGC can be described by a simple regular expression TGC. We create a simple tab-separated table describing this regular expression and save it in tgc.tab which looks like REGEXP NAME TGC TGCWhen you copy'n'paste this into tgc.tab, please make sure that columns are tab-separated. Then we run the enrichment analysis comparing nmic_s vs. ex, nmic vs. ex, and nex vs. ex using the sequences as given (single) and for computing p-values running a permutation test 100000 times: > matt test_regexp_enrich exons_testsets.tab SEQ_UP ... ... GROUP2[nmic_s,ex,nmic,ex,nex,ex] single tgc.tab NAME ... ... REGEXP quant 1000 > res.tabThe permutation test performs 1000 iterations for speed. To obtain reliable p-values, 50000 or 100000 iterations are recommended. A comprehensive summary of the enrichment analyses done is written into res.tab (via re-direction of the output). This file is best studied as a spread sheet (using, e.g., Libre Office Calc). The most interesting columns are (please see explanation of test_regexp_enrich for more details):
8. Visualizing positional distributions of TGC in up-stream regions of exons The command get_regexp_prof allows to generate positional distribution of hits of regular expressions across sequences. As we have the 200 nt up-stream but also the 200 nt down-stream regions, we visualize both together running: > matt get_regexp_prof exons_testsets.tab SEQ_UP RL SEQ_DOWN LR tgc.tab ... ... GROUP2[nmic_s,nmic,nex,ex]This will create a web page in folder regexp_profiles which you can study using any web browser. This web page looks as follows. Matt: Positional distributions of hits of REGEXPs
Data file: exons_testsets.tab Ordered ResultsOrdering from top to bottom in each column: corresponds to most negativeto most positive differences of densities / cumulative densities to densities / cumulative densities of reference group Reference group: ex If reference group is none then differences are determined wrt. zero line / diagonal with slope 1.
PlotsTGC - TGCBack to Ordered Results![]() It should be mentioned that these plots show densities and cumulative densities. Hence, they give information about the position of TGC hits across sequences; but they do not give information about the absolute number of hits. The plots show that TGC hits in the up-stream intronic regions of neural micro exons (red and blue lines) are clearly concentrated closer to the 3' splice site as in case of neural exons and non-neural exons. Downstream analysis of Nova2 regulated exonsA Shell script with all commands and explanations is available. Commands therein can be copy'n'pasted to a Shell where Matt is available.After having downloaded the RNA-seq data from GEO with help of Matt, and after having estimated inclusion levels of alternative-splicing events genome-wide with help of the program VAST-TOOLs, this downstream analysis consists of the following steps
![]() ![]() Extending Matt: add new commands/functionalityMatt can be extended by new commands/functions. Extensions can be implemented in any scripting/programming language. Though, the easiest way might be to use Perl as Matt is mainly implemented in Perl and already includes a Perl module with utility functions for working with tables.A template script new_command.pl showing how Matt commands are usually implemented in Perl is available in sub-directory abilities of the Matt installation directory. This Perl script implements already the usual structure and can be extended quickly to a full implementation of a new Matt command. This template script new_command.pl contains detailed comments/help explaining how it needs to be extended. It also contains instructions for implementing Matt commands in other scripting/programming languages. FAQ / Trouble shooting1. Which letters do genomic sequences contain?Usually, they contain A, C, G, T. All CISBP-RNA motifs are defined wrt. these letters. Hence, when you run test_cisbp_enrich, your sequences must contain A, C, G, T, and maybe N but all potential binding positions containing at least one N will be neglected. Whenever you extract sequences with, e.g., get_seqs and your FASTA file contains A, C, G, U, then the extracted sequences will contain these letter as well. Whenever you use your own position weight matrices (PWMs), these might be defined wrt. A, C, G, U. In this case, when you want to use these PWMs with command test_pwm_enrich then your sequences should be defined wrt. A, C, G, U as well. 2. Which column names do my tables need to have? You are free to choose any column names you prefer, but they should be alpha-numeric and, if possible, upper-case: A-Z, 0-9, and '_', and they should be meaningful. When applying Matt commands to your tables, often you need to specify the column names of your tables to which you want to apply Matt. E.g., get_seqs expects your table to contain a column with start coordinates and another with end coordinates. You can call these columns START and END, but any other name is possible. 3. I don't get a PDF report from cmpr_exons or cmpr_introns. These PDF reports are generated with pdflatex using the following packages. Please make sure all of them are available on your system.
4. I get strange errors when extracting sequences from a FASTA. ... uninitialized value $chr in substitution (s///) at get_seqs.pl line 118, <$fh> chunk 954. ... uninitialized value $seq in substitution (s///) at get_seqs.pl line 119, <$fh> chunk 954. ... uninitialized value $seq in transliteration (tr///) at get_seqs.pl line 120, <$fh> chunk 954. ... uninitialized value $chr in hash element at get_seqs.pl line 121, <$fh> chunk 954. We saw these errors when the FASTA file was stored and read from a pen drive/USB stick. Copying the FASTA to the local hard drive solved this issue. How to citeMatt builds on the great work of others. Hence, when publishing results which you have obtained with Matt, please follow these rules for citation.
|
System requirements • Linux/Mac OS • Perl • R, Rscript • pdflatex • any PDF viewer • any Web browser License Matt is open-source and available under GNU LGPLv3 Additional material • Intro to rna_maps Matt paper • A. Gohr, M. Irimia, Matt: Unix tools for alternative splicing analysis, Bioinformatics, 2018, DOI: 10.1093/bioinformatics/bty606 Matt poster • At NGS 2017 Please cite as well Matt uses great tools of others. |