Supplemental Data for "Cisml: an XML-based Output Format for
Sequence Element Detection Software"
Authors: Peter Haverty and Zhiping Weng
Example Data Files
- A hypothetical result of searching a group of promoters with a group of
TRANSFAC Position Specific Scoring Matrices one at a time.
- A hypothetical result of searching each sequence in a group of promoters with a group of TRANSFAC Position Specific Scoring Matrices in order to detect clusters of transcription factor binding sites.
Three Descriptions of CisML
- 1. W3C Schema
- The preferred method of CisML validation corresponding to the XML namespace http://zlab.bu.edu/schema/cisml
- 2. DTD
- An alternative for CisML validation for XML parsers that have not yet implemented support for W3C Schema
- 3. English:
- The outermost element. Contains the elements program-name, parameters,
and groups of either multi-pattern-scan or patern.
- The name of the program that created this outut file.
- A tag to contain the parameters given to the program used to create this data file. This tag contains the pattern-file, sequence-file, background-seq-file, sequence-filtering, pattern-pvalue-cutoff, sequence-pvalue-cutoff and site-pvalue-cutoff tags. You can add additional tags for other parameters if you like.
- The name of the file containing the patterns used to create this data.
- The name of the file containing the sequences which were searched for patterns.
- The name of a file containing sequences used to represent the background frequencies of patterns (if applicable).
- A cutoff used by the program that created this file to select significant patterns.
- A cutoff used by the program that created this file to select sequences with significant overall matches to a pattern. A number of matches-elements in a sequence may contribute to a sequence's overall significance.
- A cutoff used by the program that created this file to select sequence elements with significant matches to a pattern
- Used to describe the the sequence filtering method used (if any)
- Contains one of two values "on" or "off" if sequence filtering was or was not used.
- Contains the name of the sequence filtering program used.
- Can contain a group of patterns used to scan one or more sequences to
detect clusters of patterns.
- The probability that a group of patterns exists in the sequences scanned.
- A score for the match between this pattern group and the group of sequences scanned.
- A pattern used to search one or more sequences. Pattern includes a group
of scanned-sequences and may include user defined tags describing the
pattern used (if they are defined under a separate namespace).
- A unique identifier for this pattern. E.g. a TRANSFAC accession: M00192.
- The name of the pattern. E.g. CREB binding site.
- The name of the database supplying the pattern (if any). E.g. TRANSFAC
- The full Life Sciences ID string (http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp) of the pattern (if available).
- The probability that this pattern exists in the one or more sequences it contains.
- A score representing the quality of the match between this pattern and sequences it contains.
- A sequence scanned by the enclosing pattern.
- A unique identifier of the sequence scanned. In the case of promoters sequences, it may be the accession of the gene associated with the scanned promoter.
- The name of this sequence.
- The name of the database supplying the pattern (if any). E.g. genbank
- The full Life Sciences ID string (http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp) of the sequence (if available).
- A score representing the quality of the match between this sequence and the pattern it was scanned with.
- The probability that the element represented by the containing pattern exists in this sequence.
- The length of the sequence.
- The actual sequence element detected in a sequence by a pattern.
- The first position of the pattern in the sequence.
- The last position of the pattern in the sequence. For matches on the opposite strand of DNA, the stop position will be greater than the start position.
- A score representing the quality of the match between this element and the pattern that detected it.
- The probability associated with the match between this element and the pattern that detected it.
- If matched-elements found within one sequence are grouped into distinct clusters, this attribute may denote which cluster each matched-element belongs to.
- The sequence of the matched-element. For example, the 12 bases matched by a transcription factor binding site matrix.
- Generates a text report of the P-Values for matches of a series of patterns to a group of sequences.
- Generates an html report of the P-Values for matches of a series of patterns to a group of sequences.
- Generates a text report of the patterns detected in each of a group of sequences.
- Generates an html report of the patterns detected in each of a group of sequences.
- Generates an pdf report of the patterns detected in each of a group of sequences. The locations of each pattern arer depicted graphically using SVG graphics. Generating a PDF using this stylesheet requires a print formatter, such as FOP, that understands XSL Formatting Objects.
- A simple Cascading Style Sheet used to style the output of cisml.pattern.html.xsl and cisml.sequence.html.xsl.
Recommended XSLT processors
- SAXON is a java based XSLT processor. Version 7 implements new features of XSLT 2.0 which are necessary for cisml.sequence.text.xsl and cisml.sequence.html.xsl.
- FOP is a java based XSLT processor that can use XSL Formatting Objects (XSL-FO) to render XML documents in a variety of formats including PDF.
- LibXML and LibXSLT
- LibXML and LibXSLT are a C libraries for parsing XML and processing XSLT documents. They are commonly included as standard in Linux distributions and are available for use from C and perl. These libraries can also be used to process XSLT on the command line with the xsltproc program. xsltproc is significantly faster than java based programs, but does not currently implement XSLT 2.0.
Style Sheets Applied to Example Data Files
|Example Data File|
Producing reports for cisml.multiscan.xml is simple using the XSLT 2.0 feature "for-each-group" and is not straightforward otherwise. Therefore, it is recommended that users use SAXON version 7 or later to process cisml.sequence.graphics.in.pdf.xsl and create a Formatting Objects file. FOP can then use this Formatting Object file to produce PDF output. This process is demonstrated by the following UNIX command line entries:
java -jar saxon7.jar cisml.xml cisml.sequence.graphics.in.pdf.xsl > foo.fo fop.sh -fo foo.fo -pdf cisml.sequence.graphics.pdf
Converting Other Formats to CisML
The best way to use CisML is to modify motif search programs to output CisML directly. When that is not feasible, one may parse another format to produce CisML. As examples we provide a few examples:
|Program||Program Output||Parser||CisML||CisML to Text|
The CisML Schema is defined in such a way that you may add your own elements to include specific features of a program that may not be accounted for in CisML. The parameters, pattern, multi-pattern-scan, scanned-sequence, and matched-element elements can include elements you define if they are declared under a different namespace. Here is an example taken from the output of our parser for MatInspector:First, add a namespace declaration to the start of the file (seen here in bold):
<?xml version="1.0"?> <cis-element-search xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mi="MatInspector" xsi:schemaLocation="http://zlab.bu.edu/schema/cisml cisml.xsd" xmlns="http://zlab.bu.edu/schema/cisml" > <program-name>MatInspector</program-name>...
The namespace identifier ("MatInspector") needn't correspond to any real defined schema, but it would be better for users of your data if it did.
Second, add your tags AFTER the CisML tags and make sure to use the namespace prefix ("mi" in this case):
<matched-element start="8203" stop="8224" score="0.876"> <sequence>cgatgtcatagagtACGTgtca</sequence> <mi:cor-sim>0.976</mi:cor-sim> </matched-element>The <cor-sim> element denotes the match of an element to the core motif (called the "Core Similarity") as calculated by MatInspector.
Programs that use CisML
- The promoter analysis component of CARRIE uses CisML. A Web Server for ROVER is under costruction.
Last updated 2-18-04