Home      Labs      Publications      People      Tools   

From CAGT

Supplemental Data for "Cisml: an XML-based Output Format for Sequence Element Detection Software"

Authors: Peter Haverty and Zhiping Weng


Contents

Example Data Files

cisml.xml
A hypothetical result of searching a group of promoters with a group of TRANSFAC Position Specific Scoring Matrices one at a time.
cisml.multiscan.xml
A hypothetical result of searching each sequence in a group of promoters with a group of TRANSFAC Position Specific Scoring Matrices in order to detect clusters of transcription factor binding sites.

Three Descriptions of CisML

1. W3C Schema
The preferred method of CisML validation corresponding to the XML namespace http://zlab.bu.edu/schema/cisml
2. DTD
An alternative for CisML validation for XML parsers that have not yet implemented support for W3C Schema
3. English:
cis-element-search
The outermost element. Contains the elements program-name, parameters, and groups of either multi-pattern-scan or patern.
program-name
The name of the program that created this outut file.
parameters
A tag to contain the parameters given to the program used to create this data file. This tag contains the pattern-file, sequence-file, background-seq-file, sequence-filtering, pattern-pvalue-cutoff, sequence-pvalue-cutoff and site-pvalue-cutoff tags. You can add additional tags for other parameters if you like.
pattern-file
The name of the file containing the patterns used to create this data.
sequence-file
The name of the file containing the sequences which were searched for patterns.
background-seq-file
The name of a file containing sequences used to represent the background frequencies of patterns (if applicable).
pattern-pvalue-cutoff
A cutoff used by the program that created this file to select significant patterns.
sequence-pvalue-cutoff
A cutoff used by the program that created this file to select sequences with significant overall matches to a pattern. A number of matches-elements in a sequence may contribute to a sequence's overall significance.
site-pvalue-cutoff
A cutoff used by the program that created this file to select sequence elements with significant matches to a pattern
sequence-filtering
Used to describe the the sequence filtering method used (if any)
Attributes
on-off
Contains one of two values "on" or "off" if sequence filtering was or was not used.
type
Contains the name of the sequence filtering program used.
multi-pattern-scan
Can contain a group of patterns used to scan one or more sequences to detect clusters of patterns.
Attributes
pvalue
The probability that a group of patterns exists in the sequences scanned.
score
A score for the match between this pattern group and the group of sequences scanned.
pattern
A pattern used to search one or more sequences. Pattern includes a group of scanned-sequences and may include user defined tags describing the pattern used (if they are defined under a separate namespace).
Attributes
accession
A unique identifier for this pattern. E.g. a TRANSFAC accession: M00192.
name
The name of the pattern. E.g. CREB binding site.
db
The name of the database supplying the pattern (if any). E.g. TRANSFAC
lsid
The full Life Sciences ID string (http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp) of the pattern (if available).
pvalue
The probability that this pattern exists in the one or more sequences it contains.
score
A score representing the quality of the match between this pattern and sequences it contains.
scanned-sequence
A sequence scanned by the enclosing pattern.
Attributes
accession
A unique identifier of the sequence scanned. In the case of promoters sequences, it may be the accession of the gene associated with the scanned promoter.
name
The name of this sequence.
db
The name of the database supplying the pattern (if any). E.g. genbank
lsid
The full Life Sciences ID string (http://www.i3c.org/wgr/ta/resources/lsid/docs/index.asp) of the sequence (if available).
score
A score representing the quality of the match between this sequence and the pattern it was scanned with.
pvalue
The probability that the element represented by the containing pattern exists in this sequence.
length
The length of the sequence.
matched-element
The actual sequence element detected in a sequence by a pattern.
Attributes
start
The first position of the pattern in the sequence.
stop
The last position of the pattern in the sequence. For matches on the opposite strand of DNA, the stop position will be greater than the start position.
score
A score representing the quality of the match between this element and the pattern that detected it.
pvalue
The probability associated with the match between this element and the pattern that detected it.
clusterid
If matched-elements found within one sequence are grouped into distinct clusters, this attribute may denote which cluster each matched-element belongs to.
sequence
The sequence of the matched-element. For example, the 12 bases matched by a transcription factor binding site matrix.

Style Sheets

cisml.pattern.text.xsl
Generates a text report of the P-Values for matches of a series of patterns to a group of sequences.
cisml.pattern.html.xsl
Generates an html report of the P-Values for matches of a series of patterns to a group of sequences.
cisml.sequence.text.xsl
Generates a text report of the patterns detected in each of a group of sequences.
cisml.sequence.html.xsl
Generates an html report of the patterns detected in each of a group of sequences.
cisml.sequence.graphics.in.pdf.xsl
Generates an pdf report of the patterns detected in each of a group of sequences. The locations of each pattern arer depicted graphically using SVG graphics. Generating a PDF using this stylesheet requires a print formatter, such as FOP, that understands XSL Formatting Objects.
cisml.css
A simple Cascading Style Sheet used to style the output of cisml.pattern.html.xsl and cisml.sequence.html.xsl.

Recommended XSLT processors

SAXON
SAXON is a java based XSLT processor. Version 7 implements new features of XSLT 2.0 which are necessary for cisml.sequence.text.xsl and cisml.sequence.html.xsl.
FOP
FOP is a java based XSLT processor that can use XSL Formatting Objects (XSL-FO) to render XML documents in a variety of formats including PDF.
LibXML and LibXSLT
LibXML and LibXSLT are a C libraries for parsing XML and processing XSLT documents. They are commonly included as standard in Linux distributions and are available for use from C and perl. These libraries can also be used to process XSLT on the command line with the xsltproc program. xsltproc is significantly faster than java based programs, but does not currently implement XSLT 2.0.

Sample Reports

Style Sheets Applied to Example Data Files

Example Data File
Stylesheet cisml.xml cisml.multiscan.xml
cisml.pattern.text.xsl TXT TXT
cisml.pattern.html.xsl HTML HTML
cisml.sequence.text.xsl TXT TXT
cisml.sequence.html.xsl HTML HTML
cisml.sequence.graphics.in.pdf.xsl PDF PDF

Producing reports for cisml.multiscan.xml is simple using the XSLT 2.0 feature "for-each-group" and is not straightforward otherwise. Therefore, it is recommended that users use SAXON version 7 or later to process cisml.sequence.graphics.in.pdf.xsl and create a Formatting Objects file. FOP can then use this Formatting Object file to produce PDF output. This process is demonstrated by the following UNIX command line entries:
java -jar saxon7.jar cisml.xml cisml.sequence.graphics.in.pdf.xsl > foo.fo
fop.sh -fo foo.fo -pdf cisml.sequence.graphics.pdf

Converting Other Formats to CisML

The best way to use CisML is to modify motif search programs to output CisML directly. When that is not feasible, one may parse another format to produce CisML. As examples we provide a few examples:

Program Program Output Parser CisML CisML to Text
tfscan TXT perl XML TXT
MatInspector TXT perl XML TXT
CBUST TXT perl XML TXT
POSSUM TXT perl XML TXT

Extending CisML

The CisML Schema is defined in such a way that you may add your own elements to include specific features of a program that may not be accounted for in CisML. The parameters, pattern, multi-pattern-scan, scanned-sequence, and matched-element elements can include elements you define if they are declared under a different namespace. Here is an example taken from the output of our parser for MatInspector:

First, add a namespace declaration to the start of the file (seen here in bold):
<?xml version="1.0"?>

<cis-element-search
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:mi="MatInspector"
xsi:schemaLocation="http://zlab.bu.edu/schema/cisml cisml.xsd"
xmlns="http://zlab.bu.edu/schema/cisml"
>
            <program-name>MatInspector</program-name>
...

The namespace identifier ("MatInspector") needn't correspond to any real defined schema, but it would be better for users of your data if it did.
Second, add your tags AFTER the CisML tags and make sure to use the namespace prefix ("mi" in this case):

<matched-element start="8203" stop="8224" score="0.876">
    <sequence>cgatgtcatagagtACGTgtca</sequence>
    <mi:cor-sim>0.976</mi:cor-sim>
</matched-element>

The <cor-sim> element denotes the match of an element to the core motif (called the "Core Similarity") as calculated by MatInspector.

Programs that use CisML

ROVER
The promoter analysis component of CARRIE uses CisML. A Web Server for ROVER is under costruction.
Other programs from ZLAB are being converted to CisML.
Last updated 2-18-04

Views
Protein Engineering