“A Path from RDF to XTM and back for BioPAX”
Name:
Nada Hashmi
Email Address:
Introduction:
Scientists seeking to
understand the inner workings of cells have access to a multitude of pathway
data resources. However, the representations of pathway data within these
resources are not consistent or interchangeable. Furthermore, the databases
containing information about biological pathways are almost as many and varied
as the pathways themselves. [1, 9, 10] To counter these problems, a broad effort in the
biopathways community called BioPAX was formed in August 2002. There it was
decided that the creation of a standard data exchange format for pathway information
was not only a good first step toward building an open source pathway
information resource, but also that such an exchange format would be a
desirable end in itself as it would facilitate sharing of pathways information
between existing databases, both public and private. Since that time, BioPAX
has grown into a large active open source community, having released two of
four levels of the specification for biological pathway data representation. [2]
BioPAX (Biological Pathway
data-eXchange format) is a specification in OWL/RDF for representing
signal transduction, gene regulation, molecular interaction and metabolic
pathways to enable coherent and reliable queries across multiple
databases. The use of these semantic web technologies allows and enables
web services to seamlessly query and join BioPAX data from multiple locations.
[3] At the
same time, while RDF does a better job with OWL and tools for inferencing, XTM,
XML Topic Maps, have many more tools available to help build, merge, and change
metadata files. This is the actual focus of most work in BioPAX today
(which OWL can only validate). [4] For this reason, a
conversion toolkit to enable BioPAX to be converted from and to XTM must be
created.
Lexikos Corporation, an
R&D consultancy focused on intelligent Java software components, has been a
major player involved in creating a toolkit to help change management,
integrate pathway models, track data provenance, etc for BioPAX. [5] This project aims to
help build a part of the main toolkit: an open source RDF and XTM conversion
API mediated by the BioPAX ontology.
Project Objectives and
Deliverables:
Format conversion from
OWL to XTM or vice versa in the general case is a hard problem. [4, 6] An RDF triple
can be mapped to at least six different topic map constructs and without prior
knowledge of the semantics, the correct choice can not be made. For the
same reason, a topic map characteristics mapping to RDF results in a loss of
higher level semantics. While many solutions and approaches have been
outlined, Garshol states the missing information must be added from another
source. In his approach, he defines a formal vocabulary to describe exactly
what is missing, and a "mapping file" that uses it, to create a
utility ("plugin") in his tool Omnigator that can map RDF files into
his core Topic Map engine. However, Omnigator is a closed source tool. [7]
The objectives and
deliverables of this project are based on the approach by Garshol:
1) To create a ‘mapping file’ for RDF conversion to and from XTM for
BioPAX data.
2) To create and develop an open source toolkit to help facilitate an
automatic conversion.
3) To create an API to help researchers modify and create mappings
and conversions for BioPAX to and from their personal format.
4) To test and analyze the toolkit for correctness, accuracy and
reliability.
5) To make this toolkit available as open source to the community
under the GNU LGPL license.[8]
Preliminary Work:
The BioPAX ontology
supports a wide range of detail and multiple levels of abstraction to permit
support for a variety of data models. Currently, semantic mapping is done
manually—the mapping of the database fields from the native database format to
the BioPAX format typically requires at least one expert from the source
database and at least one expert in BioPAX. Such great was the need to develop
mappings to and from BioPAX that over 40 volunteers helped develop various
levels and are continuing to help develop BioPAX. Contributions have been made
from representatives from over 8 different databases, including aMAZE
[15], BioCyc[16], BIND [17], eMIM [18], INOH [19], PATIKA [20], Reactome [21] and
WIT/PUMA2 [22]. Organizations
such as Proteomics Standards Initiative [23], Chemical Markup
Language [24], SBML
[25] and CellML [26] have also made
significant contributions.
Despite the dire need,
work on BioPAX continues to be primarily volunteer-based. The volunteers are
only able to offer part-time help despite the work load necessitating full-time
staff members and resources. Without full-time staff members for BioPAX,
precious time is being wasted that otherwise could be spent to help save lives.
Significance:
Cancer continues to be
one of the leading causes of death all over the world with astonishing figures
in the United States alone. The American Cancer Society reported nearly 1.4
million expected cases and death of nearly 500,000 people in 2005 in the United
States alone. [27]
BioPAX and our project will further cancer research by significantly reducing
the time and resources necessary for functional genomics. Automatic
conversions which can be modified to suit individual cases allows scientist not
to focus on work that is otherwise tedious, monotonous and prone to human error
when performed manually. BioPAX can also impact in environment research
specifically in microbial research which metabolic pathways of prokaryotes are
studied for use in Bioremediation.
The results of this
project will also be a contribution to the following communities:
1) BioPAX Community: Multiple export formats of OWL-validated models
make BioPAX more widely useful and easier to integrate with software designed
around other standards, paradigms or languages, such as CellML, Jesse, Prolog,
Lisp, KIF, etc.
2) Biological Pathways Community: A flexible automatic conversion
toolkit will allow the scientific community to save time, money and resources
whilst being able to focus on the biological research.
3) RDF/OWL/XTM Communities: Multiple import formats make it easier to
blend metadata across paradigms. Furthermore, the results of this project will
allow a better understanding of the intricate problems of semantic format
conversions. The analysis could potentially lead to innovative methods and
solutions for this problem domain.
Project Implementation
Details:
The conversion from RDF
to XTM is a five step methodology:
1) RDF to Jena
2) Jena to Java+
3) Java+ to TMAPI
4) TMAPI to TinyTIM
5) TinyTIM to XTM*
*For a complete
description of the specified technologies, please refer to the Glossary at the
end of the proposal.
"Java+", above
and below, mean "Java objects in the porting code, as augmented by any
specially-built mapping ontology". It is at this step, a mapping
ontology/file will be created to help mediate the process. This mapping ontology
will be based on BioPAX.
The conversion from XTM
to RDF is also a five step methodology; however, different technologies are
used at key steps:
1) XTM to TM4J
2) TM4J to TMAPI+
3) TMAPI+ to Java+
4) Java+ to Jena
5) Jena to RDF
"TM4J" here
replaces "TinyTIM" as the Topic Map engine, and "TMAPI+"
replaces "TMAPI". These changes go together, because TinyTIM
does not support TMAPI+ (which is useful). TM4J is a generally nicer engine,
better supported, more scalable. It also has a bigger open source support
group, and a much larger user community, so any related questions will get
answered better.
Project Members:
Additional
Project Mentors:
Dan Corwin got his
Masters at MIT in Systems Modeling (ME), then fell in love with AI. He
has spent two decades since devising associative models of human memory that
serialize into English clauses, seeking a theory to explain why natural
language is so effortless. In other work, he wrote commercial word
processors for Wang and Lotus; simulated NASA aerospace vehicles; created
several scripting languages; and started five small software companies.
Current work targets a web toolkit for mapping paragraphs and biotech data
bases into Topic Maps and RDF, with special focus on OWL-constrained cellular
pathway models of BioPAX, one of the first industrial-scale results of the
semantic web.
Dr. Joanne Luciano has
been an active lead in BioPAX, the BioPathways Consortium, and the emerging
Semantic Web for Life Sciences. She is one of the developers of a pathway
ontology that is likely to become the standard for all pathway-related
bio-research. She is an authority in pathway modeling, familiar with systems
such as EcoCYC, BIND, WIT, KEGG, SBML, and CellML. She is one of the developers
of an OWL-based version of the BioPAX ontology (using Stanford's Protégé
knowledge environment) and its uses within the RDF framework. Her
familiarity with the processes involved in these initiatives, the tools, the
science, the contact with the community of researchers in the multiple
disciplines involved, place her in a unique combination of skills, contacts,
knowledge and know-how which she holds from an unbiased position.
Jeremy Zucker graduated
in 1997 with degrees in Computer Science and Applied Mathematics from the
University of Colorado. Soon after, Jeremy began working as a research
scientist at the MIT Artificial Intelligence Lab under Gerald J.Sussman and Hal
Abelson where he developed biologically-inspired programming paradigms.
In 1999, Jeremy left MIT to co-found Proprium Enterprises, a Hedge fund
based in New York and Sydney, Australia. As Chief Research Scientist,
Jeremy designed and implemented web services to enable real-time investment
analysis among other Y2K analyses. In 2002, Jeremy began working as a
bioinformatics specialist at the Dana-Farber Cancer Institute, and a
computational biologist for the Church lab at Harvard Medical School. He
currently leads a team of developers for DARPA’s BioSPICE project, the DOE
Genomes to Life initiative, and BioPAX. Teaching experience includes curriculum
assistant for the Harvard Department of Systems Biology's introductory course,
and the Intelligent Systems in Molecular Biology (ISMB) 2005 tutorial on
Semantic Aggregation, Integration, and Inference of Biological Pathways.
Glossary of
Technologies:
XML: eXtensible Markup
Language
In HTML, tags give
meaning and structure to a web document. XML is similar to HTML, except that,
in XML, you can define the tags (this is why it is called extensible). The
structure of an XML document or the document type definition (DTD)
specification. XML schemas and DTDs describe the XML tags and the allowable
structure of an XML document.
RDF: Resource
Description Framework
RDF uses XML syntax and
is formatted as an XML document (angle-bracketed tagged data); however, that is
where the similarity ends. Whereas XML documents can be represented as trees,
RDF documents are graphs. All data in RDF are described using
subject-verb-object triples, which define the ‘semantics’ or ‘domain logic’
needed to connect various data items and specify their relationship to each
other. The objects that RDF describes are called universal resource identifiers
(or URIs), which resemble web addresses. This is where the power of RDF comes
from. As each triple refers to a single subject-predicate-object ‘fact’, one
can assemble all the facts, but they reuse the same URIs, then what results is
the globally distributed network called the semantic web. [12]
OWL: Web Ontology
Language
OWL, which is defined
using RDF, is a language designed for ontology construction and deployment. OWL
adds the required semantic constraints on the RDF language(s) used for data
documents. Together, RDF and OWL form a logic model that can be used throughout
either data repositories or knowledge bases and inference engines. The
structure and meaning of any group of documents can be precisely defined and
related to each other. For example, similar models, one using ‘mother’ and the
other using ‘female parent’, can be semantically linked using statements such
as owl:isEquivalentTo. Similarly, one could use another OWL statement,
owl:disjointClass, to describe the fact that the two concepts ‘mother’ and
‘father’ were disjoint (i.e. no one can be both a mother and a father). One
could also use cardinality constraints to restrict the number of mothers one
may have to exactly one, or one could create subclasses of mother to be
surrogate, step, biological and genetic. The advantage of OWL is that you get
these rich semantics in a machine-readable format. [11]
Topic maps are an ISO
standard for the representation and interchange of knowledge, with an emphasis
on the findability of information. The standard is formally known as ISO/IEC
13250:2003. Topic maps are based on topics, associations and occurrences.
[4]
XTM: XML Topic Map
As defined in the XTM
specification, a topic is an addressable resource within a computer that stands
in for (or “reifies”) some real-world subject. A topic map conveys knowledge
about those resources (and therefore about the subjects they reify) through a
superimposed layer, or map, of the resources. Furthermore, a topic map captures
information about subjects, and the relationships between subjects, in a way
that is implementation-independent.
Jena is an open source
Java framework to help build semantic web applications. The Jena Framework
includes:
· A RDF API
· Reading and writing RDF in RDF/XML, N3 and
N-Triples
· An OWL API
· In-memory and persistent storage
· RDQL – a query language for RDF
TMAPI:
Common Topic Map Application Programming Interface
TMAPI
is a programming interface for accessing and manipulating data held in a topic
map. The TMAPI specification defines a set of core interfaces which must be
implemented by a compliant application as well as a set of additional
interfaces which may be implemented by a compliant application or which may be
built upon the core interfaces. TMAPI has been developed in an open
process by developers working on topic map processors and topic map
applications and placed into the public domain. There are no restrictions on
its use.
TinyTIM: TINY Topic Map
Engine & TMAPI implementation
TinyTIM is a very small,
easy to use, in memory Topic Map engine. It implements the TMAPI interfaces, so
one can work with Topic Maps via the TMAPI standard. TMAPI will be for XTM what
DOM is for XML.
TM4J Engine: a topic map
processing engine written in Java providing a pure Java API, support for the
Tolog query language, support for importing XTM and LTM syntaxes; support for
exporting XTM syntax; persistence of topic map information in a wide variety of
databases.
References: