“A Path from RDF to XTM and back for BioPAX”

 

Name:

Nada Hashmi

 

Email Address:

nadahashmi@gmail.com

 

Introduction:

 

Scientists seeking to understand the inner workings of cells have access to a multitude of pathway data resources. However, the representations of pathway data within these resources are not consistent or interchangeable. Furthermore, the databases containing information about biological pathways are almost as many and varied as the pathways themselves. [1, 9, 10] To counter these problems, a broad effort in the biopathways community called BioPAX was formed in August 2002. There it was decided that the creation of a standard data exchange format for pathway information was not only a good first step toward building an open source pathway information resource, but also that such an exchange format would be a desirable end in itself as it would facilitate sharing of pathways information between existing databases, both public and private. Since that time, BioPAX has grown into a large active open source community, having released two of four levels of the specification for biological pathway data representation. [2]

 

BioPAX (Biological Pathway data-eXchange format) is a specification in OWL/RDF for representing signal transduction, gene regulation, molecular interaction and metabolic pathways to enable coherent and reliable queries across multiple databases.  The use of these semantic web technologies allows and enables web services to seamlessly query and join BioPAX data from multiple locations. [3] At the same time, while RDF does a better job with OWL and tools for inferencing, XTM, XML Topic Maps, have many more tools available to help build, merge, and change metadata files.  This is the actual focus of most work in BioPAX today (which OWL can only validate). [4] For this reason, a conversion toolkit to enable BioPAX to be converted from and to XTM must be created. 

 

Lexikos Corporation, an R&D consultancy focused on intelligent Java software components, has been a major player involved in creating a toolkit to help change management, integrate pathway models, track data provenance, etc for BioPAX. [5] This project aims to help build a part of the main toolkit: an open source RDF and XTM conversion API mediated by the BioPAX ontology.

 

Project Objectives and Deliverables:

 

Format conversion from OWL to XTM or vice versa in the general case is a hard problem. [4, 6]  An RDF triple can be mapped to at least six different topic map constructs and without prior knowledge of the semantics, the correct choice can not be made.  For the same reason, a topic map characteristics mapping to RDF results in a loss of higher level semantics.  While many solutions and approaches have been outlined, Garshol states the missing information must be added from another source. In his approach, he defines a formal vocabulary to describe exactly what is missing, and a "mapping file" that uses it, to create a utility ("plugin") in his tool Omnigator that can map RDF files into his core Topic Map engine.  However, Omnigator is a closed source tool. [7]

 

The objectives and deliverables of this project are based on the approach by Garshol:

1)      To create a ‘mapping file’ for RDF conversion to and from XTM for BioPAX data.

2)      To create and develop an open source toolkit to help facilitate an automatic conversion.

3)      To create an API to help researchers modify and create mappings and conversions for BioPAX to and from their personal format.

4)      To test and analyze the toolkit for correctness, accuracy and reliability.

5)      To make this toolkit available as open source to the community under the GNU LGPL license.[8]

 

 

Preliminary Work:

 

The BioPAX ontology supports a wide range of detail and multiple levels of abstraction to permit support for a variety of data models.  Currently, semantic mapping is done manually—the mapping of the database fields from the native database format to the BioPAX format typically requires at least one expert from the source database and at least one expert in BioPAX. Such great was the need to develop mappings to and from BioPAX that over 40 volunteers helped develop various levels and are continuing to help develop BioPAX. Contributions have been made from representatives from over 8 different databases, including aMAZE [15], BioCyc[16], BIND [17], eMIM [18], INOH [19], PATIKA [20], Reactome [21] and  WIT/PUMA2 [22].  Organizations such as Proteomics Standards Initiative [23], Chemical Markup Language [24], SBML [25] and CellML [26] have also made significant contributions. 

 

Despite the dire need, work on BioPAX continues to be primarily volunteer-based. The volunteers are only able to offer part-time help despite the work load necessitating full-time staff members and resources.  Without full-time staff members for BioPAX, precious time is being wasted that otherwise could be spent to help save lives.

 

Significance:

 

Cancer continues to be one of the leading causes of death all over the world with astonishing figures in the United States alone. The American Cancer Society reported nearly 1.4 million expected cases and death of nearly 500,000 people in 2005 in the United States alone. [27] BioPAX and our project will further cancer research by significantly reducing the time and resources necessary for functional genomics.  Automatic conversions which can be modified to suit individual cases allows scientist not to focus on work that is otherwise tedious, monotonous and prone to human error when performed manually.  BioPAX can also impact in environment research specifically in microbial research which metabolic pathways of prokaryotes are studied for use in Bioremediation. 

 

The results of this project will also be a contribution to the following communities:

 

1)      BioPAX Community: Multiple export formats of OWL-validated models make BioPAX more widely useful and easier to integrate with software designed around other standards, paradigms or languages, such as CellML, Jesse, Prolog, Lisp, KIF, etc.

2)      Biological Pathways Community: A flexible automatic conversion toolkit will allow the scientific community to save time, money and resources whilst being able to focus on the biological research.

3)      RDF/OWL/XTM Communities: Multiple import formats make it easier to blend metadata across paradigms. Furthermore, the results of this project will allow a better understanding of the intricate problems of semantic format conversions. The analysis could potentially lead to innovative methods and solutions for this problem domain.

 

Project Implementation Details:

 

The conversion from RDF to XTM is a five step methodology:

 

1)      RDF to Jena

2)      Jena to Java+

3)      Java+ to TMAPI

4)      TMAPI to TinyTIM

5)      TinyTIM to XTM*

 

*For a complete description of the specified technologies, please refer to the Glossary at the end of the proposal.

 

"Java+", above and below, mean "Java objects in the porting code, as augmented by any specially-built mapping ontology".  It is at this step, a mapping ontology/file will be created to help mediate the process. This mapping ontology will be based on BioPAX.

  

The conversion from XTM to RDF is also a five step methodology; however, different technologies are used at key steps:

 

1)      XTM to TM4J

2)      TM4J to TMAPI+

3)      TMAPI+ to Java+

4)      Java+ to Jena

5)      Jena to RDF

 

"TM4J" here replaces "TinyTIM" as the Topic Map engine, and "TMAPI+" replaces "TMAPI".  These changes go together, because TinyTIM does not support TMAPI+ (which is useful). TM4J is a generally nicer engine, better supported, more scalable.  It also has a bigger open source support group, and a much larger user community, so any related questions will get answered better. 

 

Project Members:

 

   Additional Project Mentors:

 

 

Dan Corwin got his Masters at MIT in Systems Modeling (ME), then fell in love with AI.  He has spent two decades since devising associative models of human memory that serialize into English clauses, seeking a theory to explain why natural language is so effortless.  In other work, he wrote commercial word processors for Wang and Lotus; simulated NASA aerospace vehicles; created several scripting languages; and started five small software companies.  Current work targets a web toolkit for mapping paragraphs and biotech data bases into Topic Maps and RDF, with special focus on OWL-constrained cellular pathway models of BioPAX, one of the first industrial-scale results of the semantic web.

 

 

Dr. Joanne Luciano has been an active lead in BioPAX, the BioPathways Consortium, and the emerging Semantic Web for Life Sciences.  She is one of the developers of a pathway ontology that is likely to become the standard for all pathway-related bio-research. She is an authority in pathway modeling, familiar with systems such as EcoCYC, BIND, WIT, KEGG, SBML, and CellML. She is one of the developers of an OWL-based version of the BioPAX ontology (using Stanford's Protégé knowledge environment) and its uses within the RDF framework.  Her familiarity with the processes involved in these initiatives, the tools, the science, the contact with the community of researchers in the multiple disciplines involved, place her in a unique combination of skills, contacts, knowledge and know-how which she holds from an unbiased position.

 

 

Jeremy Zucker graduated in 1997 with degrees in Computer Science and Applied Mathematics from the University of Colorado. Soon after, Jeremy began working as a research scientist at the MIT Artificial Intelligence Lab under Gerald J.Sussman and Hal Abelson where he developed biologically-inspired programming paradigms.  In 1999, Jeremy left MIT to co-found Proprium Enterprises, a Hedge fund based in New York and Sydney, Australia.  As Chief Research Scientist, Jeremy designed and implemented web services to enable real-time investment analysis among other Y2K analyses. In 2002, Jeremy began working as a bioinformatics specialist at the Dana-Farber Cancer Institute, and a computational biologist for the Church lab at Harvard Medical School. He currently leads a team of developers for DARPA’s BioSPICE project, the DOE Genomes to Life initiative, and BioPAX. Teaching experience includes curriculum assistant for the Harvard Department of Systems Biology's introductory course, and the Intelligent Systems in Molecular Biology (ISMB) 2005 tutorial on Semantic Aggregation, Integration, and Inference of Biological Pathways.

 

  Student:

 

 

Nada Hashmi holds an MS in Computer Science from University of Maryland, College Park in 2004 and a BA in Math and Computer Science from Washington College in 2002.  Her achievements include being the recipient of the prestigious Presidential Fellowship Award in 2005 and the William Gover Duvall ’30 Prize Recipient in 2002 for being the top Math and CS student in the graduating class.  She has published over nine papers, including a book chapter.  Her work with Fujitsu Laboratories for the application of Semantic Web Technologies in the Bioinformatics domain has been patented.   Her experience with both the semantic web technologies and bioinformatics provides her with the skill set for the required technologies to help ensure this project’s success.

 

A full detailed resume available at: http://www.nadahashmi.com/resume.pdf

 

 

Glossary of Technologies:

 

  1. XML: http://www.w3.org/XML/:

 

XML: eXtensible Markup Language

In HTML, tags give meaning and structure to a web document. XML is similar to HTML, except that, in XML, you can define the tags (this is why it is called extensible). The structure of an XML document or the document type definition (DTD) specification. XML schemas and DTDs describe the XML tags and the allowable structure of an XML document.

 

  1. RDF: http://www.w3.org/TR/rdf-primer/

 

RDF: Resource Description Framework

RDF uses XML syntax and is formatted as an XML document (angle-bracketed tagged data); however, that is where the similarity ends. Whereas XML documents can be represented as trees, RDF documents are graphs. All data in RDF are described using subject-verb-object triples, which define the ‘semantics’ or ‘domain logic’ needed to connect various data items and specify their relationship to each other. The objects that RDF describes are called universal resource identifiers (or URIs), which resemble web addresses. This is where the power of RDF comes from. As each triple refers to a single subject-predicate-object ‘fact’, one can assemble all the facts, but they reuse the same URIs, then what results is the globally distributed network called the semantic web. [12]

 

  1. OWL:  http://www.w3.org/TR/owl-features/

 

OWL: Web Ontology Language

OWL, which is defined using RDF, is a language designed for ontology construction and deployment. OWL adds the required semantic constraints on the RDF language(s) used for data documents. Together, RDF and OWL form a logic model that can be used throughout either data repositories or knowledge bases and inference engines. The structure and meaning of any group of documents can be precisely defined and related to each other. For example, similar models, one using ‘mother’ and the other using ‘female parent’, can be semantically linked using statements such as owl:isEquivalentTo. Similarly, one could use another OWL statement, owl:disjointClass, to describe the fact that the two concepts ‘mother’ and ‘father’ were disjoint (i.e. no one can be both a mother and a father). One could also use cardinality constraints to restrict the number of mothers one may have to exactly one, or one could create subclasses of mother to be surrogate, step, biological and genetic. The advantage of OWL is that you get these rich semantics in a machine-readable format. [11]

 

  1. Topic Maps: http://www.ontopia.net/topicmaps/materials/tao.html

 

Topic maps are an ISO standard for the representation and interchange of knowledge, with an emphasis on the findability of information. The standard is formally known as ISO/IEC 13250:2003.  Topic maps are based on topics, associations and occurrences. [4]

 

  1. XTM: http://www.topicmaps.org/xtm/1.0/

 

XTM: XML Topic Map

As defined in the XTM specification, a topic is an addressable resource within a computer that stands in for (or “reifies”) some real-world subject. A topic map conveys knowledge about those resources (and therefore about the subjects they reify) through a superimposed layer, or map, of the resources. Furthermore, a topic map captures information about subjects, and the relationships between subjects, in a way that is implementation-independent.

 

  1. Jena: http://jena.sourceforge.net/

 

Jena is an open source Java framework to help build semantic web applications. The Jena Framework includes:

·         A RDF API

·         Reading and writing RDF in RDF/XML, N3 and N-Triples

·         An OWL API

·         In-memory and persistent storage

·         RDQL – a query language for RDF

 

  1. TMAPI: http://www.tmapi.org/

 

TMAPI: Common Topic Map Application Programming Interface

 

TMAPI is a programming interface for accessing and manipulating data held in a topic map. The TMAPI specification defines a set of core interfaces which must be implemented by a compliant application as well as a set of additional interfaces which may be implemented by a compliant application or which may be built upon the core interfaces.  TMAPI has been developed in an open process by developers working on topic map processors and topic map applications and placed into the public domain. There are no restrictions on its use.

 

  1. TinyTIM: http://tinytim.sourceforge.net/

 

TinyTIM: TINY Topic Map Engine & TMAPI implementation

TinyTIM is a very small, easy to use, in memory Topic Map engine. It implements the TMAPI interfaces, so one can work with Topic Maps via the TMAPI standard. TMAPI will be for XTM what DOM is for XML.

 

  1. TM4J Engine:  http://tm4j.org/

TM4J Engine: a topic map processing engine written in Java providing a pure Java API, support for the Tolog query language, support for importing XTM and LTM syntaxes; support for exporting XTM syntax; persistence of topic map information in a wide variety of databases.

 

References:

  1. Cary, M.P. et al. (2005) Pathway information for systems biology. FEBS Lett. 579, 1815-1820
  2. http://www.biopax.org/index.html
  3. Luciano, J.S (2005) PAX of mind for pathway researchers. DDT. Volume 10, Number 13, 937 – 942
  4. http://www.ontopia.net/topicmaps/materials/rdf.html
  5. http://www.lexikos.com/exhibit/cms/biopax/usecases.htm
  6. http://www.w3.org/TR/rdftm-survey/
  7. http://www.ontopia.net/omnigator/docs/navigator/userguide.html#rdf-support
  8. http://www.gnu.org/copyleft/lesser.html
  9. Krieger, C.J. et al (2004) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 32, D438-D442
  10. Keseler, I.M. et al. (2005) EcoCyc: a comprehensive database resource for Escherichia coli. Nucleic Acids Res. 33, D334-D337
  11. Noy, N. and McGuinnes, D. (2001) Ontology Development 101: A Guide to Creating Your First Ontology. Stanford University
  12. Berners-Lee, T et al. (2001) The semantic web. Scientific American.
  13. http://www.biopax.org/Docs/BioPAX_Roadmap.html
  14. http://www.biopax.org/About.html
  15. http://www.amaze.ulb.ac.be/
  16. http://www.biocyc.org/
  17. http://www.bind.ca/
  18. http://discover.nci.nih.gov/mim/index.jsp
  19. http://www.inoh.org/
  20. http://www.patika.org/
  21. http://www.reactome.org/
  22. http://compbio.mcs.anl.gov/puma2/cgi-bin/index.cgi
  23. http://psidev.sourceforge.net/
  24. http://www.xml-cml.org/
  25. http://www.sbml.org/
  26. http://www.cellml.org/
  27. http://www.cancer.org/docroot/STT/stt_0.asp