home

Technical Section

In this page we describe GOBU file formats and utilities that help preparing your own data.

Contents

  1. Preface
  2. GOBU File Formats
    1. Node String
    2. List Format
    3. Tree Format
  3. Data Processing Utilities
    1. OBO-format GO file
    2. TreeFileMaker
    3. TreeSorter
    4. AddAnnotation
    5. ExpressionCounter
    6. GOSlimMapper
    7. TreeFromSet
    8. Accessing EntrezGene database
  4. Suggested Steps
top

Preface

This document is written by following rules:
top

GOBU File Formats

We strongly recommend you to read at least Node String and List Format sections.
Node String
In GOBU's design, every user node has its own node string. The node string describes node characteristics, including type and content. Based on this understanding, we learned that there must be at least two fields in a node string, i.e. type field and content field.

To identify these two fields in a single string, GOBU treats a colon (':') as the field separator. That is, string "aa:bb" means "aa" as the type field and "bb" as the content field. It should be noticed that such a design will not permit strings containing colons as type fields. To remedy this problem, we introduced the use of escape character (anti-slash, '\') to node string processing where characters exactly behind escape characters will be processed without any special meaning. In GOBU source code, we implement above ideas in bio301.dataproc.EscapeStringTokenizer (this class name is just a reference for programmers). EscapeStringTokenizer produces a "processed" type field and an "unprocessed" content field as in the following table:

input output
type content
GO:0004567 GO 0004567
G\O\::0\0\04567 GO: 0\0\04567
G\\O\::0\0\04567 G\O: 0\0\04567

The following table specifies type fields and content fields of all five classes of user nodes:

class type content example
R-nodes RP any RP:LL.1542
GO annotations GO integer GO:0008372
P-nodes PR any PR:LOC:(240773309,240816925)
@chr1@Human
Unit nodes UN any UN:group1
normal nodes otherwise any name:CYMP

In the user tree, user nodes will be displayed with different icons according their classes. Node names will be specified according the following table:

class icon node name
colored uncolored
R-nodes content
GO annotations corresponding GO term
P-nodes content
Unit nodes content
normal nodes node string
(processed type field, if any)

Special Note for P-Nodes
P-nodes are designed for reprepresenting user-defined data types, so their node names will be processed by EscapeStringTokenizer for determining their data types. If there is no field separator in a node name, the node name itself will be treated as the data type field.
List Format
List format is the first GOBU file format, it is very easy to produce a list format file using database queries. In list format, every line of the file means a tree path to be added to the user tree root, and every path is encoded by node strings with <TAB> as internal edges. For example, the tree paths in Fig. 1 should be described by the following lines:
Fig. 1
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>conf: 100.0
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0005868
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0000166
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0005524
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0017111
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0042623
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0003777
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0007018
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0007052
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>PR:LOC:(100421045,100507307)@chr14@Human

To generate a correct list format file, just remember to generate paths that cover the whole user tree. You don't need to generate these lines in a special order, but notice that tree siblings (nodes having the same parent node) will be sorted according the order of their first appearance. We believe that a qualified technician could easily generate this kind of data using simple SQL queries. Note that list format file names should not end with ".tree".

Tree Format
Unlike the list format, every line of a tree format file contains exactly one node string with leading depth <Space>, where depth of a given node is the number of edges between this node and tree root. The order of lines should be the same with the visiting order of depth-first-search. For example, the user tree in Fig. 2 is described by the following lines:
Fig. 2
User Data
<Space>chr1
<Space><Space>1
<Space><Space><Space>RP:LL.1542
<Space><Space><Space><Space>name:CYMP
<Space><Space><Space>RP:LL.1772
<Space><Space><Space><Space>name:DNAH14
<Space><Space><Space><Space>GO:0008372
<Space><Space><Space><Space>GO:0003777
<Space><Space><Space><Space>GO:0000004
<Space><Space><Space>RP:LL.1989
<Space><Space><Space><Space>name:EKV

Note that tree format file names must end with ".tree".

top

Data Processing Utilities

The following utilities require more memory than the default size provided by Java Virtual Machine (JVM). We advise you to use an option like "-Xmx800M" (an option of SUN JVM) in your java command. All classes described in this section are placed in the bio301.goutil.gobu.data package.
OBO-format GO file
From version 0.95, the OBO-format GO file is supported. In the case that you are using GOBU 0.95 or later, you may replace the options "-G data\component.ontology.txt -G data\function.ontology.txt -G data\function.ontology.txt" with "-OBO data\gene_ontology.1_2.obo.txt" for all tools described in this section.
TreeFileMaker
TreeSorter

AddAnnotation

ExpressionCounter

GOSlimMapper

TreeFromSet

Accessing EntrezGene database

top

Suggested Steps

We advise you to take the following steps to build data file.
  1. Identify R-nodes: Make sure your basic objects, i.e., the objects you want to count in GO distribution.
  2. Organize R-nodes: Plan your user tree structure part according your data in hand, i.e., the tree content above (and including) R-nodes. For example, we organize R-nodes (gene IDs) according their located chromosome and map location (see Fig. 5).
    click to enlarge
    Fig. 5
  3. Generate structure part: If the user tree structure part information was held in your database, then it should be easy to generate a list format data file using SQL queries. Then TreeFileMaker could help to translate the data file into a tree format file.
  4. Appand annotations: Generate annotations for R-nodes in list format (paths should be started with R-nodes), then use AddAnnotation to append these annotations.
Sometimes you may need to go through above steps a few times for a satisfied user tree data.