Technical Section

In this page we describe GOBU file formats and utilities that help preparing your own data.

Preface
GOBU File Formats
Data Processing Utilities
Suggested Steps

top

Preface

This document is written by following rules:

GOBU design is based on some definitions, so we made the first appearance of a GOBU-jargon bold and italic.
To avoid ambiguity, we use the following replacements if necessary:
1. <TAB> for a tab character
2. <Space> for a space character

top

GOBU File Formats

We strongly recommend you to read at least Node String and List Format sections.

In GOBU's design, every user node has its own node string. The node string describes node characteristics, including type and content. Based on this understanding, we learned that there must be at least two fields in a node string, i.e. type field and content field.

To identify these two fields in a single string, GOBU treats a colon (':') as the field separator. That is, string "aa:bb" means "aa" as the type field and "bb" as the content field. It should be noticed that such a design will not permit strings containing colons as type fields. To remedy this problem, we introduced the use of escape character (anti-slash, '\') to node string processing where characters exactly behind escape characters will be processed without any special meaning. In GOBU source code, we implement above ideas in bio301.dataproc.EscapeStringTokenizer (this class name is just a reference for programmers). EscapeStringTokenizer produces a "processed" type field and an "unprocessed" content field as in the following table:

input	output
input	type	content
GO:0004567	GO	0004567
G\O\::0\0\04567	GO:	0\0\04567
G\\O\::0\0\04567	G\O:	0\0\04567

The following table specifies type fields and content fields of all five classes of user nodes:

class	type	content	example
R-nodes	RP	any	RP:LL.1542
GO annotations	GO	integer	GO:0008372
P-nodes	PR	any	PR:LOC:(240773309,240816925) @chr1@Human
Unit nodes	UN	any	UN:group1
normal nodes	otherwise	any	name:CYMP

In the user tree, user nodes will be displayed with different icons according their classes. Node names will be specified according the following table:

class	icon		node name
class	colored	uncolored	node name
R-nodes			content
GO annotations			corresponding GO term
P-nodes			content
Unit nodes			content
normal nodes			node string (processed type field, if any)

Special Note for P-Nodes

P-nodes are designed for reprepresenting user-defined data types, so their node names will be processed by EscapeStringTokenizer for determining their data types. If there is no field separator in a node name, the node name itself will be treated as the data type field.

List Format

List format is the first GOBU file format, it is very easy to produce a list format file using database queries. In list format, every line of the file means a tree path to be added to the user tree root, and every path is encoded by node strings with <TAB> as internal edges. For example, the tree paths in Fig. 1 should be described by the following lines:

Fig. 1

UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>conf: 100.0
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0005868
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0000166
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0005524
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0017111
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0042623
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0003777
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0007018
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>GO:0007052
UN:group6<TAB>tax9606<TAB>RP:33350932<TAB>PR:LOC:(100421045,100507307)@chr14@Human

To generate a correct list format file, just remember to generate paths that cover the whole user tree. You don't need to generate these lines in a special order, but notice that tree siblings (nodes having the same parent node) will be sorted according the order of their first appearance. We believe that a qualified technician could easily generate this kind of data using simple SQL queries. Note that list format file names should not end with ".tree".

Tree Format

Unlike the list format, every line of a tree format file contains exactly one node string with leading depth <Space>, where depth of a given node is the number of edges between this node and tree root. The order of lines should be the same with the visiting order of depth-first-search. For example, the user tree in Fig. 2 is described by the following lines:

Fig. 2

User Data
<Space>chr1
<Space><Space>1
<Space><Space><Space>RP:LL.1542
<Space><Space><Space><Space>name:CYMP
<Space><Space><Space>RP:LL.1772
<Space><Space><Space><Space>name:DNAH14
<Space><Space><Space><Space>GO:0008372
<Space><Space><Space><Space>GO:0003777
<Space><Space><Space><Space>GO:0000004
<Space><Space><Space>RP:LL.1989
<Space><Space><Space><Space>name:EKV

Note that tree format file names must end with ".tree".

top

Data Processing Utilities

The following utilities require more memory than the default size provided by Java Virtual Machine (JVM). We advise you to use an option like "-Xmx800M" (an option of SUN JVM) in your java command. All classes described in this section are placed in the bio301.goutil.gobu.data package.

OBO-format GO file

From version 0.95, the OBO-format GO file is supported. In the case that you are using GOBU 0.95 or later, you may replace the options "-G data\component.ontology.txt -G data\function.ontology.txt -G data\function.ontology.txt" with "-OBO data\gene_ontology.1_2.obo.txt" for all tools described in this section.

TreeFileMaker

Purpose: To translate a list format file into a tree format file.
Options:
1. -I: Specify the input file (list format)
2. -O: Specify the output file (tree format)
3. -G: Specify GO data files, we advise you to use this option three times for loading component, function and process GO data.
Example command (Windows command line window, in GOBU folder): java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.TreeFileMaker -I user\LocusData -O user\LocusData.to.tree -G data\component.ontology.txt -G data\function.ontology.txt -G data\function.ontology.txt

TreeSorter

Purpose: Given a tree format file, sort tree siblings in an easy-browseable order (see Fig. 3 for an example).

before after

Fig. 3
Options:
1. -I: Specify the input file (tree format)
2. -O: Specify the output file (tree format)
3. -G: Specify GO data files, we advise you to use this option three times for loading component, function and process GO data.
Example command (Windows command line window, in GOBU folder): java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.TreeSorter -I user\LocusData.to.tree -O user\LocusData.ts.tree -G data\component.ontology.txt -G data\function.ontology.txt -G data\process.ontology.txt

AddAnnotation

Purpose: Given a tree format file and a list format file, append sub-tree information (described in the list format file) to the tree format file (see Fig. 4 for an example).

+ =

Fig. 4
Options:
1. -I: Specify the input file (tree format)
2. -S: Specify the sub-tree information file (list format)
3. -O: Specify the output file (tree format)
4. -G: Specify GO data files, we advise you to use this option three times for loading component, function and process GO data.
Description: This program will first load input file as a user tree, then load sub-tree file as another user tree. By recording every depth-1 user nodes in second user tree, this program then looking for these depth-1 user nodes in first user tree. Once found a recorded user node in first tree, this program will attach corresponding sub-tree (of second tree) to first tree.

ExpressionCounter

Purpose: Given data described in a tree format file, compute numbers of genes or summation of specified values of GO terms.
Options:
1. -I: Specify the input file (tree format)
2. -O: Specify the output file (text format)
3. -OBO: Specify the OBO-format GO data file
4. -P: Specify the property to be summed up, only numbers of genes will be computed if this is not specified. Be very sure that you are defining P-nodes correctly.
5. -L: There should be two parameters for this options x < y. This option will restrict GO terms to be reported to be at levels x and y. If this option is not applied, all GO terms will be reported.

GOSlimMapper

Purpose: Transform a user data with GO annotation to be the same data with GO annotation within specified slimmed GO.
Options:
1. -S: GO slim file
2. -OBO: the OBO-format GO data file
3. -I: input file (tree format)
4. -O: output file (tree format)

TreeFromSet

Purpose: Given a reference tree, which usually contains data of a whole genome, generate trees of several interested gene sets. This tool would be very useful for applications of the MultiView plugin.
Options:
1. -R: the reference tree
2. -OBO: the OBO-format GO data file
3. -N: There are two parameters for this option. The first parameter specifies the name of this gene set, and the second parameter specifies the file contains the gene set. Be sure that genes are defined as R-nodes in the reference tree and that the accessions are the same with these R-nodes.

Accessing EntrezGene database

Purpose: Two simple programs are provided for extracting information from NCBI EntrezGene tables gene_info and gene2go, they are:
1. ExtractEntrezGene
  - Purpose: Extract gene IDs from gene_info table for specified species (by taxonomy ID) and gene type (see README file in NCBI EntrezGene ftp directory), then organize them according their chromosome and map location (as described in Overview).
  - Options:
    1. -I: Specify the input file (gene_info file)
    2. -O: Specify the output file (tree format)
    3. -T: Specify the species (by taxonomy ID)
    4. -G: Specify the gene type
2. ExtractEntrezGeneGO
  - Purpose: Extract GO annotations from gene2go table for specified species (by taxonomy ID).
  - Options:
    1. -I: Specify the input file (gene2go file)
    2. -O: Specify the output file (list format)
    3. -T: Specify the species (by taxonomy ID)
Example command (Windows command line window, in GOBU folder, after downloading gene_info and gene2go files from NCBI EntrezGene ftp directory):
1. (extract all human protein-coding gene IDs) java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.ExtractEntrezGene -I gene_info -O tmpHumanGene.tree -T 9606 -G protein-coding
2. (extract all human GO annotations) java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.ExtractEntrezGeneGO -I gene2go -O tmpHumanGO -T 9606
3. (append GO annotations) java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.AddAnnotation -I tmpHumanGene.tree -S tmpHumanGO -O tmpHumanGene.tree -G data\component.ontology.txt -G data\function.ontology.txt -G data\process.ontology.txt
4. (sort resulted tree into an easy-browseable order) java -Xmx600M -classpath gobu.jar bio301.goutil.gobu.data.TreeSorter -I tmpHumanGene.tree -O HumanGene.tree -G data\component.ontology.txt -G data\function.ontology.txt -G data\function.ontology.txt

top

Suggested Steps

We advise you to take the following steps to build data file.

Identify R-nodes: Make sure your basic objects, i.e., the objects you want to count in GO distribution.
Organize R-nodes: Plan your user tree structure part according your data in hand, i.e., the tree content above (and including) R-nodes. For example, we organize R-nodes (gene IDs) according their located chromosome and map location (see Fig. 5).

Fig. 5
Generate structure part: If the user tree structure part information was held in your database, then it should be easy to generate a list format data file using SQL queries. Then TreeFileMaker could help to translate the data file into a tree format file.
Appand annotations: Generate annotations for R-nodes in list format (paths should be started with R-nodes), then use AddAnnotation to append these annotations.

Sometimes you may need to go through above steps a few times for a satisfied user tree data.

before	after

Fig. 3

	+		=
Fig. 4

Contents