Thursday, May 7, 2015

Installing JCVI's GenomeProperties locally

One might wonder why I'm attempting this at all. Well, I worked at TIGR with Dan Haft for many years, and I've carried the database structures with me through my career. So I've got a leg up on your average mortal in terms of a) having and b) understanding the database structure that underlies the GP system.

That being said, I didn't anticipate this being easy.

First, I built a database called GenProp in MySQL with the relevant table structure, and populated it with data from the FTP site. (The GP v3.1 release notes are pretty easy to follow.) I suggest doing this with SQLite, because the program is written to operate off an SQLite db. Using something else requires some code alteration.

Despite the README claiming there are no prerequisites, there are quite a few perl modules you need to have installed for the system to build and operate. To wit:
  • Class::Accessor
  • Config::IniFiles
  • TimeDate
  • Module::Pluggable
  • Readonly
  • Exception::Class
  • DBD::SQLite

I unpacked the software and attempted the Build build. The build went okay after I got the requisite modules installed, but './Build test' resulted in many failures, looking like:
$ ./Build test
t/00.load.t ....... 1/12
#   Failed test 'use GenProp::Genome::Property;'
#   at t/00.load.t line 9.
#     Tried to use 'GenProp::Genome::Property'.


#   Failed test 'use GenProp::Genome::Evidence;'
#   at t/00.load.t line 10.
#     Tried to use 'GenProp::Genome::Evidence'.


#   Failed test 'use GenProp::Genome::Cluster;'
#   at t/00.load.t line 11.
#     Tried to use 'GenProp::Genome::Cluster'.


#   Failed test 'use GenProp::Model;'
#   at t/00.load.t line 12.
#     Tried to use 'GenProp::Model'.


#   Failed test 'use GenProp::Model::Property;'
#   at t/00.load.t line 13.
#     Tried to use 'GenProp::Model::Property'.


#   Failed test 'use GenProp::Model::PropLink;'
#   at t/00.load.t line 14.
#     Tried to use 'GenProp::Model::PropLink'.


#   Failed test 'use GenProp::Model::Evidence;'
#   at t/00.load.t line 15.
#     Tried to use 'GenProp::Model::Evidence'.


#   Failed test 'use GenProp::Config;'
#   at t/01.config.t line 9.
#     Tried to use 'GenProp::Config'.

Not surprisingly, the GenProp module distributed with v3.1 does not reflect the directory structure listed in the messages. This appears to be legacy that wasn't cleaned up in the distro.

Pressing on...

The meat of the system is evaluate_props.pl (in the scripts subdir). This is what runs the rules against a genome/HMM hits database and stores the results in the GenProp database. The program can run off GFF3 format files, rather than a TIGR-style small genome database schema, which is convenient for those who aren't running TIGR-style small genome database (i.e., everybody).

The description of how to encode the required information in the GFF file can be found essentially in two places: 1) the GFF3 format definition at Sequence Ontology, and 2) the htab2gff.pl program distributed with the other GenProp code (specifically for the attributes needed for HMM hit entries). One important detail is that the HMM hit records in the GFF file should have HMM_hit in the 'type' field (field 3) (rather than the GFF3-mandated protein_hmm_match - so, yes, these GFF3 files won't validate).

Minimally, you need to make sure each HMM_hit record is linked by a Parent attribute to a CDS, which in turn is linked by a Parent attribute to a gene.

Example:
ITZX_scaf_103 PNNL gene 14974 16209 . - . ID=gene16660;locus_tag=HLSNC01_00224;bin=v3_bin01;alias=ITZX_scaf_103_13,6666666.44063.peg.1126
ITZX_scaf_103 PNNL CDS 14974 16209 . - 0 ID=CDS16660;Parent=gene16660;feat_name=HLSNC01_00224;locus_tag=HLSNC01_00224;bin=v3_bin01;product=amidase%2C hydantoinase/carbamoylase family;ec_number=3.5.1.6
ITZX_scaf_103 PNNL HMM_hit 15000 16191 419.1 - . ID=TIGR01879.HLSNC01_00224;Parent=CDS16660;Target=TIGR01879 7 404;trusted_cutoff=204.00;noise_cutoff=143.85

I wrote a script that dumps my database schema to this gff format. One can also use the htab2gff.pl to combine an existing genome GFF3 file and hmmer search results (in htab format) into a properly formatted file. [The caveat here is that the trusted and noise cutoffs must be encoded in the htab file, or in a database that you would have to configure the program to contact.]

[This brings up another aside: What is an htab file, you ask? It's just hmmer search results (either hmmsearch or hmmscan) in a tab-delimited format. The conversion can be accomplished with this program, although it may need some modification to work with newer versions of hmmer.]

I had to change the GenProp.ini file (in the conf subdir) to point to my TIGR-style GenProp database, and provide login information.

My program incantation then looked like:
/PATH/TO/GenomeProperties/scripts/evaluate_props.pl -d HLUCCbin01:myShare/v3_bin01.gff -a ALL

And output looks like:
HLUCCbin01      GenProp0037:    State some evidence     Val 0.666666666666667   Assigned-by RULES
HLUCCbin01      GenProp0038:    State some evidence     Val 0.5 Assigned-by RULES
HLUCCbin01      GenProp0039:    State none found        Val 0   Assigned-by RULES
HLUCCbin01      GenProp0046:    State YES       Val 1   Assigned-by RULES
HLUCCbin01      GenProp0047:    State none found        Val 0   Assigned-by RULES
As you can see, the output very helpfully does not include a description of the property, just the state. I've actually fixed this in my version of the program.

The evaluate_props.pl program updates the 'property' and 'step_feat_link' tables in your GenProp database if you are using a database as the data source. If you are using GFF input, the GenProp database is not updated.

Again, if you used a database as your data source, one can use the prop_report.pl program to list which features fill which steps of the property. Not having done this, I instead wrote flatfile_prop_report.pl, which takes the GFF output from evaluate_props.pl as input.The ouput looks like:
$ perl /share/scripts/GenomeProperties/scripts/flatfile_prop_report.pl -f for_GP.gff -a GenProp0061

-----------------------------------------------------------------------------------
id      accession       type    property name
------- --------------- ------- ---------------------------------------------------
2043    GenProp0061    SYSTEM    lipoprotein system lgt/lsp/lnt

-----------------------
GO_id            get_GO
---------------  ------
GO:0006497            1

GO:0042158            1



-----------------------------------------------------------------------------------
property definition:
-----------------------------------------------------------------------------------
YES
NO

Some membrane proteins are anchored to the membrane by a lipid moiety
attached to a modified N-terminus rather than by a transmembrane helix
. The characteristic enzymes of this modification system are

lsp: prolipoprotein signal peptidase
lgt: prolipoprotein diacylglycerol transferase
lnt: apolipoprotein N-acyltransferase
The modifications are cleavage (by Lsp) of the lipoprotein signal
sequence at a motif resembling L-X-G/A-C to leave Cys as the new N-
terminal residue, then modification of the Cys sulfhydryl by
prolipoprotein diacylglyceryl transferase (Lgt), then acylation of the
Cys NH2 by apolipoprotein N-acyltransferase (Lnt).
-----------------------------------------------------------------------------------
parent properties                                               id      accession
--------------------------------------------------------------- ------- -----------
protein modification, prosthetic groups and cofactors           4016    GenProp0075

-----------------------------------------------------------------------------------
stepnum in_rule step_name
------- ------- -------------------------------------------------------------------
lgt        1    prolipoprotein diacylglyceryl transferase
lnt        0    apolipoprotein N-acyltransferase
lsp        1    prolipoprotein signal peptidase

-----------------------------------------------------------------------------------
step_num get_GO method          query
-------- ------ --------------- -----------------------
lgt    (prolipoprotein diacylglyceryl transferase)
         1         HMM         TIGR00544                  scaffold_26_122

lnt    (apolipoprotein N-acyltransferase)
         1         HMM         TIGR00546                  scaffold_0_365

lsp    (prolipoprotein signal peptidase)
         1         HMM         TIGR00077                  scaffold_30_198
...and Bob's your uncle. Now you can, from flatfile input, run and view GenomeProperties. A small amount of clever engineering can connect these scripts to your favorite relational database scheme.

Wednesday, May 6, 2015

NCBI-22

While registering a BioProject:

If you have not registered your sample, please register at BioSample. At the end of that process, you will be returned to this submission.

So I follow the link and start setting up my BioSample until I get to the page which says:

If you have not registered your project, please register at BioProject. At the end of that process, you will be returned to this submission.

I tried to talk to my colleague Dr. Doctor Doctor about this, but he was out, despite his secretary telling me he was in.