Journal Club

Thursday, May 26, 2016

Xander - assembling target genes from metagenomic data

I've acquired a large metagenomic data set in which I wish to find specific genes. Recently, the Michigan State group led by James Cole published a new tool to assemble targeted genes from a metagenome called Xander. Let's try it out.

First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.

Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:

Make sure you set the JAR_DIR to the RDPTools directory (that's not clear from the comments)
Leave off the trailing "/"s for directories
Do not use relative paths! Absolute paths for everything.

I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.

I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)

I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...

Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.

So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"

### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s

Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga

Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

at edu.msu.cme.rdp.kmer.cli.KmerCoverage.adjustCount(KmerCoverage.java:233)

at edu.msu.cme.rdp.kmer.cli.KmerCoverage.printCovereage(KmerCoverage.java:249)

at edu.msu.cme.rdp.kmer.cli.KmerCoverage.main(KmerCoverage.java:388)

at edu.msu.cme.rdp.kmer.cli.Main.main(Main.java:48)

kmer_coverage failed

This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.

I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.

I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.

I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).

Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.

One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.

Thursday, May 7, 2015

Installing JCVI's GenomeProperties locally

One might wonder why I'm attempting this at all. Well, I worked at TIGR with Dan Haft for many years, and I've carried the database structures with me through my career. So I've got a leg up on your average mortal in terms of a) having and b) understanding the database structure that underlies the GP system.

That being said, I didn't anticipate this being easy.

First, I built a database called GenProp in MySQL with the relevant table structure, and populated it with data from the FTP site. (The GP v3.1 release notes are pretty easy to follow.) I suggest doing this with SQLite, because the program is written to operate off an SQLite db. Using something else requires some code alteration.

Despite the README claiming there are no prerequisites, there are quite a few perl modules you need to have installed for the system to build and operate. To wit:

Class::Accessor
Config::IniFiles
TimeDate
Module::Pluggable
Readonly
Exception::Class
DBD::SQLite

I unpacked the software and attempted the Build build. The build went okay after I got the requisite modules installed, but './Build test' resulted in many failures, looking like:

$ ./Build test
t/00.load.t ....... 1/12
#   Failed test 'use GenProp::Genome::Property;'
#   at t/00.load.t line 9.
#     Tried to use 'GenProp::Genome::Property'.

#   Failed test 'use GenProp::Genome::Evidence;'
#   at t/00.load.t line 10.
#     Tried to use 'GenProp::Genome::Evidence'.

#   Failed test 'use GenProp::Genome::Cluster;'
#   at t/00.load.t line 11.
#     Tried to use 'GenProp::Genome::Cluster'.

#   Failed test 'use GenProp::Model;'
#   at t/00.load.t line 12.
#     Tried to use 'GenProp::Model'.

#   Failed test 'use GenProp::Model::Property;'
#   at t/00.load.t line 13.
#     Tried to use 'GenProp::Model::Property'.

#   Failed test 'use GenProp::Model::PropLink;'
#   at t/00.load.t line 14.
#     Tried to use 'GenProp::Model::PropLink'.

#   Failed test 'use GenProp::Model::Evidence;'
#   at t/00.load.t line 15.
#     Tried to use 'GenProp::Model::Evidence'.

#   Failed test 'use GenProp::Config;'
#   at t/01.config.t line 9.
#     Tried to use 'GenProp::Config'.

Not surprisingly, the GenProp module distributed with v3.1 does not reflect the directory structure listed in the messages. This appears to be legacy that wasn't cleaned up in the distro.

Pressing on...

The meat of the system is evaluate_props.pl (in the scripts subdir). This is what runs the rules against a genome/HMM hits database and stores the results in the GenProp database. The program can run off GFF3 format files, rather than a TIGR-style small genome database schema, which is convenient for those who aren't running TIGR-style small genome database (i.e., everybody).

The description of how to encode the required information in the GFF file can be found essentially in two places: 1) the GFF3 format definition at Sequence Ontology, and 2) the htab2gff.pl program distributed with the other GenProp code (specifically for the attributes needed for HMM hit entries). One important detail is that the HMM hit records in the GFF file should have HMM_hit in the 'type' field (field 3) (rather than the GFF3-mandated protein_hmm_match - so, yes, these GFF3 files won't validate).

Minimally, you need to make sure each HMM_hit record is linked by a Parent attribute to a CDS, which in turn is linked by a Parent attribute to a gene.

Example:

ITZX_scaf_103 PNNL gene 14974 16209 . - . ID=gene16660;locus_tag=HLSNC01_00224;bin=v3_bin01;alias=ITZX_scaf_103_13,6666666.44063.peg.1126
ITZX_scaf_103 PNNL CDS 14974 16209 . - 0 ID=CDS16660;Parent=gene16660;feat_name=HLSNC01_00224;locus_tag=HLSNC01_00224;bin=v3_bin01;product=amidase%2C hydantoinase/carbamoylase family;ec_number=3.5.1.6
ITZX_scaf_103 PNNL HMM_hit 15000 16191 419.1 - . ID=TIGR01879.HLSNC01_00224;Parent=CDS16660;Target=TIGR01879 7 404;trusted_cutoff=204.00;noise_cutoff=143.85

I wrote a script that dumps my database schema to this gff format. One can also use the htab2gff.pl to combine an existing genome GFF3 file and hmmer search results (in htab format) into a properly formatted file. [The caveat here is that the trusted and noise cutoffs must be encoded in the htab file, or in a database that you would have to configure the program to contact.]

[This brings up another aside: What is an htab file, you ask? It's just hmmer search results (either hmmsearch or hmmscan) in a tab-delimited format. The conversion can be accomplished with this program, although it may need some modification to work with newer versions of hmmer.]

I had to change the GenProp.ini file (in the conf subdir) to point to my TIGR-style GenProp database, and provide login information.

My program incantation then looked like:

/PATH/TO/GenomeProperties/scripts/evaluate_props.pl -d HLUCCbin01:myShare/v3_bin01.gff -a ALL

And output looks like:

HLUCCbin01      GenProp0037:    State some evidence     Val 0.666666666666667   Assigned-by RULES
HLUCCbin01      GenProp0038:    State some evidence     Val 0.5 Assigned-by RULES
HLUCCbin01      GenProp0039:    State none found        Val 0   Assigned-by RULES
HLUCCbin01      GenProp0046:    State YES       Val 1   Assigned-by RULES
HLUCCbin01      GenProp0047:    State none found        Val 0   Assigned-by RULES

As you can see, the output very helpfully does not include a description of the property, just the state. I've actually fixed this in my version of the program.

The evaluate_props.pl program updates the 'property' and 'step_feat_link' tables in your GenProp database if you are using a database as the data source. If you are using GFF input, the GenProp database is not updated.

Again, if you used a database as your data source, one can use the prop_report.pl program to list which features fill which steps of the property. Not having done this, I instead wrote flatfile_prop_report.pl, which takes the GFF output from evaluate_props.pl as input.The ouput looks like:

$ perl /share/scripts/GenomeProperties/scripts/flatfile_prop_report.pl -f for_GP.gff -a GenProp0061

-----------------------------------------------------------------------------------
id      accession       type    property name
------- --------------- ------- ---------------------------------------------------
2043    GenProp0061    SYSTEM    lipoprotein system lgt/lsp/lnt

-----------------------
GO_id            get_GO
--------------- ------
GO:0006497            1

GO:0042158            1

-----------------------------------------------------------------------------------
property definition:
-----------------------------------------------------------------------------------
YES
NO

Some membrane proteins are anchored to the membrane by a lipid moiety
attached to a modified N-terminus rather than by a transmembrane helix
. The characteristic enzymes of this modification system are

lsp: prolipoprotein signal peptidase
lgt: prolipoprotein diacylglycerol transferase
lnt: apolipoprotein N-acyltransferase
The modifications are cleavage (by Lsp) of the lipoprotein signal
sequence at a motif resembling L-X-G/A-C to leave Cys as the new N-
terminal residue, then modification of the Cys sulfhydryl by
prolipoprotein diacylglyceryl transferase (Lgt), then acylation of the
Cys NH2 by apolipoprotein N-acyltransferase (Lnt).
-----------------------------------------------------------------------------------
parent properties                                               id      accession
--------------------------------------------------------------- ------- -----------
protein modification, prosthetic groups and cofactors           4016    GenProp0075

-----------------------------------------------------------------------------------
stepnum in_rule step_name
------- ------- -------------------------------------------------------------------
lgt        1    prolipoprotein diacylglyceryl transferase
lnt        0    apolipoprotein N-acyltransferase
lsp        1    prolipoprotein signal peptidase

-----------------------------------------------------------------------------------
step_num get_GO method          query
-------- ------ --------------- -----------------------
lgt    (prolipoprotein diacylglyceryl transferase)
         1         HMM         TIGR00544                  scaffold_26_122

lnt    (apolipoprotein N-acyltransferase)
         1         HMM         TIGR00546                  scaffold_0_365

lsp    (prolipoprotein signal peptidase)
         1         HMM         TIGR00077                  scaffold_30_198

...and Bob's your uncle. Now you can, from flatfile input, run and view GenomeProperties. A small amount of clever engineering can connect these scripts to your favorite relational database scheme.

Wednesday, May 6, 2015

NCBI-22

While registering a BioProject:

If you have not registered your sample, please register at BioSample. At the end of that process, you will be returned to this submission.

So I follow the link and start setting up my BioSample until I get to the page which says:

If you have not registered your project, please register at BioProject. At the end of that process, you will be returned to this submission.

I tried to talk to my colleague Dr. Doctor Doctor about this, but he was out, despite his secretary telling me he was in.

Monday, September 15, 2014

Firefox/Seamonkey/IceCat and the "Couldn't load XPCOM" error

So I ran into this error upon installing Seamonkey v2.29 in order to do some compatibility testing. This was particularly annoying because I have a version of Firefox that is running just fine.

I had downloaded the Seamonkey 64-bit contributed build, and untarred/bzipped in my /usr/lib64 directory and then attempted to launch, only to see the dreaded:
Couldn't load XPCOM.

A search on the web lead to many suggestions of re-installing, which I considered to be using a machete when a scalpel should suffice.

So a little more digging suggested the problem is the ability of the program to link with the correct version of xulrunner. The seamonkey subdir does contain its own libxul.so and run-mozilla.sh, but apparently those are not interfacing with the rest of the system well.

So a peek inside /usr/lib64/xulrunner showed that it, too has a version of run-mozilla.sh. So I simply executed:
$ /usr/lib64/xulrunner/run-mozilla.sh /usr/lib64/seamonkey/seamonkey

and problem solved.

This, I'm sure, is an entirely unsophisticated description and understanding of what's going on, but the solution works (at least for me on my system).

Tuesday, February 18, 2014

Getting AMPHORA2 to work with BioPerl 1.6.9

AMPHORA2 (like its predecessor AMPHORA) is a great tool by Martin Wu and Alexandra Scott that takes you directly from genomic or metagenomic sequence to phylogeny through identification and analysis of conserved, single-copy phylogenetic marker genes. I have found it to be super handy for assessing metagenomic assemblies, both during and after the binning process. There is one problem, however, that may be preventing some people from using it.

'Known issues'

There's note in the AMPHORA2 README files that states:

KNOWN ISSUES

Bioperl
AMPHORA2 has been tested on Bioperl 1.6.1. People have reported problems running AMPHORA2 on Bioperl 1.6.9. For example, the following error has been reported when running Phylotyping.pl:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: parse error: expected ; or ) or ,

STACK: Error::throw
STACK: Bio::Root::Root::throw
/usr/local/share/perl/5.14.2/Bio/Root/Root.pm:472
STACK: Bio::TreeIO::NewickParser::parse_newick
/usr/local/share/perl/5.14.2/Bio/TreeIO/NewickParser.pm:195
STACK: Bio::TreeIO::newick::next_tree
/usr/local/share/perl/5.14.2/Bio/TreeIO/newick.pm:143
STACK: main::assign_phylotype
/usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:154
STACK: /usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:70

Downgrading Bioperl from 1.6.9 to 1.6.1 solves the problem.

The problem lies in the fact that AMPHORA2 uses RAxML to build trees, and the tree files produced by RAxML don't follow the Newick/New Hampshire or NHX format. To wit, RAxML puts edge labels in brackets after the branch length, e.g.:

((REF-YP_004290029-NC_015216:0.23840847960367719804[I116],REF-YP_004520254-NC_015574:0.18701622612251170286[I117]):0.03268127637536134139[I115],REF-YP_447902-NC_007681:0.32812155226299949407[I118])

Newick!

(Okay, okay, there is this page that may allow for the format they use, but it certainly isn't generally agreed upon in the bioinformatics community, as evidenced by the fact that...)

The BioPerl module that parses Newick trees (Bio::TreeIO::NewickParser.pm) only recognizes brackets in the context of NHX format (i.e. [&&NHX:tag1=value1), and thus fails and bails when it hits these tags. Apparently, in version 1.6.1 of BioPerl, NewickParser.pm would drop these values in a 'bootstrap' tag, and AMPHORA2 takes advantage of this placement (or misplacement as the case may be) to track taxonomy throughout the tree.

Note that in the BioPerl HOWTO:Trees, it states that

A newly added capability (after 1.5.2) allows you to specify that the internal nodes id encode bootstrap values instead of IDs. Provide the
-internal_node_id => 'bootstrap'
to Bio::TreeIO and it will automatically move the ids over to the bootstrap slot.

This, however, does not solve the AMPHORA2 issue.

The solution suggested by the Wu lab (downgrade to BioPerl 1.6.1) is unsatisfying on several levels, thus, I came up with the following hack for NewickParser.pm:

      } elsif ($state == 4) { # optional NHX tags
        if($token =~ /\[\&\&NHX/) {
            # careful: this regexp gets rid of all NHX wrapping in one step

            $self->_start('nhx_tag');
            $token =~ /\[\&\&NHX\:(\S+)\]/;
            if ($1) {
                # NHX may be empty, presumably at end of file, just before ";"
                my @attributes = split ':', $1;
                foreach my $attribute (@attributes) {
                    $attribute =~ s/\s+//;
                    my($key,$value) = split '=', $attribute;

                    $self->_start('tag_name');
                    $self->_chars($key);
                    $self->_end('tag_name');

                    $self->_start('tag_value');
                    $self->_chars($value);
                    $self->_end('tag_value');
                  }
            }
            $self->_end('nhx_tag');
            $token = next_token(\$newick, ",);"); #move to , or )
        } elsif ($token =~ /\[/) {
            # This is a hack to make AMPHORA2 work
            if ($token =~ /\[(\S+)\]/) {
                $self->_start('bootstrap');
                $self->_chars($1);
                $self->_end('bootstrap');
             }
            $token = next_token(\$newick, ",);");        }
        $state = 5;
      } elsif ($state == 5) { # end node

(I shouldn't have to say this next bit, but I will anyway...)
Adding this code may cause unexpected errors in other use cases! Be careful!

Thank you, Martin, for writing this tool!

http://snhs-plin.barry.edu/Biotech_Workshop/center.gif

Pawer to the People!