Tuesday, February 18, 2014

Getting AMPHORA2 to work with BioPerl 1.6.9

AMPHORA2 (like its predecessor AMPHORA) is a great tool by Martin Wu and Alexandra Scott that takes you directly from genomic or metagenomic sequence to phylogeny through identification and analysis of conserved, single-copy phylogenetic marker genes. I have found it to be super handy for assessing metagenomic assemblies, both during and after the binning process. There is one problem, however, that may be preventing some people from using it.

The Death Star exploding
'Known issues'

There's note in the AMPHORA2 README files that states:
KNOWN ISSUES

Bioperl
AMPHORA2 has been tested on Bioperl 1.6.1. People have reported problems running AMPHORA2 on Bioperl 1.6.9. For example, the following error has been reported when running Phylotyping.pl:

------------- EXCEPTION: Bio::Root::Exception -------------
MSG: parse error: expected ; or ) or ,

STACK: Error::throw
STACK: Bio::Root::Root::throw
/usr/local/share/perl/5.14.2/Bio/Root/Root.pm:472
STACK: Bio::TreeIO::NewickParser::parse_newick
/usr/local/share/perl/5.14.2/Bio/TreeIO/NewickParser.pm:195
STACK: Bio::TreeIO::newick::next_tree
/usr/local/share/perl/5.14.2/Bio/TreeIO/newick.pm:143
STACK: main::assign_phylotype
/usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:154
STACK: /usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:70

Downgrading Bioperl from 1.6.9 to 1.6.1 solves the problem.
The problem lies in the fact that AMPHORA2 uses RAxML to build trees, and the tree files produced by RAxML don't follow the Newick/New Hampshire or NHX format. To wit, RAxML puts edge labels in brackets after the branch length, e.g.:

((REF-YP_004290029-NC_015216:0.23840847960367719804[I116],REF-YP_004520254-NC_015574:0.18701622612251170286[I117]):0.03268127637536134139[I115],REF-YP_447902-NC_007681:0.32812155226299949407[I118])



Newick!

(Okay, okay, there is this page that may allow for the format they use, but it certainly isn't generally agreed upon in the bioinformatics community, as evidenced by the fact that...)

The BioPerl module that parses Newick trees (Bio::TreeIO::NewickParser.pm) only recognizes brackets in the context of NHX format (i.e. [&&NHX:tag1=value1), and thus fails and bails when it hits these tags. Apparently, in version 1.6.1 of BioPerl, NewickParser.pm would drop these values in a 'bootstrap' tag, and AMPHORA2 takes advantage of this placement (or misplacement as the case may be) to track taxonomy throughout the tree.

Note that in the BioPerl HOWTO:Trees, it states that
A newly added capability (after 1.5.2) allows you to specify that the internal nodes id encode bootstrap values instead of IDs. Provide the
-internal_node_id => 'bootstrap'
to Bio::TreeIO and it will automatically move the ids over to the bootstrap slot.
This, however, does not solve the AMPHORA2 issue.

The solution suggested by the Wu lab (downgrade to BioPerl 1.6.1) is unsatisfying on several levels, thus, I came up with the following hack for NewickParser.pm:

      } elsif ($state == 4) { # optional NHX tags
        if($token =~ /\[\&\&NHX/) {
            # careful: this regexp gets rid of all NHX wrapping in one step

            $self->_start('nhx_tag');
            $token =~ /\[\&\&NHX\:(\S+)\]/;
            if ($1) {
                # NHX may be empty, presumably at end of file, just before ";"
                my @attributes = split ':', $1;
                foreach my $attribute (@attributes) {
                    $attribute =~ s/\s+//;
                    my($key,$value) = split '=', $attribute;

                    $self->_start('tag_name');
                    $self->_chars($key);
                    $self->_end('tag_name');

                    $self->_start('tag_value');
                    $self->_chars($value);
                    $self->_end('tag_value');
                  }
            }
            $self->_end('nhx_tag');
            $token = next_token(\$newick, ",);"); #move to , or )
        } elsif ($token =~ /\[/) {
            # This is a hack to make AMPHORA2 work

            if ($token =~ /\[(\S+)\]/) {
                $self->_start('bootstrap');
                $self->_chars($1);
                $self->_end('bootstrap');

             }
            $token = next_token(\$newick, ",);");
        }
        $state = 5;

      } elsif ($state == 5) { # end node


(I shouldn't have to say this next bit, but I will anyway...)
Adding this code may cause unexpected errors in other use cases! Be careful!

Thank you, Martin, for writing this tool!

http://snhs-plin.barry.edu/Biotech_Workshop/center.gif
Pawer to the People!


3 comments:

pyrimidine said...
This comment has been removed by the author.
pyrimidine said...

Hey Bill, nice easy fix! I've committed this to the bioperl repo, we'll try to get it into the next release should the testing hold up (so far it works fine).

Here's the commit.

Bill Nelson said...

Glad someone found this useful. I shan't comment on the 'correctness' of either the RAxML tree format, how AMPHORA bends it to its will, or how TreeIO handles Newick tree data. I just wanted AMPHORA2 to work. :)