Wednesday, August 17, 2016

Xander part II - Building your own models

I touched on this process a little bit in my original post about Xander, but I thought I'd spell out more how I do it. As an example, we will build a model to complement the rplB model that is packaged with Xander. That model really only recognizes bacterial RplB, but I'm also interested in Archaea, so I'll build a rplB_arch model

First I searched for Archaeal RplB sequences in NCBI. The more comprehensive and specific your search, the better, so I used: 
(rplB[gene] OR "ribosomal protein L2"[title]) AND "Archaea"[Organism]
 to search the Protein database and got 744 hits. Those were downloaded as a fasta file. 

Then I simplified the sequence set for the purposes of building the hmm using cd-hit
cd-hit -c 0.8 -i NCBI.rplB_arch.pep -o NCBI.rplB_arch.nr.pep
The non-redundant set was aligned using muscle, screened for truncated sequences, trimmed for gappy (>50% gap) columns, and made non-redundant (again) to 80% identity. I use belvu for all this (part of Sanger's SeqTools package). This should help build a good model. 


This left me with 162 sequences. The alignment was saved in Stockholm format (alas, belvu does not write a correct Stockholm format, and thus the file must be edited) and an hmm was built using HMMer v3.0 (if you build it with 3.1, prepare_gene.sh won't work later).
hmmbuild -n rplB_arch rplB_arch.hmm rplB_arch.sto

The new HMM was tested by searching against a combined database of the rplB framebot file and the rplB_arch nr file to see how it scores against Archaeal and Bacterial sequences
hmmsearch --cpu 16 -E 1e-5 --tblout rplB_arch.uniprot.tab -o rplB_arch.uniprot.hmmsearch --acc --noali rplB_arch.hmm /scripts/db/uniprot/uniref90.fasta

Comparison of hmmer bit scores. Y-axis is bit score. Note
sequences score well either against rplB or rplB_arch.

The scores suggest a trusted cutoff (lowest score that excludes known false positives) of 300 and a noise cutoff (lowest score that includes true positives) of 200. These can be added to the model file, but isn't necessary.

In addition to the HMM, Xander also requires a nucleotide fasta file to use as a reference for chimera checking, and a protein fasta file used to help identify the beginning of the protein. Obviously, for these purposes, you want the files to be comprehensive and only contain full-length sequences. I wrote a program that will take a fasta file and generate a nucl.fa and framebot.fa file, drawing the nucleotide sequences from ENA using their REST interface, and drawing the taxonomy information from NCBI using EUtilities. This program is not robust, but serves my purposes for now. [As an aside, why is it impossible to get nucleotide sequence from a protein accession from NCBI? There appears to literally be no way to do it.]

[ !--- UPDATE ---!]

I think the most difficult part of this process can be generating the framebot.fa and nucl.fa files depending on what databases you are trying to draw references from. My current approach is to build the HMM based on the well-curated sequences from Uniprot, then search the HMM against RefSeq, screen for those that score above the determined trusted cutoff, and then use the protein ids to pull the protein sequences from the RefSeq database (using fastacmd or blastdbcmd),

fastacmd -d /scripts/db/NCBI/nrRefSeq_Prok.faa -i good.cNorB.ids -o good.cNorB.faa 

and get to the NA sequence and taxonomy using eutils (using first elink from protein to nuccore, and then efetch from nuccore, and elink from protein to taxonomy and efetch from taxonomy, respectively). This part has been coded into build_nucl_framebot.pl, available on my github site.
[ !------------!]

[!---UPDATE 2---!]
I ran into an interesting error today. For one model, the framebot step was failing with an error that looked like:

Exception in thread "main" java.lang.IllegalArgumentException: Cannot score V, O
    at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180)
    at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81)
    at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67)
    at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:165)
    at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:496)
    at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)


It turns out this was due to a sequence in the framebot file that contained an illegal character (an 'O'). Repairing that sequence fixed the problem.
[ !------------!]

The last thing to do is run prepare_gene_ref.sh. Two things to note:
  • This process requires the 'patched' version of hmmer3.0. Follow the instructions in the README.md file (they're way at the bottom and very simple).
  • prepare_gene_ref.sh also needs to be altered to fit your environment. Change the JAR_DIR, REF_DIR and hmmer_xanderpatch paths to suit your setup.
  • The hmm file must be named [name].hmm; the seed file must be named [name].seeds; the nucleotide file must be named nucl.fa; the framebot file must be named framebot.fa.
../../../bin/prepare_gene_ref.sh rplB_arch

Once this is run, take a look at the ref_aligned.faa file that is written to the parent directory. If some sequences are not aligned properly, they may not have good representatives in the .seeds file (and thus the HMM).
That should be it. As long as you have done all this within the Xander subdirectory structure, you should be able to use your new model to search metagenomes.

Thursday, May 26, 2016

Xander - assembling target genes from metagenomic data

I've acquired a large metagenomic data set in which I wish to find specific genes. Recently, the Michigan State group led by James Cole published a new tool to assemble targeted genes from a metagenome called Xander. Let's try it out.

First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.

Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:
  1. Make sure you set the JAR_DIR to the RDPTools directory (that's not clear from the comments)
  2. Leave off the trailing "/"s for directories
  3. Do not use relative paths! Absolute paths for everything.

I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.

I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)

I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...

Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.

So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"

### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s


Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga

Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
 Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.adjustCount(KmerCoverage.java:233)

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.printCovereage(KmerCoverage.java:249)

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.main(KmerCoverage.java:388)

        at edu.msu.cme.rdp.kmer.cli.Main.main(Main.java:48)

kmer_coverage failed

This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.

I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.

I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.

I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).

Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.

One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.