Thursday, May 26, 2016

Xander - assembling target genes from metagenomic data

I've acquired a large metagenomic data set in which I wish to find specific genes. Recently, the Michigan State group led by James Cole published a new tool to assemble targeted genes from a metagenome called Xander. Let's try it out.

First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.

Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:
  1. Make sure you set the JAR_DIR to the RDPTools directory (that's not clear from the comments)
  2. Leave off the trailing "/"s for directories
  3. Do not use relative paths! Absolute paths for everything.

I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.

I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)

I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...

Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.

So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"

### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s


Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga

Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
 Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.adjustCount(KmerCoverage.java:233)

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.printCovereage(KmerCoverage.java:249)

        at edu.msu.cme.rdp.kmer.cli.KmerCoverage.main(KmerCoverage.java:388)

        at edu.msu.cme.rdp.kmer.cli.Main.main(Main.java:48)

kmer_coverage failed

This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.

I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.

I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.

I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).

Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.

One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.