First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.
Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:
- Make sure you set the JAR_DIR to the RDPTools directory (that's not clear from the comments)
- Leave off the trailing "/"s for directories
- Do not use relative paths! Absolute paths for everything.
I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.
I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)
I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...
Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.
So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"
### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s
Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga
Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.adjustCount(KmerCoverage.java:233)
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.printCovereage(KmerCoverage.java:249)
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.main(KmerCoverage.java:388)
at edu.msu.cme.rdp.kmer.cli.Main.main(Main.java:48)
kmer_coverage failed
This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.
I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.
I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.
I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).
Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.
One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.
6 comments:
Hello,
I am trying to run the Xander testdata before I use my actual data, and am having some of the same issues as yourself. I am configuring the my_xander_setenv.sh file with the following specifications, leaving out the trailing /'s for directories.
SEQFILE=/Users/home/testdata/test_reads.fa
WORKDIR=/Users/home/Xander_assembler-master/testdata
REF_DIR=/Users/home/Xander_assembler-master/testdata
JAR_DIR=/Users/home/anaconda/pkgs/rdp-tools-2.0.2-0/bin
UCHIME=/macqiime/bin//uchime
HMMALIGN=/Users/home/hmmer-3.1b2-macosx-intel/bin/hmmalign
I have tried setting JAR_DIR to both /bin and /bin/lib, which contain different executables (lib contains KmerFilter, and bin contains hmmgs.jar). I also had to edit the run_xander_skel.sh file to change the BASEDIR to (BASEDIR=/Users/home/Xander_assembler-master/bin).
When I run the script, I get the following:
$>../bin/run_xander_skel.sh my_xander_setenv.sh "build find search" "nifH nirK rplB nosZ"
+ source my_xander_setenv.sh
++ SEQFILE=/Users/home/testdata/test_reads.fa
++ WORKDIR=/Users/home/Xander_assembler-master/testdata
++ REF_DIR=/Users/home/Xander_assembler-master/testdata
++ JAR_DIR=/Users/home/anaconda/pkgs/rdp-tools-2.0.2-0/bin/lib/KmerFilter
++ UCHIME=/macqiime/bin//uchime
++ HMMALIGN=/Users/home/hmmer-3.1b2-macosx-intel/bin/hmmalign
++ SAMPLE_SHORTNAME=test
++ K_SIZE=45
++ FILTER_SIZE=32
++ MAX_JVM_HEAP=2G
++ MIN_COUNT=2
++ THREADS=4
++ PRUNE=20
++ PATHS=1
++ LIMIT_IN_SECS=100
++ MIN_BITS=50
++ MIN_LENGTH=150
++ DIST_CUTOFF=0.01
++ NAME=k45
+ mkdir -p /Users/home/Xander_assembler-master/testdata/k45
+ cd /Users/home/Xander_assembler-master/testdata/k45
+ '[' -f k45.bloom ']'
+ echo '### Build bloom filter'
### Build bloom filter
+ java -Xmx2G -jar /Users/home/anaconda/pkgs/rdp-tools-2.0.2-0/bin/hmmgs.jar build /Users/home/testdata/test_reads.fa k45.bloom 45 32 2 4 30
+ echo 'build bloom filter failed'
build bloom filter failed
+ exit 1
It seems like the bloom build function is failing for the testdata, and I can't seem to figure out what's wrong with it. I am working on a Mac OS X system. Any help or advice would be most appreciated. Thanks!
I have the exact same problem - all the dependencies are installed, but I am getting the same build bloom filter error message with the test data. I also had to change the run_xander_skel.sh to an absolute PATH. Were you ever able to solve this?
Hey Karen,
I was not able to solve this problem so far. If you do come up with a solution, though, please share!
Best, --Matt
Hi Matt,
Have you tried?
JAR_DIR=/Users/home/anaconda/pkgs/rdp-tools-2.0.2-0
--Bill
Hey Bill,
I just gave this a try and am still getting the same error message:
+ mkdir -p /mnt/research/rdp/public/RDPTools/Xander_assembler/testdata/k45
+ cd /mnt/research/rdp/public/RDPTools/Xander_assembler/testdata/k45
+ '[' -f k45.bloom ']'
+ echo '### Build bloom filter'
### Build bloom filter
+ java -Xmx2G -jar /mnt/research/rdp/public/RDPTools//hmmgs.jar build /mnt/research/rdp/public/RDPTools/Xander_assembler/testdata/test_reads.fa k45.bloom 45 32 2 4 30
+ echo 'build bloom filter failed'
build bloom filter failed
+ exit 1
Thanks for the tip though. I should note that when I just run bloom I get a similar message (and it seems like others are having the same problem as well).
--Matt
Matt,
Hrm. It looks like there is still a file that has incorrect path information. See how your error message keeps referring to "/mnt/research/rdp/public/RDPTools"? I'm guessing that directory structure doesn't exist on your system.
In the run_xander_skel.sh file, change the line:
BASEDIR=/mnt/research/rdp/public/RDPTools/Xander_assembler/bin
to the correct path for your system (/Users/home/Xander_assembler-master/bin?), and also double-check that you have correct path info in the xander_setenv.sh file.
--Bill
Post a Comment