First I searched for Archaeal RplB sequences in NCBI. The more comprehensive and specific your search, the better, so I used:
(rplB[gene] OR "ribosomal protein L2"[title]) AND "Archaea"[Organism]to search the Protein database and got 744 hits. Those were downloaded as a fasta file.
Then I simplified the sequence set for the purposes of building the hmm using cd-hit
cd-hit -c 0.8 -i NCBI.rplB_arch.pep -o NCBI.rplB_arch.nr.pep
The non-redundant set was aligned using muscle, screened for truncated sequences, trimmed for gappy (>50% gap) columns, and made non-redundant (again) to 80% identity. I use belvu for all this (part of Sanger's SeqTools package). This should help build a good model.
This left me with 162 sequences. The alignment was saved in Stockholm format (alas, belvu does not write a correct Stockholm format, and thus the file must be edited) and an hmm was built using HMMer v3.0 (if you build it with 3.1, prepare_gene.sh won't work later).
hmmbuild -n rplB_arch rplB_arch.hmm rplB_arch.sto
The new HMM was tested by searching against a combined database of the rplB framebot file and the rplB_arch nr file to see how it scores against Archaeal and Bacterial sequences
hmmsearch --cpu 16 -E 1e-5 --tblout rplB_arch.uniprot.tab -o rplB_arch.uniprot.hmmsearch --acc --noali rplB_arch.hmm /scripts/db/uniprot/uniref90.fasta
Comparison of hmmer bit scores. Y-axis is bit score. Note sequences score well either against rplB or rplB_arch. |
The scores suggest a trusted cutoff (lowest score that excludes known false positives) of 300 and a noise cutoff (lowest score that includes true positives) of 200. These can be added to the model file, but isn't necessary.
In addition to the HMM, Xander also requires a nucleotide fasta file to use as a reference for chimera checking, and a protein fasta file used to help identify the beginning of the protein. Obviously, for these purposes, you want the files to be comprehensive and only contain full-length sequences. I wrote a program that will take a fasta file and generate a nucl.fa and framebot.fa file, drawing the nucleotide sequences from ENA using their REST interface, and drawing the taxonomy information from NCBI using EUtilities. This program is not robust, but serves my purposes for now. [As an aside, why is it impossible to get nucleotide sequence from a protein accession from NCBI? There appears to literally be no way to do it.]
[ !--- UPDATE ---!]
I think the most difficult part of this process can be generating the framebot.fa and nucl.fa files depending on what databases you are trying to draw references from. My current approach is to build the HMM based on the well-curated sequences from Uniprot, then search the HMM against RefSeq, screen for those that score above the determined trusted cutoff, and then use the protein ids to pull the protein sequences from the RefSeq database (using fastacmd or blastdbcmd),
fastacmd -d /scripts/db/NCBI/nrRefSeq_Prok.faa -i good.cNorB.ids -o good.cNorB.faa
and get to the NA sequence and taxonomy using eutils (using first elink from protein to nuccore, and then efetch from nuccore, and elink from protein to taxonomy and efetch from taxonomy, respectively). This part has been coded into build_nucl_framebot.pl, available on my github site.
[ !------------!]
[!---UPDATE 2---!]
I ran into an interesting error today. For one model, the framebot step was failing with an error that looked like:
Exception in thread "main" java.lang.IllegalArgumentException: Cannot score V, O
at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180)
at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81)
at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:165)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:496)
at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)
It turns out this was due to a sequence in the framebot file that contained an illegal character (an 'O'). Repairing that sequence fixed the problem.
[ !------------!]
The last thing to do is run prepare_gene_ref.sh. Two things to note:
- This process requires the 'patched' version of hmmer3.0. Follow the instructions in the README.md file (they're way at the bottom and very simple).
- prepare_gene_ref.sh also needs to be altered to fit your environment. Change the JAR_DIR, REF_DIR and hmmer_xanderpatch paths to suit your setup.
- The hmm file must be named [name].hmm; the seed file must be named [name].seeds; the nucleotide file must be named nucl.fa; the framebot file must be named framebot.fa.
Once this is run, take a look at the ref_aligned.faa file that is written to the parent directory. If some sequences are not aligned properly, they may not have good representatives in the .seeds file (and thus the HMM).
That should be it. As long as you have done all this within the Xander subdirectory structure, you should be able to use your new model to search metagenomes.