I was having real problems with using NCBI's Eutils to download sequences. The program would work randomly. The 'get' commands would just return nothing sometimes, which was quite frustrating, and I attributed it to the Eutils server.
Turns out the problem was (is) the LWP perl module. When I switched to using a wget system call instead of LWP, everything worked consistently.
Of course, this doesn't solve the problem of there still being no good way to get a nucleotide sequence from a protein accession, but we take the little victories where we get them...
Friday, April 14, 2017
Wednesday, August 17, 2016
Xander part II - Building your own models
I touched on this process a little bit in my original post about Xander, but I thought I'd spell out more how I do it.
As an example, we will build a model to complement the rplB model that is packaged with Xander. That model really only recognizes bacterial RplB, but I'm also interested in Archaea, so I'll build a rplB_arch model.
First I searched for Archaeal RplB sequences in NCBI. The more comprehensive and specific your search, the better, so I used:
Then I simplified the sequence set for the purposes of building the hmm using cd-hit
cd-hit -c 0.8 -i NCBI.rplB_arch.pep -o NCBI.rplB_arch.nr.pep
The non-redundant set was aligned using muscle, screened for truncated sequences, trimmed for gappy (>50% gap) columns, and made non-redundant (again) to 80% identity. I use belvu for all this (part of Sanger's SeqTools package). This should help build a good model.
This left me with 162 sequences. The alignment was saved in Stockholm format (alas, belvu does not write a correct Stockholm format, and thus the file must be edited) and an hmm was built using HMMer v3.0 (if you build it with 3.1, prepare_gene.sh won't work later).
hmmbuild -n rplB_arch rplB_arch.hmm rplB_arch.sto
The new HMM was tested by searching against a combined database of the rplB framebot file and the rplB_arch nr file to see how it scores against Archaeal and Bacterial sequences
hmmsearch --cpu 16 -E 1e-5 --tblout rplB_arch.uniprot.tab -o rplB_arch.uniprot.hmmsearch --acc --noali rplB_arch.hmm /scripts/db/uniprot/uniref90.fasta
The scores suggest a trusted cutoff (lowest score that excludes known false positives) of 300 and a noise cutoff (lowest score that includes true positives) of 200. These can be added to the model file, but isn't necessary.
In addition to the HMM, Xander also requires a nucleotide fasta file to use as a reference for chimera checking, and a protein fasta file used to help identify the beginning of the protein. Obviously, for these purposes, you want the files to be comprehensive and only contain full-length sequences. I wrote a program that will take a fasta file and generate a nucl.fa and framebot.fa file, drawing the nucleotide sequences from ENA using their REST interface, and drawing the taxonomy information from NCBI using EUtilities. This program is not robust, but serves my purposes for now. [As an aside, why is it impossible to get nucleotide sequence from a protein accession from NCBI? There appears to literally be no way to do it.]
[ !--- UPDATE ---!]
I think the most difficult part of this process can be generating the framebot.fa and nucl.fa files depending on what databases you are trying to draw references from. My current approach is to build the HMM based on the well-curated sequences from Uniprot, then search the HMM against RefSeq, screen for those that score above the determined trusted cutoff, and then use the protein ids to pull the protein sequences from the RefSeq database (using fastacmd or blastdbcmd),
fastacmd -d /scripts/db/NCBI/nrRefSeq_Prok.faa -i good.cNorB.ids -o good.cNorB.faa
and get to the NA sequence and taxonomy using eutils (using first elink from protein to nuccore, and then efetch from nuccore, and elink from protein to taxonomy and efetch from taxonomy, respectively). This part has been coded into build_nucl_framebot.pl, available on my github site.
[ !------------!]
[!---UPDATE 2---!]
I ran into an interesting error today. For one model, the framebot step was failing with an error that looked like:
Exception in thread "main" java.lang.IllegalArgumentException: Cannot score V, O
at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180)
at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81)
at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:165)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:496)
at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)
It turns out this was due to a sequence in the framebot file that contained an illegal character (an 'O'). Repairing that sequence fixed the problem.
[ !------------!]
The last thing to do is run prepare_gene_ref.sh. Two things to note:
Once this is run, take a look at the ref_aligned.faa file that is written to the parent directory. If some sequences are not aligned properly, they may not have good representatives in the .seeds file (and thus the HMM).
That should be it. As long as you have done all this within the Xander subdirectory structure, you should be able to use your new model to search metagenomes.
First I searched for Archaeal RplB sequences in NCBI. The more comprehensive and specific your search, the better, so I used:
(rplB[gene] OR "ribosomal protein L2"[title]) AND "Archaea"[Organism]to search the Protein database and got 744 hits. Those were downloaded as a fasta file.
Then I simplified the sequence set for the purposes of building the hmm using cd-hit
cd-hit -c 0.8 -i NCBI.rplB_arch.pep -o NCBI.rplB_arch.nr.pep
The non-redundant set was aligned using muscle, screened for truncated sequences, trimmed for gappy (>50% gap) columns, and made non-redundant (again) to 80% identity. I use belvu for all this (part of Sanger's SeqTools package). This should help build a good model.
This left me with 162 sequences. The alignment was saved in Stockholm format (alas, belvu does not write a correct Stockholm format, and thus the file must be edited) and an hmm was built using HMMer v3.0 (if you build it with 3.1, prepare_gene.sh won't work later).
hmmbuild -n rplB_arch rplB_arch.hmm rplB_arch.sto
The new HMM was tested by searching against a combined database of the rplB framebot file and the rplB_arch nr file to see how it scores against Archaeal and Bacterial sequences
hmmsearch --cpu 16 -E 1e-5 --tblout rplB_arch.uniprot.tab -o rplB_arch.uniprot.hmmsearch --acc --noali rplB_arch.hmm /scripts/db/uniprot/uniref90.fasta
Comparison of hmmer bit scores. Y-axis is bit score. Note sequences score well either against rplB or rplB_arch. |
The scores suggest a trusted cutoff (lowest score that excludes known false positives) of 300 and a noise cutoff (lowest score that includes true positives) of 200. These can be added to the model file, but isn't necessary.
In addition to the HMM, Xander also requires a nucleotide fasta file to use as a reference for chimera checking, and a protein fasta file used to help identify the beginning of the protein. Obviously, for these purposes, you want the files to be comprehensive and only contain full-length sequences. I wrote a program that will take a fasta file and generate a nucl.fa and framebot.fa file, drawing the nucleotide sequences from ENA using their REST interface, and drawing the taxonomy information from NCBI using EUtilities. This program is not robust, but serves my purposes for now. [As an aside, why is it impossible to get nucleotide sequence from a protein accession from NCBI? There appears to literally be no way to do it.]
[ !--- UPDATE ---!]
I think the most difficult part of this process can be generating the framebot.fa and nucl.fa files depending on what databases you are trying to draw references from. My current approach is to build the HMM based on the well-curated sequences from Uniprot, then search the HMM against RefSeq, screen for those that score above the determined trusted cutoff, and then use the protein ids to pull the protein sequences from the RefSeq database (using fastacmd or blastdbcmd),
fastacmd -d /scripts/db/NCBI/nrRefSeq_Prok.faa -i good.cNorB.ids -o good.cNorB.faa
and get to the NA sequence and taxonomy using eutils (using first elink from protein to nuccore, and then efetch from nuccore, and elink from protein to taxonomy and efetch from taxonomy, respectively). This part has been coded into build_nucl_framebot.pl, available on my github site.
[ !------------!]
[!---UPDATE 2---!]
I ran into an interesting error today. For one model, the framebot step was failing with an error that looked like:
Exception in thread "main" java.lang.IllegalArgumentException: Cannot score V, O
at edu.msu.cme.rdp.alignment.pairwise.ScoringMatrix.score(ScoringMatrix.java:180)
at edu.msu.cme.rdp.framebot.core.FramebotCore.computeMatrix(FramebotCore.java:81)
at edu.msu.cme.rdp.framebot.core.FramebotCore.processSequence(FramebotCore.java:67)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.framebotItUp_prefilter(FramebotMain.java:165)
at edu.msu.cme.rdp.framebot.cli.FramebotMain.main(FramebotMain.java:496)
at edu.msu.cme.rdp.framebot.cli.Main.main(Main.java:50)
It turns out this was due to a sequence in the framebot file that contained an illegal character (an 'O'). Repairing that sequence fixed the problem.
[ !------------!]
The last thing to do is run prepare_gene_ref.sh. Two things to note:
- This process requires the 'patched' version of hmmer3.0. Follow the instructions in the README.md file (they're way at the bottom and very simple).
- prepare_gene_ref.sh also needs to be altered to fit your environment. Change the JAR_DIR, REF_DIR and hmmer_xanderpatch paths to suit your setup.
- The hmm file must be named [name].hmm; the seed file must be named [name].seeds; the nucleotide file must be named nucl.fa; the framebot file must be named framebot.fa.
Once this is run, take a look at the ref_aligned.faa file that is written to the parent directory. If some sequences are not aligned properly, they may not have good representatives in the .seeds file (and thus the HMM).
That should be it. As long as you have done all this within the Xander subdirectory structure, you should be able to use your new model to search metagenomes.
Thursday, May 26, 2016
Xander - assembling target genes from metagenomic data
I've acquired a large metagenomic data set in which I wish to find specific genes. Recently, the Michigan State group led by James Cole published a new tool to assemble targeted genes from a metagenome called Xander. Let's try it out.
First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.
Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:
I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.
I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)
I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...
Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.
So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"
### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s
Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga
Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.
I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.
I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.
I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).
Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.
One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.
First the pre-requisites:
When installing RDPTools, I had a problem in that, for some reason, when I tried to do the 'submodule init' step, TaxonomyTree was getting a 'connection refused' error. I didn't notice at first, and this was causing make to fail when it hit Clustering. I ended up solving this by simply doing a git clone of TaxonomyTree within the RDPTools directory. After that, make ran smoothly.
I also had to install uchime (which is interesting since Bob Edgar, its author, now considers it obsolete). There was some issue with making the source, so I just downloaded the precompiled binary, and it works for me.
hmmer3.1 I already had.
I updated to openJDK 1.8 just for fun.
Python 2.7 was also already installed.
Now the main course.
I cloned the Xander project from Github, and, following the instructions in the README, edited the xander_setenv.sh to reflect what I wanted to do. My sequence file is 30G, so I set the FILTER_SIZE to 38 and the MAX_JVM_HEAP to 100G (I have 256G of RAM on my server). I also upped the MIN_COUNT to 5 for this test run.
Lessons from trouble-shooting:
- Make sure you set the JAR_DIR to the RDPTools directory (that's not clear from the comments)
- Leave off the trailing "/"s for directories
- Do not use relative paths! Absolute paths for everything.
I ran the 'build', 'find' and 'search' steps separately to simplify troubleshooting. My first issue was that they left an absolute path in run_xander_skel.sh, so it didn't execute. That was an easy edit.
I got some failures during the 'build' step due to putting a relative path in the my_xander_setenv.sh file for the sequence file. (The run_xander_skel.sh script makes the working subdir and then cd's to it, so if you provide a relative path, it can't find the file.)
I also was seeing some failures due to java running out of heap space. Turns out 100G wasn't enough for my 122M sequences. So we cranked it up to 250G, and it ran. I am now concerned about what happens when I try my files that are twice as big...
Again because of relative paths in my setenv file, the last part of the 'find' step failed and dumped all results into a single uniq_starts.txt file in the k45 subdir. That caused the 'search' step to fail, naturally.
So we get to the 'search' step. I ran:
$ /home/bifx/Xander_assembler/bin/run_xander_skel.sh my_xander_setenv.sh "search" "rplB"
### Search contigs rplB
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar search -p 20 1 100 ../k45.bloom /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm /home/bifx/Xander_assembler/gene_resource/rplB/rev_enone.hmm gene_starts.txt 1> stdout.txt 2> stdlog.txt
### Merge contigs
java -Xmx250G -jar /home/bifx/RDPTools/hmmgs.jar merge -a -o merge_stdout.txt -s C3 -b 50 --min-length 150 /home/bifx/Xander_assembler/gene_resource/rplB/for_enone.hmm stdout.txt gene_starts.txt_nucl.fasta
Read in 625 contigs, wrote out 0 merged contigs in 0.577s
Things seem to have run okay, but the stdlog.txt has almost 50K warnings like:
Warning: kmer not in bloomfilter: aacaagcgcacgaacagcatgatcgttcagcgtcgtcacaaacga
Why would these kmers not be in the bloom filter? Is it the MIN_COUNT setting? I contacted the developer, and she suggested dropping the MIN_COUNT to 2 (I didn't think this data set would suffer from low coverage, but apparently it does). This got me through the merge step only to choke at the kmer_coverage calculation. A java error:
Exception in thread "main" java.lang.NoSuchMethodError: java.util.concurrent.ConcurrentHashMap.keySet()Ljava/util/concurrent/ConcurrentHashMap$KeySetView;
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.adjustCount(KmerCoverage.java:233)
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.printCovereage(KmerCoverage.java:249)
at edu.msu.cme.rdp.kmer.cli.KmerCoverage.main(KmerCoverage.java:388)
at edu.msu.cme.rdp.kmer.cli.Main.main(Main.java:48)
kmer_coverage failed
This was solved by running with OpenJDK 1.8.0 rather than 1.7.0.
I wanted to make my own reference genes. Fortunately, for some there were already models built at fungene, so I downloaded the .seeds and .hmm files. For the ones without, I found TIGRfam models, which should also work. You can download both the HMM and seed files from the JCVI site.
I still needed the framebot.fa and nucl.fa files. I (laboriously) downloaded ~30K protein and nucleotide sequences from fungene (note here that Firefox had trouble with this site - I ended up using Konqueror to do those downloads), and used CD-HIT to make the files non-redundant. I chose to use cutoffs of 95% id for protein and 90% id for nucleotide.
I applied the Xander_patch to hmmer-3.0 (as instructed in the README.md that comes with the Xander distribution), modified the prepare_gene_ref.sh to point to the right directories (make sure you change JAR_DIR, REF_DIR, and hmmer_xanderpatch).
Running prepare_gene_ref.sh yielded the expected forward and reverse hmm files.
One other note about making your own reference genes is that the taxonomy abundance breakdown (the taxonabundance output file) depends on the framebot.fa file having the semicolon-delimited lineage as the description (see one of the provided framebot.fa files to see what I mean). I wrote a perl script that can take a fasta file, parse for the protein accession, retrieve the lineage information from NCBI, format it, and output the properly formatted framebot.fa file.
Thursday, May 7, 2015
Installing JCVI's GenomeProperties locally
One might wonder why I'm attempting this at all. Well, I worked at TIGR with Dan Haft for many years, and I've carried the database structures with me through my career. So I've got a leg up on your average mortal in terms of a) having and b) understanding the database structure that underlies the GP system.
That being said, I didn't anticipate this being easy.
First, I built a database called GenProp in MySQL with the relevant table structure, and populated it with data from the FTP site. (The GP v3.1 release notes are pretty easy to follow.) I suggest doing this with SQLite, because the program is written to operate off an SQLite db. Using something else requires some code alteration.
Despite the README claiming there are no prerequisites, there are quite a few perl modules you need to have installed for the system to build and operate. To wit:
I unpacked the software and attempted the Build build. The build went okay after I got the requisite modules installed, but './Build test' resulted in many failures, looking like:
Not surprisingly, the GenProp module distributed with v3.1 does not reflect the directory structure listed in the messages. This appears to be legacy that wasn't cleaned up in the distro.
Pressing on...
The meat of the system is evaluate_props.pl (in the scripts subdir). This is what runs the rules against a genome/HMM hits database and stores the results in the GenProp database. The program can run off GFF3 format files, rather than a TIGR-style small genome database schema, which is convenient for those who aren't running TIGR-style small genome database (i.e., everybody).
The description of how to encode the required information in the GFF file can be found essentially in two places: 1) the GFF3 format definition at Sequence Ontology, and 2) the htab2gff.pl program distributed with the other GenProp code (specifically for the attributes needed for HMM hit entries). One important detail is that the HMM hit records in the GFF file should have HMM_hit in the 'type' field (field 3) (rather than the GFF3-mandated protein_hmm_match - so, yes, these GFF3 files won't validate).
Minimally, you need to make sure each HMM_hit record is linked by a Parent attribute to a CDS, which in turn is linked by a Parent attribute to a gene.
Example:
I wrote a script that dumps my database schema to this gff format. One can also use the htab2gff.pl to combine an existing genome GFF3 file and hmmer search results (in htab format) into a properly formatted file. [The caveat here is that the trusted and noise cutoffs must be encoded in the htab file, or in a database that you would have to configure the program to contact.]
[This brings up another aside: What is an htab file, you ask? It's just hmmer search results (either hmmsearch or hmmscan) in a tab-delimited format. The conversion can be accomplished with this program, although it may need some modification to work with newer versions of hmmer.]
I had to change the GenProp.ini file (in the conf subdir) to point to my TIGR-style GenProp database, and provide login information.
My program incantation then looked like:
And output looks like:
The evaluate_props.pl program updates the 'property' and 'step_feat_link' tables in your GenProp database if you are using a database as the data source. If you are using GFF input, the GenProp database is not updated.
Again, if you used a database as your data source, one can use the prop_report.pl program to list which features fill which steps of the property. Not having done this, I instead wrote flatfile_prop_report.pl, which takes the GFF output from evaluate_props.pl as input.The ouput looks like:
That being said, I didn't anticipate this being easy.
First, I built a database called GenProp in MySQL with the relevant table structure, and populated it with data from the FTP site. (The GP v3.1 release notes are pretty easy to follow.) I suggest doing this with SQLite, because the program is written to operate off an SQLite db. Using something else requires some code alteration.
Despite the README claiming there are no prerequisites, there are quite a few perl modules you need to have installed for the system to build and operate. To wit:
- Class::Accessor
- Config::IniFiles
- TimeDate
- Module::Pluggable
- Readonly
- Exception::Class
- DBD::SQLite
I unpacked the software and attempted the Build build. The build went okay after I got the requisite modules installed, but './Build test' resulted in many failures, looking like:
$ ./Build test
t/00.load.t ....... 1/12
# Failed test 'use GenProp::Genome::Property;'
# at t/00.load.t line 9.
# Tried to use 'GenProp::Genome::Property'.
# Failed test 'use GenProp::Genome::Evidence;'
# at t/00.load.t line 10.
# Tried to use 'GenProp::Genome::Evidence'.
# Failed test 'use GenProp::Genome::Cluster;'
# at t/00.load.t line 11.
# Tried to use 'GenProp::Genome::Cluster'.
# Failed test 'use GenProp::Model;'
# at t/00.load.t line 12.
# Tried to use 'GenProp::Model'.
# Failed test 'use GenProp::Model::Property;'
# at t/00.load.t line 13.
# Tried to use 'GenProp::Model::Property'.
# Failed test 'use GenProp::Model::PropLink;'
# at t/00.load.t line 14.
# Tried to use 'GenProp::Model::PropLink'.
# Failed test 'use GenProp::Model::Evidence;'
# at t/00.load.t line 15.
# Tried to use 'GenProp::Model::Evidence'.
# Failed test 'use GenProp::Config;'
# at t/01.config.t line 9.
# Tried to use 'GenProp::Config'.
Not surprisingly, the GenProp module distributed with v3.1 does not reflect the directory structure listed in the messages. This appears to be legacy that wasn't cleaned up in the distro.
Pressing on...
The meat of the system is evaluate_props.pl (in the scripts subdir). This is what runs the rules against a genome/HMM hits database and stores the results in the GenProp database. The program can run off GFF3 format files, rather than a TIGR-style small genome database schema, which is convenient for those who aren't running TIGR-style small genome database (i.e., everybody).
The description of how to encode the required information in the GFF file can be found essentially in two places: 1) the GFF3 format definition at Sequence Ontology, and 2) the htab2gff.pl program distributed with the other GenProp code (specifically for the attributes needed for HMM hit entries). One important detail is that the HMM hit records in the GFF file should have HMM_hit in the 'type' field (field 3) (rather than the GFF3-mandated protein_hmm_match - so, yes, these GFF3 files won't validate).
Minimally, you need to make sure each HMM_hit record is linked by a Parent attribute to a CDS, which in turn is linked by a Parent attribute to a gene.
Example:
ITZX_scaf_103 PNNL gene 14974 16209 . - . ID=gene16660;locus_tag=HLSNC01_00224;bin=v3_bin01;alias=ITZX_scaf_103_13,6666666.44063.peg.1126
ITZX_scaf_103 PNNL CDS 14974 16209 . - 0 ID=CDS16660;Parent=gene16660;feat_name=HLSNC01_00224;locus_tag=HLSNC01_00224;bin=v3_bin01;product=amidase%2C hydantoinase/carbamoylase family;ec_number=3.5.1.6
ITZX_scaf_103 PNNL HMM_hit 15000 16191 419.1 - . ID=TIGR01879.HLSNC01_00224;Parent=CDS16660;Target=TIGR01879 7 404;trusted_cutoff=204.00;noise_cutoff=143.85
I wrote a script that dumps my database schema to this gff format. One can also use the htab2gff.pl to combine an existing genome GFF3 file and hmmer search results (in htab format) into a properly formatted file. [The caveat here is that the trusted and noise cutoffs must be encoded in the htab file, or in a database that you would have to configure the program to contact.]
[This brings up another aside: What is an htab file, you ask? It's just hmmer search results (either hmmsearch or hmmscan) in a tab-delimited format. The conversion can be accomplished with this program, although it may need some modification to work with newer versions of hmmer.]
I had to change the GenProp.ini file (in the conf subdir) to point to my TIGR-style GenProp database, and provide login information.
My program incantation then looked like:
/PATH/TO/GenomeProperties/scripts/evaluate_props.pl -d HLUCCbin01:myShare/v3_bin01.gff -a ALL
And output looks like:
HLUCCbin01 GenProp0037: State some evidence Val 0.666666666666667 Assigned-by RULESAs you can see, the output very helpfully does not include a description of the property, just the state. I've actually fixed this in my version of the program.
HLUCCbin01 GenProp0038: State some evidence Val 0.5 Assigned-by RULES
HLUCCbin01 GenProp0039: State none found Val 0 Assigned-by RULES
HLUCCbin01 GenProp0046: State YES Val 1 Assigned-by RULES
HLUCCbin01 GenProp0047: State none found Val 0 Assigned-by RULES
The evaluate_props.pl program updates the 'property' and 'step_feat_link' tables in your GenProp database if you are using a database as the data source. If you are using GFF input, the GenProp database is not updated.
Again, if you used a database as your data source, one can use the prop_report.pl program to list which features fill which steps of the property. Not having done this, I instead wrote flatfile_prop_report.pl, which takes the GFF output from evaluate_props.pl as input.The ouput looks like:
$ perl /share/scripts/GenomeProperties/scripts/flatfile_prop_report.pl -f for_GP.gff -a GenProp0061...and Bob's your uncle. Now you can, from flatfile input, run and view GenomeProperties. A small amount of clever engineering can connect these scripts to your favorite relational database scheme.
-----------------------------------------------------------------------------------
id accession type property name
------- --------------- ------- ---------------------------------------------------
2043 GenProp0061 SYSTEM lipoprotein system lgt/lsp/lnt
-----------------------
GO_id get_GO
--------------- ------
GO:0006497 1
GO:0042158 1
-----------------------------------------------------------------------------------
property definition:
-----------------------------------------------------------------------------------
YES
NO
Some membrane proteins are anchored to the membrane by a lipid moiety
attached to a modified N-terminus rather than by a transmembrane helix
. The characteristic enzymes of this modification system are
lsp: prolipoprotein signal peptidase
lgt: prolipoprotein diacylglycerol transferase
lnt: apolipoprotein N-acyltransferase
The modifications are cleavage (by Lsp) of the lipoprotein signal
sequence at a motif resembling L-X-G/A-C to leave Cys as the new N-
terminal residue, then modification of the Cys sulfhydryl by
prolipoprotein diacylglyceryl transferase (Lgt), then acylation of the
Cys NH2 by apolipoprotein N-acyltransferase (Lnt).
-----------------------------------------------------------------------------------
parent properties id accession
--------------------------------------------------------------- ------- -----------
protein modification, prosthetic groups and cofactors 4016 GenProp0075
-----------------------------------------------------------------------------------
stepnum in_rule step_name
------- ------- -------------------------------------------------------------------
lgt 1 prolipoprotein diacylglyceryl transferase
lnt 0 apolipoprotein N-acyltransferase
lsp 1 prolipoprotein signal peptidase
-----------------------------------------------------------------------------------
step_num get_GO method query
-------- ------ --------------- -----------------------
lgt (prolipoprotein diacylglyceryl transferase)
1 HMM TIGR00544 scaffold_26_122
lnt (apolipoprotein N-acyltransferase)
1 HMM TIGR00546 scaffold_0_365
lsp (prolipoprotein signal peptidase)
1 HMM TIGR00077 scaffold_30_198
Wednesday, May 6, 2015
NCBI-22
While registering a BioProject:
So I follow the link and start setting up my BioSample until I get to the page which says:
I tried to talk to my colleague Dr. Doctor Doctor about this, but he was out, despite his secretary telling me he was in.
If you have not registered your sample, please register at BioSample. At the end of that process, you will be returned to this submission.
So I follow the link and start setting up my BioSample until I get to the page which says:
If you have not registered your project, please register at BioProject. At the end of that process, you will be returned to this submission.
I tried to talk to my colleague Dr. Doctor Doctor about this, but he was out, despite his secretary telling me he was in.
Monday, September 15, 2014
Firefox/Seamonkey/IceCat and the "Couldn't load XPCOM" error
So I ran into this error upon installing Seamonkey v2.29 in order to do some compatibility testing. This was particularly annoying because I have a version of Firefox that is running just fine.
I had downloaded the Seamonkey 64-bit contributed build, and untarred/bzipped in my /usr/lib64 directory and then attempted to launch, only to see the dreaded:
A search on the web lead to many suggestions of re-installing, which I considered to be using a machete when a scalpel should suffice.
So a little more digging suggested the problem is the ability of the program to link with the correct version of xulrunner. The seamonkey subdir does contain its own libxul.so and run-mozilla.sh, but apparently those are not interfacing with the rest of the system well.
So a peek inside /usr/lib64/xulrunner showed that it, too has a version of run-mozilla.sh. So I simply executed:
and problem solved.
This, I'm sure, is an entirely unsophisticated description and understanding of what's going on, but the solution works (at least for me on my system).
I had downloaded the Seamonkey 64-bit contributed build, and untarred/bzipped in my /usr/lib64 directory and then attempted to launch, only to see the dreaded:
Couldn't load XPCOM.
A search on the web lead to many suggestions of re-installing, which I considered to be using a machete when a scalpel should suffice.
So a little more digging suggested the problem is the ability of the program to link with the correct version of xulrunner. The seamonkey subdir does contain its own libxul.so and run-mozilla.sh, but apparently those are not interfacing with the rest of the system well.
So a peek inside /usr/lib64/xulrunner showed that it, too has a version of run-mozilla.sh. So I simply executed:
$ /usr/lib64/xulrunner/run-mozilla.sh /usr/lib64/seamonkey/seamonkey
and problem solved.
This, I'm sure, is an entirely unsophisticated description and understanding of what's going on, but the solution works (at least for me on my system).
Tuesday, February 18, 2014
Getting AMPHORA2 to work with BioPerl 1.6.9
AMPHORA2 (like its predecessor AMPHORA) is a great tool by Martin Wu and Alexandra Scott that takes you directly from genomic or metagenomic sequence to phylogeny through identification and analysis of conserved, single-copy phylogenetic marker genes. I have found it to be super handy for assessing metagenomic assemblies, both during and after the binning process. There is one problem, however, that may be preventing some people from using it.
There's note in the AMPHORA2 README files that states:
((REF-YP_004290029-NC_015216:0.23840847960367719804[I116],REF-YP_004520254-NC_015574:0.18701622612251170286[I117]):0.03268127637536134139[I115],REF-YP_447902-NC_007681:0.32812155226299949407[I118])
(Okay, okay, there is this page that may allow for the format they use, but it certainly isn't generally agreed upon in the bioinformatics community, as evidenced by the fact that...)
The BioPerl module that parses Newick trees (Bio::TreeIO::NewickParser.pm) only recognizes brackets in the context of NHX format (i.e. [&&NHX:tag1=value1), and thus fails and bails when it hits these tags. Apparently, in version 1.6.1 of BioPerl, NewickParser.pm would drop these values in a 'bootstrap' tag, and AMPHORA2 takes advantage of this placement (or misplacement as the case may be) to track taxonomy throughout the tree.
Note that in the BioPerl HOWTO:Trees, it states that
The solution suggested by the Wu lab (downgrade to BioPerl 1.6.1) is unsatisfying on several levels, thus, I came up with the following hack for NewickParser.pm:
} elsif ($state == 4) { # optional NHX tags
if($token =~ /\[\&\&NHX/) {
# careful: this regexp gets rid of all NHX wrapping in one step
$self->_start('nhx_tag');
$token =~ /\[\&\&NHX\:(\S+)\]/;
if ($1) {
# NHX may be empty, presumably at end of file, just before ";"
my @attributes = split ':', $1;
foreach my $attribute (@attributes) {
$attribute =~ s/\s+//;
my($key,$value) = split '=', $attribute;
$self->_start('tag_name');
$self->_chars($key);
$self->_end('tag_name');
$self->_start('tag_value');
$self->_chars($value);
$self->_end('tag_value');
}
}
$self->_end('nhx_tag');
$token = next_token(\$newick, ",);"); #move to , or )
} elsif ($token =~ /\[/) {
# This is a hack to make AMPHORA2 work
if ($token =~ /\[(\S+)\]/) {
$self->_start('bootstrap');
$self->_chars($1);
$self->_end('bootstrap');
}
$token = next_token(\$newick, ",);"); }
$state = 5;
} elsif ($state == 5) { # end node
(I shouldn't have to say this next bit, but I will anyway...)
Adding this code may cause unexpected errors in other use cases! Be careful!
Thank you, Martin, for writing this tool!
'Known issues' |
There's note in the AMPHORA2 README files that states:
KNOWN ISSUESThe problem lies in the fact that AMPHORA2 uses RAxML to build trees, and the tree files produced by RAxML don't follow the Newick/New Hampshire or NHX format. To wit, RAxML puts edge labels in brackets after the branch length, e.g.:
Bioperl
AMPHORA2 has been tested on Bioperl 1.6.1. People have reported problems running AMPHORA2 on Bioperl 1.6.9. For example, the following error has been reported when running Phylotyping.pl:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: parse error: expected ; or ) or ,
STACK: Error::throw
STACK: Bio::Root::Root::throw
/usr/local/share/perl/5.14.2/Bio/Root/Root.pm:472
STACK: Bio::TreeIO::NewickParser::parse_newick
/usr/local/share/perl/5.14.2/Bio/TreeIO/NewickParser.pm:195
STACK: Bio::TreeIO::newick::next_tree
/usr/local/share/perl/5.14.2/Bio/TreeIO/newick.pm:143
STACK: main::assign_phylotype
/usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:154
STACK: /usr/local/Bioinf/AMPHORA2/Scripts/Phylotyping.pl:70
Downgrading Bioperl from 1.6.9 to 1.6.1 solves the problem.
((REF-YP_004290029-NC_015216:0.23840847960367719804[I116],REF-YP_004520254-NC_015574:0.18701622612251170286[I117]):0.03268127637536134139[I115],REF-YP_447902-NC_007681:0.32812155226299949407[I118])
Newick! |
(Okay, okay, there is this page that may allow for the format they use, but it certainly isn't generally agreed upon in the bioinformatics community, as evidenced by the fact that...)
The BioPerl module that parses Newick trees (Bio::TreeIO::NewickParser.pm) only recognizes brackets in the context of NHX format (i.e. [&&NHX:tag1=value1), and thus fails and bails when it hits these tags. Apparently, in version 1.6.1 of BioPerl, NewickParser.pm would drop these values in a 'bootstrap' tag, and AMPHORA2 takes advantage of this placement (or misplacement as the case may be) to track taxonomy throughout the tree.
Note that in the BioPerl HOWTO:Trees, it states that
A newly added capability (after 1.5.2) allows you to specify that the internal nodes id encode bootstrap values instead of IDs. Provide theThis, however, does not solve the AMPHORA2 issue.
-internal_node_id => 'bootstrap'
to Bio::TreeIO and it will automatically move the ids over to the bootstrap slot.
The solution suggested by the Wu lab (downgrade to BioPerl 1.6.1) is unsatisfying on several levels, thus, I came up with the following hack for NewickParser.pm:
} elsif ($state == 4) { # optional NHX tags
if($token =~ /\[\&\&NHX/) {
# careful: this regexp gets rid of all NHX wrapping in one step
$self->_start('nhx_tag');
$token =~ /\[\&\&NHX\:(\S+)\]/;
if ($1) {
# NHX may be empty, presumably at end of file, just before ";"
my @attributes = split ':', $1;
foreach my $attribute (@attributes) {
$attribute =~ s/\s+//;
my($key,$value) = split '=', $attribute;
$self->_start('tag_name');
$self->_chars($key);
$self->_end('tag_name');
$self->_start('tag_value');
$self->_chars($value);
$self->_end('tag_value');
}
}
$self->_end('nhx_tag');
$token = next_token(\$newick, ",);"); #move to , or )
} elsif ($token =~ /\[/) {
# This is a hack to make AMPHORA2 work
if ($token =~ /\[(\S+)\]/) {
$self->_start('bootstrap');
$self->_chars($1);
$self->_end('bootstrap');
}
$token = next_token(\$newick, ",);"); }
$state = 5;
} elsif ($state == 5) { # end node
(I shouldn't have to say this next bit, but I will anyway...)
Adding this code may cause unexpected errors in other use cases! Be careful!
Thank you, Martin, for writing this tool!
Pawer to the People! |
Subscribe to:
Posts (Atom)