GitHub - kwilkins226/TALEffectorClassifier: This directory contains python scripts to take Target Finder output files and create corresponding Weka input files for a Naive Bayes machine learning classifier for distinguishing true and false positive Target Finder predictions (Cernadas et al, 2014; Wilkins et al, under review).

This directory contains python scripts to take Target Finder output files and create corresponding Weka input files for a Naive Bayes machine learning classifier for distinguishing true and false positive Target Finder predictions (Cernadas et al, 2014; Wilkins et al, under review).

#License

All source code is available under an ISC license.

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.

#Description

The script msu_gff_parser.py creates a dictionary from a gff genome annotation file. Like all scripts in this directory, it was designed for the MSU7 rice genome annotation (Kawahara et al, 2013) and may not handle the quirks of other gff file formats.

The script generate_msu7_promoters.py uses a fasta-formatted genome sequence and the corresponding annotation to create a fasta-formatted file containing all promoters in that genome, where a promoter is defined as the 1000 base pairs upstream of the annotated transcriptional start site plus the 5' UTR if it is annotated. The script also creates pickled files containing all the genome features used by the machine learning classifier. The script depends on msu_gff_parser.py. The output file prefix should end with the beginning of a file name, not with a directory. The script can be run as follows:

python ./generate_msu7_promoters.py -a annotation_file.gff -g genome_file.fa -o output_file_prefix

The script get_msu7_Ypatch_TATABox.py searches a fasta-formatted file of promoter sequences for TATA box and Y patch sequences as defined by Yamamoto et al (2007). The script can be run as follows:

python ./get_msu7_Ypatch_TATABox.py -p promoters.fa -o output_file_prefix

The script write_weka_commands_and_input_files takes as input the text tab-delimited output from the Target Finder TAL effector binding site prediction tool, as well as the information about genomic features output by the previous two scripts, and outputs Weka input files for the Target Finder results and Weka commands for running the classifier on these input files. The script can be run as follows:

python ./write_weka_commands_and_input_files -i input_folder_containing_target_finder_output -g output_file_prefix_from_previous_scritps -o output_folder -s number_of_lines_in_target_finder_header -c AllFeaturesPlusIdNB2.model -r file_for_weka_output -w weka.jar

The file AllFeaturesPlusIdNB2.model is a Weka model file that contains a Naive Bayes classifier for distinguishing true and false positive Target Finder predictions. It is the same classifier created by Cernadas et al (2014) updated to use transcriptional and translational start sites from an annotation file in Wilkins et al (under review).

#References

Cernadas, R.A., Doyle, E.L., Niño-Liu, D.O., Wilkins, K.E., Bancroft, T., Wang, L., Schmidt, C.L., Caldo, R., Yang, B., White, F.F., Nettleton, D., Wise, R.P., and Bogdanove, A.J. (2014). Code-assisted discovery of TAL effector targets in bacterial leaf streak of rice reveals contrast with bacterial blight and a novel susceptibility gene. PLoS Pathogens 10, 1-24. doi: 10.1371/journal.ppat.1003972.

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten I. (2009) The WEKA data mining software: an update. SIGKDD Explor Newsl 11, 10–18. doi: 10.1145/1656274.1656278

Kawahara, Y., De La Bastide, M., Hamilton, J., Kanamori, H., Mccombie, W.R., Ouyang, S., Schwartz, D., Tanaka, T., Wu, J., Zhou, S., Childs, K., Davidson, R., Lin, H., Quesada-Ocampo, L., Vaillancourt, B., Sakai, H., Lee, S.S., Kim, J., Numa, H., Itoh, T., Buell, C.R., and Matsumoto, T. (2013). Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data. Rice 6, 4. doi: 10.1186/1939-8433-6-4.

Wilkins K, Booher N, Wang L, and Bogdanove A. (under review) TAL effector content and host transcriptional response across diverse strains of the rice bacterial leaf streak pathogen Xanthomonas oryzae pv. oryzicola.

Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, and Abe T. (2007) Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 8, 67. doi: 10.1186/1471-2164-8-67

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitattributes		.gitattributes
AllFeaturesPlusIdNB2.model		AllFeaturesPlusIdNB2.model
README.md		README.md
generate_msu7_promoters.py		generate_msu7_promoters.py
get_msu7_Ypatch_TATABox.py		get_msu7_Ypatch_TATABox.py
msu_gff_parser.py		msu_gff_parser.py
write_weka_commands_and_input_files.py		write_weka_commands_and_input_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages