| 1 | [[PageOutline]] |
| 2 | = General notes on data conversion, conversion tools, and formats. = |
| 3 | |
| 4 | Its very easy. You can do global replacements on genotypes lines. Do not use Excel. I always recommend using vim text editor. |
| 5 | |
| 6 | ''Anonymous'', this is actually not a good strategy for data conversion. |
| 7 | == Formats == |
| 8 | |
| 9 | === Hapmap === |
| 10 | This is a very popular format for very large projects. Not so popular within the phylogeography crowd, lots of tools don't recognize this so you have to convert. |
| 11 | |
| 12 | '''hapMap2Genind''' |
| 13 | * Thanks to [http://openwetware.org/wiki/User:Lindsay_V._Clark Lindsay V Clark] for the [http://openwetware.org/images/e/eb/HapMap2genind.R.txt R script to convert hapmap to genind, native adagenet format] |
| 14 | * Another for [http://openwetware.org/images/c/c5/HapMap2genlight.R.txt Hapmap to genlight], theres also one for converting [http://openwetware.org/images/7/7a/Genind2structure.R.txt genind2structure]. |
| 15 | * Looks like these scripts were generated as part of a larger [https://www.ideals.illinois.edu/bitstream/handle/2142/49963/README.txt?sequence=31 project] |
| 16 | * '''NB''': The default hapmap format we get back from cornell has two columns with hashes in the column name (rs# and assembly#). The hapMap2genind function (actually read.table()) does _not_ like the hashes (they are comment characters). Options 1) put quotes around these two column names (now they are “rs#” and “assembly#” 2) Add an argument to read.table(table, blah, blah, '''comment.char = ""'''), obviously this will mess with you if there are actual comments in your file ;) |
| 17 | |
| 18 | == Conversions == |
| 19 | === [http://www.maizegenetics.net/#!tassel/c17q9 Tassel] === |
| 20 | [http://www.maizegenetics.net/#!tassel/c17q9 Tassel] is part of the pipeline Cornell uses to process GBS data. It uses hapmap format, which not a lot of tools understand. Tassel CLI does have simple conversion tools for getting the data out of hapmap to vcf which is more universal. |
| 21 | |
| 22 | === GATK === |
| 23 | * From the broad institute [https://www.broadinstitute.org/gatk/ gatk] |
| 24 | The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data. |
| 25 | Pretty 'upstream' but it will supposably convert hapmap to vcf, if we need that. |
| 26 | === Samtools === |
| 27 | Samtools is a suite of programs for interacting with high-throughput sequencing data. Diego recommended this. I have installed but haven't tested. Also doesn't take in hapmap, but does take vcf. |
| 28 | * http://www.htslib.org/ |
| 29 | |
| 30 | === PGDSpider === |
| 31 | Swiss-army conversion sw reads and writes dozens of formats, written in java. Definitely not plug and play, even though it does convert stuff you have to craft an 'answer file' to tell it about specifics of your data. The GUI helps you do this, CLI not so much. Doesn't take hapmap as an input format, though it does take VCF. Here's the [http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider%20manual_vers%202-0-7-2.pdf manual]. |