wiki:DataFormatsAndConversion

Version 1 (modified by iovercast, 10 years ago) (diff)

--

General notes on data conversion, conversion tools, and formats.

Its very easy. You can do global replacements on genotypes lines. Do not use Excel. I always recommend using vim text editor.

Anonymous, this is actually not a good strategy for data conversion.

Formats

Hapmap

This is a very popular format for very large projects. Not so popular within the phylogeography crowd, lots of tools don't recognize this so you have to convert.

hapMap2Genind

  • Thanks to Lindsay V Clark for the R script to convert hapmap to genind, native adagenet format
  • NB: The default hapmap format we get back from cornell has two columns with hashes in the column name (rs# and assembly#). The hapMap2genind function (actually read.table()) does _not_ like the hashes (they are comment characters). Options 1) put quotes around these two column names (now they are “rs#” and “assembly#” 2) Add an argument to read.table(table, blah, blah, comment.char = ""), obviously this will mess with you if there are actual comments in your file ;)

Conversions

Tassel

Tassel is part of the pipeline Cornell uses to process GBS data. It uses hapmap format, which not a lot of tools understand. Tassel CLI does have simple conversion tools for getting the data out of hapmap to vcf which is more universal.

GATK

  • From the broad institute gatk The Genome Analysis Toolkit or GATK is a software package developed at the Broad Institute to analyze high-throughput sequencing data.

Pretty 'upstream' but it will supposably convert hapmap to vcf, if we need that.

Samtools

Samtools is a suite of programs for interacting with high-throughput sequencing data. Diego recommended this. I have installed but haven't tested. Also doesn't take in hapmap, but does take vcf.

PGDSpider

Swiss-army conversion sw reads and writes dozens of formats, written in java. Definitely not plug and play, even though it does convert stuff you have to craft an 'answer file' to tell it about specifics of your data. The GUI helps you do this, CLI not so much. Doesn't take hapmap as an input format, though it does take VCF. Here's the manual.