[http://www.biostat.washington.edu/suminst/sisg/general Homepage for the workshop]. They have great [http://www.biostat.washington.edu/suminst/sisg/scholarships scholarship] opportunities for grad students to cover registration and travel. == Notes from the time I went (July 2015) == === SISG 9 - Population Genetics === Taught by Bruce Weir and Jerome Jerome Goudet. [http://www2.unil.ch/popgen/teaching/SISG15/ Homepage for the R scripts from the workshop]. * [http://www2.unil.ch/popgen/teaching/SISG15/Practicals.pdf R basics (very quick)]. * It did help me learn about data.frame, and lists, they make much more sense now. * [http://www2.unil.ch/popgen/teaching/SISG15/alfreq_hwtest.pdf Allele Frequency and Hardy-Weinberg] * I want to know if there's a continuous function that describes the maximum value of the binomial distribution from 0-1. Seems like there should be. * The EM algorithm for Two loci isn't guaranteed to converge, sometimes it gets stuck flipping back and forth between two intermediate values. Seems easy to fix, but it'd be curious to know if there was some distribution of genotypes that was guaranteed to break it. === SISG 18 - MCMC for Genetics === Taught by Eric Anderson and Matthew Stephens '''[https://github.com/eriqande/sisg_mcmc_course Lecture github site]''' * Monday AM * Probability as representation of uncertainty vs long range frequency. * Expectation of mean of beta distribution is alpha/(alpha+beta) * Jeffreys Prior - a=b=0.5 * Marginal distribution of y integrating out over theta * "Propagating uncertainly" - Take uncertainty into account down the line. * Monday AM II * Monte Carlo Method - "In search of a definition..." - Approximate '''expectation based on sample mean''' of simulated random variables. * "Simple sample mean..." * Wright-Fisher Model * Sampling with replacement between generations * Markov Chains * Transition probability matrices. Do they have to be symmetric? * Limiting distribution (''ergodic Markov chain''), regardless of where you start, as t->inf the probability of being in any state will be the same. * Time averaging over the chain converges to the limiting distribution. * "known only up to scale" - shape but not normalizing constant? * Reversible jump mcmc? Bridge sampling? Importance sampling? * Ergodicity * No transient states - No states you can't reach in a finite number of steps. * Irreducible - any state is reachable from any other state in a finite number of steps * Aperiodic - Can't get stuck in a loop * Stationary distribution of Markov chain * General balance equation: πP = π, where P is a transition probability matrix and π is the stationary distribution. * Time-reversible Markov chains is required to for detailed balance to satisfy general balance * Metropolis-Hastincs Algorithm * Take state i, propose state j, accept the proposed move with probability min {1, some probability Hastings ratio} * '''Hastings ratio:''' f(j)/f(i) x q(i|j)/q(j|i) * Ratio of target densites x ratio of proposal densities * Symmetric proposal transition matrices will cancel the right half of the equation. * f(j) is more likely then it increases probablity * Monday PM * easyMCMC in R * Sticky chains: Big SD too few accepted changes, very small SD = too many accepted changes. * In complex problems, acceptance rate should be ~1% * Higher dimension problems should ~= lower acceptance rate (should propose more dramatic moves, since explored space is more complex) * Multi-modal target (need mcmc sd wide enough to traverse all modes) * Multidimentional MCMC * component-wise mcmc/gibbs sampling * Genotype freq. and inbreeding * Simple component-wise M-H sampling * propose-sample/reject each parameter individually * Gibbs sampling (Full conditional distribution) * Latent variables - missing data models. What data would you need in order to make it really easy to solve the problem? * Distribution conditional on fixed state of all other parameters * Gibb sampling is a special case of component-wise M-H sampling, conditional on all other parameters * Wrap-up * MCMC almost always proposes small changes to subsets of the variables * Detailed balance, irreducible chain, latent variables * Tuesday AM * structure admixture model: hybrid zones, gene flow, population structure, subpopulations * Falush 2003 - non-independence between loci, allele freqs in pops incorporating inbreeding * Falush 2007 - dominant markers and null alleles * Beaumont 2001 (scottish wildcats) * structure prior pop info model: multilocus genotypes, sampling locations, known symmetrical migration rate, migration limited to most recent ''n'' generations. * More parameters * Oh fuck that's what the Q-matrix is, derp. * NewHybrids (Anderson 2002) - does not require known locations, allows more than one migrant ancestor, but only 2 sources, non-symmetrical migration, dependence within loci is modeled * BayesAss+ - Specialized models, detect recent immigrants, estimate separate migration rates, multiple locales/subpops. * Multilocus, requires distinct sampling locales, assumes no LD, subpops are known, infrequent migration * mcmc in structure - * Expected values can be approximated with sample means. * dirichlet is the multivariate generalization of the beta dist * wat conjugate prior?? * dirichlet vector with k components that sum to 1 * [http://rpubs.com/eriqande/scot-cats Rrunstruct] - Usage for the R structure wrapper code * Tuesday PM * Latent variables could make gibbs sampling easier? * Haplotyping (Phase) * Clark's method (search population for common haploytpes * Id unambiguous individuals, construct known haplotypes, disambiguate unknown haplotypes from combinations of known haplotypes. * Results may depend on order of observation, frequency is ignored, only matches exact haplotypes, doesn't measure uncertainty * Bayesian method * iterate through multiple times * Use haplotype freq information * account for uncertainty * Haplotypes will look similar to ones you've seen before. * Incorporating recombination is trickier. * "Pseudo-Gibbs sampler" * Stronger modelling assumptions tend to underestimate uncertainty. * Wednesday AM * Bayesian Model Choice * Posterior odds = Prior odds x Bayes Factor * Posterior odds: ratio of posterior probabilities of model given data * Prior odds: ratio of probability of the models * Bayes factor: likelihood of model 1 over model 2 (data given the model) * Bayes factor ("marginal likelihood"), isolate the model from the data, and see how prior assumptions on the model will change the results. * Bayes factor does not rely on prior odds, which is why people use it. Interpreted in light of prior odds. Interpretation depends on context, and on prior odds. * If you collect enough data the posterior odds will converge toward infinity with probability 1 in favor of the true model. * Sensitivity analysis: Bayes factors can be peculiarly sensitive to the priors in ways you can't expect, so testing different priors could be informative. * Model choice: Don't use flat priors on things that are only present in one model * Wednesday AM2 * How to reduce variance of sampling: reduce variance of function sampled or increase the number of samples taken. * minimal relevance sampling: Choice of density of sampling will influence the variance of your monte carlo estimate. * That's pretty much what importance sampling is all about is multiplying things by 1, dressed up in a "tricky fashion". * '''Importance sampling:''' How to sample wisely for your monte carlo estimates. * [http://ib.berkeley.edu/labs/slatkin/eriq/classes/guest_lect/mc_lecture_notes.pdf Thorough lecture notes on this]. * "Poor mixing is the evil cousin of reducibility" * Metropolis-coupled monte carlo (heated chains) * Chains with exponent modifiers 0 < x < 1 * Simulated annealing * Chains with exponent modifiers x > 1 ==== quotes ==== * "Out of all the tomorrows we might experience...." * "Uncertainty is, intrinsically, personal." * "Random draws to mapped calls..." * "It's very hard to get rational behavior out of a committee." * "The problem is: How flat matters." * "How to specify your prior: look into your heart and think about what you know." * "You're going to be wrong whatever you do. These are cartoons of reality, none of these models are right."