Quicklinks: General Information, Lecture Notes, Final Assignment

The main theme of this course is the rigorous mathematical analysis of **probabilistic and
combinatorial structures arising from biology**, mostly in the study of evolution and
genetics. **No biology background is required.**
The course should be of interest to probabilists, combinatorialists, applied mathematicians,
theoretical computer scientists, computational biologists, and biostatisticians.

Various stochastic processes on combinatorial structures will be considered, including random trees, Markov models on trees, multitype branching processes, finite Markov chains, random walks, exchangeable partitions, and Kingman's coalescent. Here is a tentative list of topics:

*Mathematical Phylogenetics (a.k.a. the mathematics behind the Tree of Life)*

- Trees and splits
- Characters and compatibility
- Tree-based metrics
- Markov models on trees
- Ancestral reconstruction

*Mathematical Population Genetics*

- Wright-Fisher model and Kingman's coalescent
- Infinite-alleles and infinite-sites models
- Ancestral recombination graph
- Diffusion theory

**Instructor:**Sebastien Roch**Time and place:**MoWeFr 12:05PM-12:55PM in VAN VLECK B105**Office hours:**MoWeFr 1:30PM-2:30PM in VAN VLECK 823**Prerequisites:**No biology background is required. A graduate course in stochastic processes (e.g. MATH/STAT 831) will be useful.**Texts (not required):**Topics covered will be taken mostly from- [SS] Phylogenetics by Semple and Steel
- [D] Probability Models for DNA Sequence Evolution by Durrett
- [GW] The Cartoon Guide to Genetics by Gonick and Wheelis (a useful reference on molecular biology)

**Grades**will be based on an (optional) homework and a final assignment consisting in reading a research paper of your choice and presenting it in class.

UPDATE [Sep 1, 2015]: The latest version of the lecture notes can be found here.

*Introduction*

**Lec 1 [Sep 5]:**Crash course in molecular evolution.

*Phylogenetics*

**Lec 2 [Sep 7]:**X-trees.**Lec 3 [Sep 10]:**X-trees (continued); Splits-Equivalence Theorem.**Lec 4 [Sep 12]:**Splits-Equivalence Theorem (continued).**Lec 5 [Sep 14]:**Splits-Equivalence Theorem (continued); Maximum Parsimony.**Lec 6 [Sep 17]:**Maximum Parsimony (continued).**Lec 7 [Sep 19]:**Approximating Maximum Parsimony.**Lec 8 [Sep 21]:**Approximating Maximum Parsimony (continued).**Lec 9 [Sep 24]:**Approximating Maximum Parsimony (continued). Quartet Theorem.**Lec 10 [Sep 26]:**Quartet Theorem (continued).**Lec 11 [Sep 28]:**Tree metrics.**Lec 12 [Oct 1]:**Tree metrics (continued).**Lec 13 [Oct 3]:**Tree-metric theorem.**Lec 14 [Oct 5]:**Markov models on trees.**Lec 15 [Oct 8]:**Markov models on trees (continued).**Lec 16 [Oct 10]:**Markov models on trees (continued).**Lec 17 [Oct 12]:**Consistency.**Lec 18 [Oct 15]:**Consistency (continued).**Lec 19 [Oct 17]:**Consistency (continued).**Lec 20 [Oct 19]:**Asymptotic sample complexity.**Lec 21 [Oct 22]:**Asymptotic sample complexity (continued).**Lec 22 [Oct 24]:**Asymptotic sample complexity (continued).**Lec 23 [Oct 26]:**Ancestral reconstruction.**Lec 24 [Oct 29]:**Ancestral reconstruction (continued).**Lec 25 [Oct 31]:**Kesten-Stigum bound.**Lec 26 [Nov 2]:**Eigenvector-based estimation; Steel's conjecture.

*Population genetics*

**Lec 27 [Nov 5]:**Wright-Fisher model.**Lec 28 [Nov 7]:**Wright-Fisher model (continued); Kingman's coalescent.**Lec 29 [Nov 9]:**Kingman's coalescent (continued); Ewens' Sampling Formula.**Lec 30 [Nov 12]:**Ewens' Sampling Formula (continued); Chinese restaurant process.**Lec 31 [Nov 14]:**Chinese restaurant process (continued); Infinite-sites model.**Lec 32 [Nov 16]:**Infinite-sites model (continued); Testing neutrality**Lec 33 [Nov 19]:**Recombination.**Lec 34 [Nov 21]:**Estimating the recombination rate.**Nov 23:**No class.**Lec 35 [Nov 26]:**Wright-Fisher diffusion.**Lec 36 [Nov 28]:**Wright-Fisher diffusion (continued); Fixation in the diffusion limit.**Lec 37 [Nov 30]:**A Genealogical Interpretation of Principal Components Analysis**Dec 3:**No class.**Lec 38 [Dec 5]:**Discordance of Species Trees with Their Most Likely Gene Trees

*Presentations*

**Dec 7:**- Holcomb - Summary - Paper: Shuffling Chromosomes
- Pimentel - Summary - Paper: Reconstruction on Trees: Beating the Second Eigenvalue

**Dec 10:**- Aroor - Summary - Paper: Phylogenetic mixtures on a single tree can mimic a tree of another topology
- Blackman - Summary - Paper: Population Structure and Eigenanalysis
- Solis - Summary - Paper: Consistency of Bayesian inference of resolved phylogenetic trees

**Dec 12:****Dec 14:**- Ho - Summary - Paper: Selecting Taxa to Save or Sequence: Desirable Criteria and a Greedy Solution
- Chapman - Summary - Paper: Retractions of finite distance functions onto tree metrics
- Koyama - Summary - Paper: Random Walks on Trees and Matchings

The optional homework is here. It covers only the phylogenetics part of the course. It is due during class on December 14.

For your final assignment, choose a research paper to present in class. Examples of possible papers follow. Check with me when you have made your choice. The assignment is to give a short presentation in class. Some of the papers below also contain simulations and empirical results, but I expect you to focus on the theoretical parts of the paper. The rules for the presentation are as follows:

- 14 minutes (strictly enforced)
- Blackboard only
- Discuss only one result. Do not try to summarize the entire paper.
- Sketch the proof of one key idea: lemma, computation, special case, detailed example.
- Do not spend more than a third of the talk on motivation.
- Write a 2-page summary of your presentation, to be posted here before your talk. You may use the following template.

Partial list of possible papers (a **[X]** indicates the paper has already been assigned):

- An evolutionary model for maximum likelihood alignment of DNA sequences by Thorne et al.
- On the Approximability of Numerical Taxonomy (Fitting Distances by Tree Metrics) by Agarwala et al.
- On the Uniqueness of the Selection Criterion in Neighbor-Joining by Bryant
**[X]**Reconstruction on Trees: Beating the Second Eigenvalue by Mossel**[X]**A phase transition for a random cluster model on phylogenetic trees by Mossel and Steel- More Taxa Are Not Necessarily Better for the Reconstruction of Ancestral Character States by Li et al.
**[X]**Selecting Taxa to Save or Sequence: Desirable Criteria and a Greedy Solution by Bordewich et al.**[X]**Random Walks on Trees and Matchings by Diaconis and Holmes- Geometry of the Space of Phylogenetic Trees by Billera et al.
**[X]**Retractions of finite distance functions onto tree metrics by Moulton and Steel**[X]**Phylogenetic mixtures on a single tree can mimic a tree of another topology by Matsen and Steel- Invariants of Some Probability Models Used in Phylogenetic Inference by Evans and Speed
- Ancestral Processes with Selection by Krone and Neuhauser
**[X]**Population Structure and Eigenanalysis by Patterson et al.- Haplotyping as perfect phylogeny by Gusfield
**[X]**Shuffling Chromosomes by Durrett**[X]**A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments by Li- Strong Limit Theorems of Empirical Functionals for Large Exceedances of Partial Sums of I.I.D. Variables by Dembo and Karlin
**[X]**Consistency of Bayesian inference of resolved phylogenetic trees by Steel

- Inferring phylogenies by Felsenstein
- Fundamentals of molecular evolution by Graur and Li
- Combinatorial Stochastic Processes (St-Flour Lectures) by Pitman
- Ancestral Inference in Population Genetics (St-Flour Lectures) by Tavare
- Recent progress in coalescent theory by Berestycki
- Logarithmic Combinatorial Structures by Arratia et al.
- Mathematical Population Genetics by Ewens
- Algebraic statistics for computational biology edited by Pachter and Sturmfels

Supported partly by NSF grant DMS-1248176.

Last updated: Dec 14, 2012.