Monday, September 29, 2014

Goofy genealogies

Family pedigrees seem to be confusing things, because there are two distinct interpretations of the expression "family tree".

First, the pedigree tree could be drawn with a particular contemporary person at the root of the tree, so that the tree expands backwards in time to increasing numbers of ancestors at the leaves (ie. an "ascent tree"). In some ways this seems quite illogical as an analogy, given that the base of a real tree is the origin of its growth.

Second, the pedigree tree could be drawn with a particular ancestor at the root of the tree, so that the tree expands forwards in time to increasing numbers of descendants at the leaves (ie. a "descent tree"). This is more logical, although we often draw the root at the top. (The following example is actually a network, rather than strictly a tree; see also Pedigrees and phylogenies are networks not trees.)

Pedigrees are generally somewhat different from phylogenies, but in phylogenetics we do choose the latter option for interpreting trees — we start with a collection of contemporary leaves and try to reconstruct the tree backwards towards the common ancestor. Thus the root is at the "base" of the tree, even when we draw the root at the top of the diagram.

In popular usage these distinctions are often blurred. Consider this "family tree" of the Disney character Goofy. It is taken from Gilles R. Maurice's Calisota web page, where the character names are listed clearly.

This is based on the first usage described above, since Goofy himself is at the base and his ancestors are at the leaves. This is actually closer to a lineage rather than a tree, especially as no females seem to be involved at any stage.

However, roughly the same information can be presented the other way around. This cartoon is taken from a different Calisota page.

Here, Goofy is now at the top of the tree and his ancestry proceeds downwards, with the oldest ancestor at the base (except for his son!). This really is confusing.

Wednesday, September 24, 2014

Splits and neighborhoods in splits graphs

I have written before about How to interpret splits graphs. However, it is worth emphasizing a few points, so that people don't keep Mis-interpreting splits graphs.

A splits graph can potentially represent two main types of pattern. First, like a clustering analysis, it represents groups in the data that are in some way similar. Each group is represented by an explicit split in the graph (see Recognizing groups in splits graphs). The clusters may be hierarchically arranged (each group nested within another group), and they may overlap, so that objects can simultaneously be a member of more than one group. If the clusters do not overlap then the graph will be a tree.

Second, like on ordination analysis, a splits graph can summarize the multi-dimensional neighborhoods of the different objects. That is, the relative distance between the points on the graph summarizes the relationships among the objects — closer objects, as measured along the edges of the graph, are more similar.

These two patterns often appear in the same splits graph. Unfortunately, many published papers mis-interpret neighborhoods as splits. If there is an explicit split representing a cluster of interest, then the data can be said to support that possible cluster. However, if no such split exists, then the graph is agnostic with respect to that cluster — there might be no support for it in the data, or the split might be left out of the graph because other splits out-weigh it. So, graph objects occupying a particular neighborhood might not be well-supported by the original data, contrary to the interpretation sometimes seen in the literature.

This can be illustrated with a specific example, taken from: Sicoli MA, Holton G (2014) Linguistic phylogenies support back-migration from Beringia to Asia. PLOS One 9: e91722.

The splits graph is a consensus network, summarizing all of the splits with at least 10% support in 3000 MCMC bayesian trees. The authors note that the dashed line represents a "primary division" between the groups, and that the differently colored objects represent "clear groupings".

However, the dashed line is supported only by a small split, which has a larger contradictory split (that puts the North PCA group with the Plains-Apachean group). This split thus cannot be said to be well supported. Furthermore, the South Alaska grouping is not supported by any split shown in the graph (there are, however, two splits that combine uniquely to support it). That is, the South Alaska grouping represents a neighborhood rather than a supported cluster. Finally, the Alaska-Canada-1 grouping is also not supported by an uncontradicted split (ie. the tcb taa tau samples could as easily be part of the West Alaska grouping). All of the other identified groups are supported by unique and uncontradicted splits.

So, there are three types of pattern in this splits graph with respect to the groups of interest to the authors: uncontradicted splits, contradicted splits, and neighborhoods, representing good support, medium support and agnosticism, respectively. It is important to recognize these three possibilities, and to interpret them correctly with respect to "support" for any conclusions.

As an aside, I will point out that in the other splits graph in the same paper (a NeighborNet): the dashed line is not supported by any split, two of the colored groupings are not supported by any split, and two of the others have only a small contradicted split. Thus, the "primary division" and the "clear groupings" mostly represent neighborhoods, and are thus only dubiously supported.

Monday, September 22, 2014

Reducing networks to trees

I have commented before about the perceived tendency to resist thinking about evolutionary relationships as networks (Resistance to network thinking), and even to present reticulating evolutionary relationships as trees rather than as networks (The dilemma of evolutionary networks and Darwinian trees). Charles Darwin seems to be the guilty party in starting this phenomenon.

This behavior becomes particularly obvious when we consider family genealogies. A good example appears when we consider the family relationships of the Olympian gods of Ancient Greece. Several illustrations of these relationships are gathered together on the Olympian Gods Family Tree web page.

Noteworthy is the particularly frisky nature of Zeus, who "got around a bit", to put it mildly. As shown in the first diagram, Zeus was the offspring of Cronus and Rhea. However, he then fathered children with at least nine people, including two of his own sisters, an aunt, a first cousin, and several first cousins once removed, among others. This creates the complex network shown.

However, not everyone wants to draw family genealogies as reticulating networks. After all, they are usually called "family trees". As shown by the examples below, the most common way to reduce a network to a tree is simply to repeat people's names as often as necessary. That is, rather than have them appear once (representing their birth) with multiple reticulating connections representing their reproductive relationships, they appear repeatedly, once for their birth and once for each relationship, so that there are no reticulations. I will leave it to you to count how often Zeus appears in each of these so-called family trees.

Clearly, this is misleading, and it makes no sense to obscure the fact that a so-called tree is actually a reticulate network. If relationships are reticulate then it is best to illustrate them that way, rather than to disguise the networks as trees.

Wednesday, September 17, 2014

Using data-display networks to assess evolutionary inferences

Phylogenetic networks are of two types: those that produce direct evolutionary inferences about gene flow (eg. hybridization networks, HGT networks), and those that display multiple patterns in multivariate datasets without any necessary evolutionary implications. The latter (called data-display networks) can be used both a priori as tools for exploratory data analysis (EDA), and a posteriori as a means of evaluating (or cross-checking) the support for inferences derived from other analyses (such as evolutionary networks).

Here, I present an example of the a posteriori usage.

The data and initial analysis come from:
Fu Q, Meyer M, Gao X, Stenzel U, Burbano HA, Kelso J, Pääbo S. (2013) DNA analysis of an early modern human from Tianyuan Cave, China. Proceedings of the National Academy of Sciences of the USA 110: 2223-2227.
They describe their genome data and evolutionary analysis like this:
We have extracted DNA from a 40,000-year-old anatomically modern human from Tianyuan Cave outside Beijing, China.
To investigate the relationship of the Tianyuan individual to present-day populations, we compared it to chromosome 21 sequences from 11 present-day humans from different parts of the world (a San, a Mbuti, a Yoruba, a Mandenka, and a Dinka from Africa; a French and a Sardinian from Europe; a Papuan, a Dai, and a Han from Asia; and a Karitiana from South America) and a Denisovan individual, each sequenced to 24- to 33-fold genomic coverage. Denisovans are an extinct group of Asian hominins related to Neandertals [and used as an outgroup]. In the combined dataset, 86,525 positions variable in at least one individual are of high quality in all 13 individuals.
To more accurately gauge how the population from which the Tianyuan individual is derived was related to Eurasian populations, while taking gene flow between populations into account, we used a recent approach that estimates a maximum-likelihood tree of populations and then identifies relationships between populations that are a poor fit to the tree model and that may be due to gene flow [using the TreeMix program] ... The maximum-likelihood tree [reproduced above] shows that the branch leading to the Tianyuan individual is long, due to its lower sequence quality. However, among Eurasian populations, Tianyuan clearly falls with Asian rather than European populations (bootstrap support 100%). The strongest signal not compatible with a bifurcating tree is an inferred gene-flow event that suggests that 6.7% of chromosome 21 in the Papuan individual is derived from Denisovans ... When this is taken into account, the Tianyuan individual appears ancestral to all Asian individuals studied. We note, however, that the relationship of the Tianyuan and Papuan individuals is not resolved (bootstrap support 31%).
Setting aside the faux pas about the Tianyuan individual being "ancestral" to the others (it is shown in the tree-based figure as the sister group not the ancestor), most of the other interpretations can be assessed by looking at the multivariate data independently of any evolutionary inference. This can be done using the pairwise nucleotide differences among the samples (provided in Table 1 of the paper) and a NeighborNet data-display network, as shown in the splits graph below.

We can note the following points, some of which support the authors' conclusions and some of which don't. [Note: the authors refer to their figure as a "tree", although it is an introgression network.]:
  • All terminal edges in the network are long, and so there is actually not much genomic information on chromosome 21 about relationships.
  • The network splits do roughly match the tree splits, and so the network apparently does reflect some evolutionary information.
  • The identified gene flow from the Denisovan to the Papuan is represented by a clear split in the network. The weight (0.7335) makes it the fifth largest non-trivial split. That is, it is larger than some of the splits that purportedly represent tree-like evolution.
  • The largest split (weight = 2.8942) separates the non-African samples from the African samples + Denisovan outgroup, which does accord with the postulated dispersal of humans out of Africa.
  • The second (1.1459) and third (0.8073) largest splits are near the root of the tree.
  • The European split is the fourth largest (0.7670). The South American sample is included with the Asian group, reflecting the idea that the native people of the Americas migrated there from Asia across the Bering Strait.
  • The relationships among the Asian samples in the network do not all match those in the tree. Notably, the Han+Dai split (0.5124) is smaller than the Han+Karitiana split (0.6292), and yet the former appears in the tree with 100% bootstrap support.
  • The Han+Dai+Karitiana split is well supported (0.4450), but the Han+Dai+Karitiana +Papuan split is not (0.0152), as reflected in the 31% bootstrap value for the latter in the tree.
  • The Han+Dai+Karitiana+Papuan+Tianyuan split is not displayed in the network, although it has a long edge in the tree. The closest network split, as displayed, includes the Denisovan sample. Thus, the network emphasizes the reticulate Denisovan-Papuan relationship at the expense of the showing all of the tree-like relationship among the Asian samples.
  • The Tianyuan edge is not long in the network whereas it is long in the tree. This is likely to be because of uncertainty in its placement in the tree, rather than poor sequence quality, as claimed by the authors.

Thus, the data-display network questions some of the details of the authors' evolutionary network. However, it does support placing the Tianyuan sample with the Asian ones, as well as possible gene flow from the Denisovan sample to the Papuan one.

It thus seems to be a valuable procedure to cross-check any evolutionary analysis with a data-display network. As I have noted before (Networks and bootstraps as tree-support criteria; How networks differ from bootstrapped trees), bootstap values on a tree are insufficient as a means of assessing the robustness of evolutionary diagrams.

Monday, September 15, 2014

Guitars and networks

I have noted before that the evolutionary history of musical instruments is likely to be a reticulating network rather than being tree-like (Cornets: from a tree to a network). As another illustration of the pattern, we can consider the evolution over the past few centuries of the Spanish or flamenco guitar (taken from the Origem do nome Violão blog post).

This genealogy (with time proceeding from left to right) shows three basic characteristics that seem to be common in anthropological histories. First, there are multiple roots — in this case, three different instruments from the 16th century have provided input into the modern acoustic guitar. Second, there is an early history of reticulation, with ideas for new instrumentation being taken freely from among the existing instruments, in this case presumably in the search for better sound reproduction. Third, there is simple transformational evolution, with new models replacing the previous ones in popularity — for example, over the past 100 years the Spanish guitar has simply gotten larger (this is Cope's Rule.)

Wednesday, September 10, 2014

The importance of the Amish for reticulate genealogies

I noted in my previous blog post (Charles Darwin and the coalescent) that the multispecies coalescent needs to be based on a network model not a tree model. This is because reticulation processes occur both within species and between species — there is gene flow within genealogies and within phylogenies.

Reticulate genealogies are nothing new, and I have blogged about some of the best-known human genealogies with reticulations due to consanguinity (marriage between close relatives):
King Charles II of Spain
Charles Darwin
Henri Toulouse-Lautrec
Albert Einstein
Pharaoh Tutankhamun
Pharaoh Cleopatra

Importantly, in the modern world there are quite a few genealogical datasets available for study. For example, the Kinsources repository has c. 100 datasets from around the world, covering multi-generational histories for nearly 350,000 individuals. These data are actively used for research (eg. Bailey et al. 2014).

However, the best documented human genealogies are those for the various Anabaptist populations, who moved from Europe to North America during the 18th and 19th centuries. Anabaptists have mostly closed populations (ie. marriages occur solely within a population), and they are thus inbred, and most importantly they maintain detailed written genealogies. These populations include the Mennonites, Hutterites and Amish, the latter being the best known.

As noted by Agarwala et al. (2001):
The term "Anabaptist" literally means "rebaptizer" and is used to refer to a Christian movement that arose in central Europe in the first half of the 16th century. Adherents support adult baptism, pacifism, and separation of church and state. Among the large Anabaptist groups existing today are Mennonites (who were originally followers of Menno Simons), Amish (originally followers of Jakob Ammann who split away from the Mennonites at the end of the 17th century), and Hutterites (originally followers of Jakob Hutter). Amish and Mennonites emigrated to North America in multiple waves in the 18th and 19th centuries. The Hutterites began emigrating to the northern and western parts of North America in the late 1800s.
Distribution of Amish settlements in North America
Note the rapid expansion over the past 25 years.

The Mennonites originated in the Swiss Alps, and diffused northward into Germany and the Netherlands. The Dutch/North German Mennonites began the migration to America in the 1680s, followed by a much larger migration of Swiss/South German Mennonites beginning in 1707. The Amish are an early split from the Swiss/South German group that occurred in 1693. There are now at least 200,000 Amish in the eastern United States and eastern Canada (see the map above, taken from here), with the numbers apparently growing rapidly with recently increasing movement westward. There are various subgroups (eg. Old Order Amish, New Order Amish). There are about 1.7 million Mennonites worldwide, with c. 150,000 in the eastern United States and eastern Canada. The genealogies of 295,000 Mennonite and Amish individuals from the eastern USA have been databased (Agarwala et al. 2001).

The Hutterites originated as an Anabaptist offshoot in the Tyrolean Alps in the 1500s, but now there are c. 135,000 Hutterites living on 1,350 communal farms in the northern United States (principally South Dakota) and western Canada. Genealogical records trace all extant Hutterites to 90 ancestors who lived during the early 1700s to the early 1800s (see Ober et al. 1999).

These Anabaptist groups are frequently used in medical studies, because it is possible to relate disease occurrences to the recorded genealogy, and thus to assess the genetic component of the disease (eg. Dorsten et al. 1999, Hou et al. 2013). So, the literature is replete with figures showing the distribution of different diseases plotted onto the genealogy. I have included some of the Amish ones here, to illustrate the extreme reticulation that results when inbreeding is ongoing over many generations.

This first one is from Georgi et al. (2014). The diseased people are marked in red.

The next one is from Garner et al. (2001).

This one is from Lee et al. (2008).

The final one is from Racette et al. (2002).

Here is one small part of this genealogy, which emphasizes that between-generation marriages are an important component of the consanguinity.


Agarwala R, Schaffer A, Tomlin J (2001) Towards a complete North American Anabaptist genealogy II: analysis of inbreeding. Human Biology 73: 533-545.

Bailey DH, Hill KR, Walker RS (2014) Fitness consequences of spousal relatedness in 46 small-scale societies. Biology Letters 10: 20140160.

Dorsten L, Hotchkiss L, King T (1999) The effect of inbreeding on early childhood mortality: twelve generations of an Amish settlement. Demography 36: 263-271.

Garner C, McInnes LA, Service SK, Spesny M, Fournier E, Leon P, Freimer NB (2001) Linkage analysis of a complex pedigree with severe bipolar disorder, using a Markov chain Monte Carlo method. American Journal of Human Genetics 68: 1061-1064.

Georgi B, Craig D, Kember RL, Liu W, Lindquist I, Nasser S, Brown C, Egeland JA, Paul SM, Bućan M (2014) Genomic view of bipolar disorder revealed by whole genome sequencing in a genetic isolate. PLoS Genetics 10: e1004229.

Hou L, Faraci G, Chen DT, Kassem L, Schulze TG, Shugart YY, McMahon FJ (2013) Amish revisited: next-generation sequencing studies of psychiatric disorders among the Plain people. Trends in Genetics 29: 412-418.

Lee SL, Murdock DG, McCauley JL, Bradford Y, Crunk A, McFarland L, Jiang L, Wang T, Schnetz-Boutaud N, Haines JL (2008) A genome-wide scan in an Amish pedigree with parkinsonism. Annals of Human Genetics 72: 621-629.

Ober C, Hyslop T, Hauck WW (1999) Inbreeding effects on fertility in humans: evidence for reproductive compensation. American Journal of Human Genetics 64: 225–231.

Racette BA, Rundle M, Wang JC, Goate A, Saccone NL, Farrer M, Lincoln S, Hussey J, Smemo S, Lin J, Suarez B, Parsian A, Perlmutter JS (2002) A multi-incident, Old-Order Amish family with PD. Neurology2 58: 568-574.

Monday, September 8, 2014

Inbreeding creates the most complex networks

In an earlier blog post (The ultimate phylogenetic network?) I reproduced the lattice network from the anthropologist Franz Weidenreich. This comes close to being as complex as a network can get when applied to groups of organisms. However, when we study the genealogy of individuals, the network can get much more complex. This will be most true when there are marriages between close relatives (consanguinity), which creates inbreeding.

The family pedigree (or family tree!) shown here is for a group of people in a recently isolated population from the southwestern area of The Netherlands. There are 4,645 people involved, covering 18 generations (one row each). The average number of consanguineous loops for the 103 study individuals is 71.7, which is what is creating all of the cross-connections that make the network look so horrendous. (Consanguineous or inbreeding loops are illustrated here.)

The genealogy is from:
Liu F, Arias-Vásquez A, Sleegers K, Aulchenko YS, Kayser M, Sanchez-Juan P, Feng BJ, Bertoli-Avella AM, van Swieten J, Axenovich TI, Heutink P, van Broeckhoven C, Oostra BA, van Duijn CM (2007) A genomewide screen for late-onset Alzheimer disease in a genetically isolated Dutch population. American Journal of Human Genetics 81: 17-31.

Wednesday, September 3, 2014

Charles Darwin and the coalescent

The full title of Charles Darwin's most famous book was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. It is important to note that this title juxtaposes the concepts of between-species variation and within-species variation (Darwin usually referred to "races" rather than to "breeds", "subspecies", etc). This was one of his major insights: the idea that there is a continuum of variation in biology through time (or, as he put it, that it is arbitrary whether variants are treated as different races or as different species).

As I recently noted, this paved the way for between-species phylogenies to be seen as directly analogous to within-species genealogies (The role of biblical genealogies in phylogenetics) — previous applications of genealogies to non-humans (such as those of Buffon and Duchesne) had been explicitly restricted to within-sepcies relationships.

This conceptual integration of within-species and between-species relationships has become explicit in modern biology by using multispecies coalescent models to integrate population genetics and phylogenetics. As noted by Reid et al. (2014):
These models treat populations, rather than alleles sampled from a single individual, as the focal units in phylogenetic trees. The multispecies coalescent model connects traditional phylogenetic inference, which seeks primarily to infer patterns of divergence between species, and population genetic inference, which has typically focused on intraspecific evolutionary processes. The development of these models was motivated by the common empirical observation that genealogies estimated from different genes are often discordant and the discovery that, if ignored, this discordance can bias parameters of direct interest to systematists, such as the relationships and divergence times among species.
However, as specifically emphasized by Reid et al.:
In order to reconcile discordance among gene trees and uncover true species relationships, the first gene tree/species tree models assumed that discordance is solely the result of stochastic coalescence of gene lineages within a species phylogeny ... Coalescent stochasticity, however, is not the only source of gene tree discordance. Selection, hybridization, horizontal gene transfer, gene duplication/extinction, recombination, and phylogenetic estimation error can also result in discordance.
They examined this situation by studying the fit of the multispecies coalescent model:
to 25 published data sets. We show that poor model fit is detectable in the majority of data sets; that this poor fit can mislead phylogenetic estimation; and that in some cases it stems from processes of inherent interest to systematists ...
Our analyses suggest that poor fit to the multispecies coalescent model can mislead inference in empirical studies. In the case of recent hybridization, the consequences may be severe, as species divergences are forced to post-date gene divergences ... When topological conflict among coalescent genealogies is the result of ancient hybridization, balancing selection, or gene duplication and extinction, the consequences may be less severe.
In other words, tree-based phylogenetics is inadequate in practice because of gene flow. Within-species genealogies and between-species phylogenies intersect in the concept of a network, not a tree. That is, the multispecies coalescent needs to be based on a network model not a tree model:
The biological processes that generate variation in gene tree topologies should be explicitly modeled, as should relevant dynamics of molecular evolution. Increasingly complex multispecies coalescent models are being implemented, but there are tradeoffs. Some examine gene duplication and extinction or migration but cannot estimate divergence times.
So, current models are inadequate. It will be interesting to see how these approaches develop to incorporate gene flow (reticulation) into what has heretofore been a tree model (modeling only ancestor-descendant relationships), as we are still in need of methods for estimating rooted evolutionary networks.


Reid NM, Hird SM, Brown JM, Pelletier TA, McVay JD, Satler JD, Carstens BC (2014) Poor fit to the multispecies coalescent is widely detectable in empirical data. Systematic Biology 63: 322-333.