Tuesday, November 14, 2017

Power laws and cryptocurrencies

The Power Law is used to describe phenomena where large occurrences are rare but small ones are quite common. For example, there are few billionaires while most people make only a modest income; there are few large cities but many small towns; there are few very frequent words but many rare words.

Mathematically, Power Laws are of interest because of what is known as "scale invariance", as well as the fact that there is no well-defined average value. Furthermore, Power Laws are considered to be universal — you can read about this in Wikipedia. One of the more obvious places that we might expect to find them is in the exchange rates of currencies (their "worth") — there will be a few of great worth (the "major currencies") and lots of lesser worth.

For example, I recently read the headline: Bitcoin isn't "too expensive", says BTCC boss Bobby Lee. He was defending the price of the digital currency Bitcoin, which has increased in value more than 600 percent this year, claiming that this is not evidence of a financial bubble, but instead is evidence that the currency is proving its utility in the digital world. Obviously, I cannot let this claim pass without turning a quantitative eye upon it.


Bitcoin is the original cryptocurrency, established in 2009, just after the financial crash of that time. It is a digital currency, which by design has no central bank or regulatory authority supporting it. The coins don’t exist in a tangible form, but instead exist solely in a digital "wallet". Nevertheless, they can still be exchanged and used in transactions, just as with any fiat currency.

Bitcoin is based on a technology now referred to as the blockchain, which seriously has the potential to redefine future economic and legal transactions. Indeed, it is the blockchain idea that has proven to be of interest to financial and legal institutions, not the currency itself (which is just an example of using the blockchain). Blockchain is a distributed digital database, where every transaction is broadcast over the net and stored publicly, making it immutable as well as transparent. Compared to traditional financial and legal systems, this provides increased security, higher efficiency, greater error resistance, and reduced transaction costs. You can read about it in The ultimate 3500-word guide in plain English to understand Blockchain.

Bitcoin was launched for around $US0.005 (ie. half a cent). It was pretty much ignored for 4 years, but it has increased greatly in popularity over the past 4 years. Its exchange rate first exploded to a peak in late 2013, followed by a slow decline of nearly 90% (associated with the collapse of the Mt Gox digital currency exchange). It has achieved near-manic popularity in the past year, as shown in the first graph.

From CoinGecko
Bitcoin exchange rate with the US dollar

So, we now have headlines like this: Bitcoin just surged over $4000 and is near biggest financial crash in 400 years. The reference is to to what is known as Tulip mania, in the Netherlands in 1636-1637, where the tulip bulb prices quickly went from 1 guilder to 60, exploded to 1,000 or more, and then crashed. This is the context within which Bobby Lee made his claim (quoted above) that the current Bitcoin price is not too high.

The important point for our purposes here is that Bitcoin has spawned a host of imitators. So, there are now, or have been, more than 1,000 cryptocurrencies in existence. Many of them are intended as genuine digital currencies, each one addressing one or more of the perceived limitations of the original Bitcoin (such as its inability to scale up to a large number of transactions, or to process transactions faster). Indeed, we may see Bitcoin as a proof of concept and/or pilot study for digital currencies.

Most of the so-called altcoins, however, are not intended as general-use currencies at all. Instead, they form a totally new mode of fundraising for start-up companies, which now sell custom cryptocurrencies in order to raise investment. That is, instead of issuing shares as an IPO (initial public offer) they have an ICO (initial coin offer), thus bypassing the traditional venture capital processes. There is is a whole new world of digital finance emerging (see Cryptocurrency mania fuels hype and fear at venture firms).


In order to assess the comparative price of Bitcoin to the altcoins, I need the exchange rate of the current crop of cryptocurrencies. I took the CoinGecko rates at 14:25 UTC on 11 November 2017 (they change by the minute!). There were 735 coins listed, of which I took the top 100 exchange rates in US dollars. I then ignored the data for the Bit20 coin, which is actually related to an index fund, and thus has a price that is unrelated to the other currencies.

The next graph shows the currencies listed in the rank order of their value. This should illustrate a special case of the Power Law that is known as Zipf's Law, which refers to the "size" of each event relative to it's rank order of size. The standard way to evaluate the Zipf pattern is to plot the data with both axes of the graph converted to logarithms, under which circumstances the data should form a straight line.

As you can see, the exchange rates do fit Zipf's Law very well. In particular, Bitcoin, which is the #1 ranked coin, is not over-priced relative to the other coins. Note that this does not address the question as to whether all of the coins are over-priced or not. That would be a separate question, about the intrinsic value of cryptocurrencies.

Note that the top 25 ranked coins do not fit the Power Law as well as do the remaining 75 coins. So, we might also look at these top coins separately. This is shown in the next graph.

These 25 coins also fit Zipf's Law very well, but the power exponent is clearly smaller than for the remaining coins. In this case, Bitcoin fits the Power Law even better than before. Like it or not, relative to the other coins, Bitcoin is, indeed, not "too expensive".

Very few of the coins appear to be be over-priced (ie. far above the line), but a few of them might be considered under-priced (ie. far below the line). In particular, the #4 ranked coin is the SegWit2x [Futures]. This coin represents a controversial suggestion to split off from Bitcoin. It has not received a great deal of support from the Bitcoin community, and the proposed split was officially suspended only a few days ago. Whether it will go ahead eventually is unclear. The #5 ranked coin is Dash, which is often touted as a currency much more like cash, in the sense that the users can remain almost completely anonymous (which is actually a bit tricky with Bitcoin).

In the world of currency exchange, the big three pieces of information about each currency are (i) the Price of each coin, (ii) the Market Capitalization, which is the total coin supply multiplied by the coin price, and (iii) the Liquidity, which refers to how easy it is to buy and sell coins without causing a change in their price (it is used to measure the market share, market maturity and market acceptance). We could summarize this information for each coin by using a phylogenetic network.

So, I took the information as supplied by CoinGecko (see above) in US dollars, and log-transformed the numbers (economic worth is usually considered to be log-normally distributed). I then calculated the manhattan distances pairwise between the currencies, and plotted this using a NeighborNet graph, as shown in the final figure. The 10 top-price currencies have their full name shown, while the remainder are labeled with their exchange abbreviation. As usual, coins that have similar financial characteristics are near each other in the network; and the further apart the coins are in the network then the more different are their characteristics.

There are basically four neighborhoods in the graph, representing four different types of coins. Those coins at the top-right of the network all have a high Price, Capitalization and Liquidity. These are the coins that currently dominate the market. Moving leftwards from there in the graph, the Price, Capitalization and Liquidity all decrease, so that the coins in the middle of the network have low values of all three criteria. The coins at the top-left of the network have a relatively high Price but still have a low Capitalization and Liquidity. Those coins isolated at the bottom of the network currently have no Market Capitalization at all, even though they are available for trading and thus have a Price (this includes the SegWit2x Futures).


So, should you invest your hard-earned savings in cryptocurrencies? Plenty of people are doing so. For example, Coinbase, the largest cryptocurrency exchange in the USA, reportedly now has 12 million customers.

The general consensus seems to be "yes" to investment only if you like a bit of a gamble, because you may win big, but otherwise the answer is currently "no". The attributes that currently make cryptocurrencies such a speculative investment, such as their big price swings, their volatility and unpredictability, and their potentially lucrative payoffs, actually make them pretty useless as currencies. If you are looking for a long-term investment, then you probably need to find an altcoin that is either useful as a transaction medium, or provides an innovative application of the blockchain technology.

Tuesday, November 7, 2017

PhyloNetworks: a package for phylogenetic networks

Recently, another computer package was released that is of relevance to this blog. This is described in a forthcoming paper:
Claudia Solís-Lemus, Paul Bastide, Cécile Ané (2017) PhyloNetworks: a package for phylogenetic networks. Molecular Biology and Evolution (in press).
The authors describe the package this way:
PhyloNetworks is a Julia package for the inference, manipulation, visualization and use of phylogenetic networks in an interactive environment. Inference of phylogenetic networks is done with maximum pseudolikelihood from gene trees or multi-locus sequences (SNaQ), with possible bootstrap analysis. PhyloNetworks is the first software providing tools to summarize a set of networks (from a bootstrap or posterior sample) with measures of tree edge support, hybrid edge support, and hybrid node support. Networks can be used for phylogenetic comparative analysis of continuous traits, to estimate ancestral states or do a phylogenetic regression.

The  SNaQ analysis is described in a previous paper:
Solís-Lemus C, Ané C (2016) Inferring phylogenetic networks with maximum pseudolikelihood under incomplete lineage sorting. PLOS Genetics 12:e 1005896.
The phylogenetic model used incorporates: mutations (as usual), incomplete lineage sorting of alleles in ancestral populations (using the coalescent), and horizontal inheritance of genes (ie. reticulations in the network). The likelihood is decomposed into quartets, which makes the likelihood calculations relatively fast, and also allows the analyses to be scaled up to many species and many genes.

The PhyloNetworks software is open source, and is available with documentation at:
Have fun learning to use the Julia system, which I had never even heard of before investigating this new package!

Note: In spite of the similarity in name, this new package has nothing to do with Luay Nakhleh's PhyloNet package, nor to the Phylogenetic Networks blog.

Tuesday, October 31, 2017

"Man gave names to all those animals": cats and dogs

This is a joint post by Guido Grimm and Johann-Mattis List.

As specialists, we rarely dare to dive into cross-disciplinary research. However, in a small series of posts, we will now try to open a door between linguistics, phylogenetics, biogeography, and molecular genetics (with its various subdisciplines), using the curious cases of domestic animals, such as cat, dog, goat, and sheep, and what these are called in various Eurasian languages, with a special focus on Indo-European languages.

Today's post will introduce the little dataset that we have created, and discuss the findings for the names of cats and dogs. A follow-up post will be devoted to goats and sheep.

Domesticated animals and their names

Various types of archaeological and biological research revolve around the domestication of animals — GoogleScholar gives tens of thousands of hits for search items such as "cat domestication"; and we have several blog posts about the need for networks to illustrate the genealogy of domestication. However, linguistic literature on these topics is rather sparse, often related to specific language families, such as domesticated animals in the Indo-European proto-society (Anthony and Ringe 2015).

Nevertheless, many studies mention the potential value of linguistic evidence as some specific kind of indirect evidence, which should be considered when carrying out research on domestication (see, for example, Kraft et al. 2015). Furthermore, the public interest in domestic animals such as cat, dog, goat and sheep, is reflected by the number of languages in which Wikipedia articles are available: the domestic dog (219 entries), our most trusted companion animal, narrowly beats the cat (211 entries), our least-productive domestic animal but, according to cliché, an obligatory accessory for e.g. literates, thinkers, and little old ladies (entry counts include extinct ones like Gothic). Sheep are available for 166 languages, and goats for 142.

One doesn't have to travel far to recognize substantial difference between the four animal names. For example, when Guido moved to Sweden, the most confusing thing was "Fåret Shaun", which he knew as "Shaun, das Schaf" in German, or "Shaun, the sheep" in English. [As an aside, Shaun's name is a pun in English, but not in German or Swedish.] While Swedish and German / English differ greatly in the pronunciation of the words they use to denote "sheep", the Swedish words for "cat" (Swedish katt, German Katze), "dog" (hund vs. Hund), and "goat" (get vs. Geiß) are essentially the same (using Guido's dialect of German). They also are basically the same for many other essential items, such as "house" (hus vs. Haus), and "hand" (hand vs. Hand).

Since Guido moved to France, he has been watching "Shaun le mouton"; and Hund ("dog") has become chien. He now needs to look for chèvre ("goat") when making choosing his cheeses; but his cats are called chats, which is similar in writing (and linguistic history) but phonetically rather different, as the word is pronounced as [ʃa] (sha).

When Mattis visited China, he had few problems memorizing the word for "cat", as the Chinese word māo is quite similar to the sound which cats are alleged to make in many languages (see the list on Wikipedia for cross-linguistic similarities of onomatopoeia). The words for "sheep" and "goat", on the other hand, were surprisingly the same, the former being called míanyáng, which roughly translates as "soft sheep/goat", while the latter is called shānyáng which translates to "mountain sheep/goat".

Differences in animal naming

We were intrigued by these differences and similarities of animal names across different languages. So, we decided to investigate this further, by comparing pronunciation differences for "dog", "cat", "goat", and "sheep" across a larger sample of languages. For this purpose, we selected 28 different languages, and searched for the translations as they are given in the different Wikipedia articles. We then manually added the pronunciations, based on different sources, such as Wiktionary, our own knowledge of some of the languages, or specialized sources listing translations and transcriptions (Key and Comrie 2016; Huang et al. 1992).

We then used the overall pronunciation distances for all languages as proposed by Jäger (2015), who applied sophisticated alignment algorithms to a sample of 40 historically stable words per language for a large sample of North Eurasian languages (taken from the ASJP database). Since our sample contains languages which have never been shown to be historically related, the networks which we inferred from these distances should not be interpreted as true phylogenies, but rather as an aid for visualizing overall similarities among them.

To compare the pronunciation differences of our small datasets of animal names, we used the LingPy software (List and Forkel 2016, http://lingpy.org) to cluster the data into preliminary sets of phonetically similar words. As we lack the data to carry out deep inference of truly historical similarities, for this purpose we used the Sound-Class-Based Phonetic Alignment Algorithm (for details, see List et al. 2017). This algorithm compares words for shallow phonetic similarity with some degree of historical information. As a result, the inferred clusters do not (as we will see below) reflect true instances of cognacy (homology), but rather serve as a proxy for similarity of pronunciation.

Cats and Dogs

It is commonly assumed that the dog (Canis lupus familiaris, literally the 'domestic wolf-dog') was the first animal domesticated by humans, although it has not yet been settled exactly when and where. Multiple domestication events are quite likely, with respect to the (grey) wolves' (Canis lupus) natural behaviour (i.e. living in small family groups with complex social structure) and being originally distributed across Eurasia, although genetic studies have lead to inconclusive results (compare the contradicting results in Frantz et al. 2016 and Botigue et al. 2017). Its trainability and pack-loyalty make the wolf an excellent hunting companion, and wolf packs migrate naturally over long distances, which perfectly fits early (pre-cultivation) human societies of hunters and gatherers. Accordingly, ages of up to 30,000 BC have been proposed for the dog's domestication (Botigue et al. 2017).

In contrast, the cat, Felis sylvestris (literally the 'forest cat'), is a solitary, very elusive animal. It was domesticated much later, and most likely in the Near East (Driscoll et al. 2009). In contrast to other domestic animals, it has no direct use (other than luxury), and rather trains its owners than being trained (e.g., there are no police cats, and very rarely circus cats). But the cat decimates rodents and other small mammals, as well as birds. Thus, the domestication of cats likely followed the cultivation of wheat, and is possibly instrumental for building up fixed settlements and agricultural societies (Driscoll et al. 2009). Thus, George R.R. Martin's fictional character Haviland Tuf may be right when judging all human societies throughout the universe by how they treat cats: "civilized" people cherish them, "barbaric" societies don't!

Figure 1: Terms for cat in our sample

Thus, the hypothesis is that the dog was probably with us from the dawn of our civilization, while the cat opportunistically followed human settlements because these provide a surplus of food (and ultimately shelter). This idea is well reflected by the literal and phonetic variation of the words for "cat" (Figure 1) and "dog" (Figure 2). Cats are called by essentially the same names in all western Eurasian languages (be they Indo-European or not), but the word for dog can be phonetically very different in even closely related languages.

As you can see in the plot, the name for "cat" (English) is effectively similar across all of the Indo-European languages of western Eurasia in our sample, while the name for "dog" sounds quite different. Given that similar names for "cat" can be found in languages of northern Africa (Pfeifer 1993: s. v. "Katze"), this provides additional evidence for the Near-East domestication of the cat; and we can assume that the word traveled to Europe along with its carriers. On the other hand, the differences in the names for "dog" across all Indo-European languages in our sample reflect language change, rather than different naming practices. With the exception of Indic, Greek, and the Slavic languages, which coined new terms (cf. Derksen 2008: 431, and the cognates sets in IELex), the dog terms in Romance (with exception of Spanish), Germanic (with exception of English), Baltic and Armenian all evolved from the same root.

Figure 2: Terms for dog in our sample

With respect to the genetics of the dog (origin unclear) and the cat (origin in the Near East), plus the migration history of European people, the most likely hypothesis, which is also supported by Indo-European linguists, assumes that the dog was already with the humans before the Indo-European languages formed, following their migrations. Given the importance of the term, people may have avoided replacing it with a new term. This is also reflected in the cross-linguistic stability of the concept "dog", usually listed as one of the most stable concepts which are rarely replaced by neologisms ("dog" ranks at place 16 of Starostin's 2007 stability scale; "cat" is not even included).

With linguistic methods for language comparison, we can show that these words share a common origin, but stability does not imply that the pronunciation of the words is not affected. It is difficult to say how fast pronunciation evolves in general, but assuming that greater phonetic differences indicate a greater amount of elapsed time is a useful proxy. Since many Indo-European languages arrived in Europe by migration waves from the steppes of Central Asia, it is little surprise that each of these waves brought its modified variant of the original term for "dog" in Proto-Indo-European to Europe. Given the importance of the term for the daily lives of the people, speakers of one language variety would also not necessarily feel obliged to borrow the terms from neighboring language communities.

In Hebrew (not included in Figure 1), the word for cat is חתול khatúl. The Celtic Irish term is cat, and even the Basques, with their entirely unrelated language, have the word katu, probably a borrowing from the surrounding Romance languages (cf. Spanish gato). When the Germanic tribes (BC) and Slavs (AD) arrived on horseback, accompanied by their *hunda- (Kroonen 2013: 256), or their *pesə (Derksen 2008: 431), they settled down, started farming, and then took up the *kattōn- and the *kotə from the locals. This is interesting, because we have to assume (based on genetics and modern distribution of the wild subspecies of Felis sylvestris) that there were always wild cats in the European woods. Either the word for them was lost in surviving languages, or the hunters and gathers living in Europe never bothered to name a small furry animal that – at best – could be just glimpsed.

Notably, the South Asian Indo-European languages and the East Asian Sino-Tibetic languages have their own terms for cats (Figure 1), but the word is globally quite invariable in stark contrast to the terms for "dog".

Where does this lead?

Our graphs are at this point indicate many curiosities. Nevertheless, by mapping words associated with animals (or plants), crucial for the history of human civilisation, we may tap into a complete new data set to discuss different scenarios erected by archaeologists and historians regarding domestication and beyond. While linguists, archaeologists, and geneticists have been working a lot on these questions on their own, examples for a rigorous collaboration, involving larger datasets and common research questions, are – to our current knowledge from sifting the literature – still rather rare. Furthermore, most linguistic accounts are anecdotal. They provide valuable insights, but these insights are not amenable for empirical investigations, as they are only reflected in prose. As a result, recent articles concentrating on archaeogenetic studies often ignore linguistic evidence completely. Given the uncertainty about the origin of domesticated animals and plants, despite advanced methods and techniques in archaeology and genetics, it seems that this strategy of simply putting linguistic evidence to one side deserves some re-evaluation.

It seems to be about time to pursue these questions in data-driven frameworks. When doing so, however, we need to be careful in the way we treat linguistic data as evidence. What we need is a thorough understanding of the processes underlying "naming" in language evolution. We constantly modify our lexicon, be it (i) by no longer using certain words, (ii) by using certain previously unfashionable words more frequently, (iii) by coining new words, or (iv) by borrowing words from our linguistic neighbors. So far, we still barely understand under which conditions societies will tend to keep a certain word against pressure from linguistic neighbors who use a different term, or when they will prefer to coin their own new words for newly introduced techniques, animals, or plants, instead of taking the words along with the technology.

Linguists can say a few things about this; and etymological dictionaries, some of which we also consulted for this study, offer a wealth of information for some terms. However, without formalizing our linguistic knowledge, providing standardization efforts (compare the Tsammalex or the Concepticon projects) and improvement of algorithms for automatic sequence comparison, linguists will have a hard time keeping pace with quickly evolving disciplines like archaeogenetics and archaeology.

  • Anthony, D. and D. Ringe (2015) The Indo-European homeland from linguistic and Archaeological perspectives. Annual Review of Linguistics 1: 199-219.
  • Botigue, L., S. Song, A. Scheu, S. Gopalan, A. Pendleton, M. Oetjens, A. Taravella, T. Seregely, A. Zeeb-Lanz, R. Arbogast, D. Bobo, K. Daly, M. Unterlander, J. Burger, J. Kidd, and K. Veeramah (2017) Ancient European dog genomes reveal continuity since the Early Neolithic. Nature Communications 8: 16082.
  • Derksen, R. (2008) Etymological dictionary of the Slavic inherited lexicon. Brill: Leiden and Boston.
  • Driscoll, C., D. Macdonald, and S. O’Brien (2009) From wild animals to domestic pets, an evolutionary view of domestication. Proceedings of the National Academy of Sciences 106 Suppl 1: 9971-9978.
  • Frantz, L.A., V.E. Mullin, M. Pionnier-Capitan, O. Lebrasseur, M. Ollivier, A. Perri, A. Linderholm, V. Mattiangeli, M.D. Teasdale, E.A. Dimopoulos, A. Tresset, M. Duffraisse, F. McCormick, L. Bartosiewicz, E. Gal, É.A. Nyerges, M.V. Sablin, S. Bréhard, M. Mashkour, A. Bălăşescu, B. Gillet, S. Hughes, O. Chassaing, C. Hitte, J.-D. Vigne, K. Dobney, C. Hänni, D.G. Bradley, G. Larson (2016) Genomic and archaeological evidence suggest a dual origin of domestic dogs. Science 352: 1228-1231.
  • Huáng Bùfán 黃布凡 (1992) Zàngmiǎn yǔzú yǔyán cíhuì [A Tibeto-Burman lexicon]. Zhōngyāng Mínzú Dàxué 中央民族大学 [Central Institute of Minorities]: Běijīng 北京.
  • Jäger, G. (2015) Support for linguistic macrofamilies from weighted alignment. Proceedings of the National Academy of Sciences 112: 12752-12757.
  • Key, M. and B. Comrie (2016) The intercontinental dictionary series. Max Planck Institute for Evolutionary Anthropology: Leipzig.
  • Kraft, K., C. Brown, G. Nabhan, E. Luedeling, J. Luna Ruiz, G. Coppens d’Eeckenbrugge, R. Hijmans, and P. Gepts (2014) Multiple lines of evidence for the origin of domesticated chili pepper, Capsicum annuum, in Mexico. Proceedings of the National academy of Sciences of the United States of America 111: 6165-6170.
  • Kroonen, G. (2013) Etymological dictionary of Proto-Germanic. Brill: Leiden and Boston.
  • List, J.-M. and R. Forkel (2016) LingPy. A Python library for historical linguistics. Max Planck Institute for the Science of Human History: Jena.
  • List, J.-M., S. Greenhill, and R. Gray (2017) The potential of automatic word comparison for historical linguistics. PLOS ONE 12: 1-18.
  • Pfeifer, W. (1993) Etymologisches Wörterbuch des Deutschen. Akademie: Berlin.
  • Starostin, S. (2007) Opredelenije ustojčivosti bazisnoj leksiki [Determining the stability of basic words]. In: : S. A. Starostin: Trudy po jazykoznaniju [S. A. Starostin: Works on linguistics. Languages of Slavic Cultures: Moscow. 580-590.
Final Remark

Given that we had little time to review all of the literature on domestication in these disciplines, we may well have missed important aspects, and we may well have even failed to be original in our claims. We would like to encourage potential readers of this blog to provide us with additional hints and productive criticism. In case you know more about these topics than we have reported here, please get in touch with us — we will be glad to learn more.

Tuesday, October 24, 2017

Let's distinguish between Hennig and Cladistics

There are theoretically an infinite number of ways to mathematically analyze any set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. In this sense, the philosophy of phylogenetic analysis needs to show that there is a strong basis for treating any particular mathematical analysis as having biological relevance. This is a point that I have discussed before: Is there a philosophy of phylogenetic networks?

Willi Hennig clearly has some role to play here. However, his ideas are often treated as being solely related to one particular form of phylogenetic analysis — cladistics. In this post I will point out that his work has a much greater relevance than that — he provides a crucial logical step that applies to all phylogenetic inference.

The steps of phylogenetic inference are shown in the first figure, which is taken from my earlier post. The first step is a mathematical inference from character data to tree/network; the second step is a logical inference that the mathematical summary resulting from the first step has some biological relevance; and the third step is a practical inference that the biological summary applies to whole organisms as well as to their characters.

The logic of phylogeny reconstruction


Hennig's concept of "shared innovations" (which he called synapomorphies) is the only thing that allows us to use the mathematical phylogenetics in the pursuit of genealogical history. Without this concept, the mathematics could just produce something like the arithmetic mean, a mathematical concept with no connection to real objects (unlike the median or mode, which will always be real). The idea of shared innovations is what leads us to believe that the mathematical summary (whether tree or network) might actually also be a close approximation to the real thing. This is a separate concept from cladistics, which is simply a mathematical algorithm based on a particular optimality criterion (parsimony), just like maximum likelihood or bayesian approaches. So, shared innovations underlie the use of both parsimony, likelihood and distance methods — Willi Hennig (and, before him, Karl Brugmann in linguistics) is relevant no matter what algorithm we use.

Mathematical analyses

If they are to represent genealogical history, then all trees and networks in phylogenetics will be directed acyclic graphs (DAGs), mathematically. There are many ways to produce a DAG, some of which have had varying degrees of popularity in phylogenetics, and some of which have not been used at all.

To produce an acyclic line graph (in which nodes are connected by edges), we can start with character data or distance data. We can then use various optimality criteria to choose among the many graphs that could apply to the data, such as parsimony (usually ssociated with cladistics) and likelihood (either as maximum likelihood or integrated likelihood). We can also ensure that the graph is directed (ie. the edges have arrows), by choosing a root location, either directly as part of the analysis or a posteriori by specifying an outgroup.

All of these approaches are mathematically valid, as are a number of others. They all provide a mathematical summary of the data. This is step one of the phylogenetic inference, as illustrated above.

But what of step two? Biologists need a summary of the data that has biological relevance, as well, not just mathematical relevance. This has long been a thorn in the side of biologists — just because they can perform a particular mathematical calculation does not automatically mean that the calculation is relevant to their biological goal.

Consider the simplest mathematics of all — calculating the central location of a set of data. There are many ways to do this, mathematically — indeed, there are technically an infinite number of ways. These include the mode, the median, the arithmetic mean, the geometric mean, and the harmonic mean. All of these are mathematically valid, but do any of them produce a central location that describes biology?

The mode does, because it is the most common observation in the dataset. The median usually does, because it is the "middle" observation in the dataset. But what of the various means? There is no necessary reason for them to describe biology, although they are perfectly valid mathematics.

For instance, the modal number of children in modern families is 2, meaning that more families have this number than any other number of children. The median number is also 2, meaning that half of the families have 2 or fewer children and half of the families have 2 or more. So, these mathematical summaries are also descriptions of real families. But the means are not. For example, the arithmetic mean number of children is 2.2, which does not describe any real family. If you ever find a family with 2.2 children, then you should probably call the police, to investigate!

Mathematically valid data summaries have a lot of relevance, but they do not necessarily describe biological concepts. I can use the mean number of children per local family to estimate the number of schools that I might need in that area, but I cannot use it to describe the families themselves. This is a classic case of "horses for courses".


So, in phylogenetics we need some piece of logic that says that we can expect our DAG (a mathematical concept) to be a representation of a genealogy (a biological concept). Our genealogical estimate may still be wrong (and indeed it probably will be, in some way!), but that is a separate issue. The DAG needs to a reasonable representation, not a correct one. Correctness needs to be a result of our data, not our mathematics.

This is where Willi Hennig comes in. Hennig's ideas, and the ideas derived from them, are illustrated in the second figure.

Hennig explicitly noted that characters have a genealogical polarity, with ancestral states being modified into derived states through evolutionary time. Furthermore, he noted that it is only the derived states that are of relevance to studying evolutionary history — the sharing of derived character states reveals evolutionary history, but shared ancestral states tells us nothing.

We have done two things with these Hennigian ideas. Some people have been interested in classification, for which the concept of monophyly is relevant, and others have been interested in reconstructing the genealogies, rather than simply interpreting them.


Reconstructing a tree-like phylogenetic history is conceptually straightforward, although it took a long time for someone (Hennig 1966) to explain the most appropriate approach. Interestingly, the study of historical linguistics has developed the same methodology (Platnick and Cameron 1977; Atkinson and Gray 2005), thus independently arriving at exactly the same solution to what is, in effect, exactly the same problem. From this point of view, the logical inference itself is uncontroversial; and its generic nature means that it can be used for any objects with characteristics that can be identified and measured, and that follow a history of descent with modification. I will, however, discuss this in terms of biology — you can make the leap to other objects yourself.

The objective is to infer the ancestors of the contemporary organisms, and the ancestors of those ancestors, etc., all the way back to the most recent common ancestor of the group of organisms being studied. Ancestors can be inferred because the organisms share unique characteristics (shared innovations, or shared derived character states. That is, they have features that they hold in common and that are not possessed by any other organisms. The simplest explanation for this observation is that the features are shared because they were inherited from an ancestor. The ancestor acquired a set of heritable (i.e. genetically controlled) characteristics, and passed those characteristics on to its offspring. We observe the offspring, note their shared characteristics, and thus infer the existence of the unobserved ancestor(s). If we collect a number of such observations, what we often find is that they form a set of nested groupings of the organisms.

Hennig, in particular, was interested in the interpretation of phylogenetic trees, rather than their reconstruction. He did this interpretation in terms of monophyletic groups (also called clades), each of which consists of an ancestor and all of its descendants. These are natural groups in terms of their evolutionary history, whereas other types of groups (eg. paraphyletic, polyphyletic) are not. So, a phylogenetic tree consists of a set of nested clades, which are the groups that are represented and given names in formal taxonomic schemes.

For phylogenetic trees, there is thus a rationale for treating a tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared innovations (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

Mis-interpretations of Hennig

What I have said above has lead to various mis-interpretations of Hennig's role in phylogenetics.

First, he did not propose any specific method for producing a phylogenetic tree (or network). He was concerned about the logic of the diagram. not how to get it in the first place. He distinguished shared derived character states, or shard innovations, (he called them synapomorphies) from shared ancestral states (symplesiomorphies), and noted that only the former are relevant for phylogenies. So, distance methods will also work in phylogenetics provided the distances are based on homologous apomorphic features. If they are not so based, then they are simply mathematical constructions, which may or may not represent anything to do with phylogeny. Distances estimated from plesiomorphic features can be used to construct a tree, obviously, but there is no reason to expect that tree to represent a phylogeny.

Second, parsimony analysis was developed independently of Hennig, by people such as Farris, Nelson and Platnick. This came to be called cladistics, intended by Ernst Mayr to be a derogatory term for the new form of analysis. The fact that the Willi Hennig Society is associated exclusively with cladistics has nothing to do with Hennig himself, or with the logic of his approach to phylogenetics. You need to clearly distinguish between Hennig and Cladistics!

Third, Hennig was more interested in classification than he was in phylogeny reconstruction. This seems to cause confusion for gene jockeys and linguists, in particular, who often associate phylogenetics solely with classification (see, for example, Felsenstein 2004, chapter 10). Sure, Hennig was primarily interested in the interpretation of phylogenies, rather than their construction. However, that was simply a personal point of view. The logic of his work transcends his own personal interests. Without him, no genealogical reconstruction makes logical sense, in genetics or linguistics. Mathematical methods for summarizing data were developed independently in genetics and linguistics, just as they were in other areas of biology and also in stemmatology. However, without the concept of shared innovations, these methods remain mathematical summaries, not estimates of genealogies.

Finally, Hennig's work was not original, being naturally a synthesis of much previous work. In biology, the work of Walter Zimmerman is frequently noted (eg. Donoghue & Kadereit 1992), and in linguistics the work of Karl Brugmann is obviously important (see Mattis' post Arguments from authority, and the Cladistic Ghost, in historical linguistics). Sometimes, wheels have to be re-invented many times before the general populace comes to realize just how important they are.


Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Donoghue MJ, Kadereit W (1992) Walter Zimmermann and the growth of phylogenetic theory. Systematic Biology 41: 74-85.

Felsenstain J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland MA.

Hennig W (1966) Phylogenetic Systematics. University of Illinois Press, Urbana IL. [Translated by DD Davis and R Zangerl from W. Hennig 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]

Platnick NI, Cameron HD (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380-385.

Tuesday, October 17, 2017

Networks, not trees, identify "weak spots" in phylogenetic trees

A major application of networks in exploratory data analysis is to identify signal oddities and visualise ambiguity. Thus, they would be the natural choice when it comes to pinpointing weaknesses in phylogenetic trees. This is particularly so when the aim is to propose a relatively stable (and intuitive) ‘phylogenetic’ (identifying likely monophyla sensu Hennig) or ‘cladistic’ (clade-based) systematic framework for a group of organsims. In other words, whenever we try to translate branching patterns into monophyletic groups.

‘Weak spots’ in phylogenetic trees are relationships with either little or ambiguous support, or branching patterns strongly affected by sampling (taxa and characters). These are topological phenomena that are rather the rule than the exception when studying extinct groups of organisms (e.g. spermatophytes or ‘long-necks’).

One example appears to be probably one of the fiercest group of marine predators: the mosasaurs (mosasauroid squamates; Madzia & Cau 2017). I will discuss this example in this post.

Fig. 1. The tree-based systematic groups of mosasaurs (Mosasauroidae plus ancient relatives) when applying Madzia & Cau's nomenclature to their Bayesian-inferred majority-rule consensus tree. Most higher taxa (above genus) are "branch-based", except for the "node-based" Mosasauridae, Russellosaurina (wrong suffix, kept as rank-less taxon by the authors), Tethysaurinae, and Yaguarasaurinae. Genera represented by a single OTU in blue, 'non-monophyletic' genera in red. Thick branches received near unambiguous support (PP ≥ 0.95)

Madzia & Cau “re-examined a data set that results from modifications assembled in the course of the last 20 years and performed multiple parsimony analyses and Bayesian tip-dating analysis” in order to identify the ‘weak spots’ and take them into account when providing a revised cladistic nomenclature of “the ‘traditionally’ recognized mosasauroid clades” (Fig. 1). They define possibly monophyletic groups via recurring branching patterns in their various trees, along with the position of key taxa in those trees (see their chapter Phylogenetic [in fact: cladistic] nomenclature). This allows the groups to “self-destruct” when not forming a clade, and to be replaced.

Although the combination of unweighted and differentially weighted parsimony and Bayesian tip-dating analyses could be methodologically interesting (when examined in detail), it is hardly necessary in order to identify weaknesses and strengths of the data matrix used – going back to Bell 1997, and being emended since (see Introduction of Madzia & Cau) – to define possible monophyletic (or other) groups. A quick and simple neighbour-net splits graph would have done the trick, too.

The situation regarding tree inference, e.g. parsimony

The mosasaurid data matrix suffers from the typical problems: ambiguous, highly homoplasious signals, paired with a few missing data issues (typically lack of data overlap). Adding to this is the miscellaneous signal from taxa regarded as outgroups (here: ancient potential members of the mosasaurs): Adriosaurus suessi (which the authors used to root their trees), Dolichosaurus longicollis, and Ponto-saurus kornhuberi. Accordingly, standard parsimony analysis fails to provide a useful result for about half of the taxa, when documented in the traditional fashion (see my last post) — a strict consensus cladogram of all most parsimonious trees (MPTs) is shown in Fig. 2A.

Fig. 2 Strict consensus graphs based on 152 equally (most) parsimonious trees inferred from the matrix (all characters treated as unweighted and unordered) using PAUP*. Green, unambiguous placement/grouping; turquois, weakly 'rogue-ish', red, rogue taxa

But even the Adams consensus tree (Fig. 2B) is more informative, and the (near) strict consensus network (only showing splits that occur in more than a single MPT) highlights where the equally parsimonious solutions agree and disagree, and which taxa act more ‘rogueish’ than others (Fig. 2C). Weighting and Bayesian inference naturally produce more resolved trees; but the question remains whether the overall higher to unambiguous branch support sufficiently reflects the signal in the character matrix.

Data sets of extinct organisms need neighbour-nets, to start with

The consensus network of the most (equally) parsimonious trees (MPT; Fig. 2C) informs us about equally valid topological alternatives and ‘rogueness’. Using the branch-length averaging option, we can visualize character support to some degree for the alternatives. But there is a quicker and more comprehensive alternative, when it comes to (tree-)incompatible signal.

The neighbour-net (Fig. 3) directly identifies potentially strong signals and ‘weak spots’. First, we can see that the outgroup taxa are not clustered, which is never good. Obviously, they are not too useful to infer an ingroup root (Madzia & Cau discuss the outgroup sampling bias). Only one of the outgrops, Pontosaurus, is placed closed to the Aigialosauridae, which collects the earliest diverging Mosasauroideae lineage (see Fig. 1). Their signals are likely to mess-up any tree inference (Fig. 2).

Fig. 3 The neighbour-net based on simple (Hamming) mean distances inferred from Madzia & Cau's matrix. Colouring as in Fig. 1

Trivial (data-wise) lineages are e.g. the Tylosaurinae, supported by a very long narrow branch— this lineage is characterised by high group coherence and distinctness to any other taxon/taxon group and will inevitably have high support and placed close (phylogenetically and absolute) to the Plioplate-carpinae (Figs 2, 3). The Mosasaurinae are equally well circumscribed, with only one putative member, Dallasaurus, being substantially apart from the rest, and bridging Mosasaurinae and Halisaurinae, their putative sisters. Hence, trees will favour splits rejecting the "Natantia" group unless Dallasaurus is excluded from the inference.

Species of the same genera are conspicuously grouped; this differs from Madzia & Cau’s trees, where Mosasaurus or Prognathodon species are collected in the same subtrees, but are “non-monophyletic”, i.e. do not form an exclusive clade. Based on the neighbour-net, the main reason may be terminal noise and resulting flat likelihood surfaces (hence, low posterior probabilities). The placement of the older members of the mosasaurs (classified as Tethysaurinae and Yaguarasaurinae) to each other, and the slightly older outgroup taxa, is clearly difficult with this matrix, even though there is no ambiguity, e.g. in the MPT sample (Fig. 2). Hence, the branch-lengths do not reflect synapomorphies or rarely shared apomorphies in this subtree, but instead shared convergences — a perfect phylogeny always generates a perfectly tree-like distance matrix.

Oddly placed taxa in the neighbour-net? Probably unrepresentative distances; and the quick fix

In contrast to trees, the network in Fig. 3 fails to resolve a likely position for one Prognathodon species: P. currii, and the large associated box indicates a data issue. The pairwise distances of the oddly placed P. currii and the probably misplaced Dolichosaurus, are poorly defined: both have zero-distances to non-similar taxa, but also to each other. But whereas Dolichosaurus differs from other members of Prognathodon by mean morphological distances (MD) of 0.5–1.0 (1.0 means it differs in all defined characters!), P. currii is much more similar to its congeners (MD = 0.17–0.27 and 0.46). Their other affinities also lie with strongly different taxon sets.

Their position in the neighbour-net is the result of a missing data artefact. Being just a 2-dimensional graph, such severe signal ambiguity cannot be resolved. Unrepresentative distances are the major (only) obstacle for neighbour-nets in the context of extinct groups. Trees are more decisive in such cases, when the few covered characters fit well the preferred tree's topology. By removing the outgroup taxa and P. currii, we can generate a neighbour-net (Fig. 4) in-line, and going beyond the Bayesian-tree-based groups suggested by Madzia & Cau (Fig. 1).

Fig. 4 Same data and method as shown in Fig. 3; four OTUs were excluded, the non-Mosasauroidea (outgroup) and the misplaced Prognathodon currii

Using networks to define taxonomic groups

Just based on the neighbour-nets (Figs 3, 4), circumscription of genera and higher taxa can be discussed (assuming that morphology mirrors phylogeny). For instance, Mosasaurus can be kept as-is or can include Plotosaurus; whereas the Clidastes form a clearly distinct taxon (whether paraphyletic/ monophyletic or clade/grade may be impossible to decide, see Fig. 1). Including (all) Prognathodon in the Globidensini remains an option; Eremiasaurus may be included, too, or included in the likely sister clade, the Mosasaurini.  

Dallasaurus is not only the oldest possible but clearly the most unique (primitive?) member of the Mosasaurinae, and the Halisaurinae likely represent their early diverged sister lineage. Treating Tylosaurinae and Plioplatecarpinae as reciprocally monophyletic sister lineages makes sense with respect to the older taxa and the co-eval Mosasaurinae-Halisaurinae lineage. The ancient forms are generally more similar to Plioplatecarpinae (+ Tylosaurinae) than to the Mosasaurinae and Halisaurinae lineages; but whether they should be included in the same systematic group ("Russellosaurina") cannot be judged based on the data matrix or the inferred trees (see also Figs 1, 2). Their topological attraction may be due to more shared primitive features (Hennig's ‘symplesiomorphies’), and the "Russellosaurina" could be a paraphyletic clade.

An interesting pronounced central edge bundle in the network in Fig. 4, which agrees well with Madzia & Cau's Bayesian consensus tree (Fig. 1), is the one separating all oldest, potentially more primitive taxa/lineages (> 90 Ma) from the later more diversified lineages (Mosasaurinae, Halisaurinae, Plioplatecarpinae, and Tylosaurinae). Regarding primitiveness vs. derivedness, an option to map characters on networks and extract alternative trees directly from the network would be handy (see also David’s 500th post).

Fig. 5 Bootstrap (BS) support network based on 10,000 BS (pseudo)replicates optimised under parsimony. Splits are shown that occurred only in at least 20% of the BS replicates; trivial splits are collapsed. Some taxa have low, but unchallenged support, in other cases no preference at all is found (e.g. for the highest level bracketing taxa) or two alternatives compete with each other.

Also in the case of the mosasaurs: when we want to use phylogenetic trees as the sole (or main) basis for classification, rather than neighbour-nets (see my last post) and common sense backed up by EDA (e.g. Fig. 4; Bomfleur et al. 2017), the method of choice would be the support consensus networks based on parsimony (example provided in Fig. 5), least-squares, and/or likelihood bootstrapping pseudoreplicate samples. in addition to or instead of the Bayesian-inferred topologies sample. The posterior probabilities in Madzia & Cau’s tip-dated tree and Bayesian majority-rule consensus tree include values << 1.0, which already can be an indication of very strong signal conflict or just lack of discriminating signal (flat likelihood surfaces).

We should not be over-confident in PP, when the underlying data are not tree-like at all, as they too easily tilt towards one alternative (see also Zander 2004). The same holds for post-analysis character weighting, designed to eliminate (down-weigh) conflicting signals. While parsimony and distance methods are more easily affected by branching artefacts, probabilistic methods may struggle with flat likelihood surfaces. Thus, bootstrap support networks should be the first choice for ‘phylogenetic’ (by identifying Hennigian monophyla) or ‘cladistic’ (clade-based) classification as they show the robustness of the signal for the preferred and other topological alternatives, and can be generated under different optimality criteria. Having a certain support for a clade is nice, but one should always consider the support for alternatives, and consider how many characters support or oppose an alternative.

Morphological matrices need to be analysed using network approaches

Madzia & Cau’s study is methodologically interesting by providing a tip-dated Bayesian tree for an extinct group of organisms. A one-to-one comparison of their parsimony-BS support using different character and weighting schemes vs. Bayesian PP may be interesting, too — note the difference between the tip-dated tree and the majority rule consensus trees for several critical branches. However, following the current standard practice, no BS pseudoreplicate and Bayesian saved topologies samples were provided. Regarding the main objective, the identification of ‘weak spots’ to propose enhanced systematic groups, networks (Figs 2–5) would have been more informative and straightforward.

No matter what classification philosophy is applied, when we deal with morphological matrices of extinct groups of organisms, the first step should always be to explore the primary signal in the data before we infer trees using (highly) sophisticated methods, and interpret them — the latter may actually obscure ‘weak spots’ rather than identifying them. The quickest analyses are neighbour-nets, but watch out for odd pairwise distance patterns (easily visualised using heat maps)!

The second step is producing support consensus networks, for the fine-tuning and to decide on the most probable trees to explain the data. Regarding classification, we should ask ourselves whether we really want inevitably unstable clade-based classification systems (when dealing with extinct organisms), or robust ones that reflect the general data situation and include potentially or likely paraphyletic taxa (see e.g. Clidastes in Figs 2–5 and Madzia & Cau's trees, and their elaborate discussion of higher level taxa, which – to a good degree – could become superfluous when allowing paraphyletic taxa).


All graphics, and some primary data files, are publicly available from figshare. An archive including all re-analysis files can be downloaded at www.palaeogrimm.org.


Bell GL (1997) A phylogenetic revision of North American and Adriatic Mosasauroidea. In: Callaway JM, and Nicholls EL, eds. Ancient Marine Reptiles. San Diego: Academic Press, pp. 293–332 [cited from Madzia & Cau 2017]

Bomfleur B, Grimm GW, McLoughlin S. 2017. The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5:e3433. https://peerj.com/articles/3433/.

Madzia D, Cau A (2017) Inferring 'weak spots' in phylogenetic trees: application to mosasauroid nomenclature. PeerJ 5: e3782. https://peerj.com/articles/3782/.

Zander RH (2004) Minimal values of reliability of Bootstrap and Jackknife proportions, Decay index, and Bayesian posterior probability. PhyloInformatics 2: 1–13.

Tuesday, October 10, 2017

Where to retire in the USA

Some weeks ago I published a post on recommended countries for Where to retire. Not everyone wants to leave their homeland, however, and so for many of our readers it may therefore be relevant to consider which states in the USA might be recommended as most desirable for retirees.

In this regard, the Bankrate web site has recently considered Where are the best and worst states to retire? They collated data (from various sources) for each of the 50 states for the following eight characteristics:
  • Cost of living
  • Healthcare quality
  • Crime rate
  • Cultural and social vitality
  • Weather
  • Taxes (income and sales taxes)
  • Senior citizens' overall well-being
  • The prevalence of other seniors
For 2017, the states were then ranked from 1–50 for each of these characteristics separately. These rankings were then weighted according to a survey of the reported relative importance of each of these characteristics — they are listed above in the order of decreasing importance. From the weighted data, Bankrate produced an overall ranking of the states for their desirability to retirees, which you can check out on their web site.

However, this ranking is overly simplistic, because it suggests that there is only one main dimension to retirement desirability, from best to worst. Clearly, retirement is multi-dimensional — there is no reason to expect the eight characteristics to be highly correlated. Therefore a network analysis would be handy to explore which characteristics differ between the states.

As for my previous analysis, I have calculated the Manhattan distance pairwise between the states; and I am displaying this in the figure using a NeighborNet network. States that have similar retirement characteristics are near each other in the network; and the further apart they are in the network then the more different are their characteristics.

In the network graph I have highlighted Bankrate's top 10 ranked states in green and their bottom 10 states in red. Note that they do not cluster neatly in the network, emphasizing the importance of considering the different characteristics, rather than just averaging them into a single ranking.

So, the network does not represent a single trend (from best to worst) — this would produce a long thin graph. Instead, the network scatters the states broadly, indicating that they have multiple relationships with each other — the eight retirement characteristics are not highly correlated. Indeed, the network is L-shaped, suggesting two main trends. The main part of the L has the north-eastern and west-coast states at one end and the mid-western and western states at the other, while the short part of the L separates out the south-eastern and south-western states. There are several obvious exceptions to these broad patterns (eg. Kentucky).

You can see that the north-eastern states tend to cluster together as being among the most desirable retirement locations (in Bankrate's ranking), and that the southern states tend to cluster together as being among the least desirable.

California is interesting because it ranks in the top two for Weather and Culture, but near the bottom for everything else. Hawaii ranks highly on Well-being and Culture but very poorly on Taxes, Crime rate, and Cost of living (where it is dead last). Florida, naturally, ranks first for Prevalence of seniors, but it is ranked mediocre to poor on everything else (including its hurricane-prone weather). New York is ranked first for Culture but mediocre to poor for everything else (and is ranked last for Taxes).

Alaska is ranked best for Taxes, Mississippi is best for Cost of living, Vermont is top for Crime rate (being low!), and Maine is best for Health care. Of these, only the latter state scores well for other characteristics, being second for Crime Rate and Prevalence of seniors. This puts it in the overall top three states, along with New Hampshire and Colorado.

New Hampshire gets the top spot by ranking well on everything except Cost of living and Weather — it is close to last for the latter characteristic!

So, the bottom line is that there is no state that particularly stands out as most suitable for retirees — in terms of desirable characteristics, what you win on the swings you lose on the roundabouts. Hardly surprising, really.

If you are interested in retiring to a particular city, then this recent web page may also be of relevance to you: Top 25 cities where you can live large on less than $70k.

Neither this nor the previous analysis (for countries) has addressed the issue of politics. Political voting is not randomly distributed, and some people prefer to live surrounded by voters similar to themselves. If this is you, then Wikipedia has a map indicating which states you might prefer.

Tuesday, October 3, 2017

Clades, cladograms, cladistics, and why networks are inevitable

During the work for another post, I stumbled on a kind of gap-in-knowledge that has nagged me for quite some time. This gap exists because researchers like to stay within chosen philosophical viewpoints, rather than reassessing their stance.

This gap involves the use of cladistic methodology in a manner that obscures information about evolutionary history, rather than revealing it. A clade, a subtree in a rooted tree that fulfills the parsimony criterion (or, indeed, any other criterion), may or may not reflect monophyly in a Hennigian sense, i.e. inclusive common origin. This is especially true for studies of extinct lineages.

I will explore this idea here in some detail.

Assumptions when studying fossils

Phylogenetic papers dealing with the evolution of extinct groups of organisms frequently use strict consensus trees (typically cladograms) of a sample of equally parsimonious trees (MPT) as the sole or main basis for their conclusions. They do this under two important implicit assumptions:
  • The morphological differentiation patterns encoded in a character matrix provide a generally treelike signal. In other words, the data patterns in the morphological matrix can be explained by a single, dichotomous, 1-dimensional graph. This assumption is also the basis for posterior filtering or down-weighting of characters that support splits (taxon bipartitions) conflicting with the branches in the inferred tree(s).
  • Morphological evolution is generally parsimonious. Although this may apply for characters that evolved only once or only evolve under very rare conditions, total evidence and DNA-constrained analysis demonstrate that this is not generally the case: the tree inferred by total-evidence or molecular constraints is typically longer than the tree(s) with the fewest character changes inferred on the morphological partition alone.
Another implicit assumption seems to be that all fossil specimens must represent extinct sister clades, and that no fossil specimen is ancestral to any other (or to an extant species) — hence, all taxa can be treated as terminals (not ancestors). Rooting typically relies on outgroups, under the assumption that ingroup-outgroup branching artefacts (such as long-branch attraction) play no role for parsimony inference when using morphological data sets.

In many of these morphology-phylogenetic papers (using parsimony or other methods) the authors state that they have conduct a “cladistic” study (I also made this error in my masters thesis; Grimm 1999). Cladistics is a classification system established by Hennig (1950) that relies on synapomorphies, exclusively shared, derived traits, that are linked with groups of inclusive common origin, the so-called monophyla.

Over 90 years earlier, Haeckel (1866) used the German word monophyletisch to refer to “natural” groups defined by a shared evolutionary history (a common origin). The latter could also include what Hennig identified as paraphyla: groups that have a common origin, but are not inclusive. To avoid confusion between Haeckelian and Hennigian monophyletic groups, Ashlock (1971) suggested the term holophyletic for the latter. This can be useful when a classification should recognise evolutionary relationships but needs to classify potentially or definitely paraphyletic groups for reasons of practicality (see e.g. Bomfleur, Grimm & McLoughlin 2017). Here, I will stick to Hennig’s terminology, as it is much more commonly used (although not necessarily correctly applied).
Hennig’s monophyla are from a theoretical (and computational) point of view a brilliant concept, as they can be inferred using a rooted tree. The test for monophyly is simple: Do A and B have a common ancestor? If yes, identify all taxa that are part of the same subtree as A and B. Unfortunately, we often find more than one possible tree, and roots can be misleading.

Strict consensus trees poorly represent the alternative topologies in a MPT sample

All consensus-tree approaches are limited to depicting the topological alternatives in a tree sample, but strict consensus trees are probably the worst (see e.g. Felsenstein 2004, chapter 30). They also have become obsolete with the development of consensus networks (Holland & Moulton 2003), and their subsequent implementation in freely accessible software packages such as SplitsTree (Huson 1998; Huson & Bryant 2006) and, more recently, the PHANGORN library for R (Schliep 2011; Schliep et al. 2017).

Figure 1 illustrates this difference for two extreme cases of binary matrices and their MPT collections. The two datasets in Fig. 1 reflect a substantially different data situation. The data in one matrix are perfectly tree-unlike (completely “confused about relationships”): any possible non-trivial bipartition of the 5-taxon set is supported by one (parsimony-informative) character. The data in the other matrix reflect two incongruent trees: each character is compatible with either one of the trees (parsimony-informative characters) or both trees (unique characters). The non-treelike matrix allows for many more MPTs than does the tree-like matrix, which results in two MPTs perfectly matching the two conflicting true trees. But both consensus analyses result in the same, unresolved (polytomous) strict consensus tree. In contrast, the two consensus networks highlight the difference in the quality between the data sets and the MPT sample.

Fig. 1 Non-treelike and treelike data, and the representation of their most-parsimonious tree collections as strict consensus trees and networks

Another example is shown in Figure 2, which shows four trees that differ only in the placement of one taxon (T8). This is a common phenomenom, particularly when dealing with extinct groups of organisms. The three main reasons for such topological ambiguity are:
  1. Indicisive data regarding the exact position of T8 with respect to the members of the red (T1–T4) and green clades (T5–T7).
  2. Conflicting data, T8 shows a combination of traits that are otherwise restricted to (parts of) the green or red clade.
  3. T8 is an ancestor or primitive member of the green or red clade, or both. 

Fig. 2 A single rogue taxon (T8) with ambiguous affinities collapses the strict consensus tree. In contrast, the conensus network can simultaenously show all alternatives, and identifies T8 as the source of topological ambiguity.

The strict consensus tree shows only three clades (three pairs of sister taxa) and a large polytomy, but the strict consensus network shows simultaneously the topology of all four trees and the position of T8 in these trees. From the consensus network, it is clear that the members of the red and green clades share a common origin. T8 can easily be identified as the rogue taxon (lineage).

Cladograms are incomplete representations of evolutionary trees

Figure 3 shows one of the first phylogenetic trees ever produced, and how it would look in the results section of a cladistic study. The tree was produced 150 years ago by Franz Martin Hilgendorf — more than 100 years before Hennig’s ideas were introduced to the Anglo-Saxon world and became mainstream. Hilgendorf was a palaeontology Ph.D. student at the same institute (in Tübingen, Germany) that also promoted me. Quenstedt, his supervisor, forced a quick promotion to get him and his heretic Darwinian ideas out of his university; there are thus no figures in Hilgendorf's thesis, and he published a phylogenetic tree only after he left Tübingen. It shows the evolution of derived forms (terminals) from putative ancestral forms (placed at the nodes) of fossils snails from the Steinheimer Becken, and clearly distinguishes ancestors and sisters. At some point, Hilgendorf even considered including the reticulation of lineages to better explain some forms, but later dropped this idea, feeling it would violate Darwin’s principle (Rasser 2006; see The dilemma of evolutionary networks and Darwinian trees).

Fig. 3 Hilgendorf's phylogenetic tree of fossil snails and its representation in form of a cladogram. The coloured fields and boxes refer to a series of nested clades, which here equal monophyletic groups.

Translating Hilgendorf’s tree into a cladogram comes with a loss of information about the evolution of the snails. Some ancestors are placed as sisters to their descendants (e.g. 18 vs. 18a and 19) and others are collected in a polytomy together with their descendants/descending lineages (e.g. 15, the ancestor of the siblings 16, 17, and the 18+). The loss of information regarding assumed ancestor-descendant relationships is dramatic. But this is no problem for cladistic classification: all clades in the cladogram in Fig. 3 (boxes) refer to Hennigian monophyletic groups seen in the original phylogenetic tree (coloured backgrounds). The polytomies in the cladogram are hard polytomies and do not reflect uncertainty or ambiguity. This contrasts with most cladograms depicted in the phylogenetic (“cladistic”) literature, where polytomies can also reflect lack of support or topological ambiguity.

Accepting the possibility that some fossils (fossil forms) may be ancestral to others (or their modern counterparts), or at least represent an ancestral, underived form, we actually should not infer plain parsimony trees but median networks (Bandelt et al. 1995). Median networks and related inferences (reduced median networks: Bandelt et al. 1995; median joining networks: Bandelt, Forster & Röhl 1999) work under the same optimality criterion (evolution is parsimonious) but allow taxa to be placed at the nodes (the “median”) of the graph. In doing so, they depict ancestor-descendant relationships. That they have not been used for morphological data so far, nor in palaeophylogenetic studies (as far as I know), may have to do with their vulnerability to homoplasy and missing data. High levels of homoplasy are common in morphological matrices, and missing data can be a problem when working with extinct organisms.

An ideal matrix, in which each divergence is followed by the accumulation of synapomorphies (or “autapomorphies”, unique traits, close to the tips), results in a median network perfectly depicting the evolutionary tree (Figure 4). As soon as convergent evolution steps in, a median network can easily become chaotic, although less so for a median-joining network. Note that half of the characters are homoplasious, and yet the median-joining network is still largely treelike (Fig. 4), with only one 2-dimensional box. The true tree is included in the network; but an E-G clade evolving from D is indicated as alternative to the correct (and monophyletic) FGH clade, with G and H evolving from F. Another deviation from the true tree is that A, the ancestor of B and C, is not placed at the node, but is closer to the all-common ancestor X.

Fig. 4 Two datasets, one without (left) and one with homoplasy (right), and their median(-joining) networks. Green branches refer to exact fits with the true tree, red indicate deviation or conflict with the true tree.

Paraphyletic clades...

Figures 5A and B show the corresponding MPT for the ideal matrix and the strict consensus tree vs. strict consensus network for the matrix affected by homoplasy. As our ideal matrix includes actual ancestors, the MPT rooted with the most primitive taxon X (the common ancestor of A–H) cannot resolve the exact relationships, in contrast to the median network. It thus represents the true tree only partly. But it also does not show any clade that is not monophyletic.

In the case of the partly homoplasious data, the median-joining network reconstructs a synapomorphy of the clade BC, because A is not placed on the node. This is because one character in our matrix is a methodologically undetectable parallelism — the same trait evolved in the sister taxa B and C, but only after both evolved from A. Clade BC is non-inclusive (paraphyletic), since A is the direct ancestor of both B and C and the clade BC lacks a real synapomorphy (if we go back to Hennig's concept). The reconstructed A would, however, be a stem taxon and clade BC would be inclusive (monophyletic) with one (inferred) synapomorphy. But this is a purely semantic problem of cladistics. In the real world, we will hardly have the data to discern whether A represents: the last common ancestor of B and C, a stem taxon of the ABC-lineage (a’), a very early precursor of B or C (b/c), or an ancient sister lineage of A, B, and/or C (a*). For practicality, one would eventually include all fossil forms with A-ish appearance in a paraphyletic taxon A (Fig. 5C), in (silent) violation of cladistic classification, to name only monophyletic groups.

Fig. 5A The median network compared to the single most-parsimonious tree inferred based on the ideal matrix

Fig. 5B The median-joining network compared to the strict consensus tree and networks of five most-parsimonious trees inferred based on the matrix with homoplasy. Red edges indicate deviations from or conflicts with the true tree.

Fig. 5C Potential monophyla that could be inferred from the median-joining network (Clades XY), when rooted with the most ancient taxon X. Groups that are monophyletic according to the true tree in blue, groups that are not in orange.

The strict consensus tree of the five MPTs that can be inferred from the homoplasious matrix shows only the paraphyletic (pseudo-monophyletic) clade BC and two monophyletic clades (ABC and D–H); and it contains no further information about the actual topology of the five MPTs. Its lack of resolution is due to the ancestors, which have typically less derived traits (no autapomorphies and fewer synapomorphies), in combination with the homoplasy-induced topological ambiguity. In contrast, the strict consensus networks reveal that all five MPTs place D, the ancestor of the D–H lineage, as (zero branch length) sister to a technically paraphyletic E–H clade, thereby identifying D as the most primitive form of the monophyletic D–H clade. Furthermore, all MPTs recognise a paraphyletic FH clade (F again a zero-length branch). They disagree in the placement of G, which is either sister to F+H (monophyletic FGH clade) or sister to E (a wrong EG clade).

... and monophyletic grades

Figure 6 shows a scenario in which paraphyletic groups are resolved as clades and monophyletic groups form grades, both because of outgroup-ingroup branching artefacts. The derived outgroup O is notably distinct from all ingroup taxa showing a character suite of convergently evolved traits that are randomly shared with parts of the ingroup. Within the ingroup, members of clade DEF are much more derived than are A and C.

Fig. 6 Ingroup-outgroup long-branch attraction can turn monophyla into grades and paraphyla into clades. The ingroup (A–F) consists of a sequence of nested monophyletic lineages (green shades) including two taxa (lowercase letters) that are ancestral to others. Each ingroup lineage evolved (convergent) traits also found in the outgroup O. The data allow inferring two MPTs that misplace O. The outgroup-misinformed root leads to a series of nested clades that a paraphyletic. Splits congruent with the actual monophyletic groups in green, those in conflict with the true tree in red.

Parsimony-tree inference finds two MPTs, which, rooted with the outgroup O, recognise a distinctly paraphyletic A–D+X clade. In both outgroup-rooted MPTs, the monophyletic DEF group is dissolved into a grade. By the way: using neighbour-joining (NJ) to find a tree fulfilling the least-squares (LS) criterion based on the corresponding pairwise mean distance matrix, the outgroup-inferred root is still misplaced with respect to the primitive taxa (X, A–C), but the DEF monophylum is correctly resolved as a clade. Call the Spanish Inquisition! A “phenetic” clustering algorithm finds a tree that is less wrong than the MPTs.

The most comprehensive display of the misleading signal in this matrix is nevertheless the neighbour-net (NNet; Figure 7), which includes both the parsimony and LS-solutions, and it can be used to map the competing support patterns surfacing in a bootstrap analysis of the data. In this network we can see that the signal is not compatible with a single tree, and that the signal from the distant outgroup O is too ambiguous for rooting the ingroup. Based on this graph, one can argue to delete the outgroup, thereby deleting all non-treelike signal — a NNet (or median network) excluding O matches exactly the true tree.

Fig. 7 Neighbour-net based on mean pairwise distances (same data in Fig. 6). The outgroup O provides a strongly ambiguous (non-treelike) signal, thus, triggering a series of splits (in red) conflicting the true tree (shown in grey). Edges compatible with the true tree shown in green. The numbers refer to non-parametric bootstrap support estimated under three optimality criteria: least-squares (LS; via neighbour-joinging), maximum likelihood (ML; using Lewis' 1-parameter Mk model), and maximum parsimony (MP) and 10,000 (pseudo)replicates each. Upper right: A splits-rose illustrating the competing support patterns for proximal splits involving O: green — split seen in the true tree, reddish — the competing splits seen in the two MPTs.

We need to accept that a clade, a subtree in a rooted tree (see e.g. Felsenstein 2004) fulfilling the parsimony criterion (or any other criterion), may or may not reflect monophyly in a Hennigian sense, i.e. inclusive common origin. Thus, it is imperative to distinguish between a classification concept that interprets trees (cladistics) and the method used to infer trees (typically parsimony, in the case of extinct lineages). This is especially so when one has to work with stand-alone data, such as morphological data of extinct groups of organisms.

Aside from the clades/grades ↔ monophyla / paraphyla / can't-say problem, the instability of clades in a parsimony or otherwise optimised rooted tree, or the alternative clades that can be inferred from the more data-comprehensive networks, make it difficult to enforce a strictly cladistic naming scheme. For the example shown in Fig. 2, we would be unable to name the red and green clades until the exact position of T8 is settled (see also Bomfleur, Grimm & McLoughlin 2017). In the end, the overall diversity patterns (studied using exploratory data analysis) may remain the most solid ground for classification.

It should also be obligatory in phylogenetic studies to use networks to display both competing topological alternatives and incompatible data patterns. There should also always be some information on edge-lengths. Consensus trees are insufficient, as they mask conflicting data patterns, and cladograms mask the amount of change.


Ashlock PD. (1971) Monophyly and associated terms. Systematic Zoology 20:63–69.

Bandelt H-J, Forster P, Röhl A. (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Bandelt H-J, Forster P, Sykes BC, Richards MB. (1995) Mitochondrial portraits of human populations using median networks. Genetics 141:743-753.

Bomfleur B, Grimm GW, McLoughlin S. (2017) Figure 8 of: The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5:e3433.

Felsenstein J. (2004) Inferring phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Grimm GW. (1999) Phylogenie der Cycadales. Diploma thesis. Eberhard Karls Universität. [in German]

Haeckel E. (1866) Generelle Morphologie der Organismen. Berlin: Georg Reiner.

Hennig W. (1950) Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag.

Holland B, Moulton V. (2003) Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176.

Huson DH. (1998) SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14:68–73.

Huson DH, Bryant D. (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23:254–267.

Rasser MW. (2006) 140 Jahre Steinheimer Schnecken-Stammbaum: der älteste fossile Stammbaum aus heutiger Sicht. Online version, originally published in Geologica et Palaeontologica, vol. 40.

Schliep K, Potts AJ, Morrison DA, Grimm GW. (2017) Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Schliep KP. (2011) Phangorn: phylogenetic analysis in R. Bioinformatics 27:592–593.