The Genealogical World of Phylogenetic Networks: October 2012

Wednesday, October 31, 2012

When is there support for a large phylogeny?

I have commented before (see this post: Networks and bootstraps as tree-support criteria) that there is often a discrepancy between bootstrap support on a phylogenetic tree (performed as a data analysis) and the output of a network analysis (performed as an exploratory data analysis, EDA). Here I will present a couple of examples of large phylogenies.

Soltis et al. (2011) report for their study of angiosperm phylogeny that:

We conducted two primary analyses based on 640 species representing 330 families. The first included 25,260 aligned base pairs from 17 genes, representing all three plant genomes, i.e., nucleus, plastid, and mitochondrion ... Phylogenetic analyses using maximum likelihood were conducted in the program RAxML ... Many important questions of deep-level relationships in the non-monocot angiosperms have now been resolved with strong support ... Our analyses confirm that with large amounts of sequence data, most deep-level relationships within the angiosperms can be resolved.

By "strong support" the authors mean that most of the branches on their phylogenetic tree have >85% bootstrap support.

I performed a NeighbortNet analysis on their data file of aligned sequences (available in TreeBASE). This was quite a challenge for the SplitsTree program, testing whether SplitsTree can handle 640 taxa. It turns out that it can, but it is very slow to re-draw the figure. Thus, rotating the figure, as I did here, took a very long time.

The network shows that the bootstrap support is not as convincing as it sounds. There is not much clear tree-like structure in the dataset.

As an alternative example, Decker et al. (2009) used a NeighborNet to display their data about cattle domestication:

We constructed a phylogenomic network to accurately describe the relationships between 48 cattle breeds and facilitate inferences concerning the history of domestication and breed formation ... Due to memory limitations in SplitsTree, genotypes at 14,023 SNPs were used to construct a network of 372 individuals belonging to 48 breeds. Default settings in SplitsTree were used to construct the networks ... This figure reveals that the history of breed formation in cattle has been complicated and has involved bottlenecks, evolution in isolation, coancestry, migration, and admixture.

The network simply shows that the different breeds can be recognized but that their relationships are not easy to resolve.

In neither of these two examples does there seem to be much reason to be confident in any conclusions about evolutionary relationships, as the network in both cases is essentially an unresolved bush.

The problem is likely to be the size of the phylogenies, as the potential complexity of a dataset increases combinatorially with the number of taxa (each added taxon can potentially have a reticulation with every one of the existing taxa). A dataset thus needs a very strong tree signal when there are hundreds of taxa, if the network is to show anything more than the disorganized blobs displayed here. This seems to be an unlikely scenario for most taxonomic groups, especially when using genetic data.

If this idea is correct then we will need to start thinking about potential solutions, in order to fully utilize networks for EDA. Perhaps the most obvious approach is to filter out the smaller patterns before constructing the network, with "smaller" being defined relative to the objective of the analysis. This approach is already used, for example, for consensus networks (where only a specified percentage of the splits in the input trees is included in the network) and super networks (where splits are also filtered in order to keep the network planar; Huson et al. 2006; Whitfield et al. 2008).

References

Decker J.E., Pires J.C., Conant G.C., McKay S.D., Heaton M.P., Chen K., Cooper A., Vilkki J., Seabury C.M., Caetano A.R., Johnson G.S., Brenneman R.A., Hanotte O., Eggert L.S., Wiener P., Kim J.-J., Kim K.S., Sonstegard T.S., Van Tassell C.P., Neibergs H.L., McEwan J.C., Brauning R., Coutinho L.L., Babar M.E., Wilson G.A., McClure M.C., Rolf M.M., Kim J., Schnabel R.D., Taylor J.F. (2009) Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the U.S.A. 106: 18644-18649.

Huson D.H., Steel M., Whitfield J.B. (2006) Reducing distortion in phylogenetic networks. Lecture Notes in Bioinformatics 4175: 150-161.

Soltis D.E., Smith S.A., Cellinese N., Wurdack K.J., Tank D.C., Brockington S.F., Refulio-Rodriguez N.F., Walker J.B., Moore M.J., Carlsward B.S., Bell C.D., Latvis M., Crawley S., Black C., Diouf D., Xi Z., Rushworth C.A., Gitzendanner M.A., Sytsma K.J., Qiu Y.L., Hilu K.W., Davis C.C., Sanderson M.J., Beaman R.S., Olmstead R.G., Judd W.S., Donoghue M.J., Soltis P.S. (2011) Angiosperm phylogeny: 17 genes, 640 taxa. American Journal of Botany 98: 704-730.

Whitfield J.B., Cameron S.A., Huson D.H., Steel M.A. (2008) Filtered z-closure supernetworks for extracting and visualizing recurrent signal from incongruent gene trees. Systematic Biology 57: 939-947.

Monday, October 29, 2012

Tattoo Monday VII

Here are three more tattoos in our never-ending compilation of evolutionary tree tattoos from around the internet. This circular design for a phylogenetic tree is quite popular (see Tattoo Monday, Tattoo Monday V and Tattoo Monday X), and it appears in more diverse body locations than any other design.

Thursday, October 25, 2012

Another phylogenetic network outside science

I have noted before that evolutionary networks are used in both biology and the social sciences, and that you will also occasionally find them elsewhere, as a means of displaying historical relationships among objects or concepts (see this blog post: Phylogenetic networks outside science). However, the uses outside science are not always successful, as shown in this previous blog post (Direction is important when showing history). Here I discuss another example of a phylogeny of ideas rather than objects that has potential problems.

This is labelled as a Computer Languages Timeline but, just like the previous example of a GNU/Linux Distribution Timeline that I discussed, it is actually drawn as a set of linearized genealogies. These are evolutionary networks rather than trees because there is horizontal transfer (ideas added) and recombination (ideas replaced) among the languages.

Click to see the original image.

The basic problem with this example is that it is not time-consistent. That is, the connections between languages begin at one time and end at another time. This does not happen with the GNU/Linux example. Many of the connections seem to have arbitrary begin/end times, which is not only unnecessary but also confusing.

There is, however, a good reason for some of the connections not being time-consistent. This occurs when a previous version of one computer language is used as the source of ideas for a later language, so that the information does indeed travel through time, in the manner that I have already discussed for phylogenies of ideas rather than objects (see this blog post: Time inconsistency in evolutionary networks). Examples in the Computer Languages phylogeny include the use of Fortran I (1956) as the basis for IAL (1958), and the use Fortran II (1957) as the basis for Basic (1964).

It is important to distinguish these two types of time inconsistency. There is a logical basis for the transfer of ideas through time, in which case the reticulation connections should be drawn to reflect the time inconsistency; but there is no logical basis for the other time inconsistencies, and their use should be avoided in the diagrams.

Monday, October 22, 2012

The network history of the Carnival of Evolution

We recently hosted the 52nd edition of the Carnival of Evolution here at this blog, and since then I have done a bit of digging into the history of the Carnival (or CoE). I am sharing here some of what I found, mostly numbers.

The Carnival of Evolution was founded at the end of August 2008, by Daniel Brown, then of the Biochemical Soul blog. Most blog Carnivals seem to last for only a few issues, but the CoE has continued for more than 50 editions as a monthly summary of "all that is best in evolution blogging". Indeed, it is the only Carnival currently listed as "active" in the Science category, out of the 48 that have existed at one time or another (see this 2009 blog post by Grrl Scientist on the demise of science carnivals).

For longevity, it cannot yet compete with some other biology carnivals, such as I and the Bird, which appeared for 149 fortnightly editions from July 2005 to April 2011, but it has lasted better than most other carnivals — there are 2,964 carnivals listed, but only 136 of these posted an edition during Aug-Oct 2012. For example, the 51st edition of the CoE celebrated precisely 4 years, while the 52nd edition appeared after 1,500 days of continuous blogging. The early editions were intended to be fortnightly, but after a missed hosting early on (at the Life Before Death blog) the plan was changed to roughly monthly intervals, as shown in the first graph.

Frequency histogram of the times
between CoE editions.

The first 18 CoE editions were administered by the afore-mentioned Daniel Brown; but circumstances change for most bloggers, and so he passed the baton to Bjørn Østman, who has carried it since then. There have been 47 different host blogs for the 52 editions — Biochemical Soul, Carnival of Evolution, Greg Laden's Blog, Observations of a Nerd and Quintessence of Dust have all hosted twice. Furthermore, as individuals, Daniel Brown hosted 3 times, and Bjørn Østman, Greg Laden, Steve Matheson, Christie Wilcox and Psi Wavefunction (the Scarlet Pimpernel of evolution blogging) have each hosted twice (not always at the same blog!).

Unfortunately, not all of the 52 Carnivals are still available at the original blogs — number 13 was at a now-deleted blog, the blog hosting number 37 is now access-restricted, and the blogs for numbers 7, 9, and 15 no longer have links to the relevant pages. Fortunately, two of these Carnivals have been archived at the Internet Archive Wayback Machine (#9 and #15), and one is available in a slightly re-formatted form at the blog aggregator Planet Atheism (#37). As for the extinct two, Pleiotropy has a sample list of some of the post topics for #13; but for #7 the only information available is that it was "a short but sweet edition".

This issue raises the related question as to the fate of the 47 blog hosts since their Carnival hosting. As far as I can tell, 1 has been deleted, 1 is now restricted access, 11 have stopped new posts, 5 have continued in another form (eg. another name or address), and the remaining 29 are extant. This shows some remarkable longevity in evolution blogging.

I have looked through the 50 available Carnivals, and I can report as follows. (Note: In the following I took a restricted view of "blog posts" as not including press announcements, of which there have been quite a few.)

There are 283 separate blogs mentioned in the Carnivals, although several of these blogs were moved and/or renamed versions of other blogs, as bloggers tend to move about a fair bit. Of these blogs, 161 (57%) were One Hit Wonders (ie. they were featured only once), as shown in the second graph.

Frequency histogram of the number of
Carnivals in which each blog was cited.

The two blogs with the highest number of Carnival appearances are: Pleiotropy (by Bjørn Østman, naturally!), which was featured in 68% of the Carnivals from number 9 onwards; and NeuroDojo (by Zen Faulkes), which was featured in 69% of the Carnivals from #11 onwards. A special mention should also go to Living the Scientific Life (Grrl Scientist), which was featured in 62% of the Carnivals during numbers 1-29 (she has now drifted into a different form of blogging).

A separate issue is how many actual posts were contributed by each blog (some people blog a lot more than others), which is shown in the third graph.

Frequency histogram of the total number
of posts cited for each blog.

The record of 52 posts is held by The Mermaid's Tale (present from Carnival #30 onwards), which is a multi-author blog (Anne Buchanan, Holly Dunsworth, Ken Weiss), giving them an advantage quantity-wise (and also in diversity of subjects). Mind you, the sole-authored NeuroDojo has 51 posts and Pleiotropy has 49! The record of 10 posts cited in one Carnival is held by Sandwalk (Larry Moran), which was the "specially featured" blog in Carnival #28. Sandwalk also contributed 7 posts to Carnival #50, while The Loom (Carl Zimmer) contributed 6 to #31, and The Mermaid's Tale contributed 6 to #33.

Frequency histogram of the number
of blogs cited in each Carnival.

If we look at these data the other way around, we can contemplate the number of blogs cited per Carnival, in the fourth graph, and the total number of posts cited per Carnival, in the fifth graph.

Frequency histogram of the number
of blog posts cited in each Carnival.

The maximum number of blogs cited in any one Carnival is 44 and the record number of posts is 69, both held by The Dispersal of Darwin (Michael Barton), the host of Carnival #31. (This Carnival also produced the greatest number of One Hit Wonders.) The minimum number is 6 for both criteria, interestingly enough in Carnival #6. The highest average number of posts per blog was 2.1, in Carnival #33 (41 posts cited from 20 blogs).

Fortunately, the number of posts has shown a steady upward curve, as indicated in the sixth graph, although not always at the one-blog-post-per-day rate set in the earliest days. However, over the past 20 Carnivals there has been an average of 1.06 blog posts cited per day of passing time, so we are certainly holding our own.

The steady growth of the CoE through time.

That's enough about the numbers. What themes have been employed by the CoE host bloggers to present their Carnival? The idea of theme-based presentations was introduced by Daniel Brown in CoE #10, but they appeared only sporadically until CoE #44, since when they have become de rigueur.

The themes we have had are (in order): Darwin's journal, phylogenetic analysis, superstars, a real carnival, Feed Your Head, a football game, a Darwin letter, the Origin of Species, a scientific conference, a conference slide presentation, phylogenetic trees, a newspaper report, an Icelandic saga, mousetraps, a set of teaching modules, Darwin's Restaurant, and phylogenetic networks. Clearly, invention is the name of the game. The most inventive may well be Adrian Thysse's slide presentation in CoE #45; while probably the most outrageous came from Psi Wavefunction in CoE #20, who performed original phylogenetic analyses of the blog posts while claiming no prior knowledge about how to do it!

Finally, and most importantly, we can ask: How has the Carnival of Evolution changed through time? This is precisely what a phylogenetic network is designed to tell us, as shown in the final figure, which is based on a phylogenetic analysis of the data concerning which blogs were cited in which Carnivals.

NeighborNet graph of the similarity relationships between the
Carnivals. The data being summarized are the number of posts
from each blog. So, in the data matrix each row is a Carnival
and each column is a blog, with the data in each cell being the
number of posts. The length of the terminal branches of the
network roughly reflects how many blogs were featured,
whereas the intricate network of inter-connecting lines
indicates the complex patterns as to which blogs were

featured in which Carnivals. Note that #7 and #13 are absent.

As you can see, the analysis shows that there has been a gradient through time. For example, the first 12 Carnivals are at the bottom-right of the diagram and the most recent 12 are at the upper-left of the diagram, with the others arranged in between. (Note that they are not in perfect order.) This means that the blogs being cited have greatly changed through time. For example, none of the blogs featured in the first three Carnivals re-appeared among the most recent four Carnivals.

The biggest change appears to have occurred with Carnival #31. This is indicated by the big gap in the diagram between Carnival #28 and #31 (it is arrowed). Presumably, this had nothing to do with the CoE host (the afore-mentioned Dispersal of Darwin), but has more to do with the time, which was the end of 2010. The world of social media was changing rapidly at that time, and several of the bloggers either stopped blogging or moved house.

It is tempting to interpret the relationships among the blogs in more detail, but that might be tempting fate. I will content myself with pointing out that, in the diagram, the sister to my blog (#52) is Pharyngula (#48), which must be an example of the well-known phylogenetic artifact of long-branch attraction.

Anyway, that's all there is in my network history of the ongoing Carnival of Evolution. Congratulations to all of the people involved in this successful Carnival; but someone else will have to write the centenary history, when it falls due.

Saturday, October 20, 2012

The Future of Phylogenetic Networks: Photos

Evidence that we were in the Netherlands.

The Lorentz Center building. The Center offices are to the right and left of the yellow window.

The seminar room. Left to right: Axel Janke, David Morrison, Hans-Jürgen Bandelt, Steven Kelk, Mike Steel.

The dinner cruise around the canals.

Dan Gusfield (obscured at the left) telling the young people how it is. Left to right: Mareike Fischer (plus dog), Chris Whidden, Leo van Iersel, Steven Kelk, Simone Linz, and Céline Scornavacca (resting).

Left to right: James Oldman (chopped off), Juan-Diego Santillana-Ortiz (green jacket), Irma Lozada Chávez (at back), Jack Koolen, Axel Janke, David Morrison, Magnus Bordewich (back to camera), James McInerney, Mike Steel (obscured), and Charles Semple.

Three men and a goat, at Leiden market. Left to right: David Morrison, Scot Kelchner, and Jim Whitfield.

Friday, October 19, 2012

The Future of Phylogenetic Networks: Day 5

There were only two talks today. First, Leo van Iersel provided an excellent summary of the week's talks and discussions. He put all of the points into context that had been made throughout the week, and provided a clear context for future activities.

Finally, Daniel Huson provided the keynote talk for the week. He especially covered the need for algorithms to address real data (eg. nonbinary trees, multiple trees, partial trees), for the output to return all relevant solutions, and to provide tools for visualization and interactive analysis.

All in all, a very productive week was had by all. The speakers are especially to be thanked, for providing overviews of and introductions to the various topics; as well as those people who contributed the most to the discussion sessions. The communication between the biologists and the mathematicians was excellent (they are, after all, simply two parts of a continuum), with everyone learning something new as well as making new friendships. Hopefully, this workshop will be the start of a series of such meetings, which would then be a forum for building a substantial body of work on the topic of phylogenetic networks.

The Future of Phylogenetic Networks: Day 4

There were two talks today and two lengthy discussion sessions.

Hans-Jurgen Badlet started the day with a smallish audience that increased as time progressed. This may have something to do with the noise emenating from the hotel bar the previous night. Hans surveyed the field of splits networks, and especially their uses for data quality control in forensic and medical databases. The extent of this use was news to most of the audience.

The morning discussion turned out to be about the extent to which it might or might not be desirable to have some sort of "standards" or even "protocols" for effective use of network techniques. Clearly, many if not most users of phylogenetics are not experts, and misuse or misunderstanding of networks is a real possibility. There was no particular consensus on this issue.

The afternoon discussions covered three topics. First, we discussed ways of detecting hybridization using networks, as opposed to detecting them using directly biological techniques. There were arguments both for and against the successful use of networks, with the consensus being that networks have a useful role.

We then proceeded to a consideration of the extent to which network methodology is prepared for the expected influx of genome-scale datasets. The answer seems to be "not yet", but even with the available methods there is much scope for effective analysis.

The third topic was the practical use of current methods and programs for exploratory analysis of multi-gene data. A number of additions or modifications to current implementations were suggested, including some measure of "tree-likeness" to rank different genes in terms of their data patterns.

The day finished with Charles Semple, who is always first to breakfast and last to leave. He explained the various measures optimizing the process of combining trees, notably reticulation number and its various definitions. There was much use of the whiteboard, which is the mark of a true mathematician.

Wednesday, October 17, 2012

The Future of Phylogenetic Networks: Day 3

There were four talks today and one discussion session. We also spent the evening on a boat cruise around the waterways north of Leiden, supping on an Indonesian buffet and consuming some distinctly non-Indonesian desserts. No-one embarrassed themselves, and so (sadly) there are no stories to be told.

Jim Whitfield started the day's talks by contemplating the varied ways in which entomologists might be interested in using a network, especially with genome-scale data. He eventually worked his way around to the topic of datasets for validating network algorithms, which would need to cover all of these possibilities.

Barbara Gravendeel continued the same theme, by presenting some datasets, mainly involving orchids, for which there is independent evidence that some of the taxa are hybrids. Such datasets could be used for algorithm validation.

The discussion session followed, which focussed on the various ways in which validation datasets could be made available. In the short term, it seems likely that a web page will be set up with this information for each dataset: (i) a link to the online dataset, (ii) a link to the relevant publication(s), and (iii) a brief description of what relevant data patterns are believed to be included in the dataset.

Mike Steel then started the very mathematical afternoon. He mainly contemplated the extent to which non-tree biological processes could create tree-like patterns in the data. There are theoretical ways to differentiate various signatures of gene tree incongruence in the context of triplets, but also sources of inconsistency in phylogeny reconstruction.

Dan Gusfield finished with a coverage of ancestral recombination graphs, including their possible use in biology, but mainly some of the potential things that create problems for reconstruction. The mathematical part of the audience looked very enthused by the end of the afternoon.

The Future of Phylogenetic Networks: Day 2

There were five talks today and one discussion session. We were rained upon for much of the day, although it was better in the afternoon.

Scot Kelchner started by explaining how networks are currently used in botany. Plants have long been accepted as having a complex history, and so there is great potential for using networks to explore this complexity, from detecting unexpected hybrids to assessing lineage sorting. He emphasized the valuable role of networks as a paradigm for both exploratory data analysis and hypothesis generation.

Teun Boekhout addressed the enormous network complexity of fungi, although it seems that very few practitioners are using the available mathematical methods.

James McInerney, somewhat the worse for wear, then discussed microbiology, where much attention has been given recently to horizontal evolutionary processes. He explained the continued dominance of the tree paradigm within prokaryote studies as resulting from interest in the transmission tree of the pathogens and its obvious connection to a phylogenetic tree. He then turned to the Tree of Life, which bacteriologists seem to see as their special preserve, pointing out its inadequacy as a model, before turning to the range of microbiological questions that exist only in the context of a network paradigm.

Eric Bapteste then further explored the concept of network thinking as opposed to tree thinking, emphasizing how limited the tree paradigm is when studying the known diversity of biological phenomena. Most interestingly, however, he also noted that even the network paradigm has limitations for studying biodiversity, as they need to be linked to other types of biological networks.

Vincent Moulton produced the only mathematical talk of the day, covering the mathematical quantification of branch support within networks, which is clearly of interest for both data exploration and hypothesis generation. To date, bootstraps are the only method implemented in the software, although they are rarely used in practice. Delta plots have also been proposed, and are quick to calculate, but there are other theoretical possibilities to be explored.

The topic for the discussion was not pre-determined, and turned out to be just how much automation would be useful in network analyses. The consensus is that an attempt at complete automation would be counter-productive. However, the most thorny issue of debate was exactly which types of network are likely to be most useful to biologists. The problem here is that the mathematicians need an explicit description of such networks in order to produce them, while the biologists do not yet have such a description. The issue remained unresolved when we adjourned to the coffee room.

Tuesday, October 16, 2012

The Future of Phylogenetic Networks: Day 1

There were three talks today and two discussion sessions.

Steven Kelk and I got things rolling by introducing the topic of networks from the mathematical and biological perspectives, respectively. I thought that we both did a good job, but I learned more from Steven's talk than from my own.

Luay Nakhleh then presented some of the computational challenges of moving from a tree perspective to a network. Perhaps the most interesting of these to a biologist is the decreasing independence of reticulations as gene sampling increases (eg. due to gene linkage). Independence is a basic assumption of most computational methods in biology, and the consequences of violating this assumption are rarely addressed. However, for network construction the potential non-independence of reticulations seems to be of fundamental importance for any biological interpretation of reticulation causes.

Computationally, the obvious challenge is the complexity of scoring a network compared to a tree. Calculating the parsimony score of a network is hard enough (although trivial for a tree), and scoring the likelihood is even worse. This calls into question the practicality of using likelihood in the context of networks.

However, perhaps the most interesting challenge is how to model inter-locus incompatibility. Within-locus mutations are currently addressed using substitution/indel models in phylogenetic tree-building, but the special focus of networks is on the inter-locus patterns, about which we know much less in terms of appropriate modelling.

Axel Janke and Katharina Huber finished the day by leading discussions on why so few people currently use networks in phylogenetics, and the obvious cultural divide between mathematicians and biologists, respectively. The biologists seemed to dominate the first discussion and the mathematicians the second one.
In the former case, the main conclusion from the discussion was that the current phylogenetic "culture" is focussed so strongly on trees that the extra benefit of using a network is not obvious to practitioners. Indeed, there is still considerable focus on getting people to think in terms of trees rather than linear evolution, so that moving on to the complexity of networks simply confounds the situation. Suggestions were forthcoming about how we could be proactive in addressing this issue, including increasing the profile of networks in journals, but also the need to provide more biological information from analyses than merely the network topology.

As for the cultural divide, this is a long-standing issue that arises from the different thought processes involved in mathematics and empirical science, and the consequent differences in language. The consensus was that there are no hurdles that can't be overcome given sufficient time and patience. Moreover, trans-disciplinary people are becoming more common, which nullifies many of the potential problems.
So, a productive start to the workshop was made, which bodes well for the rest of the week. Sadly, this was the sunniest day since I arrived in the Netherlands, and I spent it indoors!

Sunday, October 14, 2012

Workshop: The Future of Phylogenetic Networks

I am currently sitting in a hotel room in Leiden (in the Netherlands), ready to take part tomorrow in the above workshop, which is sponsored by the Lorentz Center.

The workshop has been organized by Steven Kelk, Leo van Iersel, Leen Stoogie and myself, and the program and abstracts can be found here. It runs for the whole week, 15 Oct - 19 Oct 2012. We have gathered together a real wealth of talent in the field, both biologists and mathematicians, and so we are expecting a productive week for everyone involved. In particular, the weather is predicted to be changeable during the workshop, which is to be expected in northern Europe in October, and so we shall all have to stay indoors and actually discuss phylogenetic networks.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds. We have, unfortunately, had a couple of "no shows", but otherwise things are proceeding smoothly.

Thursday, October 11, 2012

An open question about computational complexity

This is a guest blog post by:

Jesper Jansson

Laboratory of Mathematical Bioinformatics, Kyoto University, Japan

Here is an open problem for people interested in computational complexity issues related to phylogenetic networks.

In a recent paper we introduced a parameter called the "minimum spread" that measures a kind of structural complexity of phylogenetic networks:
T. Asano, J. Jansson, K. Sadakane, R. Uehara, G. Valiente (2012) Faster computation of the Robinson-Foulds distance between phylogenetic networks. Information Sciences 197: 77-90.

The definition is as follows:

The "minimum spread" of a rooted phylogenetic network N is the smallest integer x such that the leaves of N can be relabeled by distinct positive integers in a way that at every node u in N, the set of all leaf descendants of u forms at most x consecutive intervals.

For example, any phylogenetic tree has minimum spread 1 because if we do a depth-first traversal of the tree and number the leaves in the order that they are discovered, then at each node, the set of leaf descendants corresponds to a single consecutive interval. This property was used in, for example, Day's algorithm from 1985 for comparing phylogenetic trees and constructing a strict consensus tree.

Similarly, any level-k phylogenetic network has minimum spread at most k+1 (see our paper for the proof). Moreover, any "leaf-outerplanar network" has minimum spread 1, where a "leaf-outerplanar network" is a network that admits a non-crossing layout in the plane with the root (if any) and all leaves lying on the outer face. Today's existing software typically outputs such networks. So, for certain classes of phylogenetic networks, we automatically get a nice upper bound on the minimum spread.

Having a small minimum spread means that the phylogenetic network is "tree-like" in the sense that its cluster collection has a space-efficient representation. But are compact representations of the clusters in a network useful?

Well, they can be employed to compare phylogenetic networks quickly, for example when using the Robinson-Foulds distance to measure the dissimilarity between two phylogenetic networks. There may be other applications, too. On the other hand, if a phylogenetic network is "chaotic" and non-tree-like then the minimum spread will not be a helpful parameter when looking for a compact encoding of its branching information.

At this point in time, not much is known about how to compute the minimum spread efficiently. As an example, consider the class of level-k networks for any fixed k > 1. According to Lemma 6 in our paper, we can always find a leaf relabeling function that yields spread at most k+1 in linear time, but that might not be the minimum possible for some particular level-k network.

As observed by Sylvain Guillemot and Philippe Gambette (independently of each other), a related result for the k-Consecutive Ones Problem implies that computing the minimum spread of an arbitrary phylogenetic network is NP-hard in the general case, although we can expect it to be easier when restricted to special cases:
P. W. Goldberg, M. C. Golumbic, H. Kaplan, R. Shamir (1995) Four strikes against physical mapping of DNA. Journal of Computational Biology 2: 139-152.

In summary, the following is still open:

What is the computational complexity of computing the minimum spread when restricted to particular classes of phylogenetic networks?

Monday, October 8, 2012

Open questions about evolutionary networks, part 3

There are a number of issues that have been of interest to the phylogenetics community with regard to the construction of evolutionary trees that have not yet been addressed for evolutionary networks. These can be considered to be "open questions" — ones that need widespread discussion at some stage, either by biologists or by computational scientists (or both). This blog post finishes my list of some of these topics (see Part 1 and Part 2).

Robustness of branch/reticulation estimates

It is de rigueur in the world of phylogenetic tree building to pepper the tree branches with bootstrap values or posterior probabilities, or frequently both, especially if these estimates are >50%. On the other hand, these values are almost never seen in the world of phylogenetic networks.

If there is a direct link between the network and some character-state data, then bootstrap values can be calculated for a network in the same manner as for a tree — one simply builds many networks from the re-sampled character data. However, this procedure may not be quite as computationally feasible, if the network method does not have a practical computational running time.

Moreover, this procedure is not necessarily straightforward for other types of data from which we might build a network. For example, if we are building a network by minimizing the number of reticulations needed to reconcile a set of conflicting trees, the application of the bootstrap has not yet been evaluated. The computational focus to date has been on the optimization problem, not on the re-sampling problem. And, of course, in the absence of a likelihood model for reticulation events, posterior probabilities cannot be calculated at all.

So, this is another area where the lack of methods commonly associated with tree building seems to be a handicap for the widespread acceptance of network-based methodology.

Can biologists correctly interpret networks?

I have used this quote in an earlier blog post, but it is relevant again here. Baum and Smith (2012, Tree Thinking: An Introduction to Phylogenetic Biology) have noted the following:

"We do not know why it should be so, but we have learned from working with thousands of students that, without contrary training, people tend to have a one-dimensional and progressive view of evolution. We tend to tell evolution as a story with a beginning, a middle, and an end. Against that backdrop, phylogenetic trees are challenging; they are not linear but branching and fractal, with one beginning and many equally valid ends. Tree thinking is, in short, counterintuitive."

This is a well-studied problem. For example, there have been a number of studies of students taking introductory biology courses at tertiary institutions (mostly in the U.S.A.), aimed at identifying the "major misconceptions" entertained by these students. Certain basic problems are discussed by almost all of the authors concerned (both inside and outside the USA). I have written more extensively on this topic in a post at the Scientopia blog (Ambiguity in phylogenies), which you can read if you are unfamiliar with the current state of affairs. That blog post lists most of the important issues as well as the available literature.

That evolution professionals often suffer the same sort of problem is also well known. I have written more extensively on this topic in a previous post at this blog (Evolutionary trees: old wine in new bottles?). This blog post also lists the relevant literature.

What is worse, some professional organizations apparently know no better. For example, the Federation of American Societies for Experimental Biology (FASEB), which describes itself as "the policy voice of biological and biomedical researchers" in the U.S.A., has this Advocacy Card on their web site:

FASEB was also giving away similar bumper stickers at the recent 20th Annual International Conference on Intelligent Systems for Molecular Biology (ISMB — July 2012, in Long Beach, CA), as discussed at the Byte Size Biology blog. Clearly, this image confounds linear evolution with tree-based evolution — this distinction is crucial to phylogenetic analysis, and yet confusion about these two things is rampant.

This leads me to an obvious question: if people have so much trouble going from a linear view of evolution to a tree-based view, are they going to have even more trouble going to a network-based view?

I cannot answer this question (yet). At one extreme, maybe the big conceptual leap is going from a chain to a tree, and a network is just a complicated tree, so that the conceptual leap is not great. Alternatively, maybe a tree is difficult because it is a set of linked and overlapping chains, and therefore a network is very difficult because it is a set of linked and overlapping trees. Maybe reality will turn out to be somewhere in between these two extremes.

There are at least two issues that are likely to be of importance here, in addition to those concerned with trees:

it is difficult to recognize monophyletic groups (clades) in a network, because the ancestry of any one taxon may be complicated (eg. what is a Most Recent Common Ancestor in a reticulated network? — see this blog post);
it is difficult to distinguish the different possible causes of reticulations (recombination, hybridization, HGT).

We will presumably find out how difficult things are after we have developed a set of widely used methods for constructing evolutionary networks.

Saturday, October 6, 2012

Open questions about evolutionary networks, part 2

Randomness, which is expected to create stochastic variation (such as homoplasy), but which may also be due to bias (eg. selection);
Rooting, with different "gene trees" being rooted in different places; and
Reticulation, which can have any one of several causes (eg. hybridization, HGT, recombination).

If we want an evolutionary network to display only Reticulation then we need to deal with the first two issues, either before-hand or at the same time.

I have previously discussed published examples in which several trees have been presented, from different gene segments, that differ from each other in the location of their outgroup root — eg Figures 4.7 and 4.27 of my book (Introduction to Phylogenetic Networks), and also in the Grass Phylogeny Working Group dataset (see this blog post). In at least one of these cases, there are no reticulate evolutionary events at all, merely an uncertain root. That is, a network was constructed showing putative hybridizations and yet the only evolutionary pattern in the data was that the single unrooted species tree had different roots in the different gene trees.

In all of these cases, it is difficult to present an evolutionary network, because many of the resulting reticulations reflect the differences in the outgroup roots rather than true evolutionary reticulation events. Clearly, we cannot accept a situation where incompatibility among the trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes. This is further discussed in the next section.

Randomness refers to uncertainty in any of the relationships depicted by the tree. Stochastic variation has long been recognized in phylogenetics, and it is the principal issue that most tree-building methods try to address in their algorithms. Biologically, stochastic variation usually arises from short evolutionary intervals (represented as short branch lengths in the tree), but may also arise from inadequate tree-building models, etc. It is the problem that branch-support estimates are designed to quantify, such as bootstrap values or posterior probabilities.

In the "normal" statistical world, random data variation is assumed to be associated with estimation errors. For phylogenetic data, these might include incorrect data (eg. contamination), inappropriate sampling, and model mis-specification. Alternatively, these errors might lead to bias rather than random variation. If so, then the sources of bias should be dealt with via exploratory data analysis, and the offending information can then be corrected or deleted.

However, when we are specifically trying to study reticulate evolution, there will also be many possible biological causes of data conflicts, which are not the result of either reticulation or estimation errors, such as homoplasy (parallelism, convergence, reversal), duplication/loss, and various complex molecular activities (such as sequence inversion, duplication, and transposition). All of these issues need to be dealt with under the concept of "non-reticulation variation".

Separating reticulation-caused data conflict from non-reticulation data conflict requires a null model for reticulation. This is discussed below.

Standardizing the root

Rooting is, to my mind, a problem that has not yet been dealt with properly in phylogenetics. I find that the differences among a set of gene trees are often little more that the relative location of the root (the common ancestor). That is, the unrooted gene trees are (almost) identical, but they have been rooted in somewhat different places (often not too far from each other). In one sense, this is simply Randomness occurring with respect to the root. However, its effect can be great, because it can potentially affect all of the rest of the network, whereas Randomness in most other locations will have only local effects on the topology.

Situations where incompatibility among trees is created by an uncertain root, rather than by conflicting signals due to reticulation processes, can be dealt with by pre-processing of the data (prior to network analysis). Here, I will make a few suggestions, just to get the ball rolling.

If we have a set of "gene trees", then problems with incompatible rooting might be dealt with using polychotomies. That is, we could try to create a set of rooted gene trees with the "same" root by deleting conflicting basal edges from the tree. For example, an algorithm might look like this:

unroot all of the gene trees
find the most-common root — the root location in an unrooted tree defines a split, so the most-common root will be the root-split that occurs in the largest number of trees (unless there are multiple outgroup taxa that make the ingroup non-monophyletic)
any rooted gene tree consistent with that root (displays that split) can be used unmodified
any gene tree with a nearby root could then be modified so that some of its edges are contracted into a ploychotomy until the unrooted tree is consistent with the common root, and the resulting less-refined tree would then be used as the rooted tree — obviously, it would be necessary to explicitly define "nearby"
the remaining gene trees would then be set aside and not used in the network analysis.

This algorithm might work more often that not in practice, although it is easy to think of situations where it will be a very uncertain procedure.

If the ingroup is not monophyletic, then the biologist should fix this before proceeding with the network analysis. This is a "biological problem" of sampling, not a mathematical one — perhaps the problem arises from deep coalescence, for example. If there is no clear "most-common root" among the trees, then perhaps we could define an "average" or centroid root of some sort. We would then proceed with the rest of the method.

An alternative to this "polychotomy method" might be to use the coalescent to construct an "approximate" species tree from the multiple gene trees (there are now several methods to do this), and then in the network analysis we could allow the gene trees to differ from the species tree only with respect to the poorly supported branches in the species tree. That is, we would use the well-supported parts of the coalescent tree as a backbone common to all of the gene trees, and for the uncertain parts we would use each of the gene trees. However, I am not certain of the applicability of the coalescent to higher taxa (as opposed to closely related species).

I have often thought that duplication-loss is another potential cause of problems with the root. It is not immediately obvious how to approach this, but some suggestions have been made by Burleigh et al. (2011).

A different strategy would be to try all possible roots and see which one(s) minimize the network complexity. This might be computationally intensive, depending on the size of the dataset and the network method used. It might be necessary to restrict the roots tested to those observed among the input trees.

Null models for reticulation

Once we have the root standardized for the dataset, we are then set the task of separating reticulation-caused data conflict from non-reticulation data conflict. This requires a null model for data conflict — any data conflict that cannot be accommodated by the null model is a candidate for explanation as the result of a reticulation event.

Looking at the literature, it seems to me that the most commonly accepted null model is deep coalescence (incomplete lineage sorting) (Meng and Kubatko 2009; Kubatko and Meng 2010). For example, a maximum-likelihood method has been developed that models hybridization in the presence of deep coalescence (Kubatko 2009). One can also use the coalescent as an optimality criterion to choose among alternative networks, with lineage sorting under the coalescent as the null hypothesis (Huson et al. 2005; Buckley et al. 2006; Than et al. 2007; Lyngsø et al. 2008; Joly et al. 2009).

However, the sole use of deep coalescence effectively ignores the other non-reticulation causes of data conflict, as listed above (under Separating randomness and rooting from reticulation). Now, I suppose that it is possible that this approach will work in practice, but it seems unlikely to me that this will be so. Effectively, this approach assumes that the gene trees correspond to the true underlying coalescent trees. This is unlikely because the gene trees are inferred and therefore can be incorrect, due to the other (listed) non-reticulation causes of data conflict. Moreover, if there are multiple types of reticulation event occurring then the approach might fail. For example, if one wishes to study hybridization, then the coalescence methods assume that recombination occurs only between and not within the regions used to infer the gene trees, which is also unlikely.

So, a more comprehensive null model seems to be needed, one that includes more than simply traditional statistical randomness plus deep coalescence. The default expectation at the moment seems to be that deep coalescence occurs above the species level, so that all data sets should be tree-like, whereas the objective here is to detect the non-tree-like parts of evolutionary history.

Dealing with stochastic error and bias

In addition to null models, we may also need pre-processing to deal with stochastic error and bias. There is a limit to what can be done with a single null model, and phylogenetic data are rarely simple. Here, I make a few suggestions, once again to start some discussion.

If we have a set of "gene trees", then perhaps the most obvious approach is to delete uncertain edges. That is, they would appear as polychotomies in the gene trees. This allows refined versions of these trees to be represented in the network, rather than requiring extra edges in a network to accommodate all of them. An alternative is to weight all of the edges with respect to their data "support", with the expectation that poorly supported edges would only appear in the network if they are consistently supported across a number of the gene trees.

I think that there are two types of support that could be relevant to uncertainty: (1) classic branch support, such as bootstrap values; and (2) the set of multiple equally optimal or nearly optimal trees. These two types coincide in bayesian analysis, as it is currently implemented in phylogenetics, because in bayesian analysis the branch support is derived from the set of nearly optimal trees. I suspect that (2) may be a better idea than (1), because it expresses something about the tree itself rather than each edge alone; and it is used in the SpNet method (Nakhleh et al. 2005), for example, where each gene tree is a consensus tree of several nearly-optimal trees. The appeal of using polychotomies is that it is simple. The main arguments against it may be the work required for the calculations in methods such as maximum likelihood (both parsimony and bayesian analyses do the necessary calculations anyway), and the fact that it may create non-dense sets of triplets (Jansson and Sung 2006), for example.

Another idea might be to delete taxa that have no consistent position among the input trees. The idea here is that biologically we are looking for things like hybridization and HGT, and we are not expecting this to involve any one taxon in combination with many other taxa. Therefore, extremely uncertain positions are unlikely to reflect Reticulation but rather Randomness (or lack of information). Creating polychotomies would lose a lot of information in this situation, and so it would be better to flag these taxa as problematic, and then leave them out of the network analysis. This is basically the concept used for largest common pruned trees (or agreement subtrees), except that here we don't prune the data all the way down to a tree (see Abby et al. 2010). This also seems to be the idea behind the Dendroscope program's option to deal only with clusters that appear in a certain percentage of the trees. The problem with the Dendroscope approach, however, is that a cluster generated by HGT (say) that appears in only one tree will be ignored. It would thus be better to use the variation in position of individual taxa, rather than presence/absence of clusters.

References

Abby S.S., Tannier E., Gouy M., Daubin V. (2010) Detecting lateral gene transfers by statistical reconciliation of phylogenetic forests. BMC Bioinformatics 11: 324.

Buckley T., Cordeiro M., Marshall D., Simon C. (2006) Differentiating between hypotheses of lineage sorting and introgression in New Zealand Alpine cicadas (Maoricicada dugdale). Systematic Biology 55: 411-425.

Burleigh J.G., Bansal M.S., Eulenstein O., Hartmann S., Wehe A., Vision T.J. (2011) Genome-scale phylogenetics: inferring the plant tree of life from 18,896 gene trees. Systematic Biology 60: 117-125.

Huson D.H., Klöpper T., Lockhart P.J., Steel M.A. (2005) Reconstruction of reticulate networks from gene trees. Lecture Notes in Bioinformatics 3500: 233-249.

Jansson J., Sung W.-K. (2006) Inferring a level-1 phylogenetic network from a dense set of rooted triplets. Theoretical Computational Science 363: 60-68.

Joly S., McLenachan P.A., Lockhart P.J. (2009) A statistical approach for distinguishing hybridization and incomplete lineage sorting. American Naturalist 174: E54-E70.

Kubatko L.S. (2009) Identifying hybridization events in the presence of coalescence via model selection. Systematic Biology 58: 478-488.

Kubatko L.S., Meng C. (2010) Accommodating hybridization in a multilocus phylogenetic network. In: Knowles L.L., Kubatko L.S. (eds) Estimating Species Trees: Practical and Theoretical Aspects, pp. 99-113. Wiley-Blackwell, Hoboken NJ.

Lyngsø R.B., Song Y.S., Hein J. (2008) Accurate computation of likelihoods in the coalescent with recombination via parsimony. Lecture Notes in Computer Science 4955: 463-477.

Meng C., Kubatko L.S. (2009) Detecting hybrid speciation in the presence of incomplete lineage sorting using gene tree incongruence: a model. Theoretical Population Biology 75: 35-45.

Nakhleh L., Warnow T., Linder C.R., St John K. (2005) Reconstructing reticulate evolution in species — theory and practice. Journal of Computational Biology 12: 796-811.

Than C., Ruths D., Innan H., Nakhleh L. (2007) Confounding factors in HGT detection: statistical error, coalescent effects, and multiple solutions. Journal of Computational Biology 14: 517-535.

Thursday, October 4, 2012

Open questions about evolutionary networks, part 1

(i) character-state changes (eg. nucleotide substitution, nucleotide insertion / deletion, or their amino acid equivalents), and
(ii) character-block events (eg. inversion, duplication / loss, transposition, recombination, hybridization, horizontal gene transfer).

To date, phylogenetic tree-building has concentrated on (i), and methods have been developed using optimization criteria such as minimum distance, maximum parsimony, maximum likelihood, and bayesian analysis (which, strictly speaking, does not involve an optimization criterion).

Most of the data-display network methods have also been based on optimizing data-type (i), notably the splits-graph methods (see this primer), which conceptually can be seen as based on either maximum parsimony or minimum distance. Moreover, it is possible to optimize the character data directly onto a network by maximizing either the parsimony scores (eg. Hein 1990, 1993; Dickerman 1998; Nakhleh et al. 2005; Jin et al. 2006a, 2007a, 2007b) or the likelihood scores (eg. von Haeseler and Churchill 1993; Strimmer and Moulton 2000; Strimmer et al. 2001; Jin et al. 2006b; Snir and Tuller 2009). The likelihood scores can also be evaluated in a bayesian context (Radice 2011).

However, evolutionary networks can differ from evolutionary trees by explicitly taking into account data-type (ii), either instead of or in addition to (i). So far, maximum parsimony has been the criterion of choice for doing this, in the sense that the available methods minimize the count of the number of events. For example, a large amount of work has been done to minimize the number of reticulation nodes when reconciling a set of incompatible phylogenetic trees, or alternatively minimizing the level (see this blog post).

However, this means that there are currently few available likelihood-based methods that will allow us to build networks directly from quantitative evolutionary models of how non-tree events occur. The most obvious exception here is the recent development of Admixture graphs (see this blog post), some at least of which are based on an approximate maximum-likelihood model (Pickrell and Pritchard 2012).

This seems to be a serious omission, given that model-based methods are among the most widely used of those available for phylogenetic trees, at least among those users who want a robust analysis (Kelchner and Thomas 2006). Likelihood has effectively replaced maximum parsimony as an optimization criterion for tree building. (The quick-and-dirty distance-based methods will probably always out-rank the other methods, because they can be useful as a "first approximation".)

It may not be easy to create likelihood models for non-tree events, perhaps even more so given the number of different types of events that need to be modelled. Nevertheless, the lack of such models seems to be a handicap for the widespread acceptance of network-based methods.

Partitioned models for likelihood analyses

This topic is a direct extension of the previous one. Current likelihood models for tree-building analyses can be applied independently to different partitions of the type-(i) character data, and this partitioning is considered to be a valuable part of any likelihood analysis (eg. Blair and Murphy 2011). Indeed, it is the desirability of model partitioning that seems to be a major component of the increasing move from maximum likelihood to bayesian analysis, as well as the ease of implementing models that deal with heterogeneity among and within lineages (especially relaxed molecular clocks).

Partitioned models allow us to add complexity that can deal with heterogeneity within a dataset (Endicott et al. 2009), by a priori or a posteriori choice of partitions with greater inter- than intra-partition variability in substitution rates. For example, there is substitution-rate heterogeneity within genes (eg. different codon positions in protein-coding genes, paired versus unpaired positions in RNA-coding genes), as well as between genes (e.g. house-keeping genes versus rRNA-coding genes), between coding and non-coding regions (e.g. introns versus exons, as well as transcribed spacers and the mitochondrial control region), and between genomes (e.g. nuclear versus mitochondrial). Failure to correctly account for this heterogeneity can seriously mislead phylogenetic analyses; and automated procedures for devising partition schemes have now been developed (eg. Lanfear et al. 2012).

Partitioning is not a panacea for heterogeneity, of course, and there are potential problems that need to be addressed concerning partition choice and its consequences (see Brown et al. 2010; Marshall 2010; Fan et al. 2011). None of these issues have yet been addressed in the context of evolutionary networks, although there seems to be no barrier to the use of partitioning for network likelihood models. On the other hand, dealing with evolutionary heterogeneity among and within lineages may actually be a bigger problem, given the increased complexity of the lineages in a network.

Mixture models for likelihood analyses

An alternative approach to dealing with heterogeneity is through the use of mixture models. Here, the likelihood of each character is calculated under more than one model, and these likelihoods are then combined. For example, the parameters of several substitution models, as well as the probability with which each model applies to each alignment position, can be determined directly from the data. Such models have been developed for nucleotide (Pagel and Meade 2004) and amino-acid (Le et al. 2008) sequences, but this is otherwise a very under-explored part of phylogenetic analysis. Nevertheless, computer programs are becoming more readily available (eg. Stamatakis 2006).

It would presumably be possible to combine data types (i) and (ii) using this approach. Indeed, this has obvious theoretical advantages for networks, although the resulting models may be overly complex. It seems likely that the ability to model, say, hybridization versus recombination, as alternative causes of reticulations in a phylogeny will be a part of any successful attempt to produce a widely used method of phylogenetic analysis.

References

Blair C., Murphy R.W. (2011) Recent trends in molecular phylogenetic analysis: where to next? Journal of Heredity 102: 130-138.

Brown J.M., Hedtke S.M., Lemmon A.R., Moriarty Lemmon E. (2010) When trees grow too long: investigating the causes of highly inaccurate bayesian branch-length estimates. Systematic Biology 59: 145-161.

Dickerman A.W. (1998) Generalizing phylogenetic parsimony from the tree to the forest. Systematic Biology 47: 414-426.

Endicott P., Ho S.Y.W., Metspalu M., Stringer C. (2009) Evaluating the mitochondrial timescale of human evolution. Trends in Ecology and Evolution 24: 515-521.

Fan Y., Wu R., Chen M.-H., Kuo L., Lewis P.O. (2011) Choosing among partition models in bayesian phylogenetics. Molecular Biology and Evolution 28: 523-532.

Hein J. (1990) Reconstructing evolution of sequences subject to recombination using parsimony. Mathematical Biosciences 98: 185-200.

Hein J. (1993) A heuristic method to reconstruct the history of sequences subject to recombination. Journal of Molecular Evolution 36: 396-405.

Jin G., Nakhleh L., Snir S., Tuller T. (2006a) Efficient parsimony-based methods for phylogenetic network reconstruction. Bioinformatics 23: e123-e128.

Jin G., Nakhleh L., Snir S., Tuller T. (2006b) Maximum likelihood of phylogenetic networks. Bioinformatics 22: 2604-2611.

Jin G., Nakhleh L., Snir S., Tuller T. (2007a) Inferring phylogenetic networks by the maximum parsimony criterion: a case study. Molecular Biology and Evolution 24: 324-337.

Jin G., Nakhleh L., Snir S., Tuller T. (2007b) A new linear-time heuristic algorithm for computing the parsimony score of phylogenetic networks: theoretical bounds and empirical performance. Lecture Notes in Bioinformatics 4463: 61-72.

Kelchner S.A., Thomas M.A. (2006) Model use in phylogenetics: nine key questions. Trends in Ecology and Evolution 22: 87-94.

Lanfear R., Calcott B., Ho S.Y.W., Guindon S. (2012) PartitionFinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. Molecular Biology and Evolution 29: 1695-1701.

Le S.Q., Lartillot N., Gascuel O. (2008) Phylogenetic mixture models for proteins. Philosophical Transactions of the Royal Society of London, B: Biological Sciences 363: 3965-3976.

Marshall D.C. (2010) Cryptic failure of partitioned bayesian phylogenetic analyses: lost in the land of long trees. Systematic Biology 59: 108-117.

Nakhleh L., Jin G., Zhao F., Mellor-Crummey J. (2005) Reconstructing phylogenetic networks using maximum parsimony. In: Proceedings of the 2005 IEEE Computational Systems Bioinformatics Conference, pp. 93-102. IEEE Computer Society, Washington DC.

Pagel M., Meade A. (2004) A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. Systematic Biology 53: 571-581.

Pickrell J.K., Pritchard J.K. (2012) Inference of population splits and mixtures from genome-wide allele frequency data. Unpublished ms (http://arxiv.org/abs/1206.2332).

Radice R. (2011) A Bayesian Approach to Phylogenetic Networks. PhD thesis, University of Bath, UK.

Snir S., Tuller T. (2009) The NET-HMM approach: phylogenetic network inference by combining maximum likelihood and hidden markov models. Journal of Bioinformatics and Computational Biology 7: 625-644.

Stamatakis A. (2006) RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22: 2688-2690.

Strimmer K., Moulton V. (2000) Likelihood analysis of phylogenetic networks using directed graphical methods. Molecular Biology and Evolution 17: 875-881.

Strimmer K., Wiuf C., Moulton V. (2001) Recombination analysis using directed graphical models. Molecular Biology and Evolution 18: 97-99.

von Haeseler A., Churchill G.A. (1993) Network models for sequence evolution. Journal of Molecular Evolution 37: 77-85.

Monday, October 1, 2012

Carnival of Evolution, Number 52 — the Network Edition

Welcome to the 52nd edition of the Carnival of Evolution, hosted here at The Genealogical World of Phylogenetic Networks blog.

Exordium

For those of you not familiar with the Carnival of Evolution, at the beginning of every month the Carnival provides a collection of some of the most interesting of the recent blog posts about biological evolution. The Carnival is hosted by a different blog every month: last month's Carnival can be found at The Stochastic Scientist blog; and next month's Carnival will be hosted by the Sorting Out Science blog at the beginning of November.

The theme for the presentations this month is, of course, phylogenetic networks. You can skip straight on to the blog posts if you are familiar with such networks.

Introduction to phylogenetic networks

For those of you not familiar with phylogenetic networks, the host blog this month is about the use of networks in evolutionary analysis, as a replacement for (or an adjunct to) the usual use of phylogenetic trees. The 46th edition of the Carnival of Evolution (hosted at the Synthetic Daisies blog) provided a good introduction to trees in the study of evolution, which are used as a metaphor for branching genealogical history. In this blog, we take evolutionary trees to the next logical stage — reticulating networks.

Networks have received considerable attention in the recent biological literature, not least in microbiology (where horizontal gene transfer is often considered to be rampant) and botany (where hybridization has always been considered to be common). It has also received increasing attention in the computational sciences.

Networks are acknowledged to have two main uses within phylogenetics: (i) exploratory data analysis, in which conflicting data patterns are displayed and their quality and quantity assessed; and (ii) evolutionary analysis, in which the historical (genealogical) patterns involve not only vertical descent (parent to offspring) but also reticulations due to horizontal processes (such as horizontal gene transfer, hybridization, recombination, and genome fusion).

A network is thus more general than a tree (or "more complete"), because it is a tree that also has reticulations. For example, the Decision Tree presented with the Carnival of Evolution #46 might look like this if it was a network:

Note that for some of the leaves there are multiple paths through the network from the root, whereas a tree is restricted to a single path between any two points. This is the essence of why networks are being introduced into evolutionary studies, because the evolutionary history of many organisms involves complex pathways of "descent with modification" (as Darwin put it).

If you would like to know more about this blog, then the simplest access is via the various Pages, listed at the top of the right-hand column, which gather together the blog posts related to particular topics. The most popular blog posts for non-specialists are in the History and the Analyses sections, as well as in the Tattoo section; so please take a look around the blog while you are here.

This month's Carnival posts

For this edition of the Carnival of Evolution, the featured posts have been incorporated into a series of phylogenetic networks. Each network represents a typical topology that you might encounter in the scientific literature, illustrating the relationships between the blog posts. I have been somewhat selective this month, by not including anything about the ongoing arguments between evolutionists and creationists.

Posts about networks

To get the Carnival off on the right foot, we will start with a collection of blog posts that are themselves about biological networks. They are all different, and not all of them involve phylogenetic networks, so they display the diversity of what networks are used for in biology. The network shown here is a NeighborNet, which connects the topics based on overall similarity.

Franklin Harold, guesting at the Small Things Considered blog, discusses the evolution of the eukaryotes in Begetting the Eukarya: an unexpected light. The eukaryotes originated from the fusion of several genomes, so that the Tree of Life is not a tree at that point in evolutionary history. Sadly, in this blog post the suggested alternative image to a tree is not a network but "a pointed Gothic arch thrusting out of the prokaryotic underbrush", which you will have to check out for yourselves.

John Hawks, at his personal blog, introduces us to the world of human evolution when he explores Denisova at high coverage. The archaic Neandertals and Denisovans have recently been shown to have been involved in gene flow with early modern humans, which re-writes the story of human evolutionary history.

The Genealogical World of Phylogenetic Networks then asks, in light of this information, Why do we still use trees for the Neandertal genealogy? Clearly, a network is more appropriate than a tree for phylogenetic analysis when there is horizontal gene flow.

Razib Khan, at the Gene Expression blog, develops this theme in Across the sea of grass: how Northern Europeans got to be ~10% Northeast Asian. He takes us into more modern times when he ponders recent evidence concerning the evolutionary relationships between Neolithic farmer migrants and the indigenous Mesolithic southern European populations.

Dienekes Pontikos, at his Anthropology Blog, then ponders Structural stability and ancient connections between languages, in which a phylogenetic network is used to discover unexpected geographic clusters of similarity among the families of modern languages.

On a related topic, Jeremy Yoder, The Molecular Ecologist, considers Genes...in...space! by looking at ways to summarize multivariate geographical patterns among human genotypes. Sadly, this inadvertently demonstrates just why one should not use Principal Components Analysis for this type of data analysis — the right idea but the wrong tool. The second axis of the ordination is frequently nothing more than a quadratic function of the first axis (ie. a mathematical artifact), as shown clearly by two of the three ordinations reproduced in the blog post. This is one of several reasons why we should use a network instead of PCA.

Bradly Alicea, writing at the Synthetic Daisies blog, then moves away from phylogenetic networks and into gene regulatory networks, with Cascades in common: biological network function in evolution. He connects these networks to biological evolution by pointing out that they contribute to both adaptive variation and to variation between species.

Finally, the ever-present Bjørn Østman, from the Pleiotropy blog, considers Epistasis in evolution. He uses genetic interaction networks to look at epistatic interactions (which are non-additive interaction effects resulting from mutations) and their role in evolution, particularly in adaptation and speciation.

Human evolution

Human evolution is always of interest to humans, and so there is a steady stream of blog posts about this topic every month. The network shown here is a Recombination network, with the converging pair of arrows indicating, in this case, a topic that combines two of the others.

This month, Kathy Orlinsky, writing as The Stochastic Scientist, discusses recent evidence that Our methylomes make us human. One explanation for how humans and chimpanzees can be so different when their genomes are so similar is that the DNA methylation of their genes is different.

On a slightly more tasty note, Heather Pringle, from The Last Word On Nothing blog, considers The sweetness of human evolution. The search for a honey diet has probably played a much more complex (and interesting) role in human history than you have heretofore realized.

Writing at the Nothing in Biology Makes Sense! blog, Jonathan Yoder then muses, appropriately enough, about the Evolution of diabetes? Type 2 Diabetes is a highly multifactorial disorder, so don't expect an answer any time soon.

Gunnar De Winter, masquerading as The Beast, The Bard and The Bot, then contemplates fatty brains and what their genes might tell us about history, in Once upon a (complicated) time in Africa.

Next, Faye Flam, from the Planet of the Apes, ponders What whales tell us about the evolution of menopause. Very few species have a long post-reproductive period for females, and these include several species of whale as well as humans, so a comparative analysis might be very revealing.

Ed Yong, over at the Not Exactly Rocket Science blog, continues the cetacean theme with Same gene linked to bigger brains of dolphins and primates — in this case, the title says it all.

Finally, Helen Thompson and Shankar Vedantam, writing at The Salt blog, reflect on How food and clothing size labels affect what we eat and what we wear. This is my personal favorite post of the month, because it tells us everything we need to know about human evolution.

The study of heterozygosis

Bodies are interesting things, especially the differences between males and females, and this month we have a few blog posts about that topic. The network shown here is a Hybridization Network, in which the paired arrows indicate three hybridization events, in this case showing hybridization of topics based on sex (male versus female).

PZ Myers, over at the Pharyngula blog, shows a great interest in reconstructing reproductive anatomy, in O brave new world that has such penises in't. I'm sure that you will be just as interested in regrowing penises as he is.

Emily Weigel, on the other hand, shows that the Beacon blog is more interested in Maternal effects — mothers have more of an effect on their offspring than most daughters ever want to admit.

As a compromise position, Jerry Coyne, at the Why Evolution is True blog, contemplates A gynandromorph cardinal. Externally, one half of the bird is male and the other half is female, divided lengthwise! A similar thing happens in fruitflies, although they take it to the extreme.

Marc Srour, over at the Teaching Biology blog, then muses about Wolbachia: the ubiquitous male-killing, feminising parasite, which has at least four different ways to alter the insect host's reproduction in order to increase its own maternal transfer.

Finally, Suzanne Elvidge, at the Genome Engineering blog, reports about Men on your mind: male DNA in women’s brains — this concerns the first description of male microchimerism in the female human brain.

Evolutionary theory

Theory is either fascinating or terribly dull, depending on whether you like to spend your time in the pub or in the field. There is room in the world for both types of scientist, and this collection of posts comes from the former group. The network shown here indicates that there is no particularly close relationships among the blog posts.

Joachim Dagg, living in the Mousetrap, starts us off by conducting a Thought experiment about recombination. The resulting conclusion is that the maintenance of sex is a problem distinct and separate from the maintenance of recombination rates.

Jeremy Yoder, still at The Molecular Ecologist blog, reflects on the problem of Isolating isolation by distance — in population genetics, can we distinguish isolation by distance from population structure? The answer appears to be rather complicated.

Ford Denison, at the This Week in Evolution blog, provides us with some thoughts that the editor excised from his book on Darwinian Agriculture, when he asks Biomimicry of forests or trees? The answer is presented as a Galilean dialog between an engineer and a couple of expert biologists.

Andrew Hendry, contributing to the Eco-Evolutionary Dynamics blog, ponders the difficulty of making arguments for biodiversity preservation solely from a consideration of ecosystem services, in Ecosystem disservices and assisted elimination.

Finally, The Genealogical World of Phylogenetic Networks reflects on Metaphors for evolutionary relationships, which surveys the rich world of evocative metaphors used in evolutionary studies.

Evolution in practice

This collection of posts comes from those scientists who have been contemplating the evolving world from inside the lab or out in the field. The network shown here is a Median Network, which simply displays all of the character-state differences between the posts (the central structure is a cube in this case).

The Mostly Open Ocean blog muses about Rapid speciation in starfish. There have been profound changes to life history in the two daughter species arising from the recent speciation event, involving selection on many morphological and physiological traits.

Carl Zimmer, weaving at The Loom blog, revisits one of his favorite experiments in The birth of the new, the rewiring of the old. The experiment is Richard Lenski's 24-year study of evolutionary change in Escherichia coli, which now encompasses an unheard-of 55,000 "generations". The results to date are, to say the least, fascinating.

Ken Weiss, contributing to The Mermaid's Tale, discusses the opposite trend in Evolving...to stay the same? — the horseshoe crab seems to have changed very little for 150 million years.

Jerry Coyne, still at the Why Evolution is True blog, reflects on the same phenomenon in Horseshoe crabs aren’t really "living fossils", but he elaborates instead on some of the differences between the fossil and contemporary species.

Finally, Greg Laden, popping up at the 10,000 Birds blog, contemplates The incredulous New Caledonian crows, which apparently can distinguish the concept of an Unknown Causal Agent from that of a Hidden Causal Agent, which most animal species cannot do.

The ENCODE debacle

The network shown here is a Horizontal Transfer Network, with the two dashed lines showing the transfer of text (quotations) in this case.

Early in the month we saw what may well be the nadir of scientific journalism, when the ENCODE (Encyclopedia of DNA Elements) consortium provided the excuse for a media blitz associated with the co-ordinated release of 30 papers in some of the high-profile genome-oriented journals. Most notably, the media reports focussed almost entirely on the apparently new claim that 80% of human DNA is not "junk" (as opposed to the previous claim that 80% is junk DNA). This new claim rests almost entirely on a re-definition of "junk DNA" rather than any new data about DNA function, so this is not much of a contribution from the ~400 ENCODE people, let alone a good reason for a media bombardment. Not unexpectedly, the blogsphere exploded with outrage at the distortions involved in the media reporting. A few selected blog posts are included here to commemorate the event.

Mike White, sitting in The Finch and Pea pub, sets the scene with ENCODE media fail (or, Where’s the null hypothesis?). Michael Eisen, at the It Is NOT Junk blog, then starts the attack on the media with This 100,000 word post on the ENCODE media bonanza will cure cancer, while Larry Moran, strolling on the Sandwalk, develops the attack with The ENCODE data dump and the responsibility of science journalists. Ryan Gregory, from the Genomicron blog, then weighs in with A slightly different response to today’s ENCODE hype, as does PZ Myers, popping up at the Panda's Thumb, with The ENCODE delusion. Sean Eddy, at his Cryptogenomicron blog, presents a DNA-researcher's perspective by asking incredulously ENCODE says what? Finally, Ewan Birney, the bioinformatician co-ordinating the ENCODE project, presents My own thoughts at his personal blog.

Genome science reporting can only get better, and less embarrassing, from here on. However, the simple fact that several reputable science journal editors got together to orchestrate the release of the papers on the same day, thus unnecessarily delaying the publication of some of the papers (by several months), strikes me as outrageous. (Casey Bergman, at the brilliantly titled I Wish You'd Made Me Angry Earlier blog, discusses this in The cost to science of the ENCODE publication embargo.) The possibility of a media extravaganza seems to have loomed larger in the minds of the editors than did their journals' role in communication among scientists. We take them seriously, so why can't they do the same for us?

Terminus

Well, that's it for this month. While you wait for the next edition, you will find the Carnival of Evolution on Facebook and Twitter, as well as at the official Carnival of Evolution blog. Past posts and future hosts can be found on the Carnival index page.

Next month's Carnival will be hosted at Sorting Out Science. You can submit posts for the next edition using the Carnival submission form (which requires you to log in), or by sending an email to Bjørn Østman.