My Structure experience (I)

Cover image obtained from Puechmaille 2016

Versión en español aquí!

To work on something that is not what you like the most is definitely among the hardest things in this life, but it might sometimes (or maybe just once in a lifetime) be good. I’m about to finish the second and last thesis of this master’s and after the successful semester at Harvard (which I consider to be the best academic months of my life) I decided to do something new. For this second thesis I’m testing the effectiveness of a software. It doesn’t sound biological at all, but if we consider that this program is (or was?) the most commonly used in the whole world in the topic of population genetics then the thing starts to make sense.

I’m working with “Structure”. This is a computer program that appeared in 2000 and was presented in a paper by Pritchard and colleagues. You just have to search for it in google scholar and see how many times the paper has been cited to make yourself and idea of the impact this program has had in the field of population genetics. What does Structure does? Well, as you can infer from its name, the program looks to unravel the genetic structure that your set of data might have. In the case that you have taken samples from various individuals of a certain species Structure can tell you if those individuals are more than simply just “a group of individuals” and if they belong to more than one genetically differentiated group. For example, you might be taking blood samples along a transect of several kilometers and you could be categorizing the individuals as belonging to the same population. However, it’s possible that when analyzing the genetic information (with Structure, for example) you realize that those individuals are not comprised in just one population, but two that are next to each other! Most of the times these genetic differences are not visible to the naked eye, so a genetic analysis is the only way to be sure that we are before more than just one population.

Structure has been used for different kinds of problems in fields like conservation genetics, speciation and, in the most cases, hybridization events (especially because of the effect it has on the previous two!). Going back to the transect example where we found two genetically differentiated populations, it’s possible to find an intermediate zone in that transect where both “pure” populations are in contact and where their members interbreed, originating a third hybrid group or population.

Several works about hybridization have been done using Structure. The results are quite clear and friendly to anybody that would like to interpret them as you can see in the following figure:

Pritchard et al., 2003

Each vertical bar (which can not be differentiated in the image when they are of the same color) represents the genetic information of an individual and the different colors represent the different groups, which we will call clusters from now on, created by Structure and to which the individuals have been assigned. Finally, it’s possible that we had categorized the sampled individuals based on, for example, the zones in which they have been found; in the case of the figure let’s say we found the individuals in the zones 1, 2, 3 and 4. We could then conclude that zones 1 and 2 are well genetically differentiated given that all the individuals of a single zone seem to be mostly associated to a single color (zone 1 with blue and zone 2 with red). However, it looks like if zones 3 and 4 represented only one genetic group, because individuals of both zones are practically included in the green cluster.

However, if we see the image, close to the left end of the red group we’ll find a bar that seems to have half of its extension colored in green. This individual might be a hybrid produced from the mix between red and green cluster’s individuals. To detect hybrid or introgressed individuals (descendants of hybrids that bred with pure individuals) is common to put a limit in respect to the proportion of the genome that comes from a given pure population. This means that, in the previous example of the individual with half of its genotype being red and the other half green we could say that this proportion is 0.5. This proportion is denominated the admixture proportion. As I mentioned, we can put certain limits to consider an individual as part of a pure population or as an hybrid. For example, we could say that if the genetic information of an individual is similar in a 90% or more to the one presented by a cluster then this individual might be considered as member of that cluster. This, of course, depends on each particular study.

Until now everything sounds really good, and for sure for many years a lot of people were satisfied with this program. But, as expected, some others started to doubt about the effectiveness of the software and actually found various scenarios in which Structure could give wrong results. Factors like time of divergence between pure populations (Kalinowski, 2011), number of loci used as genetic markers (Vähä and Primmer, 2006) and sample size (Puechmaille, 2016, among others) seem to dramatically affect the obtained results. These past studies, apart from shedding light on all these weaknesses, gave different suggestions to relieve these difficulties when using Structure to analyze data and, moreover, highlighted several of its strengths; which are probably the reasons why it is still widely used. However, the last of the factors I mentioned, the sample size, is a bit darker and, from what I’ve read so far, there are no clear recommendations about how to proceed in respect to this problem.

The sample size is probably one of the basic characteristics of any study. It is expected to have representative samples of a universe previously defined by us, like a geographic region, a determined population or an age group. The problem of obtaining samples to be analyzed by Structure is that the real genetic groups (the clusters) are unknown to us at the beginning. It’s possible to sample different defined “zones”, as in the first example where I considered the “zones” 1, 2, 3 and 4; but if it happens that we are taking “unbalanced” samples of the clusters (the groups of colors we don’t know yet) there are a lot of probabilities that the results won’t reflect reality.

Let’s imagine we have ten sub-populations (represented by each of the circles in each figure), and assume that all of them are genetically different from the others. Now imagine we sample six out of those ten sub-populations (the colored ones) and put the information on Structure. In this case we know that the six sub-populations are genetically different so we would expect Structure to show a result like the one on the left, where each sub-pop. has its own cluster. The figure on the right represents a case in which we sample those six sub-populations in an unbalanced way (represented by the different sizes of the circles). Structure might show misleading results, like merging the sub-pops. represented by smaller sample sizes in the same cluster (purple). Figure from Puechmaille, 2016.

What we know is that there is, for sure, a problem. What is not know is how exactly the different sample sizes affect the results. In other words, we know the qualitative effect of the unbalanced sample sizes (the effect is just that the results won’t reflecting reality), but we don’t know the quantitative effect (in what degree and in which directions the results are affected by the sample sizes). The goal of this project, my second master’s thesis, is to find and describe these quantitative effects.

I planned to explain the whole thesis in a single post, but it seems that I accidentally started a new series. In the next probably two posts I will try to explain how does Structure works and, most important, the results I got (which are, in my opinion, really interesting). It can be said that, based on what I’ve heard from my supervisors and colleagues, the results and the whole project seem to be quite controversial. I’ve felt a lot of skepticism, which I guess is good when you work in science; and of course, because all the impact this program has had since it first appeared.

References

Kalinowski ST. (2011). The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure. Heredity 106(4): 625-632.
Pritchard JK, Stephens M & Donnelly P. (2000). Inference of population structure using multilocus genotype data. Genetics 155(2): 945-959.
Pritchard JK, Wen W (2003) Documentation for STRUCTURE software: Version 2. Available from http://pritch.bsd.uchicago.edu.
Puechmaille SJ. (2016). The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub-sampling and new estimators alleviate the problem. Molecular Ecology Resources. doi: 10.1111/1755-0998.12512
Vähä JP & Primmer CR. (2006). Efficiency of model-based Bayesian methods for detecting hybrid individuals under different hybridization scenarios and with different numbers of loci. Molecular Ecology 15(1): 63-72.

Mi experiencia con Structure (I)

Trabajar en algo que no es lo que más te gusta es definitivamente lo más difícil de esta vida, pero de vez en cuando (o tal vez una vez en la vida y ya) está bien. Estoy a puertas de terminar la segunda y última tesis de este master y después del exitoso semestre en Harvard (que considero la mejor temporada académica de mi vida) me lancé a probar algo nuevo. Para esta segunda tesis estoy, nada más y nada menos, que probando la efectividad de un software. No suena muy biológico, pero si tomamos en cuenta que dicho programa es (o fue?) el más comúnmente usado en todo el mundo en el tópico de genética de poblaciones entonces ya la cosa empieza a tener sentido.

Estoy trabajando con “Structure”. Este es un programa de computadora que fue sacado a la luz en el año 2000 y presentado por Pritchard y colaboradores. Basta que lo busquen en el google scholar y vean la cantidad de veces que ha sido citado el paper para que se hagan una idea del impacto que ha tenido este programa en la rama de genética de poblaciones. ¿Qué hace Structure? Pues, como su nombre lo dice, busca descifrar la estructura genética que tus datos puedan presentar. Imaginando que has tomado muestras de varios individuos de cierta especie, el programa te puede decir si tales individuos son más que simplemente “un grupo de individuos” y pertenecen a más de un grupo genéticamente diferenciado. Por ejemplo, podrías estar tomando muestras de sangre o lo que sea a una especie a lo largo de un transecto de varios kilómetros y podrías catalogar a todos los individuos que encuentras como si fueran de la misma población. Pero es posible que al analizar la información genética (con Structure, por ejemplo) te des cuenta que no se trata simplemente de una población, sino de dos que están una al lado de la otra! La mayoría de las veces estas diferencias genéticas no las podrás observar visiblemente, por lo que el análisis genético es la única forma de estar seguros que estamos frente a más de una sola población.

Structure se ha utilizado para diversos tipos de problemas en ramas como la genética de la conservación, especiación y sobre todo diría yo, casos de hibridación (especialmente por el efecto que este fenómeno tiene en los dos anteriores!). Volviendo al ejemplo del transecto en el que encontramos dos poblaciones genéticamente diferenciadas puede que haya una zona intermedia en dicho transecto en donde ambas poblaciones “puras” entren en contacto y donde cuyos integrantes se mezclen, dando origen a un tercer grupo o población híbrido.

Muchísimos trabajos sobre hibridación se han desarrollado utilizando Structure. Los resultados son bastante claros y amigables para el que los quiera interpretar como se ve en la siguiente figura:

Pritchard et al., 2003

Cada barra vertical (las cuales no se diferencian en la imagen cuando son del mismo color) representa la información genética de un individuo y los diferentes colores representan los diferentes grupos, que llamaremos “clusters” a partir de ahora, que Structure ha creado y a los cuales ha asignado a los individuos. Finalmente es posible que nosotros previamente hayamos catalogado a nuestros individuos según, por ejemplo, la zona en que los hemos colectado; en este caso digamos que fueron las zonas 1, 2, 3 y 4. Podríamos concluir entonces que las zonas 1 y 2 están bastante bien diferenciadas genéticamente ya que todos los individuos de una sola zona parecen presentar mayoritariamente un solo color (zona 1 es azul y zona 2, rojo). Sin embargo pareciera que las zonas 3 y 4 representaran un solo grupo genético, ya que los individuos de ambas zonas están prácticamente incluidos en el cluster verde.

Sin embargo, si nos fijamos en la imagen, cerca al extremo izquierdo del grupo rojo encontraremos una barra que pareciera tener la mitad de su extensión de color verde. Ese individuo podría tratarse de un híbrido resultante de la mezcla entre individuos del cluster rojo y el cluster verde. Para detectar individuos híbridos o retrocruzados (producto de los híbridos que vuelven a cruzarse con individuos puros) se suele poner un límite respecto a la proporción del genoma que viene de alguna población pura. Es decir, en el ejemplo anterior del individuo con mitad de su genotipo de color rojo y la otro mitad de color verde podríamos decir que dicha proporción es de aproximadamente 0.5. Esta proporción se denomina proporción de mezcla (admixture proportion en inglés). Como decía, podemos poner ciertos límites para considerar a un individuos como perteneciente a una población o ser un híbrido. Por ejemplo, podríamos decir que si la información genética de un individuo es similar en un 90% o más a la información genética característica de un cluster entonces este individuo podría considerarse como perteneciente a dicho cluster. Esto depende de cada estudio por supuesto.

Hasta ahora todo suena muy bien, y seguro que por muchos años mucha gente estuvo satisfecha con este programa. Pero como es de esperarse varios otros empezaron a dudar de la efectividad del programa y de hecho encontraron varios escenarios en los que Structure podía dar resultados incorrectos. Factores como el tiempo de divergencia entre poblaciones puras (Kalinowski, 2011), el número de loci que se utilizan como marcadores genéticos (Vähä and Primmer, 2006) y los tamaños de muestra (Puechmaille, 2016, entre otros) parecían afectar dramáticamente los resultados que se podían obtener. Estos estudios pasados, además de sacar a la luz todas estas debilidades, dieron varias sugerencias al momento de usar Structure y además resaltaron varias de sus fortalezas, por lo cual el programa siguió y sigue siendo ampliamente utilizado. Sin embargo el último de los factores que mencioné, el tamaño de muestra, es algo más siniestro y por lo que he leído aún no hay recomendaciones claras acerca de cómo proceder.

El tamaño de muestra es probablemente uno de las características básicas de todo estudio. Lo normal es tener muestras que sean representativas de un universo que nosotros definamos, sea una región geográfica, una población determinada o un grupo de edad. El problema al obtener muestras para ser analizadas en Structure es que los grupos genéticos reales (los clusters) no los conocemos al inicio. Es posible muestrear por “zonas”, como mencioné en el primer ejemplo donde consideré las “zonas” 1, 2, 3 y 4; pero si resulta que estamos tomando muestras “desbalanceadas” de los clusters (o grupos de colores, que aún no conocemos) hay muchas posibilidades de que los resultados no reflejen la realidad.

Imaginemos que tenemos diez sub-poblaciones (representadas por cada uno de los círculos en cada figura), y asumamos que cada una de ellas es genéticamente diferente de las demás. Ahora imaginemos que tomamos muestras de seis de esas diez sub-poblaciones (las que tienen colores) y ponemos esta información para que sea analizada por Structure. En este caso nosotros sabemos que las seis sub-poblaciones son genéticamente diferentes por lo que esperaríamos que Structure nos mostrara un resultado como el de la izquierda, donde cada sub-población tiene su propio cluster. La figura de la derecha representa un caso en el que se toman muestras de las mismas seis sub-poblaciones pero de una manera desbalanceada (representado por los diferentes tamaños de los círculos). Structure podría actuar de manera equivocada, como unir todas las sub-poblaciones representadas por tamaños de muestra más pequeños en un mismo cluster (morado). Figura obtenida de Puechmaille, 2016.

Lo que se sabe es que hay, de todas maneras, un problema. Lo que no se sabe es cómo exactamente afectan los distintos tamaños de muestra a los resultados. Puesto de otra manera, se sabe el efecto cualitativo de los tamaños de muestra distintos (el efecto es, simplemente, que los resultados no reflejan la realidad), pero no se conoce el efecto cuantitativo (en qué grado y en qué direcciones los resultados se ven afectados por los tamaños de muestra). El objetivo de este proyecto, mi segunda tesis de master, es encontrar y describir estos efectos cuantitativos.

Pensaba explicar toda mi tesis en un solo post, pero parece que sin querer he empezado otra serie. En los siguientes probablemente dos post explicaré sobre como funciona Structure y, lo más importante, los resultados que he obtenido (que, según yo, son muy interesantes). Cabe decir que, por lo que he notado de parte de mis supervisores y otros compañeros y colegas, los resultados y todo el proyecto en sí parecen ser muy controversiales. He visto pasar frente a mí mucho escepticismo, lo cual supongo es bueno cuando se trabaja en ciencia; y cómo no, si ya les conté el impacto que este programa ha tenido desde que se empezó a utilizar.

Referencias

Kalinowski ST. (2011). The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure. Heredity 106(4): 625-632.
Pritchard JK, Stephens M & Donnelly P. (2000). Inference of population structure using multilocus genotype data. Genetics 155(2): 945-959.
Pritchard JK, Wen W (2003) Documentation for STRUCTURE software: Version 2. Available from http://pritch.bsd.uchicago.edu.
Puechmaille SJ. (2016). The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub-sampling and new estimators alleviate the problem. Molecular Ecology Resources. doi: 10.1111/1755-0998.12512
Vähä JP & Primmer CR. (2006). Efficiency of model-based Bayesian methods for detecting hybrid individuals under different hybridization scenarios and with different numbers of loci. Molecular Ecology 15(1): 63-72.

4 thoughts on “My Structure experience (I)”

Pingback: My Structure experience (II): “Is not about what you did in the past, but what you’re doing in the present” – Cosas grandes y ocultas
Geek Peak Software says:

16 March, 2017 at 4:31 pm

Its like you read my mind! You appear to know a lot about this,
like you wrote the book in it or something. I think that you can do with some pics to drive the
message home a little bit, but other than that,
this is magnificent blog. A fantastic read. I’ll certainly be back.

LikeLike

Pingback: My story so far | Ken S. Toyama
Pingback: Science in solitude (or, My STRUCTURE experience III) | Ken S. Toyama

Ken S. Toyama

Evolution, Ecology and Biodiversity

My Structure experience (I)

4 thoughts on “My Structure experience (I)”

Leave a reply to Geek Peak Software Cancel reply

Comparte:

Related posts

4 thoughts on “My Structure experience (I)”

Leave a reply to Geek Peak Software Cancel reply