This tool visualizes the geographic footprint of a Y-DNA subclade using STR (Short Tandem Repeat) haplotype data. It explores a deceptively simple question:
Where has the target subclade been present the longest?
A region where a subclade is common today does not necessarily indicate long-term presence — observed frequency may reflect a recent population expansion or a founder effect. Conversely, a region with fewer but highly diverse carriers may be consistent with deeper historical roots. The map combines both signals to identify areas where long-term, stable presence is most plausible.
What it shows: Where carriers of the target subclade are geographically concentrated.
A high frequency means many samples cluster in a region. This is the most intuitive layer — it answers "Where is the target subclade common today?" — but frequency alone is not informative about time depth. A recent founder effect can produce a dense cluster with very little internal variation.
What it shows: Where carriers within the same region have accumulated the most genetic differences from each other.
STR markers mutate over generations. When carriers of a subclade have co-existed in the same area for an extended period, their haplotypes gradually drift apart. The expected squared allele difference between two lineages increases linearly with their divergence time (Goldstein et al., 1995; Slatkin, 1995), so high local diversity is consistent with a population that has had time to accumulate mutations — suggesting presence over many generations rather than a recent arrival.
"The difference in repeat score between alleles carries information about the amount of time that has passed since they shared a common ancestral allele. [...] We show analytically that the expectation of this distance is a linear function of time." — Goldstein et al., 1995
"With microsatellite loci, there is abundant evidence that the size of a new mutant allele depends on the size of the allele that mutated." — Slatkin, 1995
Low diversity in a dense region, conversely, is more consistent with a recent expansion: many carriers, but their haplotypes are still nearly identical. This is the hallmark of a founder effect — a small number of closely related lineages rapidly proliferating in a new territory. Each successive founding event reduces diversity in the new colony (Ramachandran et al., 2005).
"Each bottleneck episode decreases expected heterozygosity in the new colony by a factor of 1 − 1/(2Nᵦ)." — Ramachandran et al., 2005
What it shows: Regions where the target subclade is both frequent AND internally diverse — the strongest candidates for long-term continuous presence.
This is the key layer. It multiplies the two signals above. A high combined score means: "There are many carriers here, and they are genetically different from each other." This pattern is predicted by the diversity-peak hypothesis (Ramachandran et al., 2005) — populations tend to exhibit both high frequency and high internal diversity in regions of prolonged habitation.
"Heterozygosities in the globally distributed populations of the data set are best explained by an expansion originating in Africa and [...] no geographic origin outside of Africa accounts as well for the observed patterns of genetic diversity." — Ramachandran et al., 2005
"Expected heterozygosity has been found to decrease linearly with distance from a possible site for the geographic origin of modern humans." — Ramachandran et al., 2005
In short: frequency indicates where the target subclade is common, diversity reflects relative time depth, and their product highlights regions where long-term presence is most likely.
Geographic visualization of Y-chromosome Short Tandem Repeat (STR) genetic diversity using Kernel Density Estimation (KDE) contour maps.
| Layer | Metric | Description |
|---|---|---|
| Haplotype Frequency | Sample count per geographic cluster, normalized to [0, 1] | Geographic density of the target subclade |
| Haplotype Diversity | Mutation-rate-weighted pairwise genetic distance (mu-ASD) | Regions of highest genetic diversity |
| Frequency x Diversity | Product of normalized frequency and diversity | Candidate regions of long-term continuous presence |
The diversity score is based on pairwise mu-normalized ASD (Goldstein et al., 1995; Slatkin, 1995).
For a pair of haplotypes $(i, j)$ sharing $L$ comparable loci:
$$\text{ASD}_\mu(i, j) = \frac{1}{L} \sum_{l} \left( x_{il} - x_{jl} \right)^2 \cdot \frac{\mu_{\text{median}}}{\mu_l}$$
Standard ASD treats all loci equally. Inverse weighting by $\mu_{\text{median}} / \mu_l$ amplifies slow-mutating markers (more informative about deep divergence) and dampens fast-mutating ones (noise-prone). This is proportional to Goldstein's divergence time estimator $T_l = (\Delta x)^2 / (2\mu_l)$, differing by a constant factor $2\mu_{\text{median}}$. Empirical confirmation that slow-mutating markers carry more geographic signal is provided by Zhou et al. (2020), who showed that Y-STRs with low mutation rates (e.g. DYS392, DYS438) produce the largest allele frequency differences between populations.
Per-locus mutation rates are drawn from the Heinilä dataset (ISOGG) for 96 of 102 markers, with 6 estimated via median-ratio calibration. Population-scale estimates by Willems et al. (2016) — derived from 222,000 meioses across 702 Y-STRs using phylogeny-aware inference — provide independent validation of the rates used here.
A floor prevents extreme weights from ultra-slow markers:
$$\mu_{\text{eff}} = \max\!\left(\mu_l,\; P_{10}(\mu)\right)$$
Sampling density is typically heterogeneous across the study area. Naive pairwise comparison produces artificially high diversity in densely sampled regions simply because more haplotype pairs are available.
Stage 1: Geographic Deduplication
Samples within a micro-cell are grouped. Within each cell, samples are sorted by marker count (descending). Each sample is compared to existing representatives using total allele step distance. If the distance falls below a per-test-type threshold, the sample is merged; otherwise it becomes a new representative.
| Test | Step-distance threshold |
|---|---|
| Y-DNA37 | 3 |
| Y-DNA67 | 7 |
| Y-DNA111 | 11 |
| Big Y-700 | 11 |
Each representative tracks the number of raw samples it absorbed ($n_{\text{raw}}$), used later for Bayesian shrinkage.
Stage 2: K-Nearest Neighbors
For each geographic cluster, the $K$ nearest unique locations are found via KDTree. All $\binom{K}{2}$ pairwise mu-ASD values are computed. The cluster diversity score is the 75th percentile of these pairwise distances — less sensitive to outliers than the mean, less conservative than the median.
Stage 2b: Intra-Cluster Override
If intra-cluster diversity (pairwise mu-ASD among haplotypes within the same cell, without distance-decay) exceeds the KNN diversity, it takes precedence:
$$D_{\text{final}} = \max(D_{\text{knn}},\; D_{\text{local}})$$
Pairwise ASDs within the KNN neighborhood are weighted by geographic proximity:
$$\text{ASD}_{\text{w}}(i, j) = \text{ASD}_\mu(i, j) \cdot e^{-d_{ij} / \lambda}$$
$d_{ij}$ is the geographic distance (km), $\lambda$ is the characteristic decay length. At $d = \lambda$, the contribution is $e^{-1} \approx 0.37$. This ensures local diversity is valued more highly than regional mixing from distant KNN neighbors, consistent with the landscape genetics principle that spatial proximity constrains gene flow (Manel et al., 2003) and the well-established pattern of isolation by distance in human populations (Ramachandran et al., 2005).
Small clusters receive a reliability penalty based on total raw sample count:
$$n_{\text{pairs}} = \frac{n_{\text{raw}} (n_{\text{raw}} - 1)}{2}, \qquad r = \frac{n_{\text{pairs}}}{n_{\text{pairs}} + k}, \qquad D_{\text{shrunk}} = D \cdot r$$
With $k = 15$: for 15+ raw samples $r > 0.9$; for 3 samples $r \approx 0.17$.
Clusters with abnormally large KNN radii (sparse regions) are penalized:
$$f_{\text{loc}} = \min\!\left(1,\; \frac{\tilde{r}_K}{r_K}\right), \qquad D_{\text{final}} = D_{\text{shrunk}} \cdot f_{\text{loc}}$$
$\tilde{r}_K$ — median $K$-th neighbor distance across all clusters; $r_K$ — the cluster's own.
$$\hat{D} = \text{clip}\!\left(\frac{D}{P_{90}(D)},\; 0,\; 1\right)$$
Using $P_{90}$ instead of max prevents outliers from compressing the color range. The top 10% saturate at 1.0.
A logistic function with parameters derived from the data distribution:
$$m = Q_3, \qquad s = \frac{4}{\text{IQR}}, \qquad w = \frac{1}{1 + e^{-s(\hat{D} - m)}}$$
The $Q_3$ midpoint focuses contrast on the top quartile. The $4/\text{IQR}$ steepness aligns the transition zone with the interquartile range: $Q_1 \to w \approx 0$, $Q_3 \to w \approx 1$.
Clusters with $w < 0.05$ are zeroed out to prevent faint KDE artifacts.
Weighted cluster positions are fed into Gaussian KDE (Silverman, 1986; Scott's rule bandwidth, configurable multiplier) to produce a continuous density surface on a regular grid.
$$f = \text{clip}\!\left(1 - \frac{d - r_{\text{mask}}}{w_{\text{trans}}},\; 0,\; 1\right), \qquad \rho \leftarrow \rho \cdot f$$
Prevents density from extending into empty regions while avoiding hard mask edges.
Iso-density contour paths are generated, normalized to [0, 1], and mapped to a color gradient. Contour paths are smoothed with a uniform filter to eliminate grid artifacts. Each layer normalizes independently by its own density maximum.
$$W_{\text{comb}} = \hat{F} \cdot \hat{D} \cdot \sqrt{n_{\text{div}}}$$
Only clusters present in both layers contribute (inner join). The $\sqrt{n_{\text{div}}}$ factor adjusts for sample-size reliability. The result is normalized to [0, 1].
A high combined score indicates a region where the haplogroup is both frequent and internally diverse — consistent with the diversity-peak hypothesis (Ramachandran et al., 2005).
Samples from different STR panels (Y-DNA37, Y-DNA67, Y-DNA111, Big Y-700) are pooled. Cross-panel pairs use only the intersection of available markers; ASD is normalized by $L$, making scores comparable regardless of panel size. This normalization is essential because direct comparison of diversity between marker sets with different mutational mechanisms and rates would otherwise be misleading (Mona et al., 2013).
All spatial parameters are inferred from the input dataset's coordinate distribution. Each can be overridden with a fixed value.
Nearest-neighbor distance analysis (Clark & Evans, 1954) is used as the foundation for spatial parameter inference. For each unique sample location, the Euclidean distance to its closest neighbor is computed in km:
$$\text{NND}_i = \min_{j \neq i} \sqrt{(\Delta\varphi \cdot c_{\text{lat}})^2 + (\Delta\lambda \cdot c_{\text{lat}} \cos\bar\varphi)^2}$$
where $c_{\text{lat}} = 111.32$ km/deg and $\bar\varphi$ is the dataset centroid latitude. The resulting NND array is the foundation of all spatial parameter inference.
$$r_{\text{cluster}} = P_{78}(\text{NND}) \times 2$$
The 78th percentile captures the typical gap between distinct sampling regions, excluding outliers. The $\times 2$ multiplier scales this to a grid cell size encompassing inter-cluster spacing. CV across 70% subsamples: ~5.9%.
$$r_{\text{dedup}} = \tilde{x}(\text{NND}) \times 1.5$$
The median NND represents typical spacing between distinct locations. The $\times 1.5$ multiplier creates cells large enough to merge co-located samples while preserving genuinely distinct nearby locations. CV: ~8%.
$$\lambda = \text{IQR}(\text{NND}) \times 2$$
The characteristic length scale for the exponential decay $e^{-d/\lambda}$. At $d = \lambda$, contribution is $e^{-1} \approx 0.37$. IQR captures the spread of inter-sample spacings without sensitivity to extremes. Fallback when IQR = 0: $\tilde{x}(\text{NND}) \times 3$.
$$K = \lfloor N^{0.6} \rceil, \quad K \leq N - 1$$
Sub-linear scaling: exponent 0.6 balances between $\sqrt{N}$ (too aggressive for small $N$) and a fixed fraction of $N$ (too large for big $N$).
$$m = Q_3(\hat{D}), \qquad s = \frac{4}{\text{IQR}(\hat{D})}$$
Applied after $P_{90}$ normalization. The factor 4 derives from the logistic function property: slope at midpoint equals $s/4$. Fallback when IQR = 0: $s = 60$.
The ancient DNA (aDNA) layer consists of two independent components:
Ancient Heatmap — a simple age-weighted heat overlay of archaeological samples. This is a lightweight visualization that does not interact with the KDE-based modern analysis (Frequency, Diversity, Frequency x Diversity). It shows where ancient carriers have been found and how old they are.
Meta-Centroid Gravity Area — an exploratory heuristic that combines the ancient age-weighted centroid with the modern Frequency x Diversity layer to estimate a combined center of gravity. Unlike the heatmap, this component bridges ancient and modern signals.
The heatmap provides spatial context — showing where ancient samples of the same haplogroup have been found and their relative age. It is not a statistical model; it is a raw scatter of what has been found. Ancient and modern layers are rendered independently and do not influence each other's KDE calculations.
Each ancient sample contributes a weighted heat point. The weight is a power-scaled function of age:
$$w = f + \left(\frac{t}{t_{\max}}\right)^p \cdot (1 - f)$$
$t$ — sample age (Years Before Present), $t_{\max}$ — maximum age in the dataset, $p$ — power exponent, $f$ — weight floor. Both $p$ and $f$ are inferred from the data (see Age Parameter Inference in Part 3).
When modern Frequency x Diversity data is available, ancient samples are split into two rendering groups based on their significance factor (see Significance Factor in Part 2):
Both groups are on the same map layer — visually this is one heatmap.
Clickable circle markers are placed at each sample location. Samples within 5 km of each other are spread in a circle to avoid overlap. Clicking a marker shows a popup card with Sample ID, age, archaeological culture, site name, and a link to the source publication (if available).
Ancient DNA is deliberately kept separate from the Frequency / Diversity / Frequency x Diversity layers. The reason is sampling bias: aDNA availability is largely a matter of luck.
Because of this, integrating aDNA into KDE-based diversity or frequency calculations would introduce systematic distortion — treating an inherently incomplete and non-random dataset as if it were representative.
Two scenarios are worth distinguishing:
Ancient and modern clouds overlap. When the age-weighted ancient centroid and the modern Frequency x Diversity peak coincide geographically, the meta-centroid reinforces both signals. This convergence is the strongest available indicator of a genetic Urheimat — a region of prolonged, continuous presence where the haplogroup both originated (or arrived early) and maintained high diversity through to the present.
Ancient and modern clouds are geographically separated. When the two centroids diverge — for instance, ancient samples concentrate in one region while modern diversity peaks elsewhere — the meta-centroid falls between them. In this case the gravity area may intuitively indicate a candidate search region for undiscovered paleogenomic evidence: a corridor or intermediate zone where future aDNA sampling could bridge the gap between the known ancient and modern distributions. The hypothesis is that if a haplogroup migrated from region A (ancient attestation) to region B (modern diversity peak), intermediate territory should contain transitional archaeological signatures that have not yet been sampled.
The meta-centroid is the weighted geographic center of gravity computed from two point clouds:
The balance between the two is controlled by the ancient share $\alpha$, inferred from data (see Ancient Share in Part 3). Within the ancient cloud, the oldest sample receives a data-driven boost to pull the centroid toward the deepest time-depth signal (see Oldest Sample Boost in Part 3).
$$\text{lat}_{\text{meta}} = \frac{\sum_i w_i \cdot \text{lat}_i}{\sum_i w_i}, \qquad w_i = \begin{cases} w_{\text{age}} \cdot s_i \cdot \alpha / W_A & \text{ancient} \\ w_{\text{FxD}} \cdot (1-\alpha) / W_M & \text{modern} \end{cases}$$
$\alpha$ — ancient share, $s_i$ — significance factor, $W_A$ and $W_M$ — respective totals before rescaling.
A shared per-sample significance factor determines how much each ancient sample contributes to both the visual heatmap and the meta-centroid. It combines geographic distance from the modern KDE footprint with age-based rescue:
$$s_i = \begin{cases} 1.0 & d_i \leq R_{90} \\ \min\!\left(1,\; e^{-(d_i - R_{90})/R_{90}} + \left(1 - e^{-(d_i - R_{90})/R_{90}}\right) \cdot \frac{t_i}{t_{\max}} + f\right) & d_i > R_{90} \end{cases}$$
$d_i$ — distance from the modern cloud centroid, $R_{90}$ — 90th percentile radius of the modern cloud, $t_i / t_{\max}$ — age ratio (1.0 for the oldest sample), $f$ — significance floor.
Effect: young samples far from the main KDE footprint are dampened in both weight and visual brightness. Old samples resist dampening — a 10,000 BP sample 600 km from the KDE center retains near-full significance.
The additive floor $f$ ensures that even the youngest, most distant samples retain a minimum level of influence. It is derived from the subclade's TMRCA (Time to Most Recent Common Ancestor):
$$f = \frac{2 \cdot \sigma(t)}{T_{\text{TMRCA}}}$$
$\sigma(t)$ — standard deviation of all ancient sample ages, $T_{\text{TMRCA}}$ — TMRCA of the subclade (from the Subclade data sheet).
Rationale. The floor is a 2-sigma normalization of the age spread relative to the evolutionary depth of the subclade. A dataset whose samples span a wide age range relative to the subclade's history (high $\sigma / T$) produces a generous floor — the samples cover enough of the subclade's timeline that even distant ones carry meaningful temporal signal. Conversely, if all samples cluster within a narrow time window relative to TMRCA, the floor is small and distance-based dampening dominates.
When TMRCA is not available, the fallback denominator is $t_{\max}$ (maximum sample age).
The gravity area is rendered as a synthetic HeatMap ellipse centered on the meta-centroid:
The result is a red-purple heat blob showing the combined center of gravity of ancient time-depth and modern diversity.
All parameters below are inferred from the ancient sample distribution and subclade metadata rather than hard-coded. Each was selected through stability testing: candidate formulas were evaluated on the full dataset and with the oldest sample removed. The chosen formulas minimize sensitivity to individual outliers.
$$p = 1 + \frac{H(t)}{H_{\max}}$$
$H(t)$ — Shannon entropy of the binned age distribution, $H_{\max} = \ln(B)$ where $B$ is the number of non-empty bins.
Rationale. Normalized entropy measures how evenly ages are spread across the timeline. A uniform distribution ($H / H_{\max} \to 1$) yields $p \to 2$: strong compression, appropriate because many age ranges are represented and mid-range samples would otherwise dominate. A highly clustered distribution ($H / H_{\max} \to 0$) yields $p \to 1$: near-linear weighting, preserving whatever age differentiation exists. Entropy is a whole-distribution statistic and is insensitive to individual outlier removal (empirically <0.1% shift).
$$f = \frac{\tilde{t}}{T_{\text{ref}}} \cdot (1 - G)$$
$\tilde{t}$ — median sample age, $T_{\text{ref}}$ — TMRCA (falls back to $t_{\max}$), $G$ — Gini coefficient of the age distribution.
Rationale. The floor determines the minimum visibility of the youngest samples. It combines two signals:
$$b = 1 + \ln\!\left(1 + \frac{\Delta t}{\text{IQR}(t)}\right)$$
$\Delta t = t_{\max} - t_{2\text{nd}}$ — gap between the oldest and second-oldest sample, $\text{IQR}(t)$ — interquartile range of all ages.
Rationale. The oldest sample often represents the deepest archaeological attestation of the haplogroup and should pull the meta-centroid toward that signal. The boost scales logarithmically with the age gap normalized by IQR:
IQR is used as the denominator because it resists outliers, keeping the boost stable when extreme samples are added or removed (empirically <5% shift).
$$\alpha = \max\!\left(0.2,\; c \cdot (1 - o) + 0.5 \cdot o\right)$$
$c = \min(1, \; t_{\max} / T_{\text{TMRCA}})$ — temporal coverage: what fraction of the subclade's evolutionary depth the oldest ancient sample reaches. $o$ — overlap ratio: fraction of ancient samples that fall inside the $P_{90}$ radius of the modern cloud.
Rationale. The formula interpolates between two regimes:
A floor of 0.2 prevents the ancient share from vanishing entirely when coverage is low and overlap is high, while not over-weighting young paleogenomic samples against large modern datasets. When TMRCA is unavailable, $t_{\max}$ is used as fallback, giving $c = 1$.
$$n_\sigma = 0.1 + 0.3 \cdot (1 - o) \cdot \max(1,\; d / R_{90})$$
$o$ — overlap ratio (same as in Ancient Share), $d / R_{90}$ — distance between ancient and modern centroids divided by the $P_{90}$ radius of the modern cloud.
Rationale. The ellipse extent scales with two factors:
Overview of the dataset used to render this map.
| Total samples | 1002 |
| Y-STR marker panel | 102 markers |
Geographic density of samples, computed by counting samples in spatial grid cells.
| Geographic clusters | 203 |
| Samples per cluster | 1 – 54 (median 2) |
| KDE area | ~61 thousand km² |
Mutation-rate-weighted genetic diversity via pairwise ASD in KNN neighborhoods.
| Test type | Samples | Markers |
|---|---|---|
| Y-DNA37 | 301 | 37 |
| Y-DNA67 | 66 | 67 |
| Y-DNA111 | 23 | 102 |
| Big Y-700 | 177 | 102 |
| Total | 567 |
| After deduplication | 567 → 504 unique (63 collapsed) |
| KNN neighborhood | K = 42 |
| Clusters scored | 187 |
| KDE area | ~92 thousand km² |
Intersection of frequency and diversity signals: 80 clusters with both metrics.
| KDE area | ~64 thousand km² |
Parameters inferred from the nearest-neighbor distance (NND) distribution.
| Clustering radius | 18.9 km |
| Dedup radius | 5.1 km |
| Decay length | 14.6 km |
| Median NND | 3.4 km |
| Samples | 33 |
| Age range | 800 – 11,425 years BP |
| Cultures represented | 22 |