About

What this map shows

This tool visualizes the geographic footprint of a Y-DNA subclade using STR (Short Tandem Repeat) haplotype data. It explores a deceptively simple question:

Where has the target subclade been present the longest?

A region where a subclade is common today does not necessarily indicate long-term presence — observed frequency may reflect a recent population expansion or a founder effect. Conversely, a region with fewer but highly diverse carriers may be consistent with deeper historical roots. The map combines both signals to identify areas where long-term, stable presence is most plausible.


The three layers

Haplotype Frequency

What it shows: Where carriers of the target subclade are geographically concentrated.

A high frequency means many samples cluster in a region. This is the most intuitive layer — it answers "Where is the target subclade common today?" — but frequency alone is not informative about time depth. A recent founder effect can produce a dense cluster with very little internal variation.

Haplotype Diversity

What it shows: Where carriers within the same region have accumulated the most genetic differences from each other.

STR markers mutate over generations. When carriers of a subclade have co-existed in the same area for an extended period, their haplotypes gradually drift apart. The expected squared allele difference between two lineages increases linearly with their divergence time (Goldstein et al., 1995; Slatkin, 1995), so high local diversity is consistent with a population that has had time to accumulate mutations — suggesting presence over many generations rather than a recent arrival.

"The difference in repeat score between alleles carries information about the amount of time that has passed since they shared a common ancestral allele. [...] We show analytically that the expectation of this distance is a linear function of time." — Goldstein et al., 1995

"With microsatellite loci, there is abundant evidence that the size of a new mutant allele depends on the size of the allele that mutated." — Slatkin, 1995

Low diversity in a dense region, conversely, is more consistent with a recent expansion: many carriers, but their haplotypes are still nearly identical. This is the hallmark of a founder effect — a small number of closely related lineages rapidly proliferating in a new territory. Each successive founding event reduces diversity in the new colony (Ramachandran et al., 2005).

"Each bottleneck episode decreases expected heterozygosity in the new colony by a factor of 1 − 1/(2Nᵦ)." — Ramachandran et al., 2005

Frequency x Diversity

What it shows: Regions where the target subclade is both frequent AND internally diverse — the strongest candidates for long-term continuous presence.

This is the key layer. It multiplies the two signals above. A high combined score means: "There are many carriers here, and they are genetically different from each other." This pattern is predicted by the diversity-peak hypothesis (Ramachandran et al., 2005) — populations tend to exhibit both high frequency and high internal diversity in regions of prolonged habitation.

"Heterozygosities in the globally distributed populations of the data set are best explained by an expansion originating in Africa and [...] no geographic origin outside of Africa accounts as well for the observed patterns of genetic diversity." — Ramachandran et al., 2005

"Expected heterozygosity has been found to decrease linearly with distance from a possible site for the geographic origin of modern humans." — Ramachandran et al., 2005

In short: frequency indicates where the target subclade is common, diversity reflects relative time depth, and their product highlights regions where long-term presence is most likely.


References

  • Goldstein, D. B., et al. (1995). An evaluation of genetic distances for use with microsatellite loci. Genetics, 139(1), 463-471. doi:10.1093/genetics/139.1.463
  • Slatkin, M. (1995). A measure of population subdivision based on microsatellite allele frequencies. Genetics, 139(1), 457-462. doi:10.1093/genetics/139.1.457
  • Ramachandran, S., et al. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. PNAS, 102(44), 15942-15947. doi:10.1073/pnas.0507611102

Haplotype Diversity Methodology

1. Overview

Geographic visualization of Y-chromosome Short Tandem Repeat (STR) genetic diversity using Kernel Density Estimation (KDE) contour maps.

Layer Metric Description
Haplotype Frequency Sample count per geographic cluster, normalized to [0, 1] Geographic density of the target subclade
Haplotype Diversity Mutation-rate-weighted pairwise genetic distance (mu-ASD) Regions of highest genetic diversity
Frequency x Diversity Product of normalized frequency and diversity Candidate regions of long-term continuous presence

2. Haplotype Diversity Metric

2.1 Mutation-Rate-Normalized ASD

The diversity score is based on pairwise mu-normalized ASD (Goldstein et al., 1995; Slatkin, 1995).

For a pair of haplotypes $(i, j)$ sharing $L$ comparable loci:

$$\text{ASD}_\mu(i, j) = \frac{1}{L} \sum_{l} \left( x_{il} - x_{jl} \right)^2 \cdot \frac{\mu_{\text{median}}}{\mu_l}$$

  • $x_{il}$ — allele value of sample $i$ at locus $l$
  • $\mu_l$ — per-locus mutation rate
  • $\mu_{\text{median}} / \mu_l$ — inverse mutation rate weight

2.2 Inverse Mutation Rate Weighting

Standard ASD treats all loci equally. Inverse weighting by $\mu_{\text{median}} / \mu_l$ amplifies slow-mutating markers (more informative about deep divergence) and dampens fast-mutating ones (noise-prone). This is proportional to Goldstein's divergence time estimator $T_l = (\Delta x)^2 / (2\mu_l)$, differing by a constant factor $2\mu_{\text{median}}$. Empirical confirmation that slow-mutating markers carry more geographic signal is provided by Zhou et al. (2020), who showed that Y-STRs with low mutation rates (e.g. DYS392, DYS438) produce the largest allele frequency differences between populations.

2.3 Mutation Rate Sources

Per-locus mutation rates are drawn from the Heinilä dataset (ISOGG) for 96 of 102 markers, with 6 estimated via median-ratio calibration. Population-scale estimates by Willems et al. (2016) — derived from 222,000 meioses across 702 Y-STRs using phylogeny-aware inference — provide independent validation of the rates used here.

2.4 Mutation Rate Floor

A floor prevents extreme weights from ultra-slow markers:

$$\mu_{\text{eff}} = \max\!\left(\mu_l,\; P_{10}(\mu)\right)$$


3. Spatial Sampling Pipeline

3.1 Sample Density Bias

Sampling density is typically heterogeneous across the study area. Naive pairwise comparison produces artificially high diversity in densely sampled regions simply because more haplotype pairs are available.

3.2 Geographic Deduplication + KNN

Stage 1: Geographic Deduplication

Samples within a micro-cell are grouped. Within each cell, samples are sorted by marker count (descending). Each sample is compared to existing representatives using total allele step distance. If the distance falls below a per-test-type threshold, the sample is merged; otherwise it becomes a new representative.

Test Step-distance threshold
Y-DNA37 3
Y-DNA67 7
Y-DNA111 11
Big Y-700 11

Each representative tracks the number of raw samples it absorbed ($n_{\text{raw}}$), used later for Bayesian shrinkage.

Stage 2: K-Nearest Neighbors

For each geographic cluster, the $K$ nearest unique locations are found via KDTree. All $\binom{K}{2}$ pairwise mu-ASD values are computed. The cluster diversity score is the 75th percentile of these pairwise distances — less sensitive to outliers than the mean, less conservative than the median.

Stage 2b: Intra-Cluster Override

If intra-cluster diversity (pairwise mu-ASD among haplotypes within the same cell, without distance-decay) exceeds the KNN diversity, it takes precedence:

$$D_{\text{final}} = \max(D_{\text{knn}},\; D_{\text{local}})$$

3.3 Distance-Decay Weighting

Pairwise ASDs within the KNN neighborhood are weighted by geographic proximity:

$$\text{ASD}_{\text{w}}(i, j) = \text{ASD}_\mu(i, j) \cdot e^{-d_{ij} / \lambda}$$

$d_{ij}$ is the geographic distance (km), $\lambda$ is the characteristic decay length. At $d = \lambda$, the contribution is $e^{-1} \approx 0.37$. This ensures local diversity is valued more highly than regional mixing from distant KNN neighbors, consistent with the landscape genetics principle that spatial proximity constrains gene flow (Manel et al., 2003) and the well-established pattern of isolation by distance in human populations (Ramachandran et al., 2005).

3.4 Bayesian Shrinkage

Small clusters receive a reliability penalty based on total raw sample count:

$$n_{\text{pairs}} = \frac{n_{\text{raw}} (n_{\text{raw}} - 1)}{2}, \qquad r = \frac{n_{\text{pairs}}}{n_{\text{pairs}} + k}, \qquad D_{\text{shrunk}} = D \cdot r$$

With $k = 15$: for 15+ raw samples $r > 0.9$; for 3 samples $r \approx 0.17$.

3.5 Locality Dampening

Clusters with abnormally large KNN radii (sparse regions) are penalized:

$$f_{\text{loc}} = \min\!\left(1,\; \frac{\tilde{r}_K}{r_K}\right), \qquad D_{\text{final}} = D_{\text{shrunk}} \cdot f_{\text{loc}}$$

$\tilde{r}_K$ — median $K$-th neighbor distance across all clusters; $r_K$ — the cluster's own.


4. Normalization and Contrast

4.1 P90 Normalization

$$\hat{D} = \text{clip}\!\left(\frac{D}{P_{90}(D)},\; 0,\; 1\right)$$

Using $P_{90}$ instead of max prevents outliers from compressing the color range. The top 10% saturate at 1.0.

4.2 Data-Driven Sigmoid Contrast

A logistic function with parameters derived from the data distribution:

$$m = Q_3, \qquad s = \frac{4}{\text{IQR}}, \qquad w = \frac{1}{1 + e^{-s(\hat{D} - m)}}$$

The $Q_3$ midpoint focuses contrast on the top quartile. The $4/\text{IQR}$ steepness aligns the transition zone with the interquartile range: $Q_1 \to w \approx 0$, $Q_3 \to w \approx 1$.

4.3 Weight Floor

Clusters with $w < 0.05$ are zeroed out to prevent faint KDE artifacts.


5. KDE Visualization

5.1 Kernel Density Estimation

Weighted cluster positions are fed into Gaussian KDE (Silverman, 1986; Scott's rule bandwidth, configurable multiplier) to produce a continuous density surface on a regular grid.

5.2 Soft Proximity Falloff

$$f = \text{clip}\!\left(1 - \frac{d - r_{\text{mask}}}{w_{\text{trans}}},\; 0,\; 1\right), \qquad \rho \leftarrow \rho \cdot f$$

Prevents density from extending into empty regions while avoiding hard mask edges.

5.3 Contour Rendering

Iso-density contour paths are generated, normalized to [0, 1], and mapped to a color gradient. Contour paths are smoothed with a uniform filter to eliminate grid artifacts. Each layer normalizes independently by its own density maximum.


6. Combined Layer: Frequency x Diversity

$$W_{\text{comb}} = \hat{F} \cdot \hat{D} \cdot \sqrt{n_{\text{div}}}$$

Only clusters present in both layers contribute (inner join). The $\sqrt{n_{\text{div}}}$ factor adjusts for sample-size reliability. The result is normalized to [0, 1].

A high combined score indicates a region where the haplogroup is both frequent and internally diverse — consistent with the diversity-peak hypothesis (Ramachandran et al., 2005).


7. Unified Sample Pool

Samples from different STR panels (Y-DNA37, Y-DNA67, Y-DNA111, Big Y-700) are pooled. Cross-panel pairs use only the intersection of available markers; ASD is normalized by $L$, making scores comparable regardless of panel size. This normalization is essential because direct comparison of diversity between marker sets with different mutational mechanisms and rates would otherwise be misleading (Mona et al., 2013).


8. Limitations

  • STR-only: SNP data is not used for diversity calculation. STR mutation rates provide continuous distance measures without requiring phylogenetic reconstruction.
  • Multi-copy markers: Corresponding allele positions are compared (a vs. a, b vs. b). Missing copies are duplicated from the first.
  • Outlier sensitivity: A single divergent sample can inflate p75 ASD across its KNN neighborhood. Bayesian shrinkage mitigates this for small clusters.
  • Coordinate precision: The deduplication radius should exceed the coordinate rounding precision of the input data.
  • Euclidean geographic distance: Pairwise distances are computed as straight-line km, which may underestimate effective separation across geographic barriers (mountains, seas). Connectivity-based distances can produce different spatial structure (Mona et al., 2013).

9. References

  • Goldstein, D. B., et al. (1995). An evaluation of genetic distances for use with microsatellite loci. Genetics, 139(1), 463-471. doi:10.1093/genetics/139.1.463
  • Slatkin, M. (1995). A measure of population subdivision based on microsatellite allele frequencies. Genetics, 139(1), 457-462. doi:10.1093/genetics/139.1.457
  • ISOGG Wiki. Y-STR Mutation Rates. https://isogg.org/wiki/Mutation_rates — per-locus mutation rates (Heinila, M. dataset).
  • Clark, P. J., & Evans, F. C. (1954). Distance to nearest neighbor as a measure of spatial relationships in populations. Ecology, 35(4), 445-453. doi:10.2307/1931034
  • Manel, S., et al. (2003). Landscape genetics: combining landscape ecology and population genetics. Trends in Ecology & Evolution, 18(4), 189-197. doi:10.1016/S0169-5347(03)00008-9
  • Ramachandran, S., et al. (2005). Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. PNAS, 102(44), 15942-15947. doi:10.1073/pnas.0507611102
  • Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall. doi:10.1007/978-1-4899-3324-9
  • Mona, S., et al. (2013). Investigating sex-specific dynamics using uniparental markers: West New Guinea as a case study. Ecology and Evolution, 3(8), 2647-2660. doi:10.1002/ece3.660
  • Willems, T., et al. (2016). Population-scale sequencing data enable precise estimates of Y-STR mutation rates. American Journal of Human Genetics, 98(5), 919-933. doi:10.1016/j.ajhg.2016.04.001
  • Zhou, Y., et al. (2020). Characterizing Y-STRs in the evaluation of population differentiation using the mean of allele frequency difference between populations. Genes, 11(5), 566. doi:10.3390/genes11050566

Calibration Methodology

All spatial parameters are inferred from the input dataset's coordinate distribution. Each can be overridden with a fixed value.


Core Statistic: Nearest-Neighbor Distance (NND)

Nearest-neighbor distance analysis (Clark & Evans, 1954) is used as the foundation for spatial parameter inference. For each unique sample location, the Euclidean distance to its closest neighbor is computed in km:

$$\text{NND}_i = \min_{j \neq i} \sqrt{(\Delta\varphi \cdot c_{\text{lat}})^2 + (\Delta\lambda \cdot c_{\text{lat}} \cos\bar\varphi)^2}$$

where $c_{\text{lat}} = 111.32$ km/deg and $\bar\varphi$ is the dataset centroid latitude. The resulting NND array is the foundation of all spatial parameter inference.


Inferred Parameters

Clustering radius — geographic grid cell size

$$r_{\text{cluster}} = P_{78}(\text{NND}) \times 2$$

The 78th percentile captures the typical gap between distinct sampling regions, excluding outliers. The $\times 2$ multiplier scales this to a grid cell size encompassing inter-cluster spacing. CV across 70% subsamples: ~5.9%.

Deduplication radius — micro-cell for geographic deduplication

$$r_{\text{dedup}} = \tilde{x}(\text{NND}) \times 1.5$$

The median NND represents typical spacing between distinct locations. The $\times 1.5$ multiplier creates cells large enough to merge co-located samples while preserving genuinely distinct nearby locations. CV: ~8%.

Distance-decay length scale — exponential decay parameter $\lambda$

$$\lambda = \text{IQR}(\text{NND}) \times 2$$

The characteristic length scale for the exponential decay $e^{-d/\lambda}$. At $d = \lambda$, contribution is $e^{-1} \approx 0.37$. IQR captures the spread of inter-sample spacings without sensitivity to extremes. Fallback when IQR = 0: $\tilde{x}(\text{NND}) \times 3$.

KNN neighborhood size — number of nearest unique locations

$$K = \lfloor N^{0.6} \rceil, \quad K \leq N - 1$$

Sub-linear scaling: exponent 0.6 balances between $\sqrt{N}$ (too aggressive for small $N$) and a fixed fraction of $N$ (too large for big $N$).

Sigmoid contrast parameters — logistic contrast curve

$$m = Q_3(\hat{D}), \qquad s = \frac{4}{\text{IQR}(\hat{D})}$$

Applied after $P_{90}$ normalization. The factor 4 derives from the logistic function property: slope at midpoint equals $s/4$. Fallback when IQR = 0: $s = 60$.

Paleogenomics

The ancient DNA (aDNA) layer consists of two independent components:

  1. Ancient Heatmap — a simple age-weighted heat overlay of archaeological samples. This is a lightweight visualization that does not interact with the KDE-based modern analysis (Frequency, Diversity, Frequency x Diversity). It shows where ancient carriers have been found and how old they are.

  2. Meta-Centroid Gravity Area — an exploratory heuristic that combines the ancient age-weighted centroid with the modern Frequency x Diversity layer to estimate a combined center of gravity. Unlike the heatmap, this component bridges ancient and modern signals.


Part 1: Ancient Heatmap

Purpose

The heatmap provides spatial context — showing where ancient samples of the same haplogroup have been found and their relative age. It is not a statistical model; it is a raw scatter of what has been found. Ancient and modern layers are rendered independently and do not influence each other's KDE calculations.

Age Weighting

Each ancient sample contributes a weighted heat point. The weight is a power-scaled function of age:

$$w = f + \left(\frac{t}{t_{\max}}\right)^p \cdot (1 - f)$$

$t$ — sample age (Years Before Present), $t_{\max}$ — maximum age in the dataset, $p$ — power exponent, $f$ — weight floor. Both $p$ and $f$ are inferred from the data (see Age Parameter Inference in Part 3).

Near/Far HeatMap Split

When modern Frequency x Diversity data is available, ancient samples are split into two rendering groups based on their significance factor (see Significance Factor in Part 2):

  • Near samples (significance >= 0.5): rendered at full brightness, stable across zoom levels. These are samples within or close to the modern KDE footprint.
  • Far samples (significance < 0.5): rendered with reduced opacity and zoom-scaled blobs. At zoom-out, distant clusters of a few young samples fade rather than glowing as brightly as the main concentration.

Both groups are on the same map layer — visually this is one heatmap.

Sample Markers

Clickable circle markers are placed at each sample location. Samples within 5 km of each other are spread in a circle to avoid overlap. Clicking a marker shows a popup card with Sample ID, age, archaeological culture, site name, and a link to the source publication (if available).

Why ancient samples are not part of the main analysis

Ancient DNA is deliberately kept separate from the Frequency / Diversity / Frequency x Diversity layers. The reason is sampling bias: aDNA availability is largely a matter of luck.

  • Geographic bias. Archaeological excavations are concentrated in specific regions. Vast areas have never been sampled for aDNA — absence on the map does not mean absence in history.
  • Lineage bias. Not every branch of a haplogroup has been found in ancient remains. Whether a particular subclade appears in the archaeological record depends on which sites were excavated, which bones preserved DNA, and which studies chose to publish.
  • Temporal bias. Older periods are dramatically underrepresented. A 10,000-year-old sample surviving to extraction is the exception, not the rule.

Because of this, integrating aDNA into KDE-based diversity or frequency calculations would introduce systematic distortion — treating an inherently incomplete and non-random dataset as if it were representative.


Part 2: Meta-Centroid and Gravity Area

Interpretation

Note: This overlay is an exploratory heuristic — the author does not claim statistical rigor for the conclusions it suggests. The interpretation below remains speculative and should be treated as a hypothesis-generating tool rather than a definitive spatial model.

Two scenarios are worth distinguishing:

  • Ancient and modern clouds overlap. When the age-weighted ancient centroid and the modern Frequency x Diversity peak coincide geographically, the meta-centroid reinforces both signals. This convergence is the strongest available indicator of a genetic Urheimat — a region of prolonged, continuous presence where the haplogroup both originated (or arrived early) and maintained high diversity through to the present.

  • Ancient and modern clouds are geographically separated. When the two centroids diverge — for instance, ancient samples concentrate in one region while modern diversity peaks elsewhere — the meta-centroid falls between them. In this case the gravity area may intuitively indicate a candidate search region for undiscovered paleogenomic evidence: a corridor or intermediate zone where future aDNA sampling could bridge the gap between the known ancient and modern distributions. The hypothesis is that if a haplogroup migrated from region A (ancient attestation) to region B (modern diversity peak), intermediate territory should contain transitional archaeological signatures that have not yet been sampled.

Meta-Centroid

The meta-centroid is the weighted geographic center of gravity computed from two point clouds:

  1. Ancient cloud — each sample weighted by its age (power-scaled) and significance factor.
  2. Modern cloud — Frequency x Diversity cluster positions weighted by their normalized combined score.

The balance between the two is controlled by the ancient share $\alpha$, inferred from data (see Ancient Share in Part 3). Within the ancient cloud, the oldest sample receives a data-driven boost to pull the centroid toward the deepest time-depth signal (see Oldest Sample Boost in Part 3).

$$\text{lat}_{\text{meta}} = \frac{\sum_i w_i \cdot \text{lat}_i}{\sum_i w_i}, \qquad w_i = \begin{cases} w_{\text{age}} \cdot s_i \cdot \alpha / W_A & \text{ancient} \\ w_{\text{FxD}} \cdot (1-\alpha) / W_M & \text{modern} \end{cases}$$

$\alpha$ — ancient share, $s_i$ — significance factor, $W_A$ and $W_M$ — respective totals before rescaling.

Significance Factor

A shared per-sample significance factor determines how much each ancient sample contributes to both the visual heatmap and the meta-centroid. It combines geographic distance from the modern KDE footprint with age-based rescue:

$$s_i = \begin{cases} 1.0 & d_i \leq R_{90} \\ \min\!\left(1,\; e^{-(d_i - R_{90})/R_{90}} + \left(1 - e^{-(d_i - R_{90})/R_{90}}\right) \cdot \frac{t_i}{t_{\max}} + f\right) & d_i > R_{90} \end{cases}$$

$d_i$ — distance from the modern cloud centroid, $R_{90}$ — 90th percentile radius of the modern cloud, $t_i / t_{\max}$ — age ratio (1.0 for the oldest sample), $f$ — significance floor.

Effect: young samples far from the main KDE footprint are dampened in both weight and visual brightness. Old samples resist dampening — a 10,000 BP sample 600 km from the KDE center retains near-full significance.

Significance Floor

The additive floor $f$ ensures that even the youngest, most distant samples retain a minimum level of influence. It is derived from the subclade's TMRCA (Time to Most Recent Common Ancestor):

$$f = \frac{2 \cdot \sigma(t)}{T_{\text{TMRCA}}}$$

$\sigma(t)$ — standard deviation of all ancient sample ages, $T_{\text{TMRCA}}$ — TMRCA of the subclade (from the Subclade data sheet).

Rationale. The floor is a 2-sigma normalization of the age spread relative to the evolutionary depth of the subclade. A dataset whose samples span a wide age range relative to the subclade's history (high $\sigma / T$) produces a generous floor — the samples cover enough of the subclade's timeline that even distant ones carry meaningful temporal signal. Conversely, if all samples cluster within a narrow time window relative to TMRCA, the floor is small and distance-based dampening dominates.

When TMRCA is not available, the fallback denominator is $t_{\max}$ (maximum sample age).

Gravity Area Rendering

The gravity area is rendered as a synthetic HeatMap ellipse centered on the meta-centroid:

  1. Ellipse shape is derived from the covariance matrix of the modern Frequency x Diversity cloud, measured relative to the meta-centroid. This stretches the ellipse toward the modern KDE footprint.
  2. Eigendecomposition of the covariance gives principal axes (orientation and aspect ratio).
  3. A regular grid in eigenspace is mapped back to geographic coordinates. Each synthetic point receives a Gaussian weight decaying from the centroid.

The result is a red-purple heat blob showing the combined center of gravity of ancient time-depth and modern diversity.


Part 3: Data-Driven Parameter Inference

All parameters below are inferred from the ancient sample distribution and subclade metadata rather than hard-coded. Each was selected through stability testing: candidate formulas were evaluated on the full dataset and with the oldest sample removed. The chosen formulas minimize sensitivity to individual outliers.

Weight Power

$$p = 1 + \frac{H(t)}{H_{\max}}$$

$H(t)$ — Shannon entropy of the binned age distribution, $H_{\max} = \ln(B)$ where $B$ is the number of non-empty bins.

Rationale. Normalized entropy measures how evenly ages are spread across the timeline. A uniform distribution ($H / H_{\max} \to 1$) yields $p \to 2$: strong compression, appropriate because many age ranges are represented and mid-range samples would otherwise dominate. A highly clustered distribution ($H / H_{\max} \to 0$) yields $p \to 1$: near-linear weighting, preserving whatever age differentiation exists. Entropy is a whole-distribution statistic and is insensitive to individual outlier removal (empirically <0.1% shift).

Weight Floor

$$f = \frac{\tilde{t}}{T_{\text{ref}}} \cdot (1 - G)$$

$\tilde{t}$ — median sample age, $T_{\text{ref}}$ — TMRCA (falls back to $t_{\max}$), $G$ — Gini coefficient of the age distribution.

Rationale. The floor determines the minimum visibility of the youngest samples. It combines two signals:

  • $\tilde{t} / T_{\text{ref}}$ — temporal coverage: what fraction of the subclade's evolutionary history the typical sample represents. High coverage means even young samples are relatively informative.
  • $(1 - G)$ — age equality: a Gini coefficient near 0 (equal ages) keeps the floor generous; near 1 (extreme age inequality) shrinks it, letting the power curve do the differentiation.

Oldest Sample Boost

$$b = 1 + \ln\!\left(1 + \frac{\Delta t}{\text{IQR}(t)}\right)$$

$\Delta t = t_{\max} - t_{2\text{nd}}$ — gap between the oldest and second-oldest sample, $\text{IQR}(t)$ — interquartile range of all ages.

Rationale. The oldest sample often represents the deepest archaeological attestation of the haplogroup and should pull the meta-centroid toward that signal. The boost scales logarithmically with the age gap normalized by IQR:

  • Small gap ($\Delta t \ll \text{IQR}$): boost $\approx 1$ — no special treatment needed; the oldest is not an outlier.
  • Large gap ($\Delta t \gg \text{IQR}$): boost grows but never explodes — logarithmic saturation prevents a single extreme sample from dominating.

IQR is used as the denominator because it resists outliers, keeping the boost stable when extreme samples are added or removed (empirically <5% shift).

Ancient Share

$$\alpha = \max\!\left(0.2,\; c \cdot (1 - o) + 0.5 \cdot o\right)$$

$c = \min(1, \; t_{\max} / T_{\text{TMRCA}})$ — temporal coverage: what fraction of the subclade's evolutionary depth the oldest ancient sample reaches. $o$ — overlap ratio: fraction of ancient samples that fall inside the $P_{90}$ radius of the modern cloud.

Rationale. The formula interpolates between two regimes:

  • High overlap ($o \to 1$): most ancient samples sit within the modern cloud. The share tends toward 0.5 — equal balance, because both signals agree geographically and neither should dominate.
  • Low overlap ($o \to 0$): ancient and modern clouds are separated. The share rises toward $c$ — temporal coverage. If the ancient samples span most of the subclade's history ($c \to 1$), the ancient centroid gets strong pull; if they cover only a small fraction, the pull is weaker.

A floor of 0.2 prevents the ancient share from vanishing entirely when coverage is low and overlap is high, while not over-weighting young paleogenomic samples against large modern datasets. When TMRCA is unavailable, $t_{\max}$ is used as fallback, giving $c = 1$.

Ellipse Extent (N-Sigma)

$$n_\sigma = 0.1 + 0.3 \cdot (1 - o) \cdot \max(1,\; d / R_{90})$$

$o$ — overlap ratio (same as in Ancient Share), $d / R_{90}$ — distance between ancient and modern centroids divided by the $P_{90}$ radius of the modern cloud.

Rationale. The ellipse extent scales with two factors:

  • (1 - overlap): full overlap produces a compact ellipse (0.1 sigma) — the gravity area is a tight dot confirming co-location. Low overlap produces a larger ellipse bridging the two clouds.
  • max(1, d / R_{90}): when the centroid separation exceeds the modern cloud radius, the ellipse expands further to cover the gap. When centroids are close (ratio < 1), this factor has no effect, preserving the base overlap behavior.

Data Summary

Overview of the dataset used to render this map.

Dataset

Total samples1002
Y-STR marker panel102 markers

Haplotype Frequency

Geographic density of samples, computed by counting samples in spatial grid cells.

Geographic clusters203
Samples per cluster1 – 54 (median 2)
KDE area~61 thousand km²

Haplotype Diversity

Mutation-rate-weighted genetic diversity via pairwise ASD in KNN neighborhoods.

Test typeSamplesMarkers
Y-DNA3730137
Y-DNA676667
Y-DNA11123102
Big Y-700177102
Total567
After deduplication567 → 504 unique (63 collapsed)
KNN neighborhoodK = 42
Clusters scored187
KDE area~92 thousand km²

Frequency × Diversity

Intersection of frequency and diversity signals: 80 clusters with both metrics.

KDE area~64 thousand km²

Spatial Calibration

Parameters inferred from the nearest-neighbor distance (NND) distribution.

Clustering radius18.9 km
Dedup radius5.1 km
Decay length14.6 km
Median NND3.4 km

Ancient DNA

Samples33
Age range800 – 11,425 years BP
Cultures represented22