Info: the math

You tap one button. The app reads your location and the current time, then shows real past misfortunes near you or on a day like today. Every design choice that could be tuned is tuned from your reactions.

Two learning problems

We show exactly one story at a time; reacting to it reveals the next. How many stories you see is therefore not a design choice to tune, it is just how long you keep going. The remaining chrome (button/background color, light/dark, face) is a discrete Thompson-sampling bandit: each color×mode×face combination is an arm, re-sampled for every card, and the reward is engagement — whether you react at all (any of the three buttons) versus abandon the card. Arm $a$ keeps a Beta posterior over its engage rate, $\theta_a \sim \mathrm{Beta}(1+\text{engaged}_a,\,1+\text{abandoned}_a)$; each card we draw one sample per arm and show the argmax, so the background changes per card as the bandit explores.

Story selection is the interesting one. We learn a classifier

$$\hat p \;=\; p(\text{up}\mid X_{\text{story}},X_{\text{impression}})$$

where the inputs cross story (location, date, severity, tone, theme) with impression (your location, local day-of-week and month). The feature vector is continuous, so two stories the same rough distance and age apart still get different scores: standardized log-distance $z_d$ and its square, standardized log-age $z_a$ and its square, their product; a standardized badness (severity) term and its interaction with distance; GDELT tone and its magnitude; weekday / month match; a smooth same-day-of-year bump; and indicators for the most common disaster themes:

$$\phi \;=\; \big[\,1,\; z_d,\; z_d^2,\; z_a,\; z_a^2,\; z_d z_a,\; b,\; b\,z_d,\; \tau,\; |\tau|,\; \text{dow},\; \text{moy},\; \text{anniv},\; \text{theme}_{1\ldots 8}\,\big]$$

The classifier is a logistic regression, $\hat p=\sigma(w^\top\phi)$. We fit it the honest way: a batch Newton (IRLS) solve over all logged feedback that finds the exact regularized optimum, with the ridge $\lambda$ chosen by held-out cross-validation. At this feedback volume there is no reason to do anything cheaper, so the worker does not update the weights online at all; it only scores with the current fit. The model refreshes when the offline fit is re-run:

$$w^{*}(\lambda)=\arg\min_{w}\ \sum_i \ell\big(y_i,\sigma(w^\top\phi_i)\big)+\tfrac{\lambda}{2}\lVert w\rVert^2,\qquad \lambda=\arg\min_{\lambda}\ \text{CV-NLL}(\lambda).$$

Inverse gap weighting

Given the scored candidates, we do not show the single highest-scoring story every time (that never explores) nor a uniform random one (that never exploits). We sample story $s$ with probability inversely proportional to how far its score sits below the best:

$$p(s)\;\propto\;\frac{1}{\epsilon \;+\; \hat p(\text{up}\mid s^{*}) \;-\; \hat p(\text{up}\mid s)}$$

where $s^{*}$ is the top-scoring candidate and the gap $\hat p^{*}-\hat p_s \in [0,1]$. The best story (gap $0$) gets the largest weight; weaker stories get weight that shrinks with their gap. The single knob $\epsilon\in(0,1)$ trades exploration for exploitation: small $\epsilon$ concentrates on the best, large $\epsilon$ spreads toward uniform. This is the inverse-gap-weighting reduction of a contextual bandit to a regression oracle. Each shown story's selection probability (its propensity) is logged, which lets us later estimate, off-policy and unbiasedly, what a different $\epsilon$ would have done.

Setting the one knob

$\epsilon$ is a single global exploration level, the same for every story (in inverse-gap weighting the exploration rate is a property of the policy, not something re-solved per example). For one request with candidate set $\mathcal{S}$, a policy $\pi$ that shows story $s$ with probability $\pi(s)$ has expected click-rate $\sum_{s\in\mathcal{S}}\pi(s)\,\hat p_s$. Write $V(\epsilon)$ for our IGW policy and $V^{*}$ for the oracle, the policy that puts all its mass on the best candidate:

$$V(\epsilon)=\sum_{s\in\mathcal{S}} p(s\mid\epsilon)\,\hat p_s,\qquad V^{*}=\sum_{s\in\mathcal{S}} p^{*}(s)\,\hat p_s=\max_{s\in\mathcal{S}}\hat p_s,$$

where $p^{*}(s)$ is the $\epsilon\to 0$ limit of $p(s\mid\epsilon)$ (a point mass on the argmax, so the sum collapses to the max). We choose the one $\epsilon$ whose realized click-rate, averaged over requests, is half the oracle's:

$$\frac{\mathbb{E}_{\text{req}}\!\left[V(\epsilon)\right]}{\mathbb{E}_{\text{req}}\!\left[V^{*}\right]}\;=\;0.5.$$

$V(\epsilon)$ slides monotonically from $V^{*}$ (as $\epsilon\to 0$, all mass on the best) down to the candidate mean (as $\epsilon\to\infty$, uniform), so one value of $\epsilon$ hits the half-of-oracle target. It is solved offline (no Gaussian process, no Thompson sampling) and written to a config row the worker reads as a constant, so a tap never pays solver latency. The chosen $\epsilon$ and each story's propensity are logged, so a different target could be evaluated off-policy later.