You tap one button. The app reads your location and the current time, then shows real past misfortunes near you or on a day like today. Every design choice that could be tuned is tuned from your reactions.
We show exactly one story at a time; reacting to it reveals the next. How many stories you see is therefore not a design choice to tune, it is just how long you keep going. The remaining chrome (button/background color, light/dark, face) is a discrete Thompson-sampling bandit: each color×mode×face combination is an arm, re-sampled for every card, and the reward is engagement — whether you react at all (any of the three buttons) versus abandon the card. Arm $a$ keeps a Beta posterior over its engage rate, $\theta_a \sim \mathrm{Beta}(1+\text{engaged}_a,\,1+\text{abandoned}_a)$; each card we draw one sample per arm and show the argmax, so the background changes per card as the bandit explores.
Story selection is the interesting one. We learn a classifier
where the inputs cross story (its location and date) with impression (your location, local day-of-week and month). The feature vector is the derived cross-terms: bucketed distance between you and the event (near / mid / far), bucketed time since it (within a year / years / decades), and whether the weekday or month match. The classifier is an online logistic regression, updated one example at a time by stochastic gradient descent:
Given the scored candidates, we do not show the single highest-scoring story every time (that never explores) nor a uniform random one (that never exploits). We sample story $s$ with probability inversely proportional to how far its score sits below the best:
where $s^{*}$ is the top-scoring candidate and the gap $\hat p^{*}-\hat p_s \in [0,1]$. The best story (gap $0$) gets the largest weight; weaker stories get weight that shrinks with their gap. The single knob $\epsilon\in(0,1)$ trades exploration for exploitation: small $\epsilon$ concentrates on the best, large $\epsilon$ spreads toward uniform. This is the inverse-gap-weighting reduction of a contextual bandit to a regression oracle. Each shown story's selection probability (its propensity) is logged, which lets us later estimate, off-policy and unbiasedly, what a different $\epsilon$ would have done.
Because we now show one story at a time, there is no curve to learn: $\epsilon$ is set deterministically, per request, by a target. We want the story we actually show to be good but not always the single best, so we pick $\epsilon$ so that the expected click-probability of the sampled story is a fixed fraction of the best candidate's:
$V(\epsilon)$ slides monotonically from $\hat p^{*}$ (as $\epsilon\to 0$, all mass on the best) down to the candidate mean (as $\epsilon\to\infty$, uniform), so a single value hits the $0.8\,\hat p^{*}$ target whenever it is reachable. The worker finds it by a few steps of bisection on each request — no Gaussian process, no Thompson sampling, no offline fit. When the candidate set is so tight that even the uniform average already beats the target, we just explore as much as possible. The chosen $\epsilon$ and the resulting propensity are still logged, so a different target could be evaluated off-policy later.