Data: GDELT

Every story is a real news event from The GDELT Project, a free, open database that monitors world news in near-real time and geocodes what it finds. GDELT is 100% free and open; we cite it and link back, and we never republish article bodies, only a short summary and the source link.

Two GDELT tables, and why we use the GKG

GDELT publishes a new batch every 15 minutes. There are two relevant streams: the Events table (CAMEO-coded who-did-what-to-whom) and the Global Knowledge Graph (GKG) (themes, entities, locations, tone extracted per article). CAMEO has no clean code for "flood" or "car crash" — it encodes actors and actions, not disasters. The GKG carries an explicit theme taxonomy with per-location latitude/longitude, so it is the right source for misfortune. We pull the .gkg.csv.zip files (tab delimited, no header, despite the .CSV name).

The disaster filter

We keep only records whose GKG themes intersect an allowlist of concrete misfortune, verified against GDELT's own theme lookup:

NATURAL_DISASTER (+ _EARTHQUAKE / _FLOOD / _HURRICANE / _WILDFIRE / _TSUNAMI / _TORNADO)
MANMADE_DISASTER (+ _TRAFFIC_ACCIDENT / _PLANE_CRASH / _CAR_CRASH / _DERAILMENT)
DISASTER_FIRE, MARITIME_INCIDENT, RAIL_INCIDENT
KILL, WOUND, CRISISLEX_T03_DEAD, CRISISLEX_T02_INJURED

We deny MANMADE_DISASTER_IMPLIED (too noisy) and drop non-negative-tone rows, because a positive-tone "KILL" is usually a film or book review, not a tragedy. Even so the tagging is imperfect, so a language model gives each surviving record a final yes/no "is this a real-world disaster with casualties or damage?" gate before it becomes a story.

The pipeline (deterministic, re-runnable)

Stage	What it does
`fetch_gdelt.py`	sample one GKG batch per day across a window; parse the 27-column GKG v2.1 layout; extract the most specific location's lat/lon and the page title; keep allowlisted themes
`build_corpus.py`	dedup by URL and rounded (lat,lon,date); precompute day-of-week and month; rank by a "badness" score; round-robin across region × weekday × month for coverage
`gen_stories.py`	language-model tragedy gate + one calm factual sentence per event
`load_d1.py`	emit SQL; bulk-load the edge database

Each stage writes a JSON-lines artifact so it can be re-run independently. The corpus is a few thousand geocoded, dated events; matching "near me / a day like today" is a bounding-box query plus integer weekday/month columns.