Every story is a real news event from The GDELT Project, a free, open database that monitors world news in near-real time and geocodes what it finds. GDELT is 100% free and open; we cite it and link back, and we never republish article bodies, only a short summary and the source link.
GDELT publishes a new batch every 15 minutes. There are two relevant
streams: the Events table (CAMEO-coded who-did-what-to-whom) and the
Global Knowledge Graph (GKG) (themes, entities, locations, tone
extracted per article). CAMEO has no clean code for "flood" or "car crash" —
it encodes actors and actions, not disasters. The GKG carries an explicit
theme taxonomy with per-location latitude/longitude, so it is the right
source for misfortune. We pull the .gkg.csv.zip files (tab
delimited, no header, despite the .CSV name).
We keep only records whose GKG themes intersect an allowlist of concrete misfortune, verified against GDELT's own theme lookup:
NATURAL_DISASTER (+ _EARTHQUAKE / _FLOOD / _HURRICANE / _WILDFIRE / _TSUNAMI / _TORNADO) MANMADE_DISASTER (+ _TRAFFIC_ACCIDENT / _PLANE_CRASH / _CAR_CRASH / _DERAILMENT) DISASTER_FIRE, MARITIME_INCIDENT, RAIL_INCIDENT KILL, WOUND, CRISISLEX_T03_DEAD, CRISISLEX_T02_INJURED
We deny MANMADE_DISASTER_IMPLIED (too noisy) and drop
non-negative-tone rows, because a positive-tone "KILL" is usually a film or
book review, not a tragedy. Even so the tagging is imperfect, so a language
model gives each surviving record a final yes/no "is this a real-world
disaster with casualties or damage?" gate before it becomes a story.
| Stage | What it does |
|---|---|
fetch_gdelt.py | sample one GKG batch per day across a window; parse the 27-column GKG v2.1 layout; extract the most specific location's lat/lon and the page title; keep allowlisted themes |
build_corpus.py | dedup by URL and rounded (lat,lon,date); precompute day-of-week and month; rank by a "badness" score; round-robin across region × weekday × month for coverage |
gen_stories.py | language-model tragedy gate + one calm factual sentence per event |
load_d1.py | emit SQL; bulk-load the edge database |
Each stage writes a JSON-lines artifact so it can be re-run independently. The corpus is a few thousand geocoded, dated events; matching "near me / a day like today" is a bounding-box query plus integer weekday/month columns.