The Anatomy of Fear: Quantifying Horror.
An interactive data visualization study of how horror films generate fear. Analyzed 129 horror screenplays across 9,760 scenes and 11,204 horror signals using a GPT-4o-mini + GPT-4o pipeline (99.97% success rate, $2.28 total cost), then built nine custom D3.js visualizations into a scrollytelling site that reveals horror's power law: a tiny set of elite signals like scream and blood drive most of the actual scares.

129
screenplays
9,760
scenes
11,204
horror signals
$2.28
total LLM cost
Skills used
chapter 01 —
The Power Law of Horror.
Horror follows a power law. Atmospheric elements like "dark" and "night" set the mood, but a handful of elite signals (scream, blood, kill, knife, death) drive most of the actual scares. Across 11,204 signal occurrences in 9,760 scenes, the top 10 signals accounted for 68% of all fear spikes, despite making up just 5% of our 207-term lexicon. The site turns that finding into nine interactive D3.js visualizations.
chapter 02 —
Why Horror?.
Horror is uniquely suited to data analysis because its effectiveness lives in mood, pacing, and signal rather than plot logic. Streaming platforms make billion-dollar decisions on what scares people, and yet most analysis is qualitative. We wanted to see if you could quantify the craft of horror (which words, beats, and structures actually work) and present that visually for screenwriters, directors, and curious viewers.
chapter 03 —
The Python Pipeline.
how the data flows —
Raw screenplay .txt files
129 IMSDb scripts
Scene segmentation
regex on INT./EXT./FADE markers
Scene chunking
4 scenes per call · ~2k-token budget
GPT-4o-mini → GPT-4o → fallback
three-tier hybrid extraction
JSON schema validation
jsonschema, 0 to 1 scores, required fields
Flatten to CSV
flatten_scene_row(), 5 master tables
D3 visualization datasets
6 viz-ready files in cleaner_datasets/
python source files —
hybrid_horror_parser.py
Core AI parser. OpenAI calls, JSON validation, fallback logic, scene flattening, and CSV export.
run_full_analysis.py
Production wrapper. Points the parser at data/horror_screenplays/ and writes timestamped outputs.
requirements.txt
Python dependencies (openai, pandas, jsonschema, etc.).
config.env.example
Safe local API-key template. The actual key sits in config.env (gitignored).
What the AI returns per scene
{
"scene_index": 0,
"heading": "INT. BASEMENT - NIGHT",
"location": "BASEMENT",
"time_of_day": "NIGHT",
"characters": ["LAURIE", "MICHAEL"],
"dialogue_stats": {
"lines": 12,
"words": 140,
"question_rate": 0.25,
"exclamation_rate": 0.08,
"avg_line_words": 11.7
},
"action_stats": {
"words": 210,
"stage_directions": 9
},
"horror_signals": {
"night": 1,
"dark": 2,
"blood": 0,
"scream": 1
},
"tension_score": 0.82,
"fear_emotion": 0.74,
"sentiment": -0.63,
"scene_summary": "A character moves through a dark basement while a threat closes in."
}the deep dive —
Why AI here?
Hand-coding 129 screenplays scene by scene (recording location, time of day, characters, dialogue/action mix, horror vocabulary, emotional tone) is consistent at small scale but slow and unrepeatable at corpus scale. The pipeline uses an LLM to produce structured computational annotations across all 9,760 scenes with the same prompt, then validates each response against a strict JSON schema so the data downstream looks the same whether it came from gpt-4o-mini, gpt-4o, or a conservative fallback row.
Screenplay ingestion + scene segmentation
Walked the input directory to collect 129 .txt screenplays (skipping the `downloaded_files.json` metadata file). The `split_scenes_heuristic()` parser normalizes line endings and scans for screenplay markers (INT./EXT., FADE IN/OUT, CUT TO, DISSOLVE TO, numbered scene dividers, CONTINUED), filters anything under 50 words to remove fragments and page artifacts, and truncates scenes above 2,000 words before model batching. Result: 9,760 cleanly-bounded scenes ready for AI analysis.
Hybrid AI strategy: three-tier fallback
Tier 1: GPT-4o-mini with temperature 0, max_tokens 1500, response_format JSON object, handling 99.965% of scenes. Tier 2: GPT-4o fallback only after Tier 1 fails twice on retry, picking up the remaining edge cases. Tier 3: a conservative fallback record (Unknown location/time, zeroed signals, tension/fear=0.5, sentiment=0) preserves row structure when both models fail. Real-world fallback rate: 0.035% across 9,760 scenes.
Parallel processing for throughput
ThreadPoolExecutor at the script level (max_workers=6) plus a second ThreadPoolExecutor inside each script for chunks (max_workers=3). Scenes are batched at up to 4 per chunk with a ~2,000-token budget per request, estimated at one token per four characters. Full 129-film corpus processes in under three hours for $2.28 total, 96.5% cheaper than running GPT-4o on everything.
Prompt engineering + JSON validation
The prompt builder injects film title, scene indices, and raw text (each scene truncated to 300 words inside the prompt) alongside a compact example object and a hard instruction: "Return ONLY valid JSON. No explanations, no markdown, no extra text." Post-call cleanup trims whitespace, strips anything before the first `{` and after the last `}`, then runs `json.loads()`. A `jsonschema` validator enforces required fields, integer scene indices, string-array characters, non-negative dialogue/action counts, fear/tension in 0–1, sentiment in −1 to 1, and no extra top-level fields. Same-model retry up to twice before advancing tiers.
Horror signal detection: 6-family lexicon
Each scene is scored against a fixed lexicon of 207 horror terms collapsing to 187 unique hs_* columns across six families: Atmosphere & Setting (night, dark, fog, basement, cabin, woods, cemetery, abandoned), Sound & Voice (scream, whisper, moan, gasp, shriek, howl, heartbeat, footsteps), Threats & Violence (blood, knife, gun, weapon, blade, chainsaw, stab, brutal, death), Supernatural (ghost, demon, possessed, spirit, witch, curse, haunted), Psychological (fear, panic, dread, paranoid, disturbed, terrifying), and Movement (chase, run, stalk, pursue, hide, escape, trapped). Top signals by raw count: night 3,694, blood 1,460, death 1,213, scream 1,187. Top by fear impact: scream +0.691, blood +0.562, death +0.438, night +0.297.
Emotional scoring on calibrated 0–1 scales
Tension is rubric-anchored: 0–0.2 calm, 0.3–0.5 unease, 0.6–0.8 high, 0.9–1.0 extreme suspense. Fear follows the same rubric (little/none → intense terror). Sentiment runs −1 to +1. API temperature is held at 0 throughout, since this is an extraction task: consistency matters more than creative variation. Production averages across all 129 films: tension 0.436, fear 0.310.
Dialogue + action structural analysis
Alongside emotion, the model estimates structural metrics per scene: dialogue lines, dialogue words, question rate, exclamation rate, average line length, action words, and stage-direction count. Lets us separate talk-heavy scenes from action-heavy ones and powers the "silence amplifies dread" correlation later in the analysis (−0.34 between dialogue density and tension).
Flattening + visualization-ready datasets
`flatten_scene_row()` converts the nested JSON into flat CSV columns (`dialogue_stats.lines` → `dialogue_lines`, `horror_signals.blood` → `hs_blood`, character arrays joined by pipes). Five master CSVs come out: scenes_detailed (9,760×204), horror_signals (9,760×190), emotional_analysis (9,760×7), dialogue_analysis (9,760×8), and a 1-row analysis_summary with run totals. These get cleaned into six viz-ready files (viz1_horror_signals_by_film, viz2a_tension_journey, viz2b_fear_journey, viz3_horror_effectiveness, viz4_film_comparison, viz5_horror_categories) for D3 consumption.
IMDb metadata integration
Joined screenplay data with IMDb ratings, votes, cast, and duration using a standardized "title + year" join key. Powers the Rating Constellation correlation between horror craft and audience reception, exposing the −0.245 correlation that says technical horror chops don't guarantee a high IMDb score.
chapter 04 —
Key findings.
68%
Elite signals drive fear
The top 10 signals (scream, blood, kill, knife, death, shadow, fear, dark, silent, night) account for 68% of fear spikes above 0.70, yet make up only 5% of the lexicon. "Scream" appears in 9% of scenes but adds +0.37 fear. "Dark" appears in 38% but adds only +0.18. Strategic deployment beats scattering.
73%
Sustained unease beats constant terror
Average tension (0.52) consistently exceeds average fear (0.41). 73% of scenes lead with tension over fear. Effective horror maintains baseline unease and punctuates it with shock moments. Valleys make peaks feel higher.
−0.34
Silence amplifies dread
Scenes with low dialogue ratios (< 0.30) showed 23% higher tension than dialogue-heavy scenes (> 0.70). Correlation: −0.34. Letting sounds, visuals, and pauses do the work outperforms exposition.
75–90%
Fear clusters toward the climax
Across 129 films, 28% of fear spikes happen in the third quarter and 29% in the final quarter. Most films reserve their highest fear peak (> 0.80) for the 75–90% runtime window. Horror follows surprisingly consistent pacing.
chapter 05 —
Visualizations.
Nine D3.js scenes wrap a scrollytelling narrative. Each one is built directly from the AI-extracted CSVs.

scene 01 —
Blood Flow of Horror.
Sankey diagram tracing how the six signal families (Audio, Visual, Pace, Threat, Setting, Psyche) branch into individual horror terms. Stream thickness encodes frequency; node brightness encodes fear impact.

scene 02 —
Heartbeat of Terror.
Fear progression across normalized film runtime, animated as a live BPM monitor. Skull markers flag peak terror moments (fear > 0.70).

scene 03 —
Mapping the Spikes.
Multi-film timeline with tombstone markers for fear spikes and lantern markers for tension spikes. Compare any subset of films side by side.

scene 04 —
The Ladder of Fear.
3×3 Markov transition matrix between Calm, Unease, and Panic states. Cell darkness encodes transition probability.

scene 05 —
What Actually Works.
Bubble chart plotting signal frequency against emotional impact. Color encodes shock-heavy (red) vs tension-heavy (blue).

scene 06 —
Impact Dripline.
Ranked bar chart sorting all 207 signals by combined fear + tension impact. The steep drop-off curve makes the power law impossible to miss.

scene 07 —
Does Scary Equal Good?.
Scatter of horror impact score against IMDb rating across all 129 films, with correlation line and rating-range filters.

scene 08 —
Horror Fingerprint.
6-axis radar comparing each film's balance across the six signal families. Slider-driven recommender suggests films matching your preferred horror mix.

scene 09 —
Film Dossiers.
Browsable gallery of every analyzed film with quick access to its full breakdown: runtime fear curve, top signals, family balance, IMDb context.
chapter 06 —
Tools & technologies.
D3.js v7
Custom interactive visualizations: Sankey, radar, live BPM animation, all responsive across desktop and mobile.
GPT-4o-mini + GPT-4o
Hybrid LLM pipeline. Primary parser at 99.97% success; fallback only on the hardest 0.035% of scenes. Total cost $2.28.
Python (pandas, NumPy)
End-to-end data pipeline: deduplication, scene boundary detection, signal counting, validation, IMDb joins.
Observable Framework
Scrollytelling narrative wrapping the nine visualization scenes with client-side data caching.
Playwright
Automated screenshot capture for documentation and the visualization gallery.
chapter 07 —
Real-World Applications.
For screenwriters and directors: a quantitative reference for how to deploy elite signals at climactic moments and what fear-pacing structures actually land.
For film students and researchers: the nine scenes turn qualitative film theory into something you can interact with. Compare slashers to psychological horror, study how 1980s pacing differs from 2010s, see where outliers sit.
For streaming platforms: a template for correlating craft signals with audience reception, identifying underserved horror recipes, and surfacing films with unique signal fingerprints.
Methodologically: the project demonstrates a reusable LLM screenplay-analysis pipeline (prompt templates, lexicon, emotion scoring framework) that scaled to 129 films and 10K scenes for under three dollars.
chapter 08 —
What this is, and what it isn't.
the honest caveats —
- AI-generated scores are computational annotations, not human-coded ground truth. Treat them as comparative signals, not exact measurements.
- Different model versions of gpt-4o-mini / gpt-4o can produce slightly different outputs for the same scene. The committed CSVs are pinned to one production run.
- OCR errors and inconsistent screenplay formatting affect scene splitting; some scenes break in non-ideal places.
- Signal counting is lexical, so a word like "dark" gets a hit whether it's literal, metaphorical, or atmospheric. Context is approximated, not understood.
- Fear and tension are inherently subjective. The 0–1 rubrics were calibrated against samples, but reasonable raters could still disagree on edge cases.
- The corpus is 129 horror screenplays from one source (IMSDb). Findings are descriptive of this collection, not the genre in full.