gantt
title Project Plan – Epstein Files Network Analysis
dateFormat YYYY-MM-DD
tickInterval 7day
section Setup & Data
Repository & Quarto setup :a1, 2026-02-17, 7d
Dataset loading & EDA :a2, 2026-02-24, 14d
section Analysis Pipeline
Entity extraction tests :a3, 2026-03-03, 14d
Data model & feature engineering :a4, 2026-03-17, 7d
Entity resolution & alias mapping :a5, 2026-03-31, 14d
Network construction (NetworkX) :a6, 2026-03-31, 14d
Network metrics (betweenness, structural holes) :a7, 2026-04-07, 14d
section Visualization
Streamlit prototype :a8, 2026-03-10, 14d
Dashboard refinement :a9, 2026-04-14, 14d
Data story drafting :a10, 2026-04-14, 14d
Data story finalized :a11, 2026-04-28, 7d
Deployment (Streamlit / GitHub Pages) :a12, 2026-04-28, 7d
section Coaching & Feedback
Concept coaching :milestone, m1, 2026-03-24, 1d
On-site coaching :milestone, m2, 2026-04-21, 1d
Online coaching :milestone, m3, 2026-04-28, 1d
On-site coaching :milestone, m4, 2026-05-05, 1d
section Finalization
Final testing & documentation :a13, 2026-05-05, 14d
Presentation slides & rehearsal :a14, 2026-05-19, 7d
Final presentation :milestone, m5, 2026-05-26, 1d
Project Charta – Epstein Files Network Analysis
1 Context and Scope
This project analyzes the publicly released Epstein Files – legal documents, emails, and records published by the U.S. Department of Justice and the U.S. House Oversight Committee. The dataset is sourced from HuggingFace (Nikity/Epstein-Files), which provides pre-extracted text with direct DOJ source URLs for provenance verification.
The project addresses the following research question:
Which network structure underlies the documented connections of Jeffrey Epstein, and which persons or institutions functioned as structural bridges between otherwise separate domains?
The visualization product will be published as a publicly accessible Streamlit web application and a Quarto-based data story on GitHub Pages.
2 Project Objectives and Success Criteria
The project enables users to explore the documented relationship network of key individuals in the Epstein Files. Specifically:
- Identify which actors function as structural bridges between otherwise disconnected clusters (betweenness centrality / structural holes)
- Explore the network interactively, filtered by domain, time period, and relationship type
- Trace every visualized connection back to its source document
Success criteria:
- Research question is answered visually and analytically
- Dataset meets course requirements: 100+ observations, 6+ features, mix of numerical and categorical variables
- Full pipeline is reproducible via GitHub repository and documented in Quarto
- Visualization product is publicly deployed and accessible
Out of scope:
- Full ingestion of the complete DOJ corpus – a defined, documented subset is used
- LLM/RAG-based question answering
- Speculation or unverified claims about individuals
3 Stakeholder Analysis
| Stakeholder | Role | Goals | Relationship |
|---|---|---|---|
| Project team – Gruppe 7 | Developers | Successfully complete the project, learn new methods, achieve a good grade | Internal – responsible for all deliverables |
| Course instructors | Evaluators | Assess technical quality, reproducibility, and visualization design | External – define requirements, provide feedback, grade the project |
| General public / researchers | End users | Explore documented connections in a transparent, data-driven way | External – secondary audience for the public platform |
| Journalists / investigators | End users | Cross-reference documented relationships with source material | External – potential future users of the platform |
4 User Analysis
5 Situation Assessment
Available resources:
- Dataset: Nikity/Epstein-Files on HuggingFace (4.1M rows, Parquet, pre-extracted text, DOJ source URLs)
- Personnel: 2 team members, approx. 2–3 hours/week each over 8–10 weeks (~20–30 total hours)
- Tools: Python, NetworkX, spaCy, Streamlit, Quarto, GitHub
Constraints and risks:
- Limited time requires strict scope – no bulk infrastructure projects
- Entity resolution (name disambiguation) is the main analytical risk; mitigated by alias mapping and manual validation of top entities
- Dataset may not cover 100% of released documents; coverage will be documented as a limitation
6 Visualization Concept
The product combines two components:
- Interactive network dashboard (Streamlit): force-directed graph with nodes as entities, edges as documented co-occurrences. Node size encodes betweenness centrality. Filterable by domain, time, and hop distance.
- Guided data story (Quarto/GitHub Pages): narrative walkthrough of key findings with annotated charts and methodology documentation.
[Detailed design decisions and value mapping (cognitive, communicative, experiential) – to be completed by team]
7 Project Plan
8 Roles and Contact Details
| Role | Name | Contact |
|---|---|---|
| Student | Sendogan Kulakci | kulaksen@students.zhaw.ch |
| Student | Lewis Birrer | birrelew@students.zhaw.ch |
| Lecturer | Dr. Manuel Dömer | doem@zhaw.ch |