Coursetexts · System Architecture

Copyright Compliance
Automation Pipeline

copyright.coursetexts.org
7-stage · multi-modal
human-in-the-loop
PDF Lecture Upload
professor's slide deck · any embedding format
01
Image Extraction
Hybrid fitz + YOLOv10 (Doclayout-finetuned) detection. pHash deduplication via hamming distance >15. Handles discrete objects, SVG-nested XObjects, complex embeddings.
PyMuPDF YOLOv10 Modal GPU
02
Reverse Search + License Attribution
Google + Bing SERP via 4-tier authority ranking (Wikimedia → publishers → .edu/.gov → reposts). DinoV2 cosine similarity gates misattribution. Deterministic extractors exit early; LLM invoked only on ambiguous high-tier sources.
SERP API DinoV2 Tier-gated LLM
03
Fair Use Evaluation
VLM judge scores all four § 107 factors: purpose/character, nature, amount/substantiality, market effect. Returns per-factor reasoning + confidence. Image + slide context + license metadata as joint input.
Gemini 2.5 Pro 17 U.S.C. § 107
auditor can override judge verdict or select a web replacement from reverse-search results
fair use
passes?
✕ no — replace
decision
Chart or Graph?
If structured data underlies the image, route to deterministic regeneration — not generative AI. Preserves numerical fidelity; presentational elements are redrawn.
04a · Graph Regeneration
Qwen3-VL extracts values, labels, chart type → serialized JSON → GPT-4o generates Matplotlib → executed on non-interactive backend. Colorblind-safe palettes applied.
Qwen3-VL-30B GPT-4o Matplotlib
04b · Generative Replacement
Gemini 2.5 Pro strips trademarks → descriptive prompt → Nano-banana-pro (primary) → GPT-image-1 (fallback) → DALL-E 3 (final). 4o-mini VLM judge validates output; img2img pivot for persistent failures. Bria AI for commercially-safe mode.
Gemini 2.5 Pro Nano-banana-pro GPT-image-1 Bria AI
✓ yes — retain
Attribution Extraction
If professor-provided caption exists (nearest-text heuristic + template matcher), inherit it. Otherwise compose from reverse-search metadata: host, license type, URL. Editable in frontend before PDF generation.
PyMuPDF Heuristic matcher
05
Image Replacement
Tier-1: direct XObject stream swap preserving Z-order and text overlays. Tier-2 fallback: background-camouflage overlay (perimeter pixel sampling) + centered refit. Duplicate tracking applies replacements document-wide.
PyMuPDF XObjects Letterboxing
06
Attribution Injection
3-tier spatial logic: (1) proximity placement below/above/left/right with collision detection against existing content rects; (2) vertical scaling + bottom safe-zone if congested; (3) batch grouping with Llama-3.2-11B descriptive captions to disambiguate stacked credits.
PyMuPDF Llama 3.2-11B
OCW-ready
Legally Compliant PDF
all images cleared · attributed · replaced · download-ready
primary pipeline stage
replacement branch (non-compliant)
retain branch (fair use passes)
↰ human-in-the-loop override point