Coursetexts Copyright Pipeline

PDF Lecture Upload

professor's slide deck · any embedding format

Image Extraction

Hybrid fitz + YOLOv10 (Doclayout-finetuned) detection. pHash deduplication via hamming distance >15. Handles discrete objects, SVG-nested XObjects, complex embeddings.

PyMuPDF YOLOv10 Modal GPU

Reverse Search + License Attribution

Google + Bing SERP via 4-tier authority ranking (Wikimedia → publishers → .edu/.gov → reposts). DinoV2 cosine similarity gates misattribution. Deterministic extractors exit early; LLM invoked only on ambiguous high-tier sources.

SERP API DinoV2 Tier-gated LLM

Fair Use Evaluation

VLM judge scores all four § 107 factors: purpose/character, nature, amount/substantiality, market effect. Returns per-factor reasoning + confidence. Image + slide context + license metadata as joint input.

Gemini 2.5 Pro 17 U.S.C. § 107

auditor can override judge verdict or select a web replacement from reverse-search results

fair use
passes?

✕ no — replace

decision

Chart or Graph?

If structured data underlies the image, route to deterministic regeneration — not generative AI. Preserves numerical fidelity; presentational elements are redrawn.

04a · Graph Regeneration

Qwen3-VL extracts values, labels, chart type → serialized JSON → GPT-4o generates Matplotlib → executed on non-interactive backend. Colorblind-safe palettes applied.

Qwen3-VL-30B GPT-4o Matplotlib

04b · Generative Replacement

Gemini 2.5 Pro strips trademarks → descriptive prompt → Nano-banana-pro (primary) → GPT-image-1 (fallback) → DALL-E 3 (final). 4o-mini VLM judge validates output; img2img pivot for persistent failures. Bria AI for commercially-safe mode.

Gemini 2.5 Pro Nano-banana-pro GPT-image-1 Bria AI

✓ yes — retain

Attribution Extraction

If professor-provided caption exists (nearest-text heuristic + template matcher), inherit it. Otherwise compose from reverse-search metadata: host, license type, URL. Editable in frontend before PDF generation.

PyMuPDF Heuristic matcher

Image Replacement

Tier-1: direct XObject stream swap preserving Z-order and text overlays. Tier-2 fallback: background-camouflage overlay (perimeter pixel sampling) + centered refit. Duplicate tracking applies replacements document-wide.

PyMuPDF XObjects Letterboxing

Attribution Injection

3-tier spatial logic: (1) proximity placement below/above/left/right with collision detection against existing content rects; (2) vertical scaling + bottom safe-zone if congested; (3) batch grouping with Llama-3.2-11B descriptive captions to disambiguate stacked credits.

PyMuPDF Llama 3.2-11B

OCW-ready

Legally Compliant PDF

all images cleared · attributed · replaced · download-ready

Copyright ComplianceAutomation Pipeline

Copyright Compliance
Automation Pipeline