Automating Copyright Compliance for Open Courseware

Automating copyright compliance hero visual
Feb 24, 2025

Engineering

A 7-stage AI pipeline that turns a weeks-long legal workflow into a minutes-long process, with a human reviewer in the loop.

Coursetexts Team

Coursetexts Engineering Team

Introduction

MIT OpenCourseWare runs on roughly $2.7M per year to maintain 2,300+ courses — about $1,170 per course annually, before a single image gets cleared. MIT OpenCourseWare runs on roughly $2.7M per year to maintain 2,300+ courses — about $1,170 per course annually [1]. That figure is the floor: it counts infrastructure and publishing without considering the legal review that copyright compliance demands.

A single course can contain hundreds of third-party images pulled from journals, textbooks, and news archives. Each one requires a researcher to reverse-search the source, determine license status, and either negotiate permission or redact the asset. Across a curriculum, that process takes weeks and requires dedicated legal staff.

Coursetexts publishes open courseware. When we hit 30 courses, copyright became a bottleneck. A professor's slide deck might have 200 images. Whether a photos of the Simpsons or a chart pulled from a scientific textbooks, each image is a manual stop & search. At that pace, scaling to hundreds of courses meant either a large legal budget or a broken promise about openness.

We built a copyright pipeline that takes a PDF lecture deck and returns a fully compliant version. Every image analyzed for fair use under 17 U.S.C. § 107, copyrighted assets replaced with AI-generated equivalents, charts regenerated from extracted data, and attribution injected spatially into the document. Try it at copyright.coursetexts.org by uploading a PDF to the Image Copyright Analysis flow.

Launch preview card

Announcing the Launch of copyright.coursetexts.org

Try it at copyright.coursetexts.org by uploading a PDF to the Image Copyright Analysis flow.

What makes copyright compliance hard?

How can universities and open educational resource providers share high quality course material without having to navigate the maze of copyright permissions for every image, graph and visual element? Manual copyright clearance is prohibitively expensive and requires weeks of legal review per course, yet, simply removing all potentially copyrighted images guts the value of many materials, especially in visual disciplines like art history, biology or even engineering. Our pipeline solves this problem by automating the entire copyright compliance workflow through a multi stage pipeline that treats each image as a unique legal and pedagogical problem.

A Graph of the proccess

generated with claude

In a closed classroom, a professor can show any image under fair use. For purposes like education, the doctrine codified in 17 U.S.C. § 107 permits use of copyrighted material without permission. However, publishing those same slides as open courseware (OCW) changes the calculus entirely, as digital & thus global public distribution is not a classroom. Fair use is surprisingly non-binary. Courts weigh four factors: (1) the purpose and character of use — is it transformative? commercial? (2) the nature of the copyrighted work — factual content gets more latitude than creative; (3) the amount used relative to the whole; and (4) the effect on the market for the original. The landmark Campbell v. Acuff-Rose (1994) Supreme Court decision shifted modern doctrine heavily toward factor number 1, making “transformativeness” the dominant lens. Still, determination requires case-by-case judgment. A graph from a Nature paper used in a biology lecture, given attribution in a non-commercial OCW context, is probably fine. A Getty Images photograph used decoratively, without attribution, in a globally distributed PDF, is probably not. The traditional workflow encodes all of this as manual steps done by an employee: — Identify every image in every slide deck — Reverse-search each image to locate the original source — Determine license status (CC-BY, Public Domain, All Rights Reserved, unknown) — Apply fair use judgment given the four § 107 factors — Negotiate permissions, find replacements, or redact — Add attribution for retained images Each step blocks on the previous. The result is a linear, per-image workflow that does not parallelize across a human reviewer. This is why MIT OCW explicitly notes that course packs containing proprietary content "cannot be provided under our license" — the clearance cost itself is considered to exceed the benefit of the materials existing outside their direct production pipeline [2]!

Our approach: a 7-stage compliance pipeline

The Coursetexts Copyright pipeline treats each image as an independent legal and pedagogical object. Every image moves through all seven stages in parallel, before a human reviewer sits between the automated judgment and the final download. They can override any decision, swap in a web result from the reverse search, or edit attribution text before committing.

01 — Image Extraction

The pipeline's first job is finding every image in a PDF, regardless of how it was embedded. This is harder than it sounds. PDFs do not have a flat image list — they store images as XObjects in a resource tree, sometimes nested inside other XObjects, sometimes as SVG elements, sometimes as inline streams inside page content operators. Missing an image at this stage creates a permanent blind spot. We started with PyMuPDF's fitz library for extraction. It handles the common cases well but fails on images embedded as SVG or nested inside complex XObjects — which happen to be exactly the kind of embedding that appears in professionally typeset lecture slides. We then evaluated object-detection models via LayoutParser (PubLayNet, HJDataset, PrimaLayout variants), which detect images by analyzing page renders rather than the PDF object tree. These failed differently: models like PubLayNet/mask_rcnn_X_101 would occasionally classify an entire slide as a single image region, and performance collapsed when images appeared near each other. The solution is a hybrid. The current pipeline runs fitz first, collecting all extractable images and storing their perceptual hashes (pHash) and page numbers. In parallel, the full PDF renders as images and runs through a YOLOv10 model fine-tuned on DocLayout-YOLO — hosted on Modal for GPU acceleration — which returns bounding boxes for all detected image regions. We then match the YOLO detections against the fitz extractions page-by-page using hamming distance on pHash values; detections with hamming distance greater than 15 are treated as genuinely new finds not captured by fitz. Blank image detection and an aspect ratio filter handle edge cases.

We started with PyMuPDF's fitz library for extraction. It handles the common cases well but fails on images embedded as SVG or nested inside complex XObjects — exactly the kinds of embeddings that appear in professionally typeset lecture slides. So we evaluated object-detection models via LayoutParser (PubLayNet, HJDataset, PrimaLayout variants), which detect images by analyzing page renders rather than the PDF object tree. These failed differently: models like PubLayNet/mask_rcnn_X_101 would occasionally classify an entire slide as a single image region, and performance collapsed when images appeared near each other.

The solution is a hybrid. The current pipeline runs fitz first, collecting all extractable images and storing their perceptual hashes (pHash) and page numbers. In parallel, the full PDF renders as images and runs through a YOLOv10 model fine-tuned on DocLayout-YOLO — hosted on Modal for GPU acceleration — which returns bounding boxes for all detected image regions. We then match the YOLO detections against the fitz extractions page-by-page using hamming distance on pHash values; detections with hamming distance greater than 15 are treated as genuinely new finds not captured by fitz. Blank image detection and an aspect ratio filter handle edge cases.

Implementation Note

The GitHub repo for DINO weights has rate limits. Load models directly from a self-hosted registry or use --trust-remote-code. Running sequential YOLO inference on a CPU is prohibitively slow; Modal's cold-start overhead is worth it for anything above ~10 pages.

Image extraction graph

Identified Images

source: copyright.coursetexts.org

02 — Reverse Search + License Attribution

For each extracted image, the pipeline needs to determine who owns it, and under what license. We do not rely on a single reverse image search call or unconstrained LLM inference. Instead, reverse search is a signal discovery step followed by deterministic verification, with LLMs invoked only when deterministic methods fail.

Images are uploaded to a public URL via a multi-host fallback (ensuring a stable link), then queried through Google and Bing reverse image search via the SERP API. Returned matches are classified into authority tiers that encode expected license reliability:

  • Tier 1 — Wikimedia Commons, Wikipedia, GitHub, Flickr, arXiv, Internet Archive
  • Tier 2 — NASA, NOAA, Nature, BBC, Getty Images, major institutional publishers
  • Tier 3 — .edu, .gov, .org domains, ResearchGate
  • Tier 4 — Pinterest, Reddit, Twitter (repost-centric; treated as weak signal only)

License extraction runs deterministic domain-specific extractors first — Wikimedia Commons license sections, Flickr license metadata APIs, GitHub LICENSE file fetches. If a license resolves with sufficient confidence, the pipeline exits early without touching an LLM. LLMs activate only on ambiguous results from Tier 1–2 sources.

To prevent misattribution across visually similar but legally distinct images (a common failure mode in reverse search), we compute DinoV2 embeddings for the top results and measure cosine similarity against the query image. Matches below a similarity threshold are discarded. The pipeline returns an explicit unknown rather than guessing — low-confidence candidates are dropped before attribution requirements apply.

Reverse search and attribution graph

A Graph of the proccess

source: copyright.coursetexts.org

03 — Fair Use Evaluation

Knowing an image's license is necessary but not sufficient. An All Rights Reserved image might still be usable under fair use; a Creative Commons image might have share-alike conditions that conflict with the target distribution. The fair use determination requires reading the image in context — against the slide it appears on, the course it belongs to, and the way it functions pedagogically.

Congress codified fair use in 17 U.S.C. § 107 as a four-factor balancing test. Courts since Campbell v. Acuff-Rose (1994) have weighted factor 1 — purpose and character, especially transformativeness — most heavily, but all four factors interact:

  • Factor 1 — Purpose & Character: non-commercial, transformative, educational use weighs strongly in favor
  • Factor 2 — Nature of the Work: factual/scientific images get more latitude than creative/expressive ones
  • Factor 3 — Amount & Substantiality: a full decorative photograph versus a cropped data figure are treated differently
  • Factor 4 — Market Effect: this is the factor courts historically weight most heavily; does this use substitute for the original market?

We automate this evaluation with a VLM judge — currently Gemini 2.5 Pro — that receives the image, slide context text, reverse search metadata, and resolved license as joint input. The system prompt is grounded in the § 107 statutory language and Campbell doctrine, refined over many iterations. The judge returns per-factor scores with reasoning, a composite confidence score, and a binary fair-use determination. The human reviewer can override any verdict.

Two paths: retain or replace

After fair use evaluation, the pipeline branches. Images that pass proceed to attribution extraction. Images that fail route to the replacement pipeline — which itself branches based on image type.

Two paths retain or replace graph

A Graph of the proccess

source: copyright.coursetexts.org

04a — Graph Regeneration

Charts and graphs occupy a legal gray zone. Underlying data like measurements, statistics, relationships are factual and not copyright. However, presentational choices like color scheme, layout, axis styling, legend formatting can carry copyright protection Any replacement must preserve the exact numerical values rather than just the visual trend. A generative image of " a bar chart showing increasing temperatures" is pedagogically useless if the values are wrong.

We treat graphs as a data extraction problem, not an image generation problem. Qwen3-VL-30B reads the chart and recovers precise values, axis labels, legend entries, data series names, and chart type — serialized into structured JSON. GPT-4o then converts that JSON into executable Matplotlib code, which runs on a non-interactive backend. The result is a programmatically exact reproduction with new styling. We apply colorblind-friendly palettes (Okabe-Ito) that visually differentiate the output from the source while preserving every data point.

Generative image replacement graph

A Graph of the proccess

source: copyright.coursetexts.org

04b — Generative Image Replacement

For non-chart images like photographs, illustrations, diagrams that fail fair use— the pipeline generates a copyright-safe replacement that serves the same pedagogical function. The process has three parts: prompt extraction, image generation, and validation.

Gemini 2.5 Pro analyzes the original image and generates a descriptive text prompt, explicitly instructed to strip trademarks, logos, named individuals, and any identifiable copyrightable elements. The prompt targets the educational concept, not the specific creative expression.

Generation runs through a prioritized failover sequence: Nano-banana-pro (primary, best visual fidelity), GPT-image-1 (first fallback), DALL-E 3 (final high-reliability fallback). Aspect ratio is calculated from the source image bounding box, mapped to the closest ratio supported by the target model, and the output is resized to fit the original coordinates exactly. For institutions needing commercial clearance, a Bria AI mode replaces all generation with a model trained exclusively on licensed and public-domain images.

Finally, a 4o-mini VLM judge validates each output for prompt adherence and contextual appropriateness. Failed validations escalate to GPT-image-1 for a second attempt; persistent failures pivot to an image-to-image workflow that preserves structural composition while removing copyrightable elements.

Image replacement pipeline graph

A Graph of the proccess

source: copyright.coursetexts.org

05 — Image Replacement

With replacement images in hand, we can now reconstruct and export the PDFF. This is where PDF 's internal complexity surfaces again— simply drawing a new image on the page produces visible artifacts if the original image was referenced by Z-order stacking, text overlays, or document-wide resource sharing.

We use a two-tier strategy. Tier 1 performs a direct XObject stream swap: the engine scans the PDF's internal resource tree to find the binary data stream corresponding to each detected bounding box and replaces it in-place. This preserves the page's rendering instructions exactly— text overlays that sat on top of the original image remain on top. This way, Z-order is unchanged. Aspect ratio mismatches between source and replacement are resolved with letterboxing. Tier 2 activates when an image is flattened into a vector drawing or the XObject stream is inaccessible. Rather than a white rectangle, we sample pixels around the image perimeter to calculate the average surrounding color (detect_background_color), draw a matched rectangle, then center and fit the replacement within the cleared space. The replacement engine also tracks duplicate image resources — a single XObject can be referenced on multiple pages. When the system detects that two images share a resource hash, it applies the same replacement decision to all instances.

06 — Attribution Injection

Images that pass fair use, or that carry permissive licenses, need proper attribution. This is a spatial problem: the attribution text must land near its image without colliding with existing content, in a PDF that was never designed to accommodate extra text.

The system tries three strategies in order. First, it attempts proximity placement — iterating through candidate positions (below, above, left, right of the image bounding box) and validating each against two constraints: available margin width and collision detection against existing content rectangles extracted by PyMuPDF. Second, if the area around the image is too congested, the script compresses the existing page content with a scale_factor derived from the attribution text length and font size, creating a safe zone at the bottom of the page. Third, for pages with multiple images that all fail proximity placement, attributions batch into a stack at the bottom, disambiguated by short Llama-3.2-11B-generated captions that visually identify which credit belongs to which image.

Marioverse image

Marioverse Image

source: copyright.coursetexts.org

Conclusion

The new pipeline runs live at copyright.coursetexts.org.  Upload a PDF lecture deck & receive the nudges towards a compliant version where every image is analyzed, every copyright holder is credited or replaced, and the lecture is ready for distribution under an open license. Watch the Coursetexts GitHub for the repository [4].

The broader infrastructure — ingestion engine, fair use judge, replacement pipeline — is being open-sourced. Watch the Coursetexts GitHub for the repository [4].

Need help contributing to open education?

If you're sitting on 50 courses you can't publish, email us: coursetexts.info@gmail.com

Join the Team

Coursetexts is a small team made fully of volunteers. The codebase is real, and the problem space (making the world 's knowledge genuinely open) is one of the few engineering & design problems that is also a moral one. If you find this problem genuinely interesting — we want to talk to you!

1. [1] MIT OpenCourseWare annual operating cost (~$2.7M for 2,300+ courses) from MIT OCW fundraising pages. Cited figure is total operational cost including infrastructure, publishing, and rights clearance staffing — not a per-image clearance rate. Source: ocw.mit.edu/give. The $1,170/course/year figure is a simple division; actual per-course clearance labor varies significantly by discipline (art history >> computer science).

2. [2] MIT OCW's own FAQ states that course packs containing proprietary content "cannot be provided under our license." Source: mitocw.zendesk.com.

3. [3] 17 U.S.C. § 107 (fair use). Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994) — the decision that established transformativeness as the primary analytical lens for factor 1 analysis.

4. Article was written in collaboration with Aileen Luo.

Coursetexts Engineering Team

Coursetexts Engineering Team

Learn More

A group of volunteers. Coursetexts Engineering includes Eesha Ulhaq on frontend, Advikaa & Cherish on Copyright Engineering, Akshith Garapati as full-stack, and Aayush Gupta as manager.