How We Generated 365 Lessons From a Tafsir

Part 1 How We Built the Engine Behind Nuha — PDF extraction, data modelling, Arabic enrichment, and embedding 6,235 verse blocks.

This is Part 2. Part 1 covered how we parsed Maududi's tafsir into 6,235 verse blocks. This part covers what happened when we tried to teach from them.

The Retrieval Problem

A daily Quran study lesson needs context. Not just the verses assigned to that day, but the scholarly commentary that makes those verses comprehensible. A passage about patience might reference a passage about gratitude three surahs earlier. A legal ruling in Al-Baqarah might get its fullest explanation in a footnote attached to An-Nisa.

Feeding the model a flat dump of today's verses and their commentary would produce lessons. They would be shallow.

We built a four-layer retrieval system.

Layer 1: Deterministic lookup. Every lesson maps to a fixed range of verses. Day 1 covers Al-Fatihah 1:1-7. Day 47 covers Al-Baqarah 2:261-271. The primary verses and their commentary are retrieved directly from the database by verse range. No search involved. No ambiguity.

Layer 2: Semantic search with reciprocal rank fusion. The model gets a thematic query based on the day's content. That query runs against both the primary embeddings and the commentary-only embeddings we built in Phase 5. Two ranked lists come back. We fuse them using RRF, which gives disproportionate weight to chunks that rank highly in both lists rather than averaging positions.

The result: when a lesson covers verses about divine testing, the semantic layer surfaces Maududi's commentary on Ayyub from elsewhere in the corpus, even though that passage shares no verse numbers with the day's assignment.

Layer 3: Cross-reference following. Maududi frequently references other parts of the Quran in his commentary. "See also the discussion in Surah Al-Anfal" or "This point is elaborated in Note 14 of Surah Al-A'raf." We extract these references and retrieve the target passages.

Layer 4: Continuity matching. If yesterday's lesson ended mid-discussion, today's lesson needs to know where the discussion left off. This layer retrieves the tail end of the previous day's source material so the generation model can maintain narrative coherence across lessons.

Four layers. Deterministic, semantic, cross-referential, and temporal. Each one exists because the others aren't sufficient alone.

The Generation Run

With retrieval working, we generated lessons.

Each lesson follows a fixed structure: a title, a reflection section grounded in the day's verses, a "connecting to yesterday" bridge, key themes, and a scholarly note drawn from Maududi's footnotes. The model receives the retrieved context and a structured prompt that constrains what it can and cannot do.

The constraint that matters most: every factual claim must be traceable to the source material provided. The model is not allowed to introduce Islamic rulings, historical claims, or interpretive positions that aren't present in Maududi's commentary for the verses being studied.

The first full generation run used Claude Sonnet. 365 lessons. It took about 12 hours. Most came back looking good. Then we ran the grounding audit.

The Grounding Audit

A grounding audit takes every generated lesson and checks each factual claim against the source material that was provided to the model during generation. The auditor is a separate model instance with a different prompt. Its job is adversarial: find claims that cannot be verified from the sources.

Each claim gets one of three labels: grounded (traceable to the source), ungrounded (not present in the source material), or ambiguous. The ratio of grounded claims to total claims produces a grounding score between 0 and 1.

We set the threshold at 0.85. Any lesson scoring below that gets flagged for review or regeneration.

The first audit run returned results we didn't expect.

The source truncation problem. Our auditor had a 12,000-character limit on how much source material it could evaluate per lesson. For days covering long surahs with extensive commentary, the source was being silently cut off. The auditor was correctly flagging claims as ungrounded, because the source text backing those claims had been truncated before the auditor ever saw it.

The lessons were fine. The audit was wrong. We increased the source limit to 30,000 characters.

The continuity penalty. Every lesson has a "connecting to yesterday" section. The auditor was checking this section against today's source material. But the claims in that section reference yesterday's content, not today's. The auditor would correctly report "this claim about yesterday's verses is not present in today's sources" and dock the score.

We excluded the continuity section from the audit scope. It's a design feature, not a factual claim.

After both fixes, the clean pass rate jumped from around 70% to over 85%.

The Feedback Loop

Some lessons still failed. Day 127 scored 0.49. Day 40 scored 0.72. The auditor would flag specific claims with specific reasons: "The source material does not mention that this verse was revealed during the Battle of Badr" or "The characterisation of this ruling as 'the primary obligation' is not supported by the commentary provided."

We could have just regenerated those lessons and hoped for better output. Instead, we built a feedback loop.

When a lesson fails the grounding audit, the specific flags are extracted. On the next generation attempt, those flags are injected into the prompt. The model sees exactly which claims were rejected and why. The instruction is explicit: do not repeat these claims unless you can ground them more carefully in the source text.

Day 127 went from 0.49 to 1.00 on the second attempt. The model had been paraphrasing loosely on the first try. When told which paraphrases were rejected, it stuck closer to the source language.

This pattern held across most regenerations. The feedback loop wasn't just a retry mechanism. It was a teaching signal.

Two Models, One Corpus

Not every lesson cooperated.

14 lessons failed to produce valid JSON on the first run. The output would be cut short or malformed. We traced this to a token limit that was too conservative for days with dense commentary. After increasing the generation ceiling from 2,048 to 4,096 tokens and improving the JSON parser to handle edge cases in markdown fencing, most of these resolved.

For the remaining failures, we switched to OpenAI's GPT-5.4.

This wasn't a philosophical choice. Claude Sonnet and GPT-5.4 produce comparable quality on this task. Both average around 0.94 grounding scores across the full corpus. The difference is in their failure modes. Sonnet sticks closer to Maududi's original wording, which makes it easier to pass the grounding audit on the first try. It had an 86% clean pass rate. GPT-5.4 paraphrases more aggressively, which produces more natural-sounding prose but triggers more audit flags. It had a 64% clean pass rate.

We used Sonnet as the primary generator and GPT-5.4 as the fallback for lessons that failed JSON parsing or hit API errors on the Anthropic side. The grounding audit doesn't care which model produced the lesson. It applies the same standard to all of them.

The One That Wouldn't Pass

Day 304 covers Surah At-Talaq. Maududi's commentary on this surah deals with Islamic divorce law. The subject is inherently sensitive and the commentary is unusually precise in its legal language.

The auditor rejected every paraphrase. Not because the lesson was wrong, but because the auditor's standard for "grounded in the source" became impossibly strict when the source material contained legal rulings. Any rewording of a legal statement was flagged as a departure from the source.

We ran Day 304 through the feedback loop three times. Each attempt scored below 0.85. The lesson content was accurate. The audit calibration was the problem.

We accepted it. 364 out of 365 lessons pass the grounding audit at 0.85 or above. One lesson about divorce law doesn't. The content is reviewed and correct. The audit is overly strict on legal paraphrases. That's a known limitation, not a quality issue.

What 365 Lessons Looks Like

The final corpus: 365 lessons covering the entire Quran. Each lesson is pre-generated. No AI runs at serving time. A user opening Day 1 gets a lesson that was generated, audited, reviewed, and stored in Supabase before the app shipped.

The generation cost for the full corpus was under $50. The grounding audit cost roughly the same. Total infrastructure cost to produce 365 lessons of scholar-grounded Quran study content: under $100.

That number matters because it means the product can be free. There is no per-user inference cost. No token metering. No subscription needed to offset API spend. The entire curriculum exists as static content in a database, retrieved and displayed.

The Line We Keep Drawing

Part 1 ended with a statement about misattributed verses. The same principle carried through here, but the shape of the problem changed.

In parsing, the risk was structural: a verse assigned to the wrong surah, a footnote attached to the wrong claim. Those are engineering bugs with engineering fixes.

In generation, the risk is interpretive: a model introducing a theological position that Maududi didn't hold, or flattening a nuanced legal ruling into a simple statement. These failures look correct to someone who doesn't know the source material. They read well. They feel authoritative. That's what makes them dangerous.

The grounding audit exists because the generation model cannot be trusted to stay within bounds on its own. The feedback loop exists because a single audit failure should produce a better attempt, not just a different one. The multi-model approach exists because reliability at scale requires redundancy.

Every layer in this system is a different way of asking the same question: can we trace this claim back to what Maududi actually wrote?

365 times, the answer is yes. That's the product.