We spent weeks trying to get a computer to read a book the way a scholar reads a book.
Not summarise it. Not extract keywords. Actually understand the structure — which commentary belongs to which verse, which footnote belongs to which claim, where one surah ends and the next begins — so that when a user asks about patience or tawakkul or the nature of the Day of Judgement, the right passage comes back, attributed correctly, in full.
This is the story of how that went.
The Source Material
Maududi's Towards Understanding the Quran is one of the most comprehensive English-language tafsir ever written. 30 volumes. 114 surahs. Thousands of footnotes. A lifetime of scholarship compressed into a single PDF.
The PDF was our raw material. Every phase of the pipeline had to treat it with the care it deserved — because in religious scholarship, a misattributed verse isn't a typo. It's a corruption of meaning.
That principle drove every engineering decision we made.
The PDF Would Not Cooperate
PDF extraction sounds like a solved problem. It is not.
Maududi's tafsir has a specific visual structure: verse translations appear in parenthesised markers like (2:93), followed by numbered commentary paragraphs. We wrote regex patterns to detect these boundaries and split the full text into individual verse blocks.
The first runs looked promising. Then we dug into the output.
386 verse blocks had truncated commentary. Not obviously truncated — the text just ended mid-sentence at surah boundaries. What was happening: when pdfplumber extracted text page by page, the last few verses of a surah would sometimes bleed into the start of the next surah's raw text block, before that surah's own header appeared.
Our parser was throwing that text away.
We added a spillover recovery pass. After splitting into surah blocks, we now look at the start of each block, extract any text that precedes that surah's own header, and attach it to the previous block as last_verse_suffix. Then, when parsing the last verse of a surah, we append the recovered suffix before cleaning.
That brought 386 truncated verses down to 18. The remaining 18 are genuinely short passages where Maududi wrote no further commentary.
Then came the Arabic
Maududi's PDF includes original Arabic text inline with the English commentary. We needed to strip it — we'd be sourcing clean Arabic separately from the quran.com API. But Arabic removal left artefacts: \n \n \n whitespace sequences where the Arabic had been. These weren't empty lines. They were lines containing only spaces. Standard newline normalisation didn't catch them.
The fix was a two-pass cleaning strategy. First pass strips the Arabic and cleans obvious junk. Second pass catches the whitespace artefacts that the first pass exposed. Both passes run on every verse's commentary and every surah's intro sections.
Then the footnote superscripts
Maududi's text uses inline superscript numbers to link claims in the commentary to their footnotes. Our cleaner used a regex lookbehind to identify and strip these — but the lookbehind only matched superscripts following letters, digits, closing brackets, and basic punctuation. We were missing 288 superscripts that appeared after question marks, curly quotes, colons, and en-dashes.
Each missed superscript left a stray number in the cleaned commentary. We extended the lookbehind to cover the full character set we found in the corpus. Then discovered a handful of HTML <sup> tags in the source PDF for good measure, and handled those too.
By the end of this phase: 6,235 verse blocks, all 114 surahs, zero gaps, zero truncations that weren't genuinely empty.
The Model Nobody Saw Coming
We nearly skipped the data modelling phase entirely.
Defining the canonical shape that every piece of data would take as it flowed through the rest of the pipeline felt like documentation work. We were impatient to get to the Arabic enrichment.
That was the wrong instinct.
Without a canonical model, field names drift. One part of the pipeline calls it commentary, another calls it commentary_clean. One uses verse_start, another uses verse_number_start. You don't notice until the upload script fails at 2am because the key it's looking for doesn't exist.
We stopped. We defined QuranChunk as a Pydantic model with validators, specified the exact field names that would flow through every subsequent phase, and documented the contract between the parsed data, the enriched data, and the upload-ready data.
The audit we ran before the final phase found zero field mismatches between parsed and enriched data. That was because the data model existed.
We Had to Stop the Pipeline Mid-Run
Arabic enrichment fetches the original Uthmani script and word-level data from the quran.com API for every verse in the corpus. It's a long-running process — about 90 minutes at 0.6 seconds per verse to stay under the rate limit.
We started it. Then re-read the engineering spec.
The implementation we'd written was wrong. We'd used a different field structure, a shorter delay, and a different joining format for multi-verse ranges. The spec called for arabic_uthmani (verses joined with the end-of-verse marker) and arabic_words (a flat list with global position offsets and verse numbers). We had something close but not identical.
Close enough to run. Not close enough to be correct.
We stopped the process. Rewrote quran_api.py from scratch to match the spec exactly. Restarted. 6,235 verses enriched, 0 failures.
The lesson wasn't about the specific fields. It was about the discipline of reading the spec before running, not after.
Two Embeddings Per Chunk
Each verse block gets embedded twice using OpenAI's text-embedding-3-large.
The primary embedding contains the full context: surah name, period, verse translation, and structured commentary. This is what drives general-purpose retrieval at lesson generation time.
The commentary embedding contains only Maududi's notes. This drives thematic search — when the lesson generator needs to find all the places Maududi discusses a concept like tawakkul or the nature of divine justice, the commentary vectors give a cleaner signal than the primary ones.
We built embed texts from structured footnotes rather than raw commentary when available — Note 4: This is the second prerequisite... rather than 4. This is the second prerequisite.... Cleaner signal for the model.
We also added checkpoint recovery. Embedding 6,235 chunks twice, at 0.5 seconds per chunk, takes a couple of hours and costs around $0.43. An interruption mid-run without a checkpoint would mean re-spending that time and money from scratch. Every 50 chunks, the pipeline saves its progress. A restart picks up from where it left off.
A final audit flagged one real bug: footnote keys are integers in memory but become strings when serialised to JSON. Our sort was lexicographic, not numeric. ["1", "10", "2"] instead of [1, 2, 10]. For verses with 10 or more footnotes, notes would embed in the wrong order.
We fixed the sort key and checked the checkpoint. Zero of the already-embedded chunks had footnotes numbered above 9. The bug had affected nothing. But it would have.
What Scriptural Integrity Actually Costs
It costs slowness.
Every regex pattern in the parser was tested against the full 6,235 verse corpus before we moved on. Every cleaning pass was verified against sample outputs. When the enrichment returned unexpected results, we stopped and read the spec instead of shipping and hoping.
That kind of care doesn't feel like engineering progress. The line count isn't going up. No new features are landing. But the output is a dataset where every verse is correctly attributed, every piece of commentary is correctly assigned to its verse, and every Arabic word is correctly sourced from a verified API rather than hallucinated.
For a general-purpose app, a 0.1% error rate is acceptable noise. For a Quran study app, it is not. A user who learns a misattributed verse — who builds their understanding of Islam on a corrupted passage — hasn't been helped by the technology. They've been misled by it.
That's the line we were drawing. Every iteration in this pipeline was us redrawing it more carefully.
Where It Goes From Here
The embedded data is now in Supabase. The retrieval pipeline follows: four layers of search that combine deterministic lookup, semantic similarity, cross-reference following, and continuity matching. Then the guardrails layer, which audits every generated lesson against the source material before it's stored.
Then 365 lessons. Pre-generated. Never regenerated at runtime. Zero AI cost per user session after launch.
We wrote about that process in Part 2: How We Generated 365 Lessons From a Tafsir.