Quran corpus — research & build plan

GSD phase gate: There is no .gsd/ directory in this repo yet, so gsd headless query exits with No .gsd/ directory found. Run gsd once from the repo root to initialize, then use your template’s EXPECTED_PHASE checks. Until then, structured scope → plan → build phases are documented here only.

Hypothesis (cycle): Finishing the 114 surah fetch plus keeping surah-hashes.json + Quartz paths stable removes the largest blockers for Ayah embeds, Atlas extraction, and publish.

This note is the master plan for turning the vault’s Quranic layer into a complete, reproducible, searchable, and publishable corpus. It uses Obsidian wikilinks to Surahs, Atlas, Ayah, Juz, scripts under .dev/scripts/, and related notes so you can navigate the graph and hand phases to agents without re-explaining context.


Current inventory (what already exists)

LayerRoleWhere
API clientShared httpx + retries for Quran.com v4.dev/scripts/quran_api.py (used by fetch + generators)
Surah fetchArabic + English + OpenFurqan links + ayah_header_lines.dev/scripts/fetch_quran.pySurahs folder
Hash cachePer-file SHA-256 + fetch options to skip APIsurah-hashes.json
Ayah line indexCLI to refresh / extract ayah by line.dev/scripts/quran_surah_index.py · .dev/scripts/quran_surah_lines.py
Ayah notes6,236 stubs embedding ### Ayah n from surah filesAyah index
Juz pages30 parts, API verse_mapping, embeds Ayah notesJuz index
AtlasDivine names, people, places, books (surahs)Quran Atlas
OverviewOpenFurqan, mushaf order, categoriesSurahs (vault note)

Gap (highest leverage): the 114 surahs are not all present as files yet—only a subset is fetched. Downstream embeds, entity extraction, and published HTML all depend on complete or explicitly scoped surah text first.


Target pipeline (end state)

flowchart LR
  F[Fetch surahs API] --> O[Organize paths + FM]
  O --> H[Hash + index JSON]
  O --> L[Ayah line index in FM]
  O --> A[Atlas entity notes]
  A --> C[Categorize + tag]
  C --> P[Publish Quartz]
  P --> V[Browser eval]

Each stage below lists inputs, outputs, tools, and Definition of Done (observable).


Phase A — Fetch (complete text)

Goal: Every surah 1…114 exists as Graphe/Quran/Surahs/Surah NNN - Name.md with consistent frontmatter and ayah_header_lines.

  • Inputs: translation_id, arabic_field, cache policy (see fetch script).
  • Outputs: 114 markdown files; updated surah-hashes.json.
  • Commands: uv run .dev/scripts/fetch_quran.py -f (or staged batches to respect API limits).
  • DoD: find Graphe/Quran/Surahs -name 'Surah *.md' | wc -l114; random spot-check ayah_count vs API.

Wikilinks: Surahs folder note · Surahs overview


Phase B — Organize (stable layout & naming)

Goal: One canonical tree; no duplicate “surah” stories.

  • Convention: Surah NNN - {name_simple}.md only under Surahs; [[Graphe/Quran/Ayah/Ayah|Ayah]] / Juz names stay Ayah SSS-AAA / Juz JJ.
  • Regenerate: uv run .dev/scripts/generate_quran_juz_ayah.py after any rename (uses quran_api + /chapters + /juzs).
  • DoD: No broken ![[Graphe/Quran/Surahs/...#ayah-n|Ayah n]] embeds in a sample of Ayah notes across all juz ranges.

Phase C — Hash & integrity

Goal: Reproducible “what changed” for CI and agents.

  • Existing: surah-hashes.json entries (path, surah, sha256, translation options).
  • Extensions: optional global manifest (single JSON listing all surah hashes + generator versions) for diff in PRs.
  • DoD: Re-run fetch with no API change → no file write (hash unchanged); intentional edit → hash flips.

Wikilink: surah-hashes.json


Phase D — Index (machine + human)

Goal: Fast random access without loading huge files.

  • Per-surah FM: ayah_header_lines (line of each ### Ayah n) — maintained by fetch + quran_surah_index.py index.
  • Optional: byte-offset index in a sidecar if line-scan cost becomes an issue (future).
  • DoD: uv run .dev/scripts/quran_surah_index.py extract -f "…/Surah 002 - Al-Baqarah.md" -a 7 prints correct block on a fully fetched Baqarah.

Wikilinks: Atlas (tooling section) references the same index idea.


Phase E — Atlas entity extraction

Goal: Atlas entity notes are populated from corpus-wide extraction with a balanced quality gate.

Implemented workflow (full corpus):

  1. Ontology lock — Atlas extraction now scans four families: Divine Names, People, Places, Books (scriptural books, not surah files).
  2. Candidate generation — run full scan over all 114 surahs:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --write-reports
  1. Confidence queue — emit summary + review queue:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-summary --write-review-queue
  1. Balanced write-back — apply only high confidence hits to Atlas notes via idempotent auto blocks:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --apply-high
  1. Validation + regression — sidecar schema/path/ayah checks plus Surah 1 baseline comparison:
uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --validate

Artifacts produced:

qmd-assisted review (semantic helper):

# Ensure qmd has the Quran collection once
qmd collection add "/Users/rmac/repos/GrapheLogos" --name graphelogos-quran --mask "Graphe/Quran/**/*.md"
 
# Build evidence for queued medium/low matches
uv run .dev/scripts/quran_entity_qmd_evidence.py --collection graphelogos-quran --mode search

Legacy pilot remains: uv run .dev/scripts/quran_entity_pilot.py -s 1 --write-report --write-sidecar (single-surah check).

Wikilinks: Divine Names · People · Places · Books


Phase F — Categorize & tag (corpus semantics)

Goal: Filter by Meccan/Medinan, theme, juz, hizb (optional), without duplicating the mushaf.

Implemented (surah-level): uv run .dev/scripts/quran_surah_metadata.py --write enriched all 114 surah frontmatter files with:

  • revelation_place: Meccan | Medinan (from Quran.com /api/v4/chapters)
  • revelation_order: <int> (chronological order 1–114; Surah 096 = 1, first revealed)

Script is idempotent; re-run is safe.

  • Sources: external datasets (Quran.com metadata, academic tables) or manual YAML in Graphe/Quran/meta/ (proposed).
  • Per-ayah: extend Ayah frontmatter with optional topics: [] once extraction is trusted.
  • DoD: Query (e.g. Dataview or rg) returns consistent results for one tag (e.g. juz-30) across all Ayah notes.

Wikilinks: Juz index (structural partition) · Surahs (categories) for conceptual framing


Phase G — Publish (Quartz → localhost) & visual eval

Goal: Render the Quran tree in a static site, then screenshot and evaluate UX (navigation, embeds, search).

Implemented: uv run .dev/scripts/quartz_build.py --content Graphe/Quran temporarily points .dev/quartz/content at the Quran tree, swaps in quartz.config.quran.ts (fast build: ignores Ayah/), runs Quartz, then restores the Torah symlink and quartz.config.ts. Use --include-ayah for all 6k+ ayah pages (slow). Deploy defaults to quran-graphe; override with --pages-project.

Commands (verified)

cd /Users/rmac/repos/GrapheLogos
 
# If Quartz fails with ENOTEMPTY on rmdir under public/ (mixed Torah+Quran leftovers):
rm -rf .dev/quartz/public
 
uv run .dev/scripts/quartz_build.py --content Graphe/Quran --serve
# Listen URL is usually http://localhost:8080 — if EADDRINUSE: kill $(lsof -ti :8080)

Screenshot / browser eval

  1. bunx agent-browser install (once). If open reports Browser not launched, use Playwright from the Quartz package:
cd .dev/quartz && npx playwright screenshot http://localhost:8080 /tmp/quran-quartz.png
  1. Smoke: curl -sI http://localhost:8080200; manually check Quran home, this plan, a sample surah.

Eval notes (local run)

CheckResultImprovement
Home / indexQuran · GrapheLogos, explorer: Atlas / Juz / SurahsOK
Graph viewOften empty with partial corpusAdd links or tune Quartz graph when more surahs exist
Surah subset114/114 files presentKeep fetch reruns hash-aware; regenerate sidecars after content updates
Git date warningsQuartz warns “not yet tracked by git”git add Graphe/Quran when ready

DoD (publish slice): HTTP 200 on /; RESEARCH renders; build succeeds after public/ clean + free 8080.

Wikilinks: Quran home · Atlas · Ayah index · Juz index


Phase H — Structured build loop (GSD) alignment

The repo’s GSD workflow (gsd headless query, phases scope → … → done) is not initialized here until .gsd/ exists (gsd in project root). When you add it:

  1. Hypothesis for a cycle: e.g. “Completing fetch unblocks 90% of broken Ayah embeds.”
  2. Scope one gap (see table above).
  3. ResearchPlan (single DoD) → BuildTestRegressionEvalPost-mortemLogNext paths.

Paste the phase verify bash blocks from your template at the top of each agent run; do not proceed on EXPECTED_PHASE mismatch.


Risk register

RiskMitigation
API rate limits / 429fetch_quran delay + quran_api retries; batch fetch
Huge repo (6k+ Ayah files)Git LFS optional; or generate Ayah on demand
Quartz wikilink pathsAlign vault paths with Quartz baseUrl or use alias
Entity extraction false positivesHuman-in-the-loop; pilot surahs first

Next actions (ranked)

Latest run (2026-03-22): Phase I search API - CF Pages Function /api/search deployed to qurangraphe.pages.dev; flex-offline and flex-api BM25 aligned (identical MRR); latency column added to eval; --quran-only flag scopes eval to quran corpus.

Phase I remaining:

  1. Report cosmeticsbuild_report() aggregate uses full QUERIES list (shows 19 queries instead of 6 for --quran-only); empty groups still render; fix both to respect active_queries.
  2. qur-04/qur-05 MRR=0.00 — “Juz 30 short surahs” and “Moses Musa staff Pharaoh” score 0 on both flex-offline and flex-api; the Juz-30 page and Musa Atlas page aren’t surfacing via BM25 title+content. Investigate contentIndex.json entries for these slugs.
  3. Cold-start latency — first flex-api request costs ~1s (index fetch + build over 6.3MB); subsequent warm requests ~250-300ms. Consider pre-warming or a smaller dedicated search index.

Prior cycle: 4. Review queue triage — process low candidates and promote confirmed aliases into Atlas frontmatter. 5. Alias precision pass — reduce low-signal English triggers (god, lord, short terms) by adding Arabic aliases and tighter disambiguation. 6. Phase F continuation — extend Ayah frontmatter with topics: [] once extraction is trusted; add juz tag to all Ayah notes. 7. GSD in repo — run gsd from repo root so .gsd/ exists and §Phase H can be executed directly.

2026-03-22: /api/search live; BM25 pre-built cached index in Worker (no per-request rebuild); flex-offline/flex-api aligned; --quran-only eval scope; latency measurement added. 2026-03-20: atlas_kg + wikilinks all 4 families; CLAUDE.md; Phase F revelation metadata (114 surahs); noindex on 6,268 Ayah + Juz stubs.


QMD second pass (search index)

qmd indexes Graphe/Quran as collection graphelogos-quran and runs BM25 “gap probes” (fetch coverage, review queue, stubs, entity pipeline, etc.). Regenerate the report after major vault changes:

uv run .dev/scripts/quran_qmd_gap_pass.py

Output: qmd-pipeline-gaps.md. Hybrid qmd query is optional locally (needs LLM + unset CI); BM25 is CI-safe.

qmd entity–relationship pass: uv run .dev/scripts/quran_qmd_entity_extract.pyqmd-atlas-entity-graph (BM25 graphe:qmd_cooccurs triples). Review-queue evidence: quran_entity_qmd_evidence.py.



Phase I — Search API alignment (flex-offline = flex-api)

Goal: flex-offline and flex-api produce identical rankings so the live /api/search endpoint is a faithful proxy for the local BM25 - and latency of both is measured in the eval table.

Problem diagnosis (2026-03-22)

SymptomRoot cause
Worker 503 error 1102bm25Search() rebuilds the inverted index (6,696 docs, 3.7M chars) on every request - exceeds CF Workers 10ms CPU limit
flex-api title-only workaroundChanged entry.title only to stay under limit - now a different algorithm from flex-offline
qur-02/qur-03 divergeTitle-only BM25 ranks differently than title+content BM25

Fix strategy

Worker: cache the pre-built inverted index (termDf, termPostings, docLengths, avgDl, N) alongside the raw JSON. Cold start builds once; warm requests only do the scoring step (cheap). This allows title+content BM25 to stay within limits on warm Workers.

flex-offline: already uses title+content in bm25_rank() via search_common.py - no change needed once Worker is fixed to match.

Alignment check: both use k1=1.5, b=0.75, both tokenize with [a-zA-Z0-9]+ lowercased, both score title+content concatenated.

Eval scope changes

  • Restrict eval to quran-only queries (qur-01 to qur-05, corpus graphelogos-quran) for the flex-offline/flex-api comparison - the API only covers the Quran corpus; Abraham/Torah/cross-scripture queries (corpus graphelogos) always return ERR and pollute the aggregate
  • Add latency column (wall-clock ms) for flex-offline and flex-api side-by-side
  • Target table format (quran-only slice):
| Query   | flex-offline MRR | flex-offline ms | flex-api MRR | flex-api ms |
|---------|-----------------|-----------------|--------------|-------------|
| qur-01  | 0.12            | 2               | 0.12         | 45          |
| qur-02  | 1.00            | 2               | 1.00         | 42          |

Implementation checklist

  • functions/api/search.js - separate buildIndex(rawIndex) → cached _builtIndex; revert to title+content BM25; redeploy
  • search_eval.py - add latency_ms to result dict; time each runner; add latency column to report
  • search_eval.py - add --quran-only flag; run with --endpoints flex-offline,flex-api
  • verify: flex-api and flex-offline produce identical MRR on all quran queries
  • verify: Worker no longer 503s on warm requests (pre-cached index)
  • fix build_report() aggregate + group rendering to respect active_queries (not full QUERIES)
  • investigate qur-04 / qur-05 MRR=0.00 (check contentIndex entries for Juz-30 and Musa slugs)

Deploy rule (critical)

Always deploy qurangraphe from .dev/quartz/ so wrangler auto-detects the adjacent functions/ directory. Running wrangler pages deploy from the repo root strips the Worker (discovered 2026-03-22).

cd /Users/rmac/repos/GrapheLogos/.dev/quartz
bunx wrangler pages deploy /Users/rmac/repos/GrapheLogos/.dev/public/quran \
  --project-name qurangraphe --branch=main --commit-dirty=true

Current scores (baseline before fix, quran-only queries)

Queryqmd-bm25flex-offlineflex-api (title-only, misaligned)
qur-01 Fatihah1.000.120.12
qur-02 Qiyamah1.001.000.00
qur-03 Alafasy1.000.500.00
qur-04 Juz 300.330.001.00
qur-05 Moses1.000.000.00
avg0.870.320.22

After fix, flex-offline and flex-api columns should match exactly.

See also: search-eval-2026-03-22-0355 (latest run)


See also


Cycle goal: wire full Quran fetch, Atlas extraction, and Quartz proof — see §Phase G (commands + eval) and §Phase H (GSD).