Quran corpus — research & build plan

GSD phase gate: There is no .gsd/ directory in this repo yet, so gsd headless query exits with No .gsd/ directory found. Run gsd once from the repo root to initialize, then use your template’s EXPECTED_PHASE checks. Until then, structured scope → plan → build phases are documented here only.

Hypothesis (cycle): Finishing the 114 surah fetch plus keeping surah-hashes.json + Quartz paths stable removes the largest blockers for Ayah embeds, Atlas extraction, and publish.

This note is the master plan for turning the vault’s Quranic layer into a complete, reproducible, searchable, and publishable corpus. It uses Obsidian wikilinks to Surahs, Atlas, Ayah, Juz, scripts under .dev/scripts/, and related notes so you can navigate the graph and hand phases to agents without re-explaining context.

Current inventory (what already exists)

Layer	Role	Where
API client	Shared `httpx` + retries for Quran.com v4	`.dev/scripts/quran_api.py` (used by fetch + generators)
Surah fetch	Arabic + English + OpenFurqan links + `ayah_header_lines`	`.dev/scripts/fetch_quran.py` → Surahs folder
Hash cache	Per-file SHA-256 + fetch options to skip API	surah-hashes.json
Ayah line index	CLI to refresh / extract ayah by line	`.dev/scripts/quran_surah_index.py` · `.dev/scripts/quran_surah_lines.py`
Ayah notes	6,236 stubs embedding `### Ayah n` from surah files	Ayah index
Juz pages	30 parts, API `verse_mapping`, embeds Ayah notes	Juz index
Atlas	Divine names, people, places, books (surahs)	Quran Atlas
Overview	OpenFurqan, mushaf order, categories	Surahs (vault note)

Gap (highest leverage): the 114 surahs are not all present as files yet—only a subset is fetched. Downstream embeds, entity extraction, and published HTML all depend on complete or explicitly scoped surah text first.

Target pipeline (end state)

flowchart LR
  F[Fetch surahs API] --> O[Organize paths + FM]
  O --> H[Hash + index JSON]
  O --> L[Ayah line index in FM]
  O --> A[Atlas entity notes]
  A --> C[Categorize + tag]
  C --> P[Publish Quartz]
  P --> V[Browser eval]

Each stage below lists inputs, outputs, tools, and Definition of Done (observable).

Phase A — Fetch (complete text)

Goal: Every surah 1…114 exists as Graphe/Quran/Surahs/Surah NNN - Name.md with consistent frontmatter and ayah_header_lines.

Inputs: translation_id, arabic_field, cache policy (see fetch script).
Outputs: 114 markdown files; updated surah-hashes.json.
Commands: uv run .dev/scripts/fetch_quran.py -f (or staged batches to respect API limits).
DoD: find Graphe/Quran/Surahs -name 'Surah *.md' | wc -l → 114; random spot-check ayah_count vs API.

Wikilinks: Surahs folder note · Surahs overview

Phase B — Organize (stable layout & naming)

Goal: One canonical tree; no duplicate “surah” stories.

Convention: Surah NNN - {name_simple}.md only under Surahs; [[Graphe/Quran/Ayah/Ayah|Ayah]] / Juz names stay Ayah SSS-AAA / Juz JJ.
Regenerate: uv run .dev/scripts/generate_quran_juz_ayah.py after any rename (uses quran_api + /chapters + /juzs).
DoD: No broken ![[Graphe/Quran/Surahs/...#ayah-n|Ayah n]] embeds in a sample of Ayah notes across all juz ranges.

Phase C — Hash & integrity

Goal: Reproducible “what changed” for CI and agents.

Existing: surah-hashes.json entries (path, surah, sha256, translation options).
Extensions: optional global manifest (single JSON listing all surah hashes + generator versions) for diff in PRs.
DoD: Re-run fetch with no API change → no file write (hash unchanged); intentional edit → hash flips.

Wikilink: surah-hashes.json

Phase D — Index (machine + human)

Goal: Fast random access without loading huge files.

Per-surah FM: ayah_header_lines (line of each ### Ayah n) — maintained by fetch + quran_surah_index.py index.
Optional: byte-offset index in a sidecar if line-scan cost becomes an issue (future).
DoD: uv run .dev/scripts/quran_surah_index.py extract -f "…/Surah 002 - Al-Baqarah.md" -a 7 prints correct block on a fully fetched Baqarah.

Wikilinks: Atlas (tooling section) references the same index idea.

Phase E — Atlas entity extraction

Goal: Atlas entity notes are populated from corpus-wide extraction with a balanced quality gate.

Implemented workflow (full corpus):

Ontology lock — Atlas extraction now scans four families: Divine Names, People, Places, Books (scriptural books, not surah files).
Candidate generation — run full scan over all 114 surahs:

uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --write-reports

Confidence queue — emit summary + review queue:

uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-summary --write-review-queue

Balanced write-back — apply only high confidence hits to Atlas notes via idempotent auto blocks:

uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --apply-high

Validation + regression — sidecar schema/path/ayah checks plus Surah 1 baseline comparison:

uv run .dev/scripts/quran_entity_pipeline.py --all-surahs --write-sidecars --validate

Artifacts produced:

Per-surah reports: Graphe/Quran/Research/entities/entity-scan-surah-NNN.md
Sidecars (schema_version: 3): Graphe/Quran/meta/entities/surah-NNN.yaml
Corpus summary: entity-corpus-summary
Review queue: entity-review-queue
qmd evidence dossier: entity-review-qmd-evidence
Validation report: entity-validation-report

qmd-assisted review (semantic helper):

# Ensure qmd has the Quran collection once
qmd collection add "/Users/rmac/repos/GrapheLogos" --name graphelogos-quran --mask "Graphe/Quran/**/*.md"
 
# Build evidence for queued medium/low matches
uv run .dev/scripts/quran_entity_qmd_evidence.py --collection graphelogos-quran --mode search

Legacy pilot remains: uv run .dev/scripts/quran_entity_pilot.py -s 1 --write-report --write-sidecar (single-surah check).

Wikilinks: Divine Names · People · Places · Books

Phase F — Categorize & tag (corpus semantics)

Goal: Filter by Meccan/Medinan, theme, juz, hizb (optional), without duplicating the mushaf.

Implemented (surah-level): uv run .dev/scripts/quran_surah_metadata.py --write enriched all 114 surah frontmatter files with:

revelation_place: Meccan | Medinan (from Quran.com /api/v4/chapters)
revelation_order: <int> (chronological order 1–114; Surah 096 = 1, first revealed)

Script is idempotent; re-run is safe.

Sources: external datasets (Quran.com metadata, academic tables) or manual YAML in Graphe/Quran/meta/ (proposed).
Per-ayah: extend Ayah frontmatter with optional topics: [] once extraction is trusted.
DoD: Query (e.g. Dataview or rg) returns consistent results for one tag (e.g. juz-30) across all Ayah notes.

Wikilinks: Juz index (structural partition) · Surahs (categories) for conceptual framing

Phase G — Publish (Quartz → localhost) & visual eval

Goal: Render the Quran tree in a static site, then screenshot and evaluate UX (navigation, embeds, search).

Implemented: uv run .dev/scripts/quartz_build.py --content Graphe/Quran temporarily points .dev/quartz/content at the Quran tree, swaps in quartz.config.quran.ts (fast build: ignores Ayah/), runs Quartz, then restores the Torah symlink and quartz.config.ts. Use --include-ayah for all 6k+ ayah pages (slow). Deploy defaults to quran-graphe; override with --pages-project.

Commands (verified)

cd /Users/rmac/repos/GrapheLogos
 
# If Quartz fails with ENOTEMPTY on rmdir under public/ (mixed Torah+Quran leftovers):
rm -rf .dev/quartz/public
 
uv run .dev/scripts/quartz_build.py --content Graphe/Quran --serve
# Listen URL is usually http://localhost:8080 — if EADDRINUSE: kill $(lsof -ti :8080)

Screenshot / browser eval

bunx agent-browser install (once). If open reports Browser not launched, use Playwright from the Quartz package:

cd .dev/quartz && npx playwright screenshot http://localhost:8080 /tmp/quran-quartz.png

Smoke: curl -sI http://localhost:8080 → 200; manually check Quran home, this plan, a sample surah.

Eval notes (local run)

Check	Result	Improvement
Home / index	Quran · GrapheLogos, explorer: Atlas / Juz / Surahs	OK
Graph view	Often empty with partial corpus	Add links or tune Quartz graph when more surahs exist
Surah subset	114/114 files present	Keep fetch reruns hash-aware; regenerate sidecars after content updates
Git date warnings	Quartz warns “not yet tracked by git”	`git add Graphe/Quran` when ready

DoD (publish slice): HTTP 200 on /; RESEARCH renders; build succeeds after public/ clean + free 8080.

Wikilinks: Quran home · Atlas · Ayah index · Juz index

Phase H — Structured build loop (GSD) alignment

The repo’s GSD workflow (gsd headless query, phases scope → … → done) is not initialized here until .gsd/ exists (gsd in project root). When you add it:

Hypothesis for a cycle: e.g. “Completing fetch unblocks 90% of broken Ayah embeds.”
Scope one gap (see table above).
Research → Plan (single DoD) → Build → Test → Regression → Eval → Post-mortem → Log → Next paths.

Paste the phase verify bash blocks from your template at the top of each agent run; do not proceed on EXPECTED_PHASE mismatch.

Risk register

Risk	Mitigation
API rate limits / 429	`fetch_quran` delay + `quran_api` retries; batch fetch
Huge repo (6k+ Ayah files)	Git LFS optional; or generate Ayah on demand
Quartz wikilink paths	Align vault paths with Quartz `baseUrl` or use `alias`
Entity extraction false positives	Human-in-the-loop; pilot surahs first

Next actions (ranked)

Latest run (2026-03-22): Build loop cycle complete - qmd-bm25 gate PASSES (MRR=0.772 >= 0.40); web comparison run; Quartz FlexSearch on live site nearly non-functional for Quran queries (flex-web MRR=0.053).

Phase J - Quartz FlexSearch repair (critical):

flex-web MRR=0.053 — Quartz FlexSearch on qurangraphe.pages.dev returns empty/wrong results for 4/5 quran queries (qur-01 Fatihah, qur-02 Qiyamah, qur-03 Alafasy, qur-05 Moses). Only qur-04 (Juz 30) works. Root cause unknown - candidates: FlexSearch index not built for Quran content, noindex tags on surah files blocking FlexSearch, or FlexSearch tokenizer failing on Arabic/transliteration content.
Corpus scope mismatch — flex-web only supports graphelogos-quran corpus; 14/19 eval queries (Abraham, Torah, cross-scripture) return ERR. Need either a full-corpus live endpoint or restrict eval to quran-only when running flex-web.
flex-api as replacement — our CF Pages /api/search correctly serves BM25 results (aligned with flex-offline). Consider wiring Quartz search UI to call /api/search instead of client-side FlexSearch.

Phase I remaining: 4. Report cosmetics — build_report() aggregate uses full QUERIES list (shows 19 queries instead of 6 for --quran-only); empty groups still render; fix both to respect active_queries. 5. qur-04/qur-05 flex-offline MRR=0.00 — “Juz 30 short surahs” and “Moses Musa staff Pharaoh” score 0 on flex-offline; investigate contentIndex.json entries for these slugs. 6. Cold-start latency — first flex-api request costs ~1s (index fetch + build over 6.3MB); subsequent warm requests ~250-300ms.

Prior cycle: 7. Review queue triage — process low candidates and promote confirmed aliases into Atlas frontmatter. 8. Alias precision pass — reduce low-signal English triggers (god, lord, short terms) by adding Arabic aliases and tighter disambiguation. 9. GSD in repo — run gsd from repo root so .gsd/ exists and §Phase H can be executed directly.

2026-03-22: Build loop cycle - gate PASSES qmd-bm25=0.772; web comparison reveals Quartz FlexSearch broken for Quran (flex-web=0.053); /api/search CF Function correct and aligned. 2026-03-22: /api/search live; BM25 pre-built cached index in Worker (no per-request rebuild); flex-offline/flex-api aligned; --quran-only eval scope; latency measurement added. 2026-03-20: atlas_kg + wikilinks all 4 families; CLAUDE.md; Phase F revelation metadata (114 surahs); noindex on 6,268 Ayah + Juz stubs.

QMD second pass (search index)

qmd indexes Graphe/Quran as collection graphelogos-quran and runs BM25 “gap probes” (fetch coverage, review queue, stubs, entity pipeline, etc.). Regenerate the report after major vault changes:

uv run .dev/scripts/quran_qmd_gap_pass.py

Output: qmd-pipeline-gaps.md. Hybrid qmd query is optional locally (needs LLM + unset CI); BM25 is CI-safe.

qmd entity–relationship pass: uv run .dev/scripts/quran_qmd_entity_extract.py → qmd-atlas-entity-graph (BM25 graphe:qmd_cooccurs triples). Review-queue evidence: quran_entity_qmd_evidence.py.

Phase I — Search API alignment (flex-offline = flex-api)

Goal: flex-offline and flex-api produce identical rankings so the live /api/search endpoint is a faithful proxy for the local BM25 - and latency of both is measured in the eval table.

Problem diagnosis (2026-03-22)

Symptom	Root cause
Worker 503 error 1102	`bm25Search()` rebuilds the inverted index (6,696 docs, 3.7M chars) on every request - exceeds CF Workers 10ms CPU limit
flex-api title-only workaround	Changed `entry.title` only to stay under limit - now a different algorithm from flex-offline
qur-02/qur-03 diverge	Title-only BM25 ranks differently than title+content BM25

Fix strategy

Worker: cache the pre-built inverted index (termDf, termPostings, docLengths, avgDl, N) alongside the raw JSON. Cold start builds once; warm requests only do the scoring step (cheap). This allows title+content BM25 to stay within limits on warm Workers.

flex-offline: already uses title+content in bm25_rank() via search_common.py - no change needed once Worker is fixed to match.

Alignment check: both use k1=1.5, b=0.75, both tokenize with [a-zA-Z0-9]+ lowercased, both score title+content concatenated.

Eval scope changes

Restrict eval to quran-only queries (qur-01 to qur-05, corpus graphelogos-quran) for the flex-offline/flex-api comparison - the API only covers the Quran corpus; Abraham/Torah/cross-scripture queries (corpus graphelogos) always return ERR and pollute the aggregate
Add latency column (wall-clock ms) for flex-offline and flex-api side-by-side
Target table format (quran-only slice):

| Query   | flex-offline MRR | flex-offline ms | flex-api MRR | flex-api ms |
|---------|-----------------|-----------------|--------------|-------------|
| qur-01  | 0.12            | 2               | 0.12         | 45          |
| qur-02  | 1.00            | 2               | 1.00         | 42          |

Implementation checklist

functions/api/search.js - separate buildIndex(rawIndex) → cached _builtIndex; revert to title+content BM25; redeploy
search_eval.py - add latency_ms to result dict; time each runner; add latency column to report
search_eval.py - add --quran-only flag; run with --endpoints flex-offline,flex-api
verify: flex-api and flex-offline produce identical MRR on all quran queries
verify: Worker no longer 503s on warm requests (pre-cached index)
fix build_report() aggregate + group rendering to respect active_queries (not full QUERIES) — Phase J
investigate qur-04 / qur-05 MRR=0.00 (Juz-30 and Musa slugs missing/sparse in contentIndex) — Phase J

Deploy rule (critical)

Always deploy qurangraphe from .dev/quartz/ so wrangler auto-detects the adjacent functions/ directory. Running wrangler pages deploy from the repo root strips the Worker (discovered 2026-03-22).

cd /Users/rmac/repos/GrapheLogos/.dev/quartz
bunx wrangler pages deploy /Users/rmac/repos/GrapheLogos/.dev/public/quran \
  --project-name qurangraphe --branch=main --commit-dirty=true

3-way eval results (2026-03-22, quran-only queries)

python3 .dev/scripts/search_eval.py --endpoints bm25,flex-offline,flex-api --quran-only

Query	qmd-bm25	flex-offline	flex-api	flex-offline ms	flex-api ms
abr-05 Ibrahim patriarch	1.00	0.00	0.00	683	413
qur-01 Fatihah	1.00	0.12	0.12	809	310
qur-02 Qiyamah	1.00	1.00	1.00	942	207
qur-03 Alafasy	1.00	0.50	0.50	833	189
qur-04 Juz 30	0.33	0.00	0.00	754	255
qur-05 Moses Musa	1.00	0.00	0.00	592	211
avg (quran-only)	0.89	0.27	0.27	~769	~264

Result: flex-offline and flex-api are perfectly aligned (identical MRR every query). The CF Pages /api/search endpoint faithfully mirrors offline BM25. flex-api latency ~3x faster than flex-offline (no local index load; quran-only index is smaller).

See also: /tmp/search-3way-2026-03-21-2216.md

Baseline before fix (for reference)

Query	qmd-bm25	flex-offline	flex-api (title-only, was misaligned)
qur-02 Qiyamah	1.00	1.00	0.00
qur-03 Alafasy	1.00	0.50	0.00
qur-04 Juz 30	0.33	0.00	1.00

Phase J — Quartz FlexSearch diagnostics (2026-03-22)

Goal: Understand why Quartz client-side FlexSearch on qurangraphe.pages.dev fails for 4/5 quran benchmark queries.

Web comparison baseline

Run: just search-web → flex-offline vs flex-web (Playwright), 2026-03-22

Query	flex-offline	flex-web (live Quartz)
qur-01 Fatihah opening chapter	MRR=0.12	MRR=0.00
qur-02 Day of Resurrection Qiyamah	MRR=1.00	MRR=0.00
qur-03 Alafasy recitation audio	MRR=0.50	MRR=0.00
qur-04 Juz 30 short surahs	MRR=0.00	MRR=1.00
qur-05 Moses Musa staff Pharaoh	MRR=0.00	MRR=0.00
avg (quran-only)	0.32	0.20

Note: flex-web shows ERR for all 14 graphelogos corpus queries (Abraham, Torah, cross-scripture) - flex-web is scoped to qurangraphe.pages.dev only.

Hypotheses

#	Hypothesis	How to test
J1	~~Surah files have `noindex: true`~~	Checked: 0 surah files have noindex; Ayah stubs do - RULED OUT
J2	FlexSearch index not built for Quran content (Quartz content dir scoping)	Check `quartz.config.quran.ts` `ignorePatterns`
J3	Arabic + transliteration text breaks FlexSearch tokenizer	Compare English-only vs mixed content queries
J4	Quartz build doesn’t include Quran surahs in `pagefind` index	Check built `public/quran/static/` for pagefind index

Next steps

Check surah frontmatter for noindex - if present, these are invisible to FlexSearch
Inspect public/quran/static/contentIndex.json - verify surah slugs present (flex-offline loads this; if slugs exist there but not in FlexSearch, it’s a FlexSearch build issue)
Check Quartz config ignorePatterns in quartz.config.quran.ts
If wiring to /api/search is the right path - add a custom search component to Quartz that calls the CF Pages Function instead of FlexSearch

GrapheLogos

Explorer

Quran corpus — research & build plan

Quran corpus — research & build plan

Current inventory (what already exists)

Target pipeline (end state)

Phase A — Fetch (complete text)

Phase B — Organize (stable layout & naming)

Phase C — Hash & integrity

Phase D — Index (machine + human)

Phase E — Atlas entity extraction

Phase F — Categorize & tag (corpus semantics)

Phase G — Publish (Quartz → localhost) & visual eval

Commands (verified)

Screenshot / browser eval

Eval notes (local run)

Phase H — Structured build loop (GSD) alignment

Risk register

Next actions (ranked)

QMD second pass (search index)

Phase I — Search API alignment (flex-offline = flex-api)

Problem diagnosis (2026-03-22)

Fix strategy

Eval scope changes

Implementation checklist

Deploy rule (critical)

3-way eval results (2026-03-22, quran-only queries)

Baseline before fix (for reference)

Phase J — Quartz FlexSearch diagnostics (2026-03-22)

Web comparison baseline

Hypotheses

Next steps

See also

Graph View

Table of Contents