Agentic-Service-Data-Eyond-Catalog

Sleeping

App Files Files Community

Rifqi Hafizuddin commited on 21 days ago

Commit

f31f673

1 Parent(s): efc0c0a

[KM-553] initialize shared contract

Browse files

Files changed (12) hide show

REPO_CONTEXT.md +448 -0
src/catalog/introspect/database.py +198 -2
src/catalog/models.py +19 -1
src/catalog/pii_detector.py +28 -5
src/catalog/reader.py +20 -3
src/catalog/store.py +50 -4
src/catalog/validator.py +26 -1
src/db/postgres/init_db.py +1 -0
src/db/postgres/models.py +16 -0
src/query/ir/models.py +0 -1
src/query/ir/operators.py +13 -3
src/query/ir/validator.py +107 -6

REPO_CONTEXT.md ADDED Viewed

	@@ -0,0 +1,448 @@

+# Repo Context — Agentic Service Data Eyond Catalog
+Orientation file for future Claude Code sessions. Cross-reference `ARCHITECTURE.md` for the full design rationale and decision log.
+---
+## TL;DR
+FastAPI multi-agent backend for data analysis. Users upload documents and register databases / tabular files; they ask natural-language questions and get answers grounded in their data, streamed via SSE.
+The architecture has two paths:
+- **Unstructured** (PDF, DOCX, TXT) — dense similarity over prose chunks (PGVector).
+- **Structured** (databases, XLSX, CSV, Parquet) — a per-user **data catalog** describes what tables/columns exist; an LLM produces a **JSON IR** of intent; a deterministic Python compiler turns the IR into SQL or pandas; the executor runs it.
+The LLM produces *intent*, not query syntax. Deterministic code does the rest.
+The repo is a **scaffold** — folder structure and contracts (Pydantic shapes, ABC signatures, docstrings) are in place; most module bodies raise `NotImplementedError`. See *Implementation status* below.
+---
+## Stack
+- Python 3.12, FastAPI 0.115, uvicorn, sse-starlette
+- Async SQLAlchemy 2.0 + asyncpg (Postgres), psycopg3 (PGVector multi-statement workaround)
+- LangChain 0.3 + langchain-postgres (PGVector) + langchain-openai (Azure OpenAI GPT-4o + embeddings)
+- LangGraph 0.2 + langgraph-checkpoint-postgres
+- Redis 5 (response + retrieval cache)
+- Azure Blob Storage (uploads + Parquet)
+- pandas, pyarrow, polars-ready (deferred), sqlglot, pydantic v2, structlog, slowapi, langfuse
+- presidio-analyzer + spaCy `en_core_web_lg` (PII), pytesseract + pdf2image (PDF OCR)
+- DB connectors: psycopg2, pymysql, pymssql, sqlalchemy-bigquery, snowflake-sqlalchemy
+Run: `uv run --no-sync uvicorn main:app --host 0.0.0.0 --port 7860`. On Windows use `uv run --no-sync python run.py` (sets `WindowsSelectorEventLoopPolicy` for psycopg3 async).
+---
+## Top-level layout
+```
+main.py                — FastAPI app + middleware + router wiring + init_db() on startup
+run.py                 — Windows-safe local entry point
+ARCHITECTURE.md        — design intent (source of truth for shape + invariants)
+README.md
+Dockerfile             — python:3.12-slim, installs spaCy en_core_web_lg, tesseract, poppler
+pyproject.toml / uv.lock
+scripts/               — backfill scripts (build_initial_catalogs, enrich_all_sources)
+src/                   — all application code
+```
+---
+## src/ map
+### Core data shapes (only files with real content)
+| Path | Role |
+|---|---|
+| `catalog/models.py` | Pydantic: `Catalog → Source[] → Table[] → Column[]` |
+| `query/ir/models.py` | `QueryIR` (select / filters / group_by / order_by / limit) |
+| `query/ir/operators.py` | `ALLOWED_FILTER_OPS`, `ALLOWED_AGG_FNS`, `LIMIT_HARD_CAP=10000` |
+| `security/pii_patterns.py` | name patterns + email/phone regex for PII detection |
+### Catalog — identity layer for structured sources (Cs ∪ Ct)
+| Path | Role |
+|---|---|
+| `catalog/introspect/base.py` | `BaseIntrospector.introspect(location_ref) -> Source` |
+| `catalog/introspect/database.py` | `information_schema` + ~100 row sample → draft Source |
+| `catalog/introspect/tabular.py` | Parquet/CSV/XLSX header reader + sample (one Table per sheet for XLSX) |
+| `catalog/enricher.py` | one LLM call per source — adds AI descriptions at source/table/column |
+| `catalog/validator.py` | invariants beyond Pydantic shape (unique IDs, FK refs) |
+| `catalog/store.py` | persist as Postgres `jsonb` row keyed by user_id (`get/upsert/delete`) |
+| `catalog/reader.py` | load + filter catalog by source_hint (returns full catalog for ≤50 tables) |
+| `catalog/pii_detector.py` | flag PII columns at ingestion → suppresses `sample_values` |
+### Query — catalog-driven structured path
+| Path | Role |
+|---|---|
+| `query/service.py` | `QueryService.run(user_id, question, catalog) -> QueryResult` (top-level) |
+| `query/planner/service.py` | LLM call: question + catalog → QueryIR (structured output) |
+| `query/planner/prompt.py` | renders catalog into the planner prompt |
+| `query/ir/validator.py` | catalog-aware IR validation: column_ids exist, ops whitelisted, value_type matches data_type, limit ≤ cap |
+| `query/compiler/base.py` | `BaseCompiler.compile(ir) -> object` |
+| `query/compiler/sql.py` | IR → `(sql, params)`; identifiers from catalog, values parameterized |
+| `query/compiler/pandas.py` | IR → callable that runs against a DataFrame |
+| `query/executor/base.py` | `BaseExecutor.run(ir) -> QueryResult` (uniform across backends) |
+| `query/executor/db.py` | runs compiled SQL via asyncpg/pymysql in read-only txn (sqlglot second-line defence) |
+| `query/executor/tabular.py` | runs pandas/polars chain on a Parquet file (eager pandas → pyarrow pushdown → polars lazy by file size) |
+| `query/executor/dispatcher.py` | picks DB vs Tabular executor based on `source.source_type` of the IR's source |
+### Retrieval — unstructured path (Cu)
+| Path | Role |
+|---|---|
+| `retrieval/document.py` | `DocumentRetriever` over PGVector chunks |
+| `retrieval/router.py` | dispatches the `unstructured` route (the `chat` and `structured` routes do not pass through here) |
+### Agents — the four LLM call sites
+| Path | Role |
+|---|---|
+| `agents/intent_router.py` | classify message → `needs_search`, `source_hint ∈ {chat, unstructured, structured}` |
+| `agents/chatbot.py` | final answer formation (receives Cu chunks or QueryResult); SSE-streamed |
+(`CatalogEnricher` + `QueryPlanner` are the other two LLM call sites — both live under `catalog/` and `query/planner/`.)
+### Pipelines — ingestion coordinators
+| Path | Role |
+|---|---|
+| `pipeline/orchestrator.py` | top-level: routes uploads / DB connects to the right pipeline |
+| `pipeline/structured_pipeline.py` | DB / tabular: introspect → enrich → validate → store |
+| `pipeline/document_pipeline.py` | unstructured: extract → chunk → embed → PGVector |
+| `pipeline/triggers.py` | event entry points called by API routes (`on_document_uploaded`, `on_db_registered`, …) |
+### Security — cross-cutting
+| Path | Role |
+|---|---|
+| `security/auth.py` | bcrypt password hash/verify, JWT encode/decode, get_user |
+| `security/credentials.py` | Fernet encrypt/decrypt for stored DB credentials |
+| `security/pii_patterns.py` | (already listed) |
+### API + infra + config
+| Path | Role |
+|---|---|
+| `api/v1/*.py` | FastAPI routers — thin endpoints delegating to `pipeline/triggers` and `query/service` |
+| `models/api/{catalog,chat,document}.py` | request/response Pydantic models |
+| `db/postgres/connection.py` | two async engines: `engine` (app) and `_pgvector_engine` (PGVector) |
+| `db/postgres/init_db.py` | startup: creates `vector` extension, all tables, HNSW + GIN indexes |
+| `db/postgres/models.py` | SQLAlchemy app tables (users, rooms, chat messages, …) |
+| `db/postgres/vector_store.py` | shared PGVector instance (collection `document_embeddings`) |
+| `db/redis/connection.py` | async Redis client |
+| `storage/az_blob/az_blob.py` | Azure Blob async wrapper (uploads + Parquet) |
+| `middlewares/{cors,logging,rate_limit}.py` | CORS allow-all (POC), structlog JSON, slowapi |
+| `observability/langfuse/langfuse.py` | trace helper |
+| `config/settings.py` | pydantic-settings; `.env` uses double-underscore aliases |
+| `config/env_constant.py` | env file path constant |
+| `config/prompts/*.md` | prompt templates: `intent_router`, `catalog_enricher`, `query_planner`, `chatbot_system`, `guardrails` |
+---
+## Core architectural decisions
+1. **Catalog as primary context, not retrieval.** For ≤50 tables (typical), the entire catalog is rendered into the planner prompt verbatim (~3–5k tokens). No vector search, no BM25, no top-k for structured data. Catalog-level retrieval (BM25 + table-level vectors with RRF) is the *deferred* upgrade for users with hundreds of tables.
+2. **JSON IR over raw SQL.** The planner LLM emits a Pydantic-validated intent, never a SQL string. The compiler is deterministic Python. Benefits: validatable before execution, dialect-portable (one IR → SQL of any dialect / pandas / polars), cheaper tokens, trivially testable without an LLM, and the LLM literally cannot emit invalid SQL syntax.
+3. **Deterministic compiler, not LLM SQL writer.** All actual query construction happens in pure code. Compiler bugs are reproducible and fixable. Same IR → same query.
+4. **Pipeline stage isolation.** Each stage (`IntentRouter`, `CatalogReader`, `QueryPlanner`, `IRValidator`, `QueryCompiler`, `QueryExecutor`, `ChatbotAgent`) is its own module with typed input and typed output. No god classes.
+5. **Minimal LLM surface.** Only four LLM call sites in the system:
+   - `CatalogEnricher` — once per source, **at ingestion** (not query time)
+   - `IntentRouter` — once per user message
+   - `QueryPlanner` — once per structured query
+   - `ChatbotAgent` — once per answer (formatting)
+6. **Three-way routing**: `chat` / `unstructured` / `structured`. The router commits to one path. Cross-source questions ("compare DB sales vs uploaded customer file") are handled inside the structured path because the planner sees Cs ∪ Ct in one prompt. **DB vs tabular is not a routing concern** — it's a per-source attribute (`source_type`) that only matters at execution time.
+7. **Stable IDs.** `source_id`, `table_id`, `column_id` are stable internal references. Renaming a column in the source DB does not invalidate cached IRs.
+8. **PII suppression at the boundary.** Columns flagged with `pii_flag=true` have `sample_values: null` — real PII never enters LLM prompts. Auto-detected at ingestion via name patterns + value regex (`security/pii_patterns.py`). When in doubt, flag — false positives cost nothing; false negatives leak data.
+---
+## End-to-end flows
+### Ingestion (when user uploads a file or connects a DB)
+```
+source upload / DB connect
+    │
+    ├── unstructured (pdf/docx/txt)
+    │     → DocumentPipeline: extract → chunk → embed → PGVector
+    │
+    └── structured (DB schema or tabular file)
+          → introspect (information_schema or file headers + sample rows)
+          → CatalogEnricher (1 LLM call per source — AI descriptions)
+          → CatalogValidator (Pydantic + unique-IDs + FK refs)
+          → CatalogStore.upsert(user_id jsonb row)
+```
+### Query (per user message)
+```
+user message
+    │
+    → Redis cache check (24h TTL)  ── miss ─→ continue
+    →
+    → IntentRouter LLM   →  needs_search? source_hint?
+    │
+    ├── chat          → ChatbotAgent → SSE stream
+    ├── unstructured  → DocumentRetriever (Cu) → ChatbotAgent → SSE stream
+    └── structured    →
+          CatalogReader.read(user_id, "structured")          # full Cs ∪ Ct
+              ↓
+          QueryPlanner LLM(question, catalog) → QueryIR
+              ↓
+          IRValidator.validate(ir, catalog)
+              (source_id ∈ catalog, table_id ∈ source, column_ids ∈ table,
+               ops/aggs whitelisted, value_type matches data_type, limit ≤ 10000)
+              fail → re-prompt planner with error context (max 3 retries)
+              ↓
+          ExecutorDispatcher.pick(ir)              # by source.source_type
+              ├─ DbExecutor       → SqlCompiler → sqlglot guard → asyncpg/pymysql
+              │                     (read-only txn, 30s timeout)
+              └─ TabularExecutor  → PandasCompiler → eager pandas (≤100 MB)
+                                    or pyarrow pushdown (100 MB–1 GB)
+                                    or polars lazy scan (>1 GB)
+              ↓
+          QueryResult
+              ↓
+          ChatbotAgent → SSE stream
+```
+---
+## Catalog schema (per-user `jsonb` row)
+```
+Catalog
+├── user_id, schema_version, generated_at
+└── sources[]
+    └── Source { source_id, source_type, name, description, location_ref, updated_at }
+        └── tables[]
+            └── Table { table_id, name, description, row_count }
+                └── columns[]
+                    └── Column { column_id, name, data_type, description,
+                                  nullable, pii_flag, sample_values[]|null, stats|null }
+```
+`source_type ∈ {schema, tabular, unstructured}`.
+`data_type ∈ {int, decimal, string, datetime, date, bool, json}`.
+Deferred Column fields (add when justified): `description_human`, `synonyms[]`, `tags[]`, `primary_key`, `foreign_keys`, `unit`, `semantic_type`, `example_questions[]`, `schema_hash`, `enrichment_status`.
+---
+## JSON IR schema
+```jsonc
+{
+  "ir_version": "1.0",
+  "source_id":  "...",
+  "table_id":   "...",
+  "select": [
+    {"kind": "column", "column_id": "...", "alias": "..."},
+    {"kind": "agg",    "fn": "count|count_distinct|sum|avg|min|max",
+                       "column_id": "...?", "alias": "..."}
+  ],
+  "filters": [
+    {"column_id": "...",
+     "op":    "= | != | < | <= | > | >= | in | not_in | is_null | is_not_null | like | between",
+     "value": ...,
+     "value_type": "int|decimal|string|datetime|date|bool"}
+  ],
+  "group_by": ["column_id", ...],
+  "order_by": [{"column_id": "...", "dir": "asc|desc"}],
+  "limit": 100
+}
+```
+Single-table only in v1. `having`, `offset`, boolean filter trees, `distinct`, joins, window functions are deferred until user demand proves the limitation.
+---
+## Implementation status
+`raise NotImplementedError` everywhere except the four files listed under *Core data shapes*. Every stub has a docstring describing inputs, outputs, and rules — those are the contract. When implementing, fill in the body; don't change the signature without updating `ARCHITECTURE.md`.
+Per `ARCHITECTURE.md §9`, the initial PR ships:
+| Item | Status |
+|---|---|
+| Catalog Pydantic models | ✓ done (`catalog/models.py`) |
+| JSON IR Pydantic models | ✓ done (`query/ir/models.py`) |
+| Catalog ingestion (introspect → enrich → validate → store) | stubs |
+| `IntentRouter` (3-way `source_hint`) | stub |
+| `CatalogReader` | stub |
+| `QueryPlanner` LLM call | stub |
+| IR validator | stub |
+**Output of PR 1**: a validated `QueryIR` object. Execution lands in PR 2 (compiler), PR 3 (executors), PR 4 (retry/self-correction), PR 5 (eval harness), PR 6 (auto PII tagging). Joins, schema drift detection, hybrid catalog search are explicitly later.
+---
+## Team — division of work
+The service is built by two engineers; many modules are source-type-agnostic and shared.
+- **DB** owns SQL paths: introspection, SQL compiler, DB executor, credential storage.
+- **TAB** owns tabular paths: CSV/XLSX/Parquet introspection, pandas compiler, tabular executor, blob/Parquet plumbing.
+- **B** = both — shared contracts and source-type-agnostic plumbing. Pair-program or split with explicit hand-off.
+### Step-by-step ownership
+| # | Step | File / area | Owner | Notes |
+|---|---|---|---|---|
+| 0 | **Lock contracts before coding** | — | B | See "Decisions to lock" below; block until aligned |
+| 1 | Catalog Pydantic models | `catalog/models.py` | B | Already done; only touch if both agree |
+| 2 | IR Pydantic models | `query/ir/models.py` | B | Already done; joins/window fns require joint sign-off |
+| 3 | IR operator whitelists | `query/ir/operators.py` | B | Already done; both compilers rely on these |
+| 4 | PII patterns / regex | `security/pii_patterns.py` | B | Already done; extend together as gaps appear |
+| **Ingestion — introspection** | | | | |
+| 5 | DB introspector (information_schema, sample, FKs) | `catalog/introspect/database.py` | DB | Use SQLAlchemy `inspect()`; dialect-aware quoting |
+| 6 | Tabular introspector (CSV/XLSX/Parquet headers + sample) | `catalog/introspect/tabular.py` | TAB | Each XLSX sheet → one Table |
+| 7 | `BaseIntrospector` ABC | `catalog/introspect/base.py` | B | Confirm signature returns the same `Source` shape |
+| **Ingestion — shared catalog plumbing** | | | | |
+| 8 | Catalog enricher + prompt | `catalog/enricher.py`, `config/prompts/catalog_enricher.md` | B | Whoever picks it up first; the other reviews. Prompt must work uniformly across source types |
+| 9 | Catalog validator | `catalog/validator.py` | B | Type-agnostic |
+| 10 | Catalog store (Postgres jsonb) | `catalog/store.py` | B | Recommend DB (Postgres expertise) |
+| 11 | Catalog reader | `catalog/reader.py` | B | Type-agnostic |
+| 12 | PII detector | `catalog/pii_detector.py` | B | Either; uses `pii_patterns.py` |
+| **Ingestion — pipelines** | | | | |
+| 13 | Structured pipeline (introspect → enrich → validate → store) | `pipeline/structured_pipeline.py` | B | Pair on this — calls both introspectors via dispatcher |
+| 14 | Triggers (`on_db_registered`, `on_tabular_uploaded`) | `pipeline/triggers.py` | B | Each owns their trigger function |
+| 15 | Ingestion orchestrator | `pipeline/orchestrator.py` | B | Routes by source_type; pair |
+| 16 | Document pipeline (PDF/DOCX/TXT) | `pipeline/document_pipeline.py` | TAB | Tabular-adjacent (file uploads) |
+| **Query — shared spine** | | | | |
+| 17 | IR validator (catalog-aware) | `query/ir/validator.py` | B | Recommend DB; both must agree on exact error messages so retry-prompt is consistent |
+| 18 | Planner LLM service | `query/planner/service.py` | B | Type-agnostic |
+| 19 | Planner prompt (catalog → text) | `query/planner/prompt.py`, `config/prompts/query_planner.md` | B | **Pair-program**. Must describe DB tables and tabular files in one consistent format |
+| 20 | Intent router (chat/unstructured/structured) | `agents/intent_router.py`, `config/prompts/intent_router.md` | B | Type-agnostic |
+| 21 | Executor base + `QueryResult` | `query/executor/base.py` | B | Lock the shape before either implements an executor |
+| 22 | Executor dispatcher | `query/executor/dispatcher.py` | B | Reads `source.source_type` from catalog; pair |
+| 23 | Compiler base ABC | `query/compiler/base.py` | B | Already done |
+| 24 | Top-level QueryService | `query/service.py` | B | Wires planner → validator → compiler → executor; pair |
+| **Query — DB path** | | | | |
+| 25 | SQL compiler (IR → SQL + params, per dialect) | `query/compiler/sql.py` | DB | Identifiers from catalog (quoted), values parameterized |
+| 26 | DB executor (asyncpg/pymysql, sqlglot guard, RO txn, 30s timeout) | `query/executor/db.py` | DB | |
+| 27 | Credential encryption (Fernet) | `security/credentials.py` | DB | Needed for stored user DB creds |
+| 28 | User-DB connection management | helper in pipelines | DB | engine_scope context manager pattern |
+| **Query — Tabular path** | | | | |
+| 29 | Pandas compiler (IR → callable on DataFrame) | `query/compiler/pandas.py` | TAB | Same IR, different backend |
+| 30 | Tabular executor (eager pandas first; pyarrow / polars later) | `query/executor/tabular.py` | TAB | Initial scope: eager pandas only |
+| 31 | Parquet upload/download + Azure Blob wrapper | `storage/az_blob/az_blob.py` (+ helper) | TAB | XLSX sheet → one Parquet per sheet (deterministic blob name) |
+| **Agents + chat** | | | | |
+| 32 | Chatbot agent + prompt | `agents/chatbot.py`, `config/prompts/chatbot_system.md` | B | Receives QueryResult or Cu chunks |
+| 33 | Guardrails prompt | `config/prompts/guardrails.md` | B | |
+| **API surface** | | | | |
+| 34 | DB client endpoints (register/ingest/list/delete) | `api/v1/db_client.py` | DB | |
+| 35 | Document/tabular upload endpoints | `api/v1/document.py` | TAB | |
+| 36 | Chat stream endpoint (SSE) | `api/v1/chat.py` | B | Dispatches both paths; pair |
+| 37 | Room / users endpoints | `api/v1/room.py`, `api/v1/users.py` | B | Whoever has bandwidth |
+| **Tests + eval** | | | | |
+| 38 | DB compiler golden tests (IR → SQL fixtures) | `tests/query/compiler/test_sql.py` | DB | Pure-Python, no LLM |
+| 39 | Pandas compiler golden tests (IR → expected DataFrame) | `tests/query/compiler/test_pandas.py` | TAB | Pure-Python, no LLM |
+| 40 | IR validator tests (catalog × IR error matrix) | `tests/query/ir/test_validator.py` | B | Each contributes test cases for their source type |
+| 41 | Planner eval (golden question → IR examples) | `tests/query/planner/` | B | Each contributes ~10 question→IR examples |
+| 42 | E2E smoke tests | `tests/e2e/` | B | Pair |
+### Decisions to lock before coding
+If made unilaterally these create silent contract drift. Lock them in a 30-min sync first.
+| Decision | Why it matters | Recommended call |
+|---|---|---|
+| `QueryResult` shape (current scaffold: `source_id, backend, rows, row_count, truncated, elapsed_ms, error`) | Both executors return this; chatbot consumes it | Lock as-is unless either side needs more (e.g. `column_types` for formatting) |
+| `Source.location_ref` format (`az_blob://...` vs `dbclient://{id}` etc.) | Dispatcher and executors both parse this | Pick a convention now; document in `catalog/models.py` docstring |
+| Where do user DB credentials live? | DB executor needs creds to run queries; Source has `location_ref` but creds are encrypted separately | Recommend: `location_ref="dbclient://{client_id}"`; executor looks up creds by ID |
+| How does dispatcher pick the executor? | Routes by `source.source_type` — but where does dispatcher get it (catalog reload, or IR carries it)? | Recommend: dispatcher takes `(Catalog, IR)`, looks up source by `IR.source_id` |
+| Joins in v1 IR? | Excluded per ARCHITECTURE.md §7. DB path is most affected — real DB use often needs joins. | Recommend: ship single-table; revisit in PR 2. **DB owner must accept the constraint or push back early** |
+| Planner prompt — render tabular vs DB sources uniformly | If described differently, planner gets confused | Pair-program. Render both as `Table: name (n rows) — Columns: ...` regardless of source_type |
+| Error contract — raise or return `QueryResult.error`? | Both executors must behave the same so chatbot branches consistently | Recommend: never raise from `executor.run()`; populate `QueryResult.error` |
+| PII handling for tabular `sample_values` | DB samples come from `information_schema`; tabular from file reads. Same `pii_flag` rule must apply both sides | Confirm tabular introspector calls `pii_detector` |
+| Catalog refresh trigger (open question §3) | Affects both pipelines symmetrically | Default: rebuild on every upload/connect; defer auto-refresh |
+| `updated_at` semantics — per-Source vs per-Catalog | Affects how each pipeline writes | Recommend: per-Source `updated_at` + Catalog-level `generated_at` |
+| Dialect support scope for v1 | DB compiler must implement at least one dialect well | Recommend: Postgres first (matches app DB); MySQL second |
+| Test-fixture format for golden IRs | Both compilers test against golden IR → expected output | Recommend: shared `tests/fixtures/golden_irs.json`; each side adds expected SQL or DataFrame |
+| Logging conventions | structlog is already in place; both should log the same fields | Quick agreement: log `source_id`, `table_id`, `ir_version`, `elapsed_ms` |
+### Working rhythm (suggested)
+1. **Day 1** — 30-min sync to lock the decisions table. PR any contract/docstring changes that fall out.
+2. **Week 1** — both build introspectors + agree on the planner prompt format. PR in parallel; review each other's.
+3. **Week 2** — DB builds SQL compiler + DB executor; TAB builds pandas compiler + tabular executor. Both write golden tests against shared IR fixtures.
+4. **Week 3** — pair on dispatcher, QueryService, and chat endpoint integration. End-to-end smoke test.
+5. **Ongoing** — short daily standup, mostly to flag IR-shape questions and catalog-field additions *before* either side implements against an unconfirmed contract.
+Biggest risk: **silent contract drift** — one side adds a `QueryResult` field or assumes a new IR op exists, the other ships without it, and integration breaks at the dispatcher. The §0 lock + shared golden-IR fixtures are what prevent that.
+### Onboarding to Claude Code
+If you're new to Claude Code, before you start:
+1. Read `ARCHITECTURE.md` end-to-end (~10 min) — this is the source of truth.
+2. Skim this file (`REPO_CONTEXT.md`) — find your section in the ownership table.
+3. Read your owned files' docstrings — every stub explains its contract.
+4. Open Claude Code in this repo. When you ask Claude to implement a stub:
+   - Reference the file path + the contract it should follow
+   - Point it at `ARCHITECTURE.md` section if relevant (e.g. §7 for IR validation)
+   - Ask it to write the test first (golden IR fixtures), then the implementation
+   - Always review the diff — don't auto-accept
+Useful slash commands while working: `/review` (PR review), `/security-review` (audit pending changes).
+---
+## Conventions & gotchas
+- **Async event loop on Windows**: `run.py` sets `WindowsSelectorEventLoopPolicy` because psycopg3 async needs it. Don't call `uvicorn` directly on Windows.
+- **Two Postgres engines**: `engine` (app tables) and `_pgvector_engine` (asyncpg with `prepared_statement_cache_size=0`) — the latter is required because PGVector emits `advisory_lock + CREATE EXTENSION` as a multi-statement string and asyncpg rejects multi-statement prepared queries. `init_db.py` creates the extension explicitly so `PGVector(create_extension=False)` skips that path.
+- **Read-only at every layer for user DBs**: IR validation + compiler whitelists + sqlglot SELECT-only check + read-only DB credentials + LIMIT enforcement + 30s timeout. Five layers; no single point of failure.
+- **Identifiers vs values**: identifiers (table/column names) come from the catalog and are inlined as quoted identifiers — they were verified at validation time so this is safe. Values from `IR.filters` are *always* parameterized, never inlined as strings.
+- **Credential encryption**: Fernet via `dataeyond__db__credential__key` env var; lives in `security/credentials.py`. Sensitive fields = `{"password", "service_account_json"}`.
+- **Settings env-var aliases**: `.env` uses double-underscore names (`azureai__api_key__4o`); `Settings` exposes them as `azureai_api_key_4o` via `Field(alias=...)`. Mind both forms when adding settings.
+- **Prompts**: `src/config/prompts/*.md` — most are placeholders ("to be written"). The system prompt + few-shots for each LLM call site live here, not inline in the agent code.
+- **No tests yet**: pytest-asyncio + ruff + mypy are in dev deps; create `tests/` when implementing PR 1. The IR validator and compiler should be the first targets — both are deterministic and testable without an LLM.
+---
+## Recommended reading order
+1. `ARCHITECTURE.md` — design intent (the source of truth)
+2. `src/catalog/models.py` + `src/query/ir/models.py` — the two data shapes everything else moves between
+3. `src/query/ir/operators.py` + `src/security/pii_patterns.py` — the explicit whitelists / patterns
+4. Skim every `__init__.py`-level docstring under `src/catalog/`, `src/query/`, `src/agents/`, `src/pipeline/` — each describes the contract its module enforces
+5. `main.py` + `src/db/postgres/{connection,init_db}.py` — runtime bootstrap
+6. `ARCHITECTURE.md §10` — five open questions that haven't been decided yet
+---
+## Open questions (unresolved)
+From `ARCHITECTURE.md §10`:
+1. Catalog storage shape — JSON file per user vs Postgres `jsonb` row?
+2. Should the catalog also list unstructured files (with descriptions only) so the router has a unified view?
+3. Catalog refresh trigger — explicit "rebuild" button, on every upload, or background TTL?
+4. Confirm joins are out of initial IR scope?
+5. PII handling for `sample_values` — mask, synthesize, or skip?
+Settle these as PRs land — most won't block PR 1.
+---
+## Glossary
+- **Cu** — unstructured context (prose chunks)
+- **Cs** — schema context (DB tables/columns from catalog)
+- **Ct** — tabular context (file sheets/columns from catalog)
+- **IR** — intermediate representation (the JSON query shape)
+- **PII** — personally identifiable information
+- **ABC** — abstract base class

src/catalog/introspect/database.py CHANGED Viewed

@@ -3,14 +3,210 @@
 Reads information_schema for tables/columns/types, samples ~100 rows per table
 for `sample_values` and basic stats. Does NOT generate descriptions
 (that happens in CatalogEnricher).
 """
-from ..models import Source
 from .base import BaseIntrospector
 class DatabaseIntrospector(BaseIntrospector):
     """Connect to user DB → read information_schema → sample 100 rows/table."""
     async def introspect(self, location_ref: str) -> Source:
-        raise NotImplementedError

 Reads information_schema for tables/columns/types, samples ~100 rows per table
 for `sample_values` and basic stats. Does NOT generate descriptions
 (that happens in CatalogEnricher).
+Reuses Phase 1 utilities (`database_client_service`, `db_credential_encryption`,
+`db_pipeline_service.engine_scope`, `extractor.get_schema/profile_column/get_row_count`)
+to avoid reimplementation. The cleanup PR will move those into `security/` and
+`pipeline/db_pipeline/` respectively.
 """
+import asyncio
+import hashlib
+from datetime import UTC, datetime
+from decimal import Decimal
+from typing import Any
+from src.database_client.database_client_service import database_client_service
+from src.db.postgres.connection import AsyncSessionLocal
+from src.middlewares.logging import get_logger
+from src.pipeline.db_pipeline import db_pipeline_service
+from src.pipeline.db_pipeline.extractor import (
+    get_row_count,
+    get_schema,
+    profile_column,
+)
+from src.utils.db_credential_encryption import decrypt_credentials_dict
+from ..models import Column, ColumnStats, DataType, Source, Table
+from ..pii_detector import PIIDetector
 from .base import BaseIntrospector
+logger = get_logger("db_introspector")
+_DBCLIENT_PREFIX = "dbclient://"
+def _stable_id(prefix: str, *parts: str) -> str:
+    """Deterministic short ID from joined parts. Survives renames at the
+    `name` field while preserving identity for cached IRs.
+    Hash is non-cryptographic (identifier only).
+    """
+    h = hashlib.sha1(
+        "/".join(parts).encode("utf-8"), usedforsecurity=False
+    ).hexdigest()[:12]
+    return f"{prefix}{h}"
+def _map_sql_type(sql_type: str) -> DataType:
+    """Map a stringified SQLAlchemy type to a Catalog DataType.
+    Matches on substring of the SQLAlchemy type repr (e.g. 'INTEGER',
+    'TIMESTAMP', 'BOOLEAN'). Conservative — unknowns fall back to "string"
+    so the column is at least addressable.
+    """
+    s = sql_type.upper()
+    if "INT" in s:
+        return "int"
+    if "FLOAT" in s or "NUMERIC" in s or "DECIMAL" in s or "REAL" in s or "DOUBLE" in s:
+        return "decimal"
+    if "BOOL" in s:
+        return "bool"
+    if "TIMESTAMP" in s or "DATETIME" in s:
+        return "datetime"
+    if "DATE" in s:
+        return "date"
+    if "JSON" in s:
+        return "json"
+    return "string"
+def _normalize(v: Any) -> Any:
+    """Coerce non-JSON-native scalars (Decimal, numpy, datetime) to types
+    that survive the jsonb round-trip when the catalog is persisted.
+    """
+    if v is None:
+        return None
+    if isinstance(v, Decimal):
+        return float(v)
+    try:
+        import numpy as np
+        if isinstance(v, np.generic):
+            return v.item()
+    except ImportError:
+        pass
+    if isinstance(v, datetime):
+        return v.isoformat()
+    return v
 class DatabaseIntrospector(BaseIntrospector):
     """Connect to user DB → read information_schema → sample 100 rows/table."""
+    def __init__(self) -> None:
+        self._pii = PIIDetector()
     async def introspect(self, location_ref: str) -> Source:
+        if not location_ref.startswith(_DBCLIENT_PREFIX):
+            raise ValueError(
+                f"DatabaseIntrospector expects 'dbclient://...' location_ref, "
+                f"got {location_ref!r}"
+            )
+        client_id = location_ref[len(_DBCLIENT_PREFIX):]
+        if not client_id:
+            raise ValueError("location_ref is missing client_id after 'dbclient://'")
+        async with AsyncSessionLocal() as session:
+            client = await database_client_service.get(session, client_id)
+        if client is None:
+            raise ValueError(f"DatabaseClient {client_id!r} not found")
+        creds = decrypt_credentials_dict(client.credentials)
+        logger.info(
+            "introspecting db source",
+            client_id=client_id,
+            db_type=client.db_type,
+            name=client.name,
+        )
+        # SQLAlchemy inspect() + pandas read_sql are synchronous — run in a
+        # threadpool so the event loop stays free.
+        tables: list[Table] = await asyncio.to_thread(
+            self._introspect_sync, client.db_type, creds
+        )
+        return Source(
+            source_id=client_id,
+            source_type="schema",
+            name=client.name,
+            description="",
+            location_ref=location_ref,
+            updated_at=datetime.now(UTC),
+            tables=tables,
+        )
+    def _introspect_sync(self, db_type: str, creds: dict) -> list[Table]:
+        with db_pipeline_service.engine_scope(db_type, creds) as engine:
+            schema = get_schema(engine)
+            tables: list[Table] = []
+            for table_name, cols in schema.items():
+                try:
+                    row_count = get_row_count(engine, table_name)
+                except Exception as e:
+                    logger.error(
+                        "row_count failed; skipping table",
+                        table=table_name,
+                        error=str(e),
+                    )
+                    continue
+                columns: list[Column] = []
+                for col in cols:
+                    try:
+                        profile = profile_column(
+                            engine,
+                            table_name,
+                            col["name"],
+                            col.get("is_numeric", False),
+                            row_count,
+                        )
+                    except Exception as e:
+                        logger.error(
+                            "profile_column failed; skipping column",
+                            table=table_name,
+                            column=col["name"],
+                            error=str(e),
+                        )
+                        continue
+                    columns.append(self._to_column(table_name, col, profile))
+                tables.append(
+                    Table(
+                        table_id=_stable_id("t_", table_name),
+                        name=table_name,
+                        description="",
+                        row_count=row_count,
+                        columns=columns,
+                    )
+                )
+        return tables
+    def _to_column(
+        self, table_name: str, col: dict[str, Any], profile: dict[str, Any]
+    ) -> Column:
+        name = col["name"]
+        sample_values: list[Any] | None = [
+            _normalize(v) for v in (profile.get("sample_values") or [])
+        ] or None
+        column = Column(
+            column_id=_stable_id("c_", table_name, name),
+            name=name,
+            data_type=_map_sql_type(str(col["type"])),
+            description="",
+            nullable=True,  # nullable not surfaced by extractor; default permissive
+            pii_flag=False,
+            sample_values=sample_values,
+            stats=ColumnStats(
+                min=_normalize(profile.get("min")),
+                max=_normalize(profile.get("max")),
+                distinct_count=profile.get("distinct_count"),
+            ),
+        )
+        if self._pii.detect(column):
+            return column.model_copy(update={"pii_flag": True, "sample_values": None})
+        return column
+database_introspector = DatabaseIntrospector()

src/catalog/models.py CHANGED Viewed

@@ -1,6 +1,25 @@
 """Pydantic models for the per-user data catalog (Cs + Ct).
 See ARCHITECTURE.md §6 for the full schema definition.
 """
 from datetime import datetime
@@ -8,7 +27,6 @@ from typing import Any, Literal
 from pydantic import BaseModel, Field
 SourceType = Literal["schema", "tabular", "unstructured"]
 DataType = Literal["int", "decimal", "string", "datetime", "date", "bool", "json"]

 """Pydantic models for the per-user data catalog (Cs + Ct).
 See ARCHITECTURE.md §6 for the full schema definition.
+Source.location_ref URI scheme
+------------------------------
+A `Source` is uniquely addressable by `location_ref`; introspectors and
+executors parse it to find the underlying data:
+  schema sources   → "dbclient://{database_client_id}"
+                     Resolves via `database_client_service.get(...)` which
+                     returns a `DatabaseClient` row whose Fernet-encrypted
+                     credentials are decrypted at runtime.
+  tabular sources  → "az_blob://{user_id}/{document_id}"
+                     The Source aggregates one or more sheets as Tables;
+                     each per-sheet Parquet blob is named via
+                     `parquet_service.parquet_blob_name(user_id, document_id, sheet_name)`,
+                     so executors derive the per-Table blob path from
+                     `Source.location_ref` plus `Table.name`.
+  unstructured     → reserved (deferred — see ARCHITECTURE.md §10 q2).
 """
 from datetime import datetime
 from pydantic import BaseModel, Field
 SourceType = Literal["schema", "tabular", "unstructured"]
 DataType = Literal["int", "decimal", "string", "datetime", "date", "bool", "json"]

src/catalog/pii_detector.py CHANGED Viewed

@@ -1,16 +1,39 @@
 """PII auto-detection for catalog columns.
 When pii_flag is set True, sample_values is forced to None so real PII
-never enters LLM prompts.
-Patterns live in src/security/pii_patterns.py.
 """
 from .models import Column
 class PIIDetector:
-    """Marks columns as pii_flag=True when name/values look sensitive."""
     def detect(self, column: Column) -> bool:
-        raise NotImplementedError

 """PII auto-detection for catalog columns.
 When pii_flag is set True, sample_values is forced to None so real PII
+never enters LLM prompts. Patterns live in src/security/pii_patterns.py.
 """
+from src.security.pii_patterns import EMAIL_REGEX, PHONE_REGEX, PII_NAME_PATTERNS
 from .models import Column
 class PIIDetector:
+    """Marks columns as pii_flag=True when name or sampled values look sensitive.
+    Bias is intentional: false positives hide harmless sample values,
+    false negatives leak data. When unsure, flag.
+    """
     def detect(self, column: Column) -> bool:
+        if self._name_matches(column.name):
+            return True
+        if column.sample_values and self._values_match(column.sample_values):
+            return True
+        return False
+    @staticmethod
+    def _name_matches(name: str) -> bool:
+        lowered = name.lower()
+        return any(pat in lowered for pat in PII_NAME_PATTERNS)
+    @staticmethod
+    def _values_match(values: list) -> bool:
+        for v in values:
+            if v is None:
+                continue
+            s = str(v)
+            if EMAIL_REGEX.match(s) or PHONE_REGEX.match(s):
+                return True
+        return False

src/catalog/reader.py CHANGED Viewed

@@ -4,20 +4,37 @@ For typical users (≤50 tables), returns the FULL catalog with no slicing.
 Catalog-level search is added later if catalog grows past the limit.
 """
 from typing import Literal
 from .models import Catalog
 from .store import CatalogStore
 SourceHint = Literal["chat", "unstructured", "structured"]
 class CatalogReader:
-    """Loads the user's catalog and filters by source_hint."""
     def __init__(self, store: CatalogStore) -> None:
         self._store = store
     async def read(self, user_id: str, source_hint: SourceHint) -> Catalog:
-        raise NotImplementedError

 Catalog-level search is added later if catalog grows past the limit.
 """
+from datetime import UTC, datetime
 from typing import Literal
 from .models import Catalog
 from .store import CatalogStore
 SourceHint = Literal["chat", "unstructured", "structured"]
 class CatalogReader:
+    """Loads the user's catalog and filters by source_hint.
+    On miss, returns an empty Catalog (never raises) — query path is
+    responsible for handling "no data registered yet" gracefully.
+    Returned Catalog is always a copy; the underlying stored catalog
+    is never mutated.
+    """
     def __init__(self, store: CatalogStore) -> None:
         self._store = store
     async def read(self, user_id: str, source_hint: SourceHint) -> Catalog:
+        catalog = await self._store.get(user_id)
+        if catalog is None:
+            return Catalog(user_id=user_id, generated_at=datetime.now(UTC))
+        if source_hint == "chat":
+            filtered: list = []
+        elif source_hint == "structured":
+            filtered = [s for s in catalog.sources if s.source_type in {"schema", "tabular"}]
+        else:  # "unstructured"
+            filtered = [s for s in catalog.sources if s.source_type == "unstructured"]
+        return catalog.model_copy(update={"sources": filtered})

src/catalog/store.py CHANGED Viewed

@@ -4,17 +4,63 @@ Storage shape: one row per user in a `catalogs` table with columns
 (user_id PK, data jsonb, schema_version, generated_at, updated_at).
 """
 from .models import Catalog
 class CatalogStore:
-    """Read/write catalogs keyed by user_id."""
     async def get(self, user_id: str) -> Catalog | None:
-        raise NotImplementedError
     async def upsert(self, catalog: Catalog) -> None:
-        raise NotImplementedError
     async def delete(self, user_id: str) -> None:
-        raise NotImplementedError

 (user_id PK, data jsonb, schema_version, generated_at, updated_at).
 """
+from sqlalchemy import delete, select
+from sqlalchemy.dialects.postgresql import insert
+from src.db.postgres.connection import AsyncSessionLocal
+from src.db.postgres.models import Catalog as CatalogRow
+from src.middlewares.logging import get_logger
 from .models import Catalog
+logger = get_logger("catalog_store")
 class CatalogStore:
+    """Read/write catalogs keyed by user_id.
+    Each method opens its own AsyncSession. Callers needing transactional
+    coordination across multiple stores can be refactored to accept an
+    explicit AsyncSession in a later PR.
+    """
     async def get(self, user_id: str) -> Catalog | None:
+        async with AsyncSessionLocal() as session:
+            result = await session.execute(
+                select(CatalogRow.data).where(CatalogRow.user_id == user_id)
+            )
+            row = result.scalar_one_or_none()
+        if row is None:
+            return None
+        return Catalog.model_validate(row)
     async def upsert(self, catalog: Catalog) -> None:
+        payload = catalog.model_dump(mode="json")
+        async with AsyncSessionLocal() as session:
+            stmt = insert(CatalogRow).values(
+                user_id=catalog.user_id,
+                data=payload,
+                schema_version=catalog.schema_version,
+                generated_at=catalog.generated_at,
+            )
+            stmt = stmt.on_conflict_do_update(
+                index_elements=[CatalogRow.user_id],
+                set_={
+                    "data": stmt.excluded.data,
+                    "schema_version": stmt.excluded.schema_version,
+                    "generated_at": stmt.excluded.generated_at,
+                },
+            )
+            await session.execute(stmt)
+            await session.commit()
+        logger.info(
+            "catalog upserted",
+            user_id=catalog.user_id,
+            sources=len(catalog.sources),
+        )
     async def delete(self, user_id: str) -> None:
+        async with AsyncSessionLocal() as session:
+            await session.execute(delete(CatalogRow).where(CatalogRow.user_id == user_id))
+            await session.commit()
+        logger.info("catalog deleted", user_id=user_id)

src/catalog/validator.py CHANGED Viewed

@@ -21,4 +21,29 @@ class CatalogValidator:
     """
     def validate(self, catalog: Catalog) -> None:
-        raise NotImplementedError

     """
     def validate(self, catalog: Catalog) -> None:
+        seen_sources: set[str] = set()
+        for source in catalog.sources:
+            if source.source_id in seen_sources:
+                raise CatalogValidationError(
+                    f"duplicate source_id {source.source_id!r} in catalog "
+                    f"for user_id={catalog.user_id!r}"
+                )
+            seen_sources.add(source.source_id)
+            seen_tables: set[str] = set()
+            for table in source.tables:
+                if table.table_id in seen_tables:
+                    raise CatalogValidationError(
+                        f"duplicate table_id {table.table_id!r} in source "
+                        f"{source.source_id!r}"
+                    )
+                seen_tables.add(table.table_id)
+                seen_columns: set[str] = set()
+                for column in table.columns:
+                    if column.column_id in seen_columns:
+                        raise CatalogValidationError(
+                            f"duplicate column_id {column.column_id!r} in table "
+                            f"{table.table_id!r} (source {source.source_id!r})"
+                        )
+                    seen_columns.add(column.column_id)

src/db/postgres/init_db.py CHANGED Viewed

@@ -3,6 +3,7 @@
 from sqlalchemy import text
 from src.db.postgres.connection import engine, Base
 from src.db.postgres.models import (
     ChatMessage,
     DatabaseClient,
     Document,

 from sqlalchemy import text
 from src.db.postgres.connection import engine, Base
 from src.db.postgres.models import (
+    Catalog,
     ChatMessage,
     DatabaseClient,
     Document,

src/db/postgres/models.py CHANGED Viewed

@@ -96,4 +96,20 @@ class DatabaseClient(Base):
     status = Column(String, nullable=False, default="active")  # active | inactive
     created_at = Column(DateTime(timezone=True), server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())

     status = Column(String, nullable=False, default="active")  # active | inactive
     created_at = Column(DateTime(timezone=True), server_default=func.now())
     updated_at = Column(DateTime(timezone=True), onupdate=func.now())
+class Catalog(Base):
+    """Per-user data catalog stored as a single jsonb row.
+    `data` holds the full Pydantic Catalog (src/catalog/models.py:Catalog)
+    serialized via `model_dump(mode="json")`. Read path uses
+    `Catalog.model_validate(...)` to rehydrate.
+    """
+    __tablename__ = "catalogs"
+    user_id = Column(String, primary_key=True)
+    data = Column(JSONB, nullable=False)
+    schema_version = Column(String, nullable=False, default="1.0")
+    generated_at = Column(DateTime(timezone=True), server_default=func.now())
+    updated_at = Column(DateTime(timezone=True), onupdate=func.now())

src/query/ir/models.py CHANGED Viewed

@@ -10,7 +10,6 @@ from typing import Any, Literal
 from pydantic import BaseModel, Field
 FilterOp = Literal[
     "=", "!=", "<", "<=", ">", ">=",
     "in", "not_in", "is_null", "is_not_null",

 from pydantic import BaseModel, Field
 FilterOp = Literal[
     "=", "!=", "<", "<=", ">", ">=",
     "in", "not_in", "is_null", "is_not_null",

src/query/ir/operators.py CHANGED Viewed

@@ -12,6 +12,16 @@ ALLOWED_AGG_FNS = frozenset({
 LIMIT_HARD_CAP = 10_000
-# Type compatibility: which value_types are valid for each column data_type.
-# To be filled with the explicit matrix when validator.py is implemented.
-TYPE_COMPATIBILITY: dict[str, frozenset[str]] = {}

 LIMIT_HARD_CAP = 10_000
+# Type compatibility: which value_types may appear in a FilterClause when the
+# referenced column has the given data_type. Numeric types are mutually
+# compatible (decimal literal against int column is fine). Date/datetime accept
+# string so the planner can emit ISO-8601 literals without mode juggling.
+TYPE_COMPATIBILITY: dict[str, frozenset[str]] = {
+    "int":      frozenset({"int", "decimal"}),
+    "decimal":  frozenset({"int", "decimal"}),
+    "string":   frozenset({"string"}),
+    "datetime": frozenset({"datetime", "date", "string"}),
+    "date":     frozenset({"date", "datetime", "string"}),
+    "bool":     frozenset({"bool"}),
+    "json":     frozenset({"string"}),
+}

src/query/ir/validator.py CHANGED Viewed

@@ -1,11 +1,20 @@
 """IRValidator — checks a QueryIR against a user's catalog.
-See ARCHITECTURE.md §7 for the validation rules.
-On failure, the planner is re-prompted with the error context (max 3 retries).
 """
-from ...catalog.models import Catalog
 from .models import QueryIR
 class IRValidationError(Exception):
@@ -20,9 +29,101 @@ class IRValidator:
     - table_id belongs to that source
     - every column_id exists in that table
     - every agg.fn and filter.op is whitelisted (see operators.py)
-    - value_type consistent with column.data_type
-    - limit positive int, ≤ hard cap
     """
     def validate(self, ir: QueryIR, catalog: Catalog) -> None:
-        raise NotImplementedError

 """IRValidator — checks a QueryIR against a user's catalog.
+See ARCHITECTURE.md §7 for the validation rules. On failure, the planner
+is re-prompted with the error context (max 3 retries) — error messages
+must therefore be specific enough that the LLM can self-correct.
 """
+from ...catalog.models import Catalog, Column, Source, Table
 from .models import QueryIR
+from .operators import (
+    ALLOWED_AGG_FNS,
+    ALLOWED_FILTER_OPS,
+    LIMIT_HARD_CAP,
+    TYPE_COMPATIBILITY,
+)
+_NULLARY_FILTER_OPS = frozenset({"is_null", "is_not_null"})
 class IRValidationError(Exception):
     - table_id belongs to that source
     - every column_id exists in that table
     - every agg.fn and filter.op is whitelisted (see operators.py)
+    - value_type consistent with column.data_type (TYPE_COMPATIBILITY)
+    - limit positive int, ≤ LIMIT_HARD_CAP
     """
     def validate(self, ir: QueryIR, catalog: Catalog) -> None:
+        source = self._find_source(catalog, ir.source_id)
+        table = self._find_table(source, ir.table_id)
+        columns_by_id: dict[str, Column] = {c.column_id: c for c in table.columns}
+        select_aliases: set[str] = set()
+        for i, item in enumerate(ir.select):
+            where = f"select[{i}]"
+            if item.kind == "column":
+                self._require_column(columns_by_id, item.column_id, where)
+            else:  # "agg"
+                if item.fn not in ALLOWED_AGG_FNS:
+                    raise IRValidationError(
+                        f"{where}.fn: must be in {sorted(ALLOWED_AGG_FNS)}, "
+                        f"got {item.fn!r}"
+                    )
+                if item.column_id is not None:
+                    self._require_column(columns_by_id, item.column_id, where)
+                elif item.fn != "count":
+                    raise IRValidationError(
+                        f"{where}.fn={item.fn!r} requires a column_id "
+                        "(only 'count' may omit it for COUNT(*))"
+                    )
+            if item.alias:
+                select_aliases.add(item.alias)
+        for i, f in enumerate(ir.filters):
+            where = f"filters[{i}]"
+            col = self._require_column(columns_by_id, f.column_id, where)
+            if f.op not in ALLOWED_FILTER_OPS:
+                raise IRValidationError(
+                    f"{where}.op: must be in {sorted(ALLOWED_FILTER_OPS)}, "
+                    f"got {f.op!r}"
+                )
+            if f.op not in _NULLARY_FILTER_OPS:
+                allowed = TYPE_COMPATIBILITY.get(col.data_type, frozenset())
+                if f.value_type not in allowed:
+                    raise IRValidationError(
+                        f"{where}: value_type {f.value_type!r} incompatible with "
+                        f"column.data_type {col.data_type!r} "
+                        f"(allowed: {sorted(allowed)})"
+                    )
+        for i, col_id in enumerate(ir.group_by):
+            self._require_column(columns_by_id, col_id, f"group_by[{i}]")
+        for i, ob in enumerate(ir.order_by):
+            if ob.column_id not in columns_by_id and ob.column_id not in select_aliases:
+                raise IRValidationError(
+                    f"order_by[{i}].column_id: {ob.column_id!r} not found in table "
+                    f"{ir.table_id!r} columns or select aliases "
+                    f"(known columns: {sorted(columns_by_id.keys())}, "
+                    f"aliases: {sorted(select_aliases)})"
+                )
+        if ir.limit is not None:
+            if ir.limit <= 0:
+                raise IRValidationError(f"limit must be positive, got {ir.limit}")
+            if ir.limit > LIMIT_HARD_CAP:
+                raise IRValidationError(
+                    f"limit {ir.limit} exceeds hard cap {LIMIT_HARD_CAP}"
+                )
+    @staticmethod
+    def _find_source(catalog: Catalog, source_id: str) -> Source:
+        for s in catalog.sources:
+            if s.source_id == source_id:
+                return s
+        raise IRValidationError(
+            f"source_id {source_id!r} not in catalog "
+            f"(known: {[s.source_id for s in catalog.sources]})"
+        )
+    @staticmethod
+    def _find_table(source: Source, table_id: str) -> Table:
+        for t in source.tables:
+            if t.table_id == table_id:
+                return t
+        raise IRValidationError(
+            f"table_id {table_id!r} not in source {source.source_id!r} "
+            f"(known: {[t.table_id for t in source.tables]})"
+        )
+    @staticmethod
+    def _require_column(
+        columns_by_id: dict[str, Column], col_id: str, where: str
+    ) -> Column:
+        col = columns_by_id.get(col_id)
+        if col is None:
+            raise IRValidationError(
+                f"{where}.column_id: {col_id!r} not in table "
+                f"(known: {sorted(columns_by_id.keys())})"
+            )
+        return col