Civic-SLM
Open Source
Civic-SLM is a domain-specialized fine-tune of Qwen2.5-7B-Instruct for U.S. local-government documents — city, county, and township agendas, staff reports, comprehensive plans, minutes, ordinances, and municipal codes. Designed to power civic transparency tools across all 50 states.
Trained on a single Apple Silicon Mac via MLX-LM. Served on whatever runtime you like — MLX, Ollama, LM Studio, llama.cpp, or any OpenAI-compatible endpoint. Released as both MLX-q4 and GGUF Q5_K_M. Documents are crawled with browser-use — one small recipe per jurisdiction.
This project is open source under the MIT license and the source code is available here.
Why
Local government is where most public decisions actually get made, and the documents that drive those decisions — agendas, staff reports, minutes, ordinances — are mostly PDFs buried on legacy CMSes. General-purpose LLMs can read them, but they hallucinate specifics, miss citations, and don’t know the genre. Civic-SLM is a small, open, auditable model trained specifically on this corpus so it can ground answers in the source text, extract structured data from staff reports, and refuse when the context doesn’t support an answer.
Pipeline
- Crawl — one browser-use recipe per jurisdiction (San Clemente, CA ships as the demo; recipes are tiny and composable for any U.S. city, county, or township).
- Chunk — Pydantic-validated
DocumentChunkschemas with provenance. - Synthesize — generate training pairs via the Anthropic SDK or a fully-local LLM backend (env-switchable).
- Train — continued pre-training (CPT), supervised fine-tuning (SFT), and direct preference optimization (DPO) on MLX.
- Merge & quantize — final adapter merged and quantized to MLX-q4 and GGUF Q5_K_M.
- Eval — every stage reported to W&B and compared against the committed baselines.
Eval-first
The training contract is no training without a baseline. Four benchmarks run against base Qwen2.5-7B before any fine-tuning starts; those numbers are what every subsequent stage has to beat.
| Bench | What it measures | Score |
|---|---|---|
civic_factuality | Q&A grounded in held-out docs | citation exact-match + word-overlap |
refusal | refuses when context lacks the answer | refusal rate (regex + fallback judge) |
structured_extraction | staff report → JSON | field-level F1 |
side_by_side | open-ended municipal prompts vs base 7B and 72B | Claude or local-LLM judge with A/B position swap |
Baseline numbers (Qwen2.5-7B-Instruct 4-bit, MLX)
| Bench | n | Mean | Median | Latency |
|---|---|---|---|---|
| factuality | 10 | 0.501 | 0.566 | 637 ms |
| refusal | 10 | 0.800 | 1.000 | 460 ms |
| extraction | 5 | 0.277 | 0.000 | 925 ms |
| side_by_side | — | — (pending 72B comparator) | — | — |
Quickstart
uv sync --all-extras
uv run pytest # 42 tests across schema, ingest, scorers, synth, train, llm-backend
uv run civic-slm --help
The civic-slm umbrella CLI exposes every stage: doctor, crawl, eval run, eval side-by-side, and train cpt|sft|dpo. See the repo’s docs/USAGE.md for an end-to-end walkthrough and docs/RECIPES.md to add a new jurisdiction.
Status
Scaffold, schemas, ingestion (browser-use + San Clemente demo recipe + a template for any U.S. jurisdiction), 4-bench eval harness, synth pipeline (Anthropic or fully-local backend), MLX training scripts (CPT/SFT/DPO), merge + quantize to MLX-q4 and GGUF Q5_K_M, runtime-agnostic serving, and committed baselines for factuality, refusal, and extraction are all in place. Next up: synth corpus and the first training pass.