Civic-SLM

Open Source

Civic-SLM is a domain-specialized fine-tune of Qwen2.5-7B-Instruct for U.S. local-government documents — city, county, and township agendas, staff reports, comprehensive plans, minutes, ordinances, and municipal codes. Designed to power civic transparency tools across all 50 states.

Trained on a single Apple Silicon Mac via MLX-LM. Served on whatever runtime you like — MLX, Ollama, LM Studio, llama.cpp, or any OpenAI-compatible endpoint. Released as both MLX-q4 and GGUF Q5_K_M. Documents are crawled with browser-use — one small recipe per jurisdiction.

This project is open source under the MIT license and the source code is available here.

Why

Local government is where most public decisions actually get made, and the documents that drive those decisions — agendas, staff reports, minutes, ordinances — are mostly PDFs buried on legacy CMSes. General-purpose LLMs can read them, but they hallucinate specifics, miss citations, and don’t know the genre. Civic-SLM is a small, open, auditable model trained specifically on this corpus so it can ground answers in the source text, extract structured data from staff reports, and refuse when the context doesn’t support an answer.

Pipeline

Crawl — one browser-use recipe per jurisdiction (San Clemente, CA ships as the demo; recipes are tiny and composable for any U.S. city, county, or township).
Chunk — Pydantic-validated DocumentChunk schemas with provenance.
Synthesize — generate training pairs via the Anthropic SDK or a fully-local LLM backend (env-switchable).
Train — continued pre-training (CPT), supervised fine-tuning (SFT), and direct preference optimization (DPO) on MLX.
Merge & quantize — final adapter merged and quantized to MLX-q4 and GGUF Q5_K_M.
Eval — every stage reported to W&B and compared against the committed baselines.

Eval-first

The training contract is no training without a baseline. Four benchmarks run against base Qwen2.5-7B before any fine-tuning starts; those numbers are what every subsequent stage has to beat.

Bench	What it measures	Score
`civic_factuality`	Q&A grounded in held-out docs	citation exact-match + word-overlap
`refusal`	refuses when context lacks the answer	refusal rate (regex + fallback judge)
`structured_extraction`	staff report → JSON	field-level F1
`side_by_side`	open-ended municipal prompts vs base 7B and 72B	Claude or local-LLM judge with A/B position swap

Baseline numbers (Qwen2.5-7B-Instruct 4-bit, MLX)

Bench	n	Mean	Median	Latency
factuality	10	0.501	0.566	637 ms
refusal	10	0.800	1.000	460 ms
extraction	5	0.277	0.000	925 ms
side_by_side	—	— (pending 72B comparator)	—	—

The bars the fine-tune has to clear. Refusal is already strong — protect it. Extraction is the biggest training opportunity.

Quickstart

uv sync --all-extras
uv run pytest                   # 42 tests across schema, ingest, scorers, synth, train, llm-backend
uv run civic-slm --help

The civic-slm umbrella CLI exposes every stage: doctor, crawl, eval run, eval side-by-side, and train cpt|sft|dpo. See the repo’s docs/USAGE.md for an end-to-end walkthrough and docs/RECIPES.md to add a new jurisdiction.

Status

Scaffold, schemas, ingestion (browser-use + San Clemente demo recipe + a template for any U.S. jurisdiction), 4-bench eval harness, synth pipeline (Anthropic or fully-local backend), MLX training scripts (CPT/SFT/DPO), merge + quantize to MLX-q4 and GGUF Q5_K_M, runtime-agnostic serving, and committed baselines for factuality, refusal, and extraction are all in place. Next up: synth corpus and the first training pass.

Follow along or contribute on GitHub →