A few weeks ago I shipped a phone app that lets residents measure the speed of cars on their street. The thing I learned building it wasn’t about traffic engineering. It was about how invisible local government is to the people it serves.

Every city has agendas, staff reports, comprehensive plans. Minutes from meetings that decided whether the stop sign goes in or doesn’t. Ordinances that shape what gets built across the street from your house. It’s all public, it’s all online, it’s all pretty much unreadable.

Not because it’s encrypted, because it’s long. A typical city council packet for one Tuesday meeting is 400 pages of PDFs. The staff report on a single zoning variance is 30 pages. The comprehensive plan for a small California city runs over 600.

Does anyone reads these? The people whose lives are most shaped by them read them least. The people who do read themare paid to!

I spent the last few Friday nights training a model to read them instead.

What I built

The project is called civic-slm. It’s a domain-specialized fine-tune of Qwen2.5-7B-Instruct — Alibaba’s open-weights small model — trained specifically on U.S. local-government documents: agendas, staff reports, comprehensive plans, minutes, ordinances, municipal codes, and livestreams coming soon.

Trained on a single Apple Silicon Mac. The whole pipeline runs on a laptop using MLX, Apple’s machine learning framework.

It serves on whatever runtime you want: MLX, Ollama, LM Studio, llama.cpp. The release ships in two formats: MLX-q4 for Apple Silicon, GGUF Q5_K_M for everything else. Quantized to run on consumer hardware, and the 7B model is small enough to fit in 6GB of VRAM.

The pipeline is the boring part: crawl city websites with one tiny recipe per jurisdiction, chunk the documents, synthesize Q&A pairs, run continued pre-training, supervised fine-tuning, and direct preference optimization. Merge the adapter, quantize, evaluate.

The eval harness runs four benchmarks and the rule is no training without a baseline. The four numbers from base Qwen2.5-7B are the bars every fine-tuned version has to clear. Citation factuality measures whether the model can answer a question grounded in a held-out document and cite the source. Refusal measures whether it shuts up when the answer isn’t in the context — the most important behavior for a tool used in civic work. Structured extraction measures whether it can turn a staff report into clean JSON. Side-by-side pits it against the base 7B and a 72B comparator on open-ended municipal prompts.

Why a small model

Frontier models can read these documents. So why train a 7B model that, on most general-purpose benchmarks, will lose to either of them?

Three reasons.

Privacy. The whole point of civic infrastructure is that it belongs to the public. A tool that helps residents engage with their local government should not require them to send the contents of their city’s documents and their questions about those documents to a third-party API. A small model running on a laptop or a Mac Mini in a city library has zero exfiltration surface. The data never leaves the building.

Cost. Frontier model API calls cost money per query. A small model running locally costs the electricity to run a laptop. If the goal is to make this available to every city in the country the unit economics have to bottom out at zero per query. Local inference is the only architecture that gets there.

Specialization beats scale at narrow tasks. This is the part that I think most people miss. A 7B model fine-tuned on the exact documents you’re going to query against will, on those documents, beat a 72B general-purpose model that has never seen them. Not on every benchmark but on this benchmark. The job isn’t to know everything but the job is to know municipal-document-shaped things very well, and to refuse confidently when asked anything else.

This is the opposite of the bigger-is-better narrative that dominates the public AI conversation. Bigger is better when you don’t know what you’re going to ask. When you do know i.e when the domain is bounded, the documents are public, and the failure mode of hallucination is unacceptable I believe smaller and specialized wins.

I’ve been writing about this from the inside of an enterprise context for a while. Tokenmaxxing is the wrong way to measure AI value. Minutes Added to Workforce is closer to the right one. The civic case is the cleanest possible illustration of both. The right question isn’t “how big is the model” or “how many tokens did it generate.” The right question is “how many minutes of attention did this give back to a resident who needed to engage with their city and what did they do with those minutes?”

Enjoying this? I write about AI implementation and engineering leadership every week.

The architecture decision underneath the architecture decision

Every recipe in civic-slm is a tiny YAML file. One per jurisdiction. The recipe describes how to crawl one city’s site which links to follow, which document types to keep, where the agendas live, where the staff reports live.

The crawling is done with browser-use, an open-source library that lets a model drive a real browser. This matters because most municipal websites are not designed for scraping. They’re designed for someone clicking through. browser-use lets the crawler behave like a person navigating menus, opening calendars, downloading PDFs the way a resident would. One recipe per jurisdiction and recipes are a few dozen lines.

Adding a new city is a pull request to a directory of YAML files which is the contribution model. Same as the agency directory in Slow Them Down as community-contributed civic infrastructure, not a SaaS product.

The synthesis pipeline that generates training Q&A from the crawled documents runs on either the Anthropic API or a local LLM, switchable via env var. If you have an API key, use it. If you don’t, point it at a local llama-server running Qwen2.5-72B and the whole pipeline is fully offline. Same architecture decision as the inference layer: privacy and cost are tunable, not assumed.

Where it goes from here

Right now the project has the scaffold, the eval harness, the baselines on the base model, the synthesis pipeline, the training scripts, and the merge/quantize/release pipeline. The next milestone is the synth corpus and the first training pass. After that, a real fine-tune for one city, evaluated against the four benchmarks, released to Hugging Face.

After that, more cities. The model trained on San Clemente isn’t the model. The pipeline that produces a model for any U.S. city is the model.

The bigger question is whether this kind of thing actually changes anything about how residents engage with local government. I don’t know yet. I suspect the answer is “a little, in the cities where someone runs it, until it doesn’t.” The same way the speed-measurement app is one tool among many which is not a replacement for showing up to a council meeting, but a way to make showing up more powerful.

What I’m sure of is this: the cost of trying these experiments has collapsed. Two years ago, training a domain-specialized 7B model on a laptop would have been a research project. Whatever the answer is to “does this change anything,” we’re going to find out faster than we used to, because anyone with a Mac and a weekend can run the experiment.

The repo is on GitHub. If you live in a U.S. city or county or township and want yours added, the recipe template is in docs/RECIPES.md. It takes no time to write a recipe for a new jurisdiction!

If you build something with it, I want to hear about it.


civic-slm is open source. The pipeline runs on any Apple Silicon Mac. The released models will work with MLX, Ollama, LM Studio, or llama.cpp… pick whichever runtime you like. This post connects to ideas in I Vibe Coded a Civic ToolTokenmaxxing, and Minutes Added to Workforce.