Why we built this
Indian addresses are notoriously unstructured. A single line can look like this:
FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029
House number, building name, street, locality, district, state, pincode — all jammed into one free-text string with no consistent formatting. Anyone building on Indian company registry data, bank KYC records, or delivery logistics runs into this constantly. We wanted a parser that turns strings like the above into clean, structured JSON:
{
"houseNumber": "FLAT NO.32",
"houseName": "UTTARA TOWERS",
"street": "MG ROAD",
"city": "GUWAHATI",
"district": "Kamrup",
"state": "AS",
"pincode": "781029",
"poi": null, "subsubLocality": null, "subLocality": null,
"locality": null, "village": null, "subDistrict": null
}
13 fields, always present, null when absent. We built the whole pipeline in-house and open-sourced every piece of it — model, code, and data. Here's how it went, including the parts that went wrong first.
Step 1: Labeling 4.37M addresses without a labeling budget
We started with 4.37M raw addresses from two structurally different sources — Indian MCA (Ministry of Corporate Affairs) company registrations, and bank/business-correspondent branch records. Zero labels on any of it.
Manual labeling doesn't scale to that volume, so we built a layered pipeline instead:
- Rule-based tagging — regex patterns plus gazetteer cross-checks against India Post's official pincode registry (pincode → district/state lookup) give every record a confidence score. High-confidence records auto-accept as "silver" training labels.
- LLM-assisted labeling for the rest — batched calls to an LLM (via OpenRouter), with a system prompt that requires every extracted field value to be copied verbatim from the source text. If a value the model returns isn't found as a literal substring of the input, we drop it rather than trust it. This single constraint eliminated an entire class of hallucination risk.
- A small human-reviewed slice as a calibration check against the LLM's own accuracy before we trusted it at scale.
One domain quirk that mattered in practice: MCA addresses carry a machine-generated tail like "...Kamrup Unclassified AS 781029", where "Unclassified" is a fixed placeholder meaning "no sub-district classification recorded" by the Registrar of Companies — not a real place name. Early labeling runs had the LLM tagging "Unclassified" as a subDistrict value. We fixed it by explicitly teaching the model this convention in the prompt. Small fix, but exactly the kind of domain-specific knowledge no off-the-shelf address parser would know to apply.
Worth flagging clearly: designing the field taxonomy turned out to be harder than training the model. Our first pass used Google Maps' full geocoding component taxonomy — 35 field types. Too granular for any human reviewer to label consistently, and it showed in review quality. We collapsed it to the 13-field schema above, chosen specifically around what a human could actually apply without agonizing over edge cases.
Step 2: Fine-tuning
We fine-tuned Qwen/Qwen3-0.6B with LoRA, trained via Apple's MLX framework on an M4 Mac. mlx-lm's lora command was genuinely pleasant to work with on Apple Silicon — no CUDA/bitsandbytes environment wrangling.
rank=16, alpha=32, dropout=0.05
target_modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
16 of 28 layers fine-tuned, 2000 iterations, ~1.8 hours
Results on a 237-example held-out gold test set:
| Metric | Value |
|---|---|
| JSON parse rate | 100% |
| Mean per-field accuracy | 82.4% |
| Overall exact match (all fields) | 30.8% |
The gap between per-field accuracy and exact-match is the interesting part. Digging into disagreements, most of it isn't the model being wrong — it's schema ambiguity baked into the taxonomy itself. locality/subLocality/subsubLocality/village represent the same "named area, different granularity" concept, and even our gold labels are sometimes inconsistent about which bucket a given place name belongs in — we found gold records where the same string was labeled as both locality and village simultaneously. That's a taxonomy problem, not a model-capacity problem, and no amount of additional training data fixes it without a firmer labeling convention. We're treating this as an open item, not a hidden one.
Step 3: Making it run anywhere, not just on Apple Silicon
This is where most of our actual debugging time went, and none of it was machine learning.
mlx-lm produces its own proprietary adapter format — not compatible with the standard PEFT ecosystem. To make the model usable on CUDA and CPU (not just Apple Silicon), we hand-derived the weight conversion:
# mlx-lm: lora_a [in_features, r], lora_b [r, out_features], used as x @ A @ B
# PEFT: lora_A.weight [r, in_features], lora_B.weight [out_features, r]
# So: peft_A = mlx_a.T, peft_B = mlx_b.T
We didn't trust our own derivation blindly — we verified it against mlx-lm's own fuse() source code (delta = (scale * lora_b.T) @ lora_a.T), then confirmed numerically: ran the same 15 addresses through both the original MLX adapter and our converted PEFT version. 13/15 identical outputs; the 2 mismatches landed exactly on already-known-ambiguous fields, consistent with floating-point differences between backends on a near-tied decision — not a conversion bug.
Step 4: Publishing, and the dependency-floor whack-a-mole
We published the model to Hugging Face in both formats (PEFT at the repo root, MLX in a subfolder), then packaged it as a pip-installable library: indian-address-parser on PyPI, source on GitHub.
Then real users installed it into their existing environments — Anaconda base environments specifically — and things broke in sequence:
-
peftimportstransformers.BloomPreTrainedModel, whose lazy-loading chain unconditionally doesimport tensorflow. In an environment with a mismatched TensorFlow/numpy/h5py install, that crashed everything before our code ever touched TensorFlow functionality. Fix: setUSE_TF=0before any transformers/peft import, so transformers' TF-detection short-circuits and skips that import path entirely. -
qwen3model type not recognized. We bisected real PyPI releases and foundtransformersonly added Qwen3 support at exactly version4.51.0(4.50.0: unsupported,4.51.0: supported). Our original dependency floor (>=4.45.0) was loose enough that pip left an older, incompatible transformers version in place instead of upgrading it. -
hf_hub_download() got an unexpected keyword argument 'use_auth_token'. We traced this topeft<0.18.0unconditionally passinguse_auth_token=Noneintohf_hub_download, regardless of whether the caller ever asked for it. Recenthuggingface_hub(1.x) dropped that long-deprecated kwarg entirely. We bisected peft's source across ten released versions to find the exact fix boundary (0.17.1: unconditional pass; 0.18.0: made conditional).
For each of these, we verified the fix against the actual reported failure — built a virtual environment pinned to the exact stale dependency combination from the bug report, installed our patched package, confirmed pip auto-upgraded correctly, and ran real inference before calling anything fixed.
Our takeaway, and something we now apply as a standing engineering practice: dependency floors need to be the actual verified minimum that works — not whatever happened to be installed during development. A loose floor doesn't fail on your machine. It fails silently on someone else's, months later, in a way that looks like your code is broken when it isn't.
Step 5: Benchmarking honestly against an existing model
We compared our model against Shiprocket's open-tinybert-indian-address-ner — a 6-layer TinyBERT doing BIO-tagged token classification, architecturally very different from our 0.6B causal LM generating JSON, and using a different field taxonomy entirely.
We built an explicit field mapping covering the 9 conceptually overlapping fields (their house_details ↔ our houseNumber, road ↔ street, and so on) and scored both models against the same 237-example held-out set:
| Field | Ours | Shiprocket's |
|---|---|---|
| city | 91.3% | 17.4% |
| state | 96.2% | 41.5% |
| pincode | 100.0% | 69.2% |
| houseNumber | 84.5% | 27.1% |
We scored higher on every shared field — but Shiprocket's model is ~240x faster per address (19ms vs 4.6s). That's not a quality artifact; it's architecture. A 6-layer classifier doing a single forward pass will always beat autoregressive generation on raw throughput. If a use case needs high-volume, low-latency parsing over maximum accuracy, that's a legitimate reason to reach for the other model. We'd rather publish that tradeoff honestly than present a comparison that only cuts in our favor.
Step 6: Publishing the data, responsibly
We also published the underlying data as two Hugging Face datasets:
indian-addresses-raw— the full 4.37M-record unlabeled corpusindian-addresses-gold— 4,834 span-labeled training examples, the actual data our model was trained on
Before publishing the raw corpus, we caught something that needed careful handling: bank/BC address records are KYC-style data, and a subset embed real customer phone numbers and relational-name markers (S/O/D/O/W/O/C/O — "son of"/"care of", standard on Indian address forms). This is meaningfully different from MCA's superficially similar C/O <company director> convention, which is already public disclosure via MCA's own registry. We wrote a targeted redaction pass for the bank source specifically, verified against the real corpus rather than assumed (and caught a genuine false-positive collision — "Door No." matching our "D/O [name]" pattern — before it shipped). For the gold dataset, we went further: rather than redact text in place (which would shift the character offsets our span labels depend on, risking silent corruption), we dropped the small number of affected records instead.
Try it
pip install indian-address-parser
from indian_address_parser import AddressParser
parser = AddressParser() # pulls weights from Hugging Face automatically
parser.parse("FLAT NO.32, UTTARA TOWERS, MG ROAD GUWAHATI , Kamrup Unclassified AS 781029")
Everything is open source under Apache 2.0:
- Model: gagan1985/qwen3-0.6b-indian-address-parser
- Code: github.com/innerkorehq/indian-address-parser
- Package: pypi.org/project/indian-address-parser
- Datasets: indian-addresses-gold / indian-addresses-raw
We built this because we needed it, and we're publishing all of it — including the parts that didn't go smoothly — because that's the kind of engineering we want to be known for at Innerkore. Feedback and contributions are welcome, especially on the locality/subLocality boundary ambiguity we flagged above. We have a hypothesis for a firmer labeling convention that might resolve it, but haven't yet tested whether it actually reduces disagreement or just relocates it.
