How Geocoding Works
Under the hood of a geocoding system: address parsing, standardisation, gazetteer indexing, candidate matching, street-segment interpolation, and confidence ranking. The article walks through the internal pipeline that converts a messy text query into a ranked list of coordinate candidates — and the algorithms (libpostal, Elasticsearch / spatial indices, TIGER address ranges) that make it work at scale.
By Steve K.. Published . Last updated .
The /learn/what-is-geocoding pillar covers what geocoding is and the practical landscape. This article goes one level deeper into the internals: how an incoming text query becomes a ranked list of coordinate candidates, and the algorithms behind each step.
The six-step pipeline
Modern geocoders — commercial (Google, Mapbox, HERE) and open-source (Pelias, Nominatim) — follow roughly the same six-step pipeline for forward geocoding:
- Parse the input string into address components.
- Standardise the components (abbreviations, casing, punctuation).
- Look up against the gazetteer index.
- Generate candidate matches.
- Rank candidates by confidence.
- Return the top N results.
Each step has well-known algorithms and trade-offs. The differences between providers are mostly in (a) the underlying data (proprietary vs OSM vs TIGER) and (b) the scoring heuristics in step 5 — not the pipeline structure.
Step 1: parsing
The input is a free-form text string. The parser breaks it into structured components:
Input: "1600 Pennsylvania Ave NW, Washington, DC 20500"
Output: {
house_number: "1600",
street: "Pennsylvania Ave NW",
city: "Washington",
state: "DC",
postal_code: "20500",
country: "United States" (inferred)
}
The parser handles:
- Common patterns: US-style (Number Street, City, State, ZIP), UK-style (Number, Street, Town, Postcode), Japanese-style, etc.
- Abbreviations: “Ave” vs “Avenue”, “St” vs “Street”.
- Case variations: lowercase, uppercase, mixed.
- Punctuation: commas vs no commas, trailing periods.
- Word order: most parsers handle some out-of-order components.
- Multiple languages: Japanese addresses written in Japanese characters, Russian in Cyrillic, etc.
The dominant open-source parser is libpostal — a C library trained on OpenStreetMap data using statistical models. It handles 60+ languages with state-of-the-art accuracy. Commercial geocoders typically use proprietary parsers but libpostal-equivalent capability is the baseline.
Step 2: standardisation
Once parsed, the components are normalised to a canonical form:
Input: { street: "Pennsylvania Ave NW" }
Output: { street: "Pennsylvania Avenue Northwest" }
Standardisation rules:
- Expand abbreviations: “Ave” → “Avenue”, “NW” → “Northwest”.
- Normalise case: title case for street names, uppercase for state codes.
- Remove punctuation: commas, periods, extra whitespace.
- Resolve aliases: “Wash DC” → “Washington DC”.
Different standardisation libraries follow different rules. USPS has a strict standardisation specification for US addresses (used by USPS Address Information Systems). Most geocoders implement a similar standardisation step, though not always to USPS strictness.
Step 3: gazetteer lookup
The standardised components are queried against the gazetteer index. Modern geocoders use:
- Text-search engines: Elasticsearch, Apache Solr, or custom Lucene-based indexes. The street name is indexed with n-gram tokenisation and synonym expansion; fuzzy matching handles typos.
- Spatial indexes: R-tree, KD-tree, or PostGIS spatial indexes for reverse geocoding (find the K addresses nearest a query point) and for spatial filtering (constrain forward matches to a country / region).
- Hierarchical indexes: countries contain states; states contain cities; cities contain streets; streets contain address ranges. The hierarchy lets the geocoder narrow down the search space quickly.
The data: TIGER/Line for US streets (with address ranges), OpenStreetMap addr:* tags for global coverage, commercial address layers (Mapbox, Google, HERE) for proprietary enhancements.
Step 4: candidate generation
The lookup returns a set of plausible matches. The candidate generator may produce multiple variants per match:
- Exact match: the input matches a database record exactly (rooftop precision).
- Parcel match: the input matches a property record but not a building (parcel-level precision).
- Street-segment interpolation: the input matches a street with an address range; the position is interpolated along the segment.
- City / region centroid: the input matches a city or region name; the centroid is returned as a fallback.
For a US address like “150 Main St”:
- If the geocoder has 150 Main St in its database, return the rooftop coordinate (high confidence).
- If not, but it has a record “Main St: 100–200 north side”, interpolate the position to roughly the midpoint of that segment (medium confidence).
- If nothing matches, return the city centroid (low confidence).
Step 5: ranking
Candidates are scored by a combination of factors:
- String match quality: exact string match raises confidence; partial / fuzzy match lowers it.
- Geographic context: if the request includes a proximity bias (the user's current location, the IP geolocation, a previous query area), candidates near that context are ranked higher.
- Source quality: rooftop > parcel > street-segment > region centroid.
- Address completeness: if the input includes a postcode and the candidate matches that postcode, score is higher.
- Recency: addresses recently updated in the data source are weighted slightly higher.
- Popularity: addresses with higher historical query frequency are weighted higher (the Empire State Building beats the Brooklyn 350 5th Ave for the “350 5th Ave NYC” query).
Each provider has proprietary scoring logic. Open-source geocoders (Pelias, Nominatim) publish their scoring rules; commercial geocoders treat them as trade secrets.
Step 6: output
The geocoder returns the top N candidates (typically 5 per arch §19.1) with metadata:
[
{
address: "1600 Pennsylvania Avenue NW, Washington DC 20500",
coordinates: [38.8977, -77.0365],
confidence: 0.95,
accuracy_tier: "rooftop",
source: "TIGER + commercial enhancement"
},
{
address: "1600 Pennsylvania Avenue, Wilmington DE 19801",
coordinates: [39.7391, -75.5398],
confidence: 0.32,
accuracy_tier: "street-segment-interpolated",
source: "TIGER"
},
...
]
The caller can take the top candidate (most common), surface alternatives to the user when confidence is below a threshold (Coordinately's pattern), or use the full list for downstream processing (logistics, deduplication).
Reverse-geocoding pipeline
For reverse geocoding, the pipeline collapses to:
- Spatial query: find addresses within a radius (e.g., 100 m) of the input point. Uses the R-tree or KD-tree spatial index.
- Distance ranking: closer addresses are ranked higher.
- Filter: exclude unsuitable matches (POIs vs addresses, commercial vs residential, depending on the API parameters).
- Output: return the top N with distance and confidence metadata.
No parsing, no standardisation — the input is precise. The pipeline is simpler but the edge cases are different (empty results in ocean, ambiguity in dense urban grids).
Batch processing
For large-scale geocoding (millions of addresses), providers offer batch endpoints with different cost / latency profiles:
- Real-time batch: send individual requests at the per-second rate limit; suitable for thousands to tens of thousands of addresses.
- Bulk upload: upload a CSV; the provider returns results within minutes to hours. Suitable for millions of addresses. Lower per-record cost.
- Pre-geocoded datasets: some providers sell pre-geocoded address databases for offline use; suitable for very high volumes or where ToS allows.
Coordinately's tools don't currently support batch geocoding; every request is real-time per arch §19.2.
Open-source geocoders
Several open-source geocoders are widely used:
- Pelias: built on Elasticsearch, uses OSM + OpenAddresses + WhosOnFirst + Geonames as data. Modular pipeline; self-hosted; used by Mapzen, Geocode Earth, MapTiler, and others.
- Nominatim: built on PostgreSQL + PostGIS, uses OpenStreetMap exclusively. Used by the OSM Foundation's public service and by many self-hosted installations.
- Photon: built on Elasticsearch, uses OSM + Wikipedia. Faster than Nominatim for some queries but less feature-complete.
- libpostal: not a geocoder itself — just the address parser used by Pelias and others.
Self-hosting an open-source geocoder requires substantial infrastructure (a global Nominatim instance needs ~100 GB of SSD and 32+ GB of RAM) but provides full control over the data, the pipeline, and the privacy.
TIGER address ranges in detail
The US Census TIGER/Line dataset is the canonical source for US street-segment interpolation. Each street segment in TIGER has:
- Geometry: the polyline shape of the segment.
- From-address (left) and to-address (left): the range of addresses on the left side of the street as you traverse from one end to the other.
- From-address (right) and to-address (right): the same for the right side.
- Parity hints: which side has odd vs even addresses.
A query for “150 Main St” with a TIGER segment “Main St: left from 100 to 200, right from 101 to 199” would interpolate to roughly 50 % of the way from the segment's start to its end on the left side (because 150 is half-way between 100 and 200). The resulting coordinate is somewhere on the left side of the road, mid-block.
TIGER is updated annually by the Census Bureau. New construction (subdivisions added in the last 2–3 years) may not be in TIGER yet; commercial geocoders fill the gap with their own field surveys.
Worked pipeline trace
A trace of the six-step pipeline for the query “1600 penn ave dc”:
1. Parse:
- input: "1600 penn ave dc"
- libpostal output: {
house_number: "1600",
street: "penn ave",
state: "dc",
country: (implicit "US" via state)
}
2. Standardise:
- "penn ave" → "Pennsylvania Avenue"
- "dc" → "District of Columbia" → "Washington, DC"
3. Lookup:
- Index query: street="Pennsylvania Avenue" AND state="DC"
- Returns 3 segments in Washington DC
4. Candidate generation:
- Segment 1 (Pennsylvania Avenue NW, 1600 block): rooftop match → 38.8977, -77.0365
- Segment 2 (Pennsylvania Avenue SE, 1600 block): rooftop match → 38.8810, -76.9856
- Segment 3 (other lesser candidate)
5. Rank:
- "Pennsylvania Avenue NW" wins because:
- "Pennsylvania Ave" is the famous one
- Higher historical query frequency
- White House is widely recognised
- Confidence: 0.92 (high)
- SE candidate: confidence 0.15 (low)
6. Output:
[
{ address: "1600 Pennsylvania Ave NW, Washington DC", confidence: 0.92, ... },
{ address: "1600 Pennsylvania Ave SE, Washington DC", confidence: 0.15, ... }
]
The pipeline is deterministic in its structure but depends on the ranking heuristics to surface the right answer. “1600 penn ave dc” → White House requires the ranker to know that the NW address is much more famous than the SE one.
Common misconceptions
“Geocoding is just a database lookup.” The lookup is one of six steps. Parsing and standardisation are substantial work; ranking is what differentiates good geocoders from bad ones. The database lookup itself is the easiest part.
“All providers use the same algorithms.” The pipeline structure is similar, but specific algorithms vary substantially. Google's parser is proprietary; Mapbox uses some open-source components and proprietary ranking; Nominatim is fully open-source. The accuracy differences come from data quality and ranking sophistication, not the high-level architecture.
“libpostal solves address parsing.” libpostal is excellent — state-of-the-art for an open-source parser — but it still struggles with extreme edge cases (rural informal addresses, mixed-script multilingual inputs, addresses referencing non-standard landmarks). Commercial parsers may do better in specific languages or regions where they have more training data.
“You can build a geocoder in a weekend.” A toy geocoder over a small dataset, yes. A production geocoder that handles 99 % of real-world queries with sub-100 ms latency at thousands of QPS over global data is years of work. Companies like Mapbox have entire teams working on geocoding full-time.
“Street-segment interpolation is accurate.” It's approximate. Address numbers aren't always evenly spaced (some blocks have only odd numbers, some have gaps); interpolation assumes uniform spacing. Typical error: 20–50 m. Quality applications treat interpolation as a lower-tier result than rooftop and prefer the rooftop when available.
“Confidence scores are comparable across providers.” They're not. Mapbox confidence “high” means something specific to Mapbox; Google confidence “ROOFTOP” means something specific to Google. The scores are useful within a single provider but not portable.
Related
- What Is Geocoding?— The pillar — geocoding in general
- Forward vs Reverse Geocoding— The two directional pipelines
- Address Standardization— The cleaning step in the pipeline (when shipped)
- Geocoding Accuracy Levels— How candidate matches are tiered by precision (when shipped)
- Methodology— How content is sourced and verified
Frequently asked questions
What are the main steps in a forward geocoder?
Six steps. (1) Parsing: break the input string into address components (house number, street, city, postcode, country). (2) Standardisation: normalise abbreviations, casing, punctuation. (3) Lookup: query the address database for matching records. (4) Candidate generation: produce a list of plausible matches. (5) Ranking: score each candidate by string match, geographic context, and source quality. (6) Output: return the top-N candidates with metadata (accuracy tier, confidence, distance from any provided context point).
What is libpostal?
libpostal is an open-source C library that parses free-form address strings into components using machine-learning models trained on OpenStreetMap data. It handles addresses in 60+ languages and many international formats. libpostal is used as the address-parsing component in several open-source geocoders (including Pelias) and by some commercial systems. It's the standard answer to 'how do I parse an arbitrary address string in code?'.
What is street-segment interpolation?
When a geocoder doesn't have a specific address in its database but knows the street's address range, it interpolates the position. For example: a street segment is recorded as '100–200 Main St on the north side of the block.' A query for '150 Main St' is interpolated to roughly the midpoint of that segment, assuming addresses are evenly spaced along the segment. Accuracy: typically ±20–50 m (worse if addresses aren't evenly spaced; better if the geocoder has parcel boundary data). The TIGER/Line dataset is the canonical address-range source for US street-segment interpolation.
How does the gazetteer index work?
A gazetteer is a searchable index of geographic names and address components. Modern geocoders use text-search engines (Elasticsearch, Apache Solr, or custom indexes) to query the gazetteer with fuzzy matching, n-gram tokenisation, and synonym expansion. The street name 'Main St' is indexed with variations (Main Street, Main St., main, etc.); queries match against any variation. Spatial indexes (R-tree, KD-tree) handle the spatial side: for reverse geocoding, find the K addresses nearest the query point.
What makes a candidate match high-confidence?
Several factors. Exact string match on the city / state / ZIP raises confidence; partial matches lower it. Geographic context: if the user is in CONUS, a CONUS match is preferred over an international match. Source quality: rooftop precision raises confidence; street-segment interpolation lowers it. Address completeness: if the input includes a postcode, a candidate matching that postcode is more likely correct. The exact scoring is proprietary to each geocoder; open-source geocoders (Pelias, Nominatim) publish their scoring rules.
Sources
- US Census — TIGER/Line — address-range data documentation · https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html · Accessed .
- Pelias — Pelias open-source geocoder — pipeline reference · https://github.com/pelias/pelias · Accessed .
- OSM — Nominatim open-source geocoder — algorithm documentation · https://nominatim.org/release-docs/develop/ · Accessed .
- libpostal — libpostal — open-source address parser · https://github.com/openvenues/libpostal · Accessed .
Cite this article
APA format:
Steve K. (2026). How Geocoding Works. Coordinately. https://coordinately.org/learn/how-geocoding-works
BibTeX:
@misc{coordinately_howgeocodingworks_2026,
author = {K., Steve},
title = {How Geocoding Works},
year = {2026},
publisher = {Coordinately},
url = {https://coordinately.org/learn/how-geocoding-works},
note = {Accessed: 2026-06-05}
}