Skip to content

Conversation

@ianthetechie
Copy link
Contributor

Now that #7 is merged, I think we can start discussing other layers. This is one of the ones that initially drove that PR, and we're building this on our own internal pipelines already.

I think this is pretty straightforward, but I'll raise a few points to get more eyes on:

Convention for multilingual names

I expect this will come up a lot of layers (highways, for example, don't currently have this but definitely need them), so we might want to put at least a little thought into it now.

Decision 1: is name a separate column or part of a map? The "name" is always the local name, and it (somewhat regrettably, for people like me :P) doesn't carry any language information. This makes it somewhat different from the other names. So I have kept name as a top-level column, and put the rest into multilingual_names. I think this acknowledges the difference without needing to decide on a well-known name, and hope there are no collisions (e.g. name:local is apparently a thing?).

Decision 2: what to use for the keys in the map? I just used the OSM tag name as the map key, but could just as easily argue for dropping the name: prefix so you could index by what you hope is an ISO language code 🤷.

Alt names

Alt names exist, and I haven't really given much thought to how we should handle those yet. I don't think we need to decide on that before merging this PR per se. But the above discussion is a prerequisite, since it should use similar naming/structure.

Anything I missed?

  • Have I missed any boundary values that would be useful to include?
  • Are there any other tags I missed? I think I have the important ones based on taginfo. Though I was surprised to see that less than half have a name tag 🤔

("type", pyarrow.string()),
("admin_level", pyarrow.string()),
("boundary", pyarrow.string()),
("place", pyarrow.string()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s a growing trend to use border_type=* instead of place=* on administrative boundaries. Some other keys are also used to break free of the globally harmonized place classification scheme, but border_type=* is the most common and globally distributed of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! That would actually be very useful if it gets broader adoption. Particularly in areas like South Korea with wildly complex "official" admin_level rules (e.g., there is no one-to-one-mapping, and there are specific exceptions for some areas), and the reality of tagging doesn't even follow that all the time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wiki has a rundown of the alternatives. Of these, border_type=* is the oldest and the only one that global software makes use of (namely, Nominatim and openstreetmap-website). But it might be a good idea to try and normalize the other keys to it for the convenience of clients.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rundown! I'm adding a comment to note this as a possibility for future improvement.


class BoundariesWriter(GeoParquetWriter):
COLUMNS = [
("name", pyarrow.string()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. name:local is apparently a thing?

If I’m not mistaken, this stems from the ancient debate about old_name=* versus name:old=* and so on. There isn’t much usage and most data consumers just ignore it. Incidentally, I think data consumers may use name:local or somesuch as a placeholder after replacing name with the requested language.

If you need a language code for consistency, mul is the official ISO 639 code for multilingual content, and und is the official code for content in an unknown language. Localization functionality in renderers, such as Mapbox/MapLibre and OSM Americana, use mul with the rationale that the overall coverage is multilingual, while a different type of data consumer might go with und instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh interesting! I think I'll just leave it as name for now for the top level key.


try:
self.append(
"way" if o.from_way() else "relation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I was surprised to see that less than half have a name tag 🤔

This is because of lingering usage of boundary=* on ways. The figure is over 96% among relations only. For the most part, boundary=* on ways has been out of favor since old-style multipolygons were deprecated. However, it remains valid as part of the disputed boundary tagging scheme and technically when a boundary is isolated with no neighbors, enclaves, or exclaves. On that note, name:left=* and name:right=* may also be of interest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, it remains valid as part of the disputed boundary tagging scheme and technically when a boundary is isolated with no neighbors, enclaves, or exclaves. On that note, name:left=* and name:right=* may also be of interest.

Good point on disputed boundaries... and other things. Apparently name can also be combined with arbitrary separators, and have a name:left / name:right 😂 This will take a bit more effort than just adding tags though.

  1. The boundaries layer as written currently only considers areas. I think name:left and name:right would only be for open ways tagging a specific border in isolation without the full geometry. What would be the most useful way to proceed in your opinion? Neither a linestring border nor the "left/right" convention are particularly friendly for data consumers so maybe we should put some thought into how to make this maximally useful (and I don't think I understand the usage / conventions well enough TBH).
  2. I had a look over the Disputed territories wiki page and it unhelpfully indicates that there were two proposals for how to tag these, both of which are rejected (though one is still used?). I also know the wiki is often the reflection of whoever last edited it and may not always portray reality :) Any suggestions on how to handle this would be very welcome.
  3. I don't understand your comment on isolated boundaries / enclaves. Can you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The boundaries layer as written currently only considers areas.

In that case, it doesn’t need to worry about name:left/right=*.

I don't understand your comment on isolated boundaries / enclaves. Can you elaborate?

I recall that some boundaries around islands in the ocean are still mapped as ways rather than relations. The U.S. community has made an effort to upgrade them all to relations, but I’m not sure about other parts of the world.

I had a look over the Disputed territories wiki page and it unhelpfully indicates that there were two proposals for how to tag these, both of which are rejected (though one is still used?).

Neither of those. The only tagging scheme being used for the things you’d care about is boundary=disputed along with disputed_by=*, claimed_by=*, and controlled_by=*.

Apparently name can also be combined with arbitrary separators

The semicolon is the only delimiter that reliably separates names globally. Other arbitrary separators like slashes, dashes, or spaces ostensibly matter in some regions, but in my experience you’ll find too many exceptions to make it worthwhile.

res = {tag.k: tag.v for tag in tags if not tag.k.startswith("name:")}

# Special shape transformation for names
name_tags = {tag.k: tag.v for tag in tags if tag.k.startswith("name:")}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some keys like name:left=*, name:right=*, name:etymology=*, name:pronunciation=*, and name:genitive=* are not exactly “multilingual” names but rather properties of names. If you have the ability to store a more deeply nested structure than just a flat key-value store, it might be a good idea to somehow associate the etymologies and pronunciations with their respective names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still figuring out how to approach name:left and name:right for this layer (discussed in another comment), but this is a good point about name:etymology and other tags! Are these tags only related to name? Or can they be namespaced like name:en:etymology? For this specific tag, I only found 2 usages, so it seems like this may only apply to name?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name:etymology=* and name:pronunciation=* are properties of the name itself, while name:left=*, name:genitive=*, etc. are other names used in more specific contexts.

There’s a bit of inconsistency about whether to put the language code before or after pronunciation or etymology. And if I had known about it earlier, I would’ve chosen to popularize name:en-fonipa=*, name:en-fonxsamp=*, etc. according to the standard instead.

@jake-low
Copy link
Member

jake-low commented Sep 8, 2025

This is great! Thanks for putting up this PR. I think a boundaries layer is really useful. Personally, my main use case is to use boundaries as a join or clipping polygon for other data (e.g. grab the "California" boundary from the boundaries layer, then use it to select all buildings in California from the buildings layer). But I'm sure there are other use cases too (including cartography, where handling disputed boundaries thoughtfully is important).

I was recently playing around with a places layer (for OSM features tagged place=*, i.e. point geometries for cities, towns, etc), and one of the things I was experimenting with was multilingual names. I just put that code up in a PR (#19).

When working on multilingual names, I came to similar decisions as you, namely:

  1. I kept name as its own string column, but put all the other name:* tags in a names map. I chose names as the column name since it'll end up including multilingual names but also name:left, name:etymology, etc; it's up to data consumers to read the value(s) they want from the map.
  2. I did choose to drop the name: prefix from keys in the map, since it makes querying nicer (select tags.names.es instead of ... select tags.names["name:es"] I think?).

I imagine this same pattern would work well for any tag that has multilingual variants. I added alt_names and official_names maps in the places layer, just as a proof of concept. Once we figure out what pattern we want to use for these, we should document it in the (as-yet-to-be-created) layer schema docs.

Have I missed any boundary values that would be useful to include?

boundary=aboriginal_lands comes to mind (example) since in many cases these function as their own administrative units. I agree with your decision to exclude protected areas, census boundaries, etc from this layer though. I don't think all layers need to be 1:1 with a top-level tag. For example, I'm planning to create a parks layer that will include all leisure=park, leisure=nature_reserve, boundary=national_park, and boundary=protected_area elements. Other boundary types could also get their own layers if needed.

Are there any other tags I missed? I think I have the important ones based on taginfo.

Looks pretty complete to me. Maybe look into disputed, disputed_by, claimed_by, recognized_by, as I've seen those on disputed boundaries. Although TagInfo suggests they are more commonly used on ways, not relations.

@ianthetechie
Copy link
Contributor Author

Thanks for the reviews!

I kept name as its own string column, but put all the other name:* tags in a names map. I chose names as the column name since it'll end up including multilingual names but also name:left, name:etymology, etc; it's up to data consumers to read the value(s) they want from the map.

I think @1ec5 brings up an interesting point about leveraging the rich column types, but I'm not sure what I'd call the name value itself ;) Your proposal neatly side-steps this dilemma and makes the keys correspond directly to the OSM tags, without necessarily implying language like mine did...

I'll think this over a bit more this week.

I did choose to drop the name: prefix from keys in the map, since it makes querying nicer (select tags.names.es instead of ... select tags.names["name:es"] I think?).

Yeah, on second thought, I think this works better (especially if the "parent" key is names).

I imagine this same pattern would work well for any tag that has multilingual variants. I added alt_names and official_names maps in the places layer, just as a proof of concept. Once we figure out what pattern we want to use for these, we should document it in the (as-yet-to-be-created) layer schema docs.

👍 I'll check out the implementation there.

boundary=aboriginal_lands comes to mind

Good call!

I agree with your decision to exclude protected areas, census boundaries, etc from this layer though. I don't think all layers need to be 1:1 with a top-level tag.

Yeah, my goal for this layer is to identify, in coarse terms, where you are on earth. (My specific use case is reverse geocoding.) That mostly means administrative boundaries, but this isn't always the case (boundary=place, for example). National parks, census areas, etc., aren't typically needed for the same kinds of things, and probably make more sense as separate layers.

maritime is sorta in the middle, since it's sort of administrative too. My understanding of the OSM usage is that it's typically used for things like EEZs and other jurisdictional things. It's not a requirement for my use case, but seemed close enough to include.

@ianthetechie
Copy link
Contributor Author

Following up on this after reviewing #19.

  • I think your method of extracting various map/sub-key fields is better than mine. Let's standardize on that an approach for this across importers 👍
  • @1ec5 brings up some interesting points on semicolon-delimited names. This brings up a question... should the name (etc.) map value be a string which we leave up to the consumer to parse? Or should we also standardize on splitting names and such into an array? I lean toward parsing into an array, but that's not a super strong opinion.
  • I'll update this PR with the disputed tags. Thanks to both of you for pointing the way on this!

@1ec5
Copy link
Member

1ec5 commented Sep 18, 2025

should the name (etc.) map value be a string which we leave up to the consumer to parse? Or should we also standardize on splitting names and such into an array? I lean toward parsing into an array, but that's not a super strong opinion.

Both approaches have precedent, but from my perspective having worked on osm-americana/openstreetmap-americana#670, an array representation would be much friendlier to consumers. Also, I recall that the Daylight distribution represented names as arrays; maybe Overture as well.

@jake-low jake-low mentioned this pull request Sep 18, 2025
@ianthetechie
Copy link
Contributor Author

... an array representation would be much friendlier to consumers. Also, I recall that the Daylight distribution represented names as arrays; maybe Overture as well.

Agreed on friendliness! It looks like Overture just uses a single string per language (keyed similarly to how we do) but that doesn't make much difference to me :P https://docs.overturemaps.org/schema/reference/places/place/

I think I'll go with the same approach as Americana and Nominatim (semicolon-only splitting) for now. I looked at the Pelias code and they also do the same (semicolon splitting on name tags). We can improve and expand it later, and this is a good starting point. I am not applying it to all layers at the moment; just ones which I know have some precedent for using it. In particular, the disputed tag family uses semicolon delimited ISO country codes quite consistently.

Might be worth splitting out some of these "common tag sets" at some point as we build out more layer definitions. For example, names and wiki links. Maybe after another layer or 2 we can see what makes sense to have common defs for.

Finally, I think we might be able to automate away all or part these column functions at some point by inspecting the schema.... but let's get a few more examples first ;)

Comment on lines +27 to +30
if s:
return [value.strip() for value in s.split(";")]
else:
return None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further reflection, I realized that this is perhaps a bit inconsistent. We seem to encode empty maps as empty maps rather than null. Null values can be encoded more efficiently in parquet IIUC (https://parquet.apache.org/docs/file-format/nulls/), particularly for columns which are predominantly null, like most of the things where we have arrays.

I'm not sure if we need to actually make any change for map columns but I wanted to flag this for discussion either way.

image

Comment on lines +68 to +88
match column_name:
case (
"name"
| "official_name"
| "int_name"
| "alt_name"
| "disputed_by"
| "claimed_by"
| "controlled_by"
| "recognized_by"
):
# Multi-value fields
return split_multi_value_field(tags.get(column_name))
case "names" | "alt_names" | "official_names":
# Prefixed maps
return tags_with_prefix(
f"{column_name[:-1]}:", tags, transform=split_multi_value_field
)
case _:
# Single value fields
return tags.get(column_name)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made certain columns multi-value arrays per the discussion. If we agree on this, we'll want to update the settlements layer as well to match.

@jake-low
Copy link
Member

I've been testing this PR with your latest changes, and I think it's great!

I built the boundaries dataset from north-america-latest.osm.pbf. It took about 3 hours on my laptop (using the --osmium-idx sparse_file_array option as I don't have enough RAM to use the default index type). The resulting dataset contains 66,404 rows and is 371M. I'm guessing that the full planet boundaries will be no more than an order of magnitude more than that.

image

I was initially hesitant about semicolon-splitting for names, since I wasn't sure about the ergonomics of having the name column be a varchar[]. But it's actually fine. Arrays are one-indexed in DuckDB's SQL dialect (maybe this is universal in SQL? IDK), so if you want strings, you can get the first value in the names array like this:

D select type, id, tags.name[1] as name from 'out/boundaries.parquet' limit 10;
┌──────────┬──────────┬────────────────────────────────────────┐
│   type   │    id    │                  name                  │
│ varchar  │  int64   │                varchar                 │
├──────────┼──────────┼────────────────────────────────────────┤
│ relation │ 17571170 │ Public Gardens                         │
│ way      │  3850270 │ NULL                                   │
│ way      │  3941174 │ Ojibwa Island                          │
│ relation │  9435730 │ Halifax Citadel National Historic Site │
│ relation │ 10705906 │ Algonquin Island                       │
│ way      │  9650702 │ NULL                                   │
│ way      │  9650877 │ NULL                                   │
│ way      │  9650887 │ NULL                                   │
│ way      │  9650894 │ NULL                                   │
│ way      │  9650898 │ NULL                                   │
├──────────┴──────────┴────────────────────────────────────────┤
│ 10 rows                                            3 columns │
└──────────────────────────────────────────────────────────────┘

...and pleasantly, it handles NULL values in the way you'd want.

If you want to export the results of a query to GeoJSON, you'll run into an error if you try to include an array-valued column:

D copy (
    from 'out/boundaries.parquet'
    select type, id, tags.name[1] as name, tags.alt_name as alt_name, geometry
  ) to 'test.geojson' with (format 'GDAL', driver 'GeoJSON');
Not implemented Error:
Unsupported field type

But you can fix this by wrapping that column in a call to the built-in json() function.

D copy (
    from 'out/boundaries.parquet'
    select type, id, tags.name[1] as name, json(tags.alt_name) as alt_name, geometry
  ) to 'test.geojson' with (format 'GDAL', driver 'GeoJSON');

This is how I produced the GeoJSON file shown in the mapshaper.org screenshot above. In that GeoJSON file, each feature's properties["alt_name"] is an array of strings (or null).

This seems like it's ready for merging to me. 🙂 As always, happy to gather feedback and make adjustments to the schema later on. Thanks for working on this layer; I'm excited to use it.

@jake-low jake-low merged commit 6a64801 into osmus:main Sep 26, 2025
@ianthetechie
Copy link
Contributor Author

Rockin'! Yeah it's a few gigabytes and is even faster than that on my server using a filtered planet PBF.

I also had the same concern over a list of names initially but came to the same conclusion. DataFusion is similar in ergonomics.

Of course it's still an interesting question for data consumers how THEY should use multiple names, if at all, but I think making it explicit is in everyone's best interest :)

@ianthetechie ianthetechie deleted the boundaries-layer branch October 1, 2025 04:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants