-
-
Notifications
You must be signed in to change notification settings - Fork 4
First pass at a boundary layer #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ("type", pyarrow.string()), | ||
| ("admin_level", pyarrow.string()), | ||
| ("boundary", pyarrow.string()), | ||
| ("place", pyarrow.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There’s a growing trend to use border_type=* instead of place=* on administrative boundaries. Some other keys are also used to break free of the globally harmonized place classification scheme, but border_type=* is the most common and globally distributed of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! That would actually be very useful if it gets broader adoption. Particularly in areas like South Korea with wildly complex "official" admin_level rules (e.g., there is no one-to-one-mapping, and there are specific exceptions for some areas), and the reality of tagging doesn't even follow that all the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wiki has a rundown of the alternatives. Of these, border_type=* is the oldest and the only one that global software makes use of (namely, Nominatim and openstreetmap-website). But it might be a good idea to try and normalize the other keys to it for the convenience of clients.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the rundown! I'm adding a comment to note this as a possibility for future improvement.
src/boundaries.py
Outdated
|
|
||
| class BoundariesWriter(GeoParquetWriter): | ||
| COLUMNS = [ | ||
| ("name", pyarrow.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. name:local is apparently a thing?
If I’m not mistaken, this stems from the ancient debate about old_name=* versus name:old=* and so on. There isn’t much usage and most data consumers just ignore it. Incidentally, I think data consumers may use name:local or somesuch as a placeholder after replacing name with the requested language.
If you need a language code for consistency, mul is the official ISO 639 code for multilingual content, and und is the official code for content in an unknown language. Localization functionality in renderers, such as Mapbox/MapLibre and OSM Americana, use mul with the rationale that the overall coverage is multilingual, while a different type of data consumer might go with und instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh interesting! I think I'll just leave it as name for now for the top level key.
|
|
||
| try: | ||
| self.append( | ||
| "way" if o.from_way() else "relation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I was surprised to see that less than half have a name tag 🤔
This is because of lingering usage of boundary=* on ways. The figure is over 96% among relations only. For the most part, boundary=* on ways has been out of favor since old-style multipolygons were deprecated. However, it remains valid as part of the disputed boundary tagging scheme and technically when a boundary is isolated with no neighbors, enclaves, or exclaves. On that note, name:left=* and name:right=* may also be of interest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, it remains valid as part of the disputed boundary tagging scheme and technically when a boundary is isolated with no neighbors, enclaves, or exclaves. On that note, name:left=* and name:right=* may also be of interest.
Good point on disputed boundaries... and other things. Apparently name can also be combined with arbitrary separators, and have a name:left / name:right 😂 This will take a bit more effort than just adding tags though.
- The boundaries layer as written currently only considers areas. I think
name:leftandname:rightwould only be for open ways tagging a specific border in isolation without the full geometry. What would be the most useful way to proceed in your opinion? Neither a linestring border nor the "left/right" convention are particularly friendly for data consumers so maybe we should put some thought into how to make this maximally useful (and I don't think I understand the usage / conventions well enough TBH). - I had a look over the Disputed territories wiki page and it unhelpfully indicates that there were two proposals for how to tag these, both of which are rejected (though one is still used?). I also know the wiki is often the reflection of whoever last edited it and may not always portray reality :) Any suggestions on how to handle this would be very welcome.
- I don't understand your comment on isolated boundaries / enclaves. Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The boundaries layer as written currently only considers areas.
In that case, it doesn’t need to worry about name:left/right=*.
I don't understand your comment on isolated boundaries / enclaves. Can you elaborate?
I recall that some boundaries around islands in the ocean are still mapped as ways rather than relations. The U.S. community has made an effort to upgrade them all to relations, but I’m not sure about other parts of the world.
I had a look over the Disputed territories wiki page and it unhelpfully indicates that there were two proposals for how to tag these, both of which are rejected (though one is still used?).
Neither of those. The only tagging scheme being used for the things you’d care about is boundary=disputed along with disputed_by=*, claimed_by=*, and controlled_by=*.
Apparently name can also be combined with arbitrary separators
The semicolon is the only delimiter that reliably separates names globally. Other arbitrary separators like slashes, dashes, or spaces ostensibly matter in some regions, but in my experience you’ll find too many exceptions to make it worthwhile.
src/boundaries.py
Outdated
| res = {tag.k: tag.v for tag in tags if not tag.k.startswith("name:")} | ||
|
|
||
| # Special shape transformation for names | ||
| name_tags = {tag.k: tag.v for tag in tags if tag.k.startswith("name:")} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some keys like name:left=*, name:right=*, name:etymology=*, name:pronunciation=*, and name:genitive=* are not exactly “multilingual” names but rather properties of names. If you have the ability to store a more deeply nested structure than just a flat key-value store, it might be a good idea to somehow associate the etymologies and pronunciations with their respective names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still figuring out how to approach name:left and name:right for this layer (discussed in another comment), but this is a good point about name:etymology and other tags! Are these tags only related to name? Or can they be namespaced like name:en:etymology? For this specific tag, I only found 2 usages, so it seems like this may only apply to name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name:etymology=* and name:pronunciation=* are properties of the name itself, while name:left=*, name:genitive=*, etc. are other names used in more specific contexts.
There’s a bit of inconsistency about whether to put the language code before or after pronunciation or etymology. And if I had known about it earlier, I would’ve chosen to popularize name:en-fonipa=*, name:en-fonxsamp=*, etc. according to the standard instead.
|
This is great! Thanks for putting up this PR. I think a boundaries layer is really useful. Personally, my main use case is to use boundaries as a join or clipping polygon for other data (e.g. grab the "California" boundary from the I was recently playing around with a When working on multilingual names, I came to similar decisions as you, namely:
I imagine this same pattern would work well for any tag that has multilingual variants. I added
Looks pretty complete to me. Maybe look into |
|
Thanks for the reviews!
I think @1ec5 brings up an interesting point about leveraging the rich column types, but I'm not sure what I'd call the name value itself ;) Your proposal neatly side-steps this dilemma and makes the keys correspond directly to the OSM tags, without necessarily implying language like mine did... I'll think this over a bit more this week.
Yeah, on second thought, I think this works better (especially if the "parent" key is
👍 I'll check out the implementation there.
Good call!
Yeah, my goal for this layer is to identify, in coarse terms, where you are on earth. (My specific use case is reverse geocoding.) That mostly means administrative boundaries, but this isn't always the case (
|
|
Following up on this after reviewing #19.
|
Both approaches have precedent, but from my perspective having worked on osm-americana/openstreetmap-americana#670, an array representation would be much friendlier to consumers. Also, I recall that the Daylight distribution represented names as arrays; maybe Overture as well. |
Agreed on friendliness! It looks like Overture just uses a single string per language (keyed similarly to how we do) but that doesn't make much difference to me :P https://docs.overturemaps.org/schema/reference/places/place/ I think I'll go with the same approach as Americana and Nominatim (semicolon-only splitting) for now. I looked at the Pelias code and they also do the same (semicolon splitting on name tags). We can improve and expand it later, and this is a good starting point. I am not applying it to all layers at the moment; just ones which I know have some precedent for using it. In particular, the disputed tag family uses semicolon delimited ISO country codes quite consistently. Might be worth splitting out some of these "common tag sets" at some point as we build out more layer definitions. For example, names and wiki links. Maybe after another layer or 2 we can see what makes sense to have common defs for. Finally, I think we might be able to automate away all or part these |
| if s: | ||
| return [value.strip() for value in s.split(";")] | ||
| else: | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon further reflection, I realized that this is perhaps a bit inconsistent. We seem to encode empty maps as empty maps rather than null. Null values can be encoded more efficiently in parquet IIUC (https://parquet.apache.org/docs/file-format/nulls/), particularly for columns which are predominantly null, like most of the things where we have arrays.
I'm not sure if we need to actually make any change for map columns but I wanted to flag this for discussion either way.
| match column_name: | ||
| case ( | ||
| "name" | ||
| | "official_name" | ||
| | "int_name" | ||
| | "alt_name" | ||
| | "disputed_by" | ||
| | "claimed_by" | ||
| | "controlled_by" | ||
| | "recognized_by" | ||
| ): | ||
| # Multi-value fields | ||
| return split_multi_value_field(tags.get(column_name)) | ||
| case "names" | "alt_names" | "official_names": | ||
| # Prefixed maps | ||
| return tags_with_prefix( | ||
| f"{column_name[:-1]}:", tags, transform=split_multi_value_field | ||
| ) | ||
| case _: | ||
| # Single value fields | ||
| return tags.get(column_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made certain columns multi-value arrays per the discussion. If we agree on this, we'll want to update the settlements layer as well to match.
|
Rockin'! Yeah it's a few gigabytes and is even faster than that on my server using a filtered planet PBF. I also had the same concern over a list of names initially but came to the same conclusion. DataFusion is similar in ergonomics. Of course it's still an interesting question for data consumers how THEY should use multiple names, if at all, but I think making it explicit is in everyone's best interest :) |

Now that #7 is merged, I think we can start discussing other layers. This is one of the ones that initially drove that PR, and we're building this on our own internal pipelines already.
I think this is pretty straightforward, but I'll raise a few points to get more eyes on:
Convention for multilingual names
I expect this will come up a lot of layers (highways, for example, don't currently have this but definitely need them), so we might want to put at least a little thought into it now.
Decision 1: is
namea separate column or part of a map? The "name" is always the local name, and it (somewhat regrettably, for people like me :P) doesn't carry any language information. This makes it somewhat different from the other names. So I have keptnameas a top-level column, and put the rest intomultilingual_names. I think this acknowledges the difference without needing to decide on a well-known name, and hope there are no collisions (e.g.name:localis apparently a thing?).Decision 2: what to use for the keys in the map? I just used the OSM tag name as the map key, but could just as easily argue for dropping the
name:prefix so you could index by what you hope is an ISO language code 🤷.Alt names
Alt names exist, and I haven't really given much thought to how we should handle those yet. I don't think we need to decide on that before merging this PR per se. But the above discussion is a prerequisite, since it should use similar naming/structure.
Anything I missed?
boundaryvalues that would be useful to include?nametag 🤔