Boundary changes endpoint. by GeoWill · Pull Request #629 · DemocracyClub/aggregator-api

GeoWill · 2026-01-13T15:46:16Z

No description provided.

This commit deals with two types of error. First it adds a sentry log when there is a file not found on s3. Second it catches any exceptions raised when calling the code which gets the boundary review response. It then logs this to sentry, but still returns a response. This seemed better than raising a 5xx in this situation, where a client might want the rest of the information in the response, even if boundary change failed.

GeoWill · 2026-02-03T14:53:12Z

api/endpoints/v1/voting_information/static_data_helper.py

        data = self.get_data_for_uprn()
        return self.query_to_dict(data)

+    def data_quality_check(self, postcode_df):


This now happens inside every request/response cycle that looks for static data.
I don't think it will add much, but it feels like the wrong place for it.

Would it be better to try and add some checks to the statemachine that run at the end of each run and do data quality assurance.

I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.

Anyway, yes. Lets check the data here before we consume it.

I think the other failure modes we should care about here are:

The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?

This isn't something we can do without changing the lambda that writes outcode parquet files in data baker. Specifically these lines:

if has_any_non_null_filter_column: print( f"At least one UPRN in {outcode} has data in {filter_column}, writing a file with data" ) outcode_df.sort(by=["postcode", "uprn"]) outcode_df.write_parquet(outcode_path) else: print( f"No {filter_column} for any address in {outcode}, writing an empty file" ) polars.DataFrame().write_parquet(outcode_path)

It will mean writing loads of files with empty columns, but can go with that if you think it's worth it.

I think one of us isn't understanding the other on this.
Lets have a look at this one on a call together.

api/endpoints/v1/voting_information/stitcher.py

api/endpoints/v1/voting_information/boundary_changes/client.py

chris48s · 2026-02-10T16:58:25Z

common/common/conf.py

+        self.BOUNDARY_REVIEWS_ENABLED = True
+        self.BOUNDARY_REVIEWS_DATA_KEY_PREFIX = os.environ.get(
+            "BOUNDARY_REVIEWS_DATA_KEY_PREFIX",
+            "current_boundary_reviews_parquet/",


Suggested change

"current_boundary_reviews_parquet/",

"addressbase/production/current_boundary_reviews_parquet/",

I think this is why this was failing when you deployed it to dev.

I don't think this is quite what is needed. I had to add f5256bb to get it working on dev. Having just the folder name means I can run LOCAL_STATIC_DATA_PATH=~/cloud/aws/democracy-club/pollingstations.private.data python run_local_api.py --function voting_information --port 8000 to have it work locally.

Hmm. so I am running this locally with

S3_CLIENT_ENABLED=1 AWS_PROFILE=wdiv-prod WDIV_API_KEY=[redacted] python run_local_api.py --function voting_information

to query the real s3 bucket from my local copy.

For me to get this to work, I have to make the BOUNDARY_REVIEWS_DATA_KEY_PREFIX equal to addressbase/production/current_boundary_reviews_parquet/

With the default, every request throws FileNotFoundError

api/endpoints/v1/sandbox/sandbox-responses/AD11DD.json

chris48s · 2026-02-10T16:59:27Z

api/endpoints/v1/sandbox/sandbox-responses/DD11DD.json

+      "boundary_review_id": "123",
+      "boundary_review_details": {


A bit of bikeshedding on the API response format:
Why different keys for boundary_review_id/boundary_review_details ?

Why not

"boundary_reviews": [ { "id": "123", "consultation_url": "https://example.com/boundary-review", "effective_date": "2026-05-07", "legislation_title": "The Test Boundary Order 2025", "organisation_name": "Test Council", "organisation_official_name": "Test Council Official", "organisation_gss": "E09000033" "boundary_changes": [ ... ] } ]

I think that would be neater if it makes no odds either way.

Also, shall we just put in a blank placeholder for ballots: [] in the relevant place even though we can't populate it yet?

I'd done it like that because that's how I'd ended up structuring the query in data baker. It's not to hard to change it to what you suggest, which I agree is neater. ae2ee92

OK, so fundamentally the approach here is: We solve this in the schema in this app rather than in the underlying data. I guess there's something about the way we're assembling the data that makes that representation more convenient there?

api/endpoints/v1/voting_information/config.py

api/endpoints/v1/voting_information/stitcher.py

chris48s · 2026-02-10T17:04:31Z

api/endpoints/v1/voting_information/static_data_helper.py

+                    sentry_sdk.capture_exception(
+                        ex, context={"s3_key": key, "bucket": bucket}
+                    )


I've not tested this, but does passing a context= kwarg to capture_exception() here work?
I think you might need to set context like this now?
Can we double-check this on dev.

https://github.com/DemocracyClub/UK-Polling-Stations/blob/76fd6a1947c1f6df6a140386f5677eb5e769cb87/polling_stations/apps/data_finder/helpers/baked_data_helper.py#L75-L91

I went off the docs here. That says you can either set scope or scope_kwargs (as described in Scope.update_from_kwargs). But will deploy to dev and check.

..or you can set up your local copy with a sentry DSN. That will make it easier to deliberately trigger exceptions.

api/endpoints/v1/voting_information/boundary_changes/client.py

chris48s · 2026-02-10T17:05:31Z

tests/helpers.py

+    if not fixture_data:
+        return pl.DataFrame()


Hmm. My gut instinct reading this is that if I write test code that is trying to fetch a fixture that doesn't exist, that seems like it should raise an exception rather than silently returning an empty DataFrame.

I did this, because it mirrors what load_fixture does. Essentially this helper is a wrapper load_fixture to return a DataFrame rather than some json. This line isn't really doing anything, so can be deleted, but I thought it made the behaviour more obvious. If I delete it then if the fixture doesn't exists load_fixture will return [] (and pl.DataFrame().equals( pl.DataFrame([])) is True).

I could change load_fixture:

@@ -12,7 +12,7 @@ def load_fixture(testname, fixture, api_version="v1"): dirname / api_version / "test_data" / testname / f"{fixture}.json" ) if not file_path.exists(): - return [] + raise FileNotFoundError(f"Could not find fixture:{fixture} at {file_path}") with file_path.open("r") as f: return json.loads(f.read())

but that breaks some other tests.

OK. I feel like that underlying behaviour in load_fixture() is probably wrong/unhelpful, and if there are specific requests we want to mock as returning [] we should explicitly write files to disk containing [] but I think pulling that thread right this second is a distraction from the core thing we're trying to accomplish in this PR.

Lets leave this for now, but I would like to revisit what load_fixture() is doing here

api/endpoints/v1/voting_information/boundary_changes/client.py

chris48s · 2026-02-10T17:07:19Z

api/endpoints/v1/voting_information/static_data_helper.py

        data = self.get_data_for_uprn()
        return self.query_to_dict(data)

+    def data_quality_check(self, postcode_df):


I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.

Anyway, yes. Lets check the data here before we consume it.

I think the other failure modes we should care about here are:

The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?

api/endpoints/v1/voting_information/static_data_helper.py

chris48s · 2026-02-19T15:16:28Z

api/endpoints/v1/voting_information/stitcher.py

+            and not resp["address_picker"]
+        ):
+            try:
+                resp["boundary_reviews"] = None


Suggested change

resp["boundary_reviews"] = None

resp["boundary_reviews"] = []

chris48s · 2026-02-19T15:17:37Z

api/endpoints/v1/voting_information/static_data_helper.py

+class DuplicateUPRNError(ValueError):
+    def __init__(self, postcode, uprns):
+        message = (
+            f"Duplicate UPRNs found for postcode {postcode}: {sorted(uprns)}"
+        )
+        super().__init__(message)
+
+
+class MultipleAddressbaseSourceError(ValueError):
+    def __init__(self, postcode, sources):
+        message = f"Multiple addressbase sources found for postcode {postcode}: {sources}"
+        super().__init__(message)
+
+


Including the postcode/UPRN in the exception message means Sentry probably won't group all DuplicateUPRNErrors together
i.e: if we throw this 500 times for different UPRNs then Sentry will probably consider that 500 completely different issues instead of 1 issue with 500 events
This will be very annoying.

There are multiple ways to skin this cat. One of them is to set a fingerprint based on exception class. So you can do something like this:

https://docs.sentry.io/platforms/python/usage/sdk-fingerprinting/#group-errors-more-aggressively

..or in WDIV, I set the fingerprints at log time e.g:

https://github.com/DemocracyClub/UK-Polling-Stations/blob/b0a46a1df2591d9280b7b1d2f93e3ed32223b4c0/polling_stations/apps/data_finder/helpers/baked_data_helper.py#L160-L190

Another way to do it is something like

class DuplicateUPRNError(ValueError): def __init__(self, postcode, uprns): self.postcode = postcode self.uprns = sorted(uprns) # static — Sentry groups on this super().__init__("Duplicate UPRNs found") def __str__(self): # human-readable for local dev return f"Duplicate UPRNs found for postcode {self.postcode}: {self.uprns}" # ..and when we raise it: with sentry_sdk.push_scope() as scope: # attach extras for sentry scope.set_extra("postcode", self.postcode.with_space) scope.set_extra("uprns", duplicate_uprns) raise DuplicateUPRNError(postcode=self.postcode.with_space, uprns=duplicate_uprns)

I've not tested that code, but it should be.. roughly right. Can you try setting up a sentry DSN locally and have a go with one or other of these approaches.

GeoWill force-pushed the boundary_changes branch from 4cb6c02 to 93fc4d4 Compare January 20, 2026 17:31

GeoWill added 5 commits February 3, 2026 12:13

Models for boundary changes endpoint.

091f4f8

Dry static data helper

58ca32b

boundary changes client

5517862

boundary changes in voting info endpoint

22be494

GeoWill force-pushed the boundary_changes branch from 93fc4d4 to aa12f29 Compare February 3, 2026 12:13

GeoWill marked this pull request as ready for review February 3, 2026 13:14

GeoWill force-pushed the boundary_changes branch from 3d4a92d to 7904efb Compare February 3, 2026 14:50

GeoWill added 3 commits February 3, 2026 15:02

Integration test for boundary_changes

c787687

Unit tests for boundary review client

067f218

Add data quality checks to StaticDataHelper

05a8eb4

GeoWill force-pushed the boundary_changes branch from 7904efb to 05a8eb4 Compare February 3, 2026 15:02

GeoWill changed the title ~~WiP models for boundary changes endpoint.~~ Boundary changes endpoint. Feb 3, 2026

GeoWill commented Feb 3, 2026

View reviewed changes

GeoWill commented Feb 4, 2026

View reviewed changes

api/endpoints/v1/voting_information/stitcher.py Outdated Show resolved Hide resolved

GeoWill commented Feb 4, 2026

View reviewed changes

api/endpoints/v1/voting_information/boundary_changes/client.py Outdated Show resolved Hide resolved

Add BOUNDARY_REVIEWS_DATA_KEY_PREFIX to deploy

f5256bb

GeoWill force-pushed the boundary_changes branch from d2c7a10 to f5256bb Compare February 5, 2026 19:57

chris48s reviewed Feb 10, 2026

View reviewed changes

api/endpoints/v1/voting_information/static_data_helper.py Show resolved Hide resolved

GeoWill added 6 commits February 19, 2026 09:08

fixup! boundary changes client

ade8916

fixup! Boundary review error handling

334b636

fixup! boundary changes in voting info endpoint

cdd6970

fixup! Integration test for boundary_changes

eec462b

fixup! Add data quality checks to StaticDataHelper

2c8b1ff

fixup! Models for boundary changes endpoint.

ae2ee92

GeoWill force-pushed the boundary_changes branch from f6cfc71 to 451388a Compare February 19, 2026 11:44

GeoWill added 2 commits February 19, 2026 11:49

fixup! Integration test for boundary_changes

6289ccd

fixup! Unit tests for boundary review client

55b9daa

GeoWill force-pushed the boundary_changes branch from 451388a to 55b9daa Compare February 19, 2026 11:49

fixup! Models for boundary changes endpoint.

1ed24d6

chris48s reviewed Feb 19, 2026

View reviewed changes

	"current_boundary_reviews_parquet/",
	"addressbase/production/current_boundary_reviews_parquet/",

	resp["boundary_reviews"] = None
	resp["boundary_reviews"] = []

Conversation

GeoWill commented Jan 13, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GeoWill Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

GeoWill Feb 17, 2026 •

edited

Loading