Skip to content

Boundary changes endpoint.#629

Open
GeoWill wants to merge 18 commits intomasterfrom
boundary_changes
Open

Boundary changes endpoint.#629
GeoWill wants to merge 18 commits intomasterfrom
boundary_changes

Conversation

@GeoWill
Copy link
Contributor

@GeoWill GeoWill commented Jan 13, 2026

No description provided.

This commit deals with two types of error.
First it adds a sentry log when there is a file not found on s3.
Second it catches any exceptions raised when calling the code which gets
the boundary review response. It then logs this to sentry, but still
returns a response. This seemed better than raising a 5xx in this
situation, where a client might want the rest of the information in the
response, even if boundary change failed.
@GeoWill GeoWill marked this pull request as ready for review February 3, 2026 13:14
@GeoWill GeoWill changed the title WiP models for boundary changes endpoint. Boundary changes endpoint. Feb 3, 2026
data = self.get_data_for_uprn()
return self.query_to_dict(data)

def data_quality_check(self, postcode_df):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now happens inside every request/response cycle that looks for static data.
I don't think it will add much, but it feels like the wrong place for it.

Would it be better to try and add some checks to the statemachine that run at the end of each run and do data quality assurance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.

Anyway, yes. Lets check the data here before we consume it.

I think the other failure modes we should care about here are:

  • The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
  • The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't something we can do without changing the lambda that writes outcode parquet files in data baker. Specifically these lines:

        if has_any_non_null_filter_column:
            print(
                f"At least one UPRN in {outcode} has data in {filter_column}, writing a file with data"
            )
            outcode_df.sort(by=["postcode", "uprn"])
            outcode_df.write_parquet(outcode_path)
        else:
            print(
                f"No {filter_column} for any address in {outcode}, writing an empty file"
            )
            polars.DataFrame().write_parquet(outcode_path)

It will mean writing loads of files with empty columns, but can go with that if you think it's worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of us isn't understanding the other on this.
Lets have a look at this one on a call together.

self.BOUNDARY_REVIEWS_ENABLED = True
self.BOUNDARY_REVIEWS_DATA_KEY_PREFIX = os.environ.get(
"BOUNDARY_REVIEWS_DATA_KEY_PREFIX",
"current_boundary_reviews_parquet/",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"current_boundary_reviews_parquet/",
"addressbase/production/current_boundary_reviews_parquet/",

I think this is why this was failing when you deployed it to dev.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is quite what is needed. I had to add f5256bb to get it working on dev. Having just the folder name means I can run LOCAL_STATIC_DATA_PATH=~/cloud/aws/democracy-club/pollingstations.private.data python run_local_api.py --function voting_information --port 8000 to have it work locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. so I am running this locally with

S3_CLIENT_ENABLED=1 AWS_PROFILE=wdiv-prod WDIV_API_KEY=[redacted] python run_local_api.py --function voting_information

to query the real s3 bucket from my local copy.

For me to get this to work, I have to make the BOUNDARY_REVIEWS_DATA_KEY_PREFIX equal to addressbase/production/current_boundary_reviews_parquet/

With the default, every request throws FileNotFoundError

Comment on lines 214 to 215
"boundary_review_id": "123",
"boundary_review_details": {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of bikeshedding on the API response format:
Why different keys for boundary_review_id/boundary_review_details ?

Why not

  "boundary_reviews": [
    {
      "id": "123",
      "consultation_url": "https://example.com/boundary-review",
      "effective_date": "2026-05-07",
      "legislation_title": "The Test Boundary Order 2025",
      "organisation_name": "Test Council",
      "organisation_official_name": "Test Council Official",
      "organisation_gss": "E09000033"
      "boundary_changes": [
        ...
      ]
    }
  ]

I think that would be neater if it makes no odds either way.

Also, shall we just put in a blank placeholder for ballots: [] in the relevant place even though we can't populate it yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd done it like that because that's how I'd ended up structuring the query in data baker. It's not to hard to change it to what you suggest, which I agree is neater. ae2ee92

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so fundamentally the approach here is: We solve this in the schema in this app rather than in the underlying data. I guess there's something about the way we're assembling the data that makes that representation more convenient there?

Comment on lines +135 to +137
sentry_sdk.capture_exception(
ex, context={"s3_key": key, "bucket": bucket}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not tested this, but does passing a context= kwarg to capture_exception() here work?
I think you might need to set context like this now?
Can we double-check this on dev.

https://github.com/DemocracyClub/UK-Polling-Stations/blob/76fd6a1947c1f6df6a140386f5677eb5e769cb87/polling_stations/apps/data_finder/helpers/baked_data_helper.py#L75-L91

Copy link
Contributor Author

@GeoWill GeoWill Feb 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went off the docs here. That says you can either set scope or scope_kwargs (as described in Scope.update_from_kwargs). But will deploy to dev and check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..or you can set up your local copy with a sentry DSN. That will make it easier to deliberately trigger exceptions.

Comment on lines +78 to +79
if not fixture_data:
return pl.DataFrame()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. My gut instinct reading this is that if I write test code that is trying to fetch a fixture that doesn't exist, that seems like it should raise an exception rather than silently returning an empty DataFrame.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this, because it mirrors what load_fixture does. Essentially this helper is a wrapper load_fixture to return a DataFrame rather than some json. This line isn't really doing anything, so can be deleted, but I thought it made the behaviour more obvious. If I delete it then if the fixture doesn't exists load_fixture will return [] (and pl.DataFrame().equals( pl.DataFrame([])) is True).

I could change load_fixture:

@@ -12,7 +12,7 @@ def load_fixture(testname, fixture, api_version="v1"):
         dirname / api_version / "test_data" / testname / f"{fixture}.json"
     )
     if not file_path.exists():
-        return []
+        raise FileNotFoundError(f"Could not find fixture:{fixture} at {file_path}")
     with file_path.open("r") as f:
         return json.loads(f.read())

but that breaks some other tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I feel like that underlying behaviour in load_fixture() is probably wrong/unhelpful, and if there are specific requests we want to mock as returning [] we should explicitly write files to disk containing [] but I think pulling that thread right this second is a distraction from the core thing we're trying to accomplish in this PR.

Lets leave this for now, but I would like to revisit what load_fixture() is doing here

data = self.get_data_for_uprn()
return self.query_to_dict(data)

def data_quality_check(self, postcode_df):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.

Anyway, yes. Lets check the data here before we consume it.

I think the other failure modes we should care about here are:

  • The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
  • The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.

I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?

and not resp["address_picker"]
):
try:
resp["boundary_reviews"] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
resp["boundary_reviews"] = None
resp["boundary_reviews"] = []

Comment on lines +205 to +218
class DuplicateUPRNError(ValueError):
def __init__(self, postcode, uprns):
message = (
f"Duplicate UPRNs found for postcode {postcode}: {sorted(uprns)}"
)
super().__init__(message)


class MultipleAddressbaseSourceError(ValueError):
def __init__(self, postcode, sources):
message = f"Multiple addressbase sources found for postcode {postcode}: {sources}"
super().__init__(message)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including the postcode/UPRN in the exception message means Sentry probably won't group all DuplicateUPRNErrors together
i.e: if we throw this 500 times for different UPRNs then Sentry will probably consider that 500 completely different issues instead of 1 issue with 500 events
This will be very annoying.

There are multiple ways to skin this cat. One of them is to set a fingerprint based on exception class. So you can do something like this:

https://docs.sentry.io/platforms/python/usage/sdk-fingerprinting/#group-errors-more-aggressively

..or in WDIV, I set the fingerprints at log time e.g:

https://github.com/DemocracyClub/UK-Polling-Stations/blob/b0a46a1df2591d9280b7b1d2f93e3ed32223b4c0/polling_stations/apps/data_finder/helpers/baked_data_helper.py#L160-L190

Another way to do it is something like

class DuplicateUPRNError(ValueError):
    def __init__(self, postcode, uprns):
        self.postcode = postcode
        self.uprns = sorted(uprns)

        # static — Sentry groups on this
        super().__init__("Duplicate UPRNs found")

    def __str__(self):
        # human-readable for local dev
        return f"Duplicate UPRNs found for postcode {self.postcode}: {self.uprns}"


# ..and when we raise it:
with sentry_sdk.push_scope() as scope:
    # attach extras for sentry
    scope.set_extra("postcode", self.postcode.with_space)
    scope.set_extra("uprns", duplicate_uprns)
    raise DuplicateUPRNError(postcode=self.postcode.with_space, uprns=duplicate_uprns)

I've not tested that code, but it should be.. roughly right. Can you try setting up a sentry DSN locally and have a go with one or other of these approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments