Conversation
4cb6c02 to
93fc4d4
Compare
This commit deals with two types of error. First it adds a sentry log when there is a file not found on s3. Second it catches any exceptions raised when calling the code which gets the boundary review response. It then logs this to sentry, but still returns a response. This seemed better than raising a 5xx in this situation, where a client might want the rest of the information in the response, even if boundary change failed.
93fc4d4 to
aa12f29
Compare
3d4a92d to
7904efb
Compare
7904efb to
05a8eb4
Compare
| data = self.get_data_for_uprn() | ||
| return self.query_to_dict(data) | ||
|
|
||
| def data_quality_check(self, postcode_df): |
There was a problem hiding this comment.
This now happens inside every request/response cycle that looks for static data.
I don't think it will add much, but it feels like the wrong place for it.
Would it be better to try and add some checks to the statemachine that run at the end of each run and do data quality assurance.
There was a problem hiding this comment.
I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.
Anyway, yes. Lets check the data here before we consume it.
I think the other failure modes we should care about here are:
- The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
- The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?
There was a problem hiding this comment.
This isn't something we can do without changing the lambda that writes outcode parquet files in data baker. Specifically these lines:
if has_any_non_null_filter_column:
print(
f"At least one UPRN in {outcode} has data in {filter_column}, writing a file with data"
)
outcode_df.sort(by=["postcode", "uprn"])
outcode_df.write_parquet(outcode_path)
else:
print(
f"No {filter_column} for any address in {outcode}, writing an empty file"
)
polars.DataFrame().write_parquet(outcode_path)It will mean writing loads of files with empty columns, but can go with that if you think it's worth it.
There was a problem hiding this comment.
I think one of us isn't understanding the other on this.
Lets have a look at this one on a call together.
d2c7a10 to
f5256bb
Compare
| self.BOUNDARY_REVIEWS_ENABLED = True | ||
| self.BOUNDARY_REVIEWS_DATA_KEY_PREFIX = os.environ.get( | ||
| "BOUNDARY_REVIEWS_DATA_KEY_PREFIX", | ||
| "current_boundary_reviews_parquet/", |
There was a problem hiding this comment.
| "current_boundary_reviews_parquet/", | |
| "addressbase/production/current_boundary_reviews_parquet/", |
I think this is why this was failing when you deployed it to dev.
There was a problem hiding this comment.
I don't think this is quite what is needed. I had to add f5256bb to get it working on dev. Having just the folder name means I can run LOCAL_STATIC_DATA_PATH=~/cloud/aws/democracy-club/pollingstations.private.data python run_local_api.py --function voting_information --port 8000 to have it work locally.
There was a problem hiding this comment.
Hmm. so I am running this locally with
S3_CLIENT_ENABLED=1 AWS_PROFILE=wdiv-prod WDIV_API_KEY=[redacted] python run_local_api.py --function voting_information
to query the real s3 bucket from my local copy.
For me to get this to work, I have to make the BOUNDARY_REVIEWS_DATA_KEY_PREFIX equal to addressbase/production/current_boundary_reviews_parquet/
With the default, every request throws FileNotFoundError
| "boundary_review_id": "123", | ||
| "boundary_review_details": { |
There was a problem hiding this comment.
A bit of bikeshedding on the API response format:
Why different keys for boundary_review_id/boundary_review_details ?
Why not
"boundary_reviews": [
{
"id": "123",
"consultation_url": "https://example.com/boundary-review",
"effective_date": "2026-05-07",
"legislation_title": "The Test Boundary Order 2025",
"organisation_name": "Test Council",
"organisation_official_name": "Test Council Official",
"organisation_gss": "E09000033"
"boundary_changes": [
...
]
}
]
I think that would be neater if it makes no odds either way.
Also, shall we just put in a blank placeholder for ballots: [] in the relevant place even though we can't populate it yet?
There was a problem hiding this comment.
I'd done it like that because that's how I'd ended up structuring the query in data baker. It's not to hard to change it to what you suggest, which I agree is neater. ae2ee92
There was a problem hiding this comment.
OK, so fundamentally the approach here is: We solve this in the schema in this app rather than in the underlying data. I guess there's something about the way we're assembling the data that makes that representation more convenient there?
| sentry_sdk.capture_exception( | ||
| ex, context={"s3_key": key, "bucket": bucket} | ||
| ) |
There was a problem hiding this comment.
I've not tested this, but does passing a context= kwarg to capture_exception() here work?
I think you might need to set context like this now?
Can we double-check this on dev.
There was a problem hiding this comment.
I went off the docs here. That says you can either set scope or scope_kwargs (as described in Scope.update_from_kwargs). But will deploy to dev and check.
There was a problem hiding this comment.
..or you can set up your local copy with a sentry DSN. That will make it easier to deliberately trigger exceptions.
| if not fixture_data: | ||
| return pl.DataFrame() |
There was a problem hiding this comment.
Hmm. My gut instinct reading this is that if I write test code that is trying to fetch a fixture that doesn't exist, that seems like it should raise an exception rather than silently returning an empty DataFrame.
There was a problem hiding this comment.
I did this, because it mirrors what load_fixture does. Essentially this helper is a wrapper load_fixture to return a DataFrame rather than some json. This line isn't really doing anything, so can be deleted, but I thought it made the behaviour more obvious. If I delete it then if the fixture doesn't exists load_fixture will return [] (and pl.DataFrame().equals( pl.DataFrame([])) is True).
I could change load_fixture:
@@ -12,7 +12,7 @@ def load_fixture(testname, fixture, api_version="v1"):
dirname / api_version / "test_data" / testname / f"{fixture}.json"
)
if not file_path.exists():
- return []
+ raise FileNotFoundError(f"Could not find fixture:{fixture} at {file_path}")
with file_path.open("r") as f:
return json.loads(f.read())but that breaks some other tests.
There was a problem hiding this comment.
OK. I feel like that underlying behaviour in load_fixture() is probably wrong/unhelpful, and if there are specific requests we want to mock as returning [] we should explicitly write files to disk containing [] but I think pulling that thread right this second is a distraction from the core thing we're trying to accomplish in this PR.
Lets leave this for now, but I would like to revisit what load_fixture() is doing here
| data = self.get_data_for_uprn() | ||
| return self.query_to_dict(data) | ||
|
|
||
| def data_quality_check(self, postcode_df): |
There was a problem hiding this comment.
I think it is fine to run consistency checks on every request. That is what we are doing on WDIV. I definitely think we should also try to prevent errors at write time. However, as long as we're using a format that doesn't allow us to enforce constraints I think we also need to be defensive when we consume the data. This is one of the reasons I think there is mileage in looking at something like SQLite/DuckDB as the static data format. It would allow us to enforce some constraints and "trust" the data a bit more in the client code.
Anyway, yes. Lets check the data here before we consume it.
I think the other failure modes we should care about here are:
- The Postcode we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
- The UPRN we are fetching does exist (we got a response from WDIV) but it is not in the parquet file.
I think I would also want a notification in Sentry if either of those things happen because it means our data is out of sync (although we can serve the "there are no applicable boundary reviews to this query" response to the user). Are those two cases captured anywhere?
f6cfc71 to
451388a
Compare
451388a to
55b9daa
Compare
| and not resp["address_picker"] | ||
| ): | ||
| try: | ||
| resp["boundary_reviews"] = None |
There was a problem hiding this comment.
| resp["boundary_reviews"] = None | |
| resp["boundary_reviews"] = [] |
| class DuplicateUPRNError(ValueError): | ||
| def __init__(self, postcode, uprns): | ||
| message = ( | ||
| f"Duplicate UPRNs found for postcode {postcode}: {sorted(uprns)}" | ||
| ) | ||
| super().__init__(message) | ||
|
|
||
|
|
||
| class MultipleAddressbaseSourceError(ValueError): | ||
| def __init__(self, postcode, sources): | ||
| message = f"Multiple addressbase sources found for postcode {postcode}: {sources}" | ||
| super().__init__(message) | ||
|
|
||
|
|
There was a problem hiding this comment.
Including the postcode/UPRN in the exception message means Sentry probably won't group all DuplicateUPRNErrors together
i.e: if we throw this 500 times for different UPRNs then Sentry will probably consider that 500 completely different issues instead of 1 issue with 500 events
This will be very annoying.
There are multiple ways to skin this cat. One of them is to set a fingerprint based on exception class. So you can do something like this:
https://docs.sentry.io/platforms/python/usage/sdk-fingerprinting/#group-errors-more-aggressively
..or in WDIV, I set the fingerprints at log time e.g:
Another way to do it is something like
class DuplicateUPRNError(ValueError):
def __init__(self, postcode, uprns):
self.postcode = postcode
self.uprns = sorted(uprns)
# static — Sentry groups on this
super().__init__("Duplicate UPRNs found")
def __str__(self):
# human-readable for local dev
return f"Duplicate UPRNs found for postcode {self.postcode}: {self.uprns}"
# ..and when we raise it:
with sentry_sdk.push_scope() as scope:
# attach extras for sentry
scope.set_extra("postcode", self.postcode.with_space)
scope.set_extra("uprns", duplicate_uprns)
raise DuplicateUPRNError(postcode=self.postcode.with_space, uprns=duplicate_uprns)I've not tested that code, but it should be.. roughly right. Can you try setting up a sentry DSN locally and have a go with one or other of these approaches.
No description provided.