-
Notifications
You must be signed in to change notification settings - Fork 0
Phase 2 of API improvements #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a semaphore-based ConcurrencyLimiter; tightens ParamValidator with a pagination-depth guard and facet validation; rewrites QueryBuilder to use multi_match and set track_total_hits to 10000; updates many tests (route timeouts, typed ActorSystem, ConcurrencyLimiter tests); fixes test file reading to use UTF-8 and close sources. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/test/scala/dpla/api/helpers/FileReader.scala (1)
8-10: Resource handling is correct;Usingcould simplify (if Scala 2.13+).The explicit UTF‑8 codec and
try/finallywithbuffered.close()look correct and fix the resource‑leak risk. If the project is on Scala 2.13+, you could optionally rewrite this withscala.util.Usingfor a more idiomatic one-liner, but it’s not required.src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
184-194: Consider removing explicit cast.The
.asInstanceOf[JsArray]cast is safe but could be avoided by using the type system.- private def filterQuery(filters: Seq[Filter]): JsArray = - filters - .map(filter => { - JsObject( - "term" -> JsObject( - filter.fieldName -> filter.value.toJson - ) - ) - }) - .toJson - .asInstanceOf[JsArray] + private def filterQuery(filters: Seq[Filter]): JsArray = + JsArray(filters.map { filter => + JsObject( + "term" -> JsObject( + filter.fieldName -> filter.value.toJson + ) + ) + }: _*)src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (1)
55-72: Narrow exception handling when evaluatingfand consider documenting interruption semanticsTwo small robustness points in
apply:
- When evaluating
f, you currently catchThrowable. In Scala it’s more idiomatic to catchNonFatalso that truly fatal errors (e.g.,VirtualMachineError,ThreadDeath) are not transformed into failedFutures:+import scala.util.control.NonFatal ... - } else { - try { - val future = f - future.andThen { case _ => semaphore.release() }(ec) - } catch { - case e: Throwable => - semaphore.release() - Future.failed(e) - } - } + } else { + try { + val future = f + future.andThen { case _ => semaphore.release() }(ec) + } catch { + case NonFatal(e) => + semaphore.release() + Future.failed(e) + } + }
- If the thread is interrupted while waiting in
semaphore.tryAcquire(timeoutSeconds, TimeUnit.SECONDS), anInterruptedExceptionwill currently bubble out ofapplyinstead of becoming aFuturefailure (and the interrupt flag isn’t restored). It might be worth deciding and documenting whether you want to:
- Let that propagate as-is (and mention it in the Scaladoc), or
- Catch it and convert to a failed
Future(possibly re‑interrupting the thread).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala(1 hunks)src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala(18 hunks)src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala(11 hunks)src/test/scala/dpla/api/helpers/FileReader.scala(1 hunks)src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala(8 hunks)src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala(4 hunks)src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala(2 hunks)src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala(2 hunks)src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala(2 hunks)src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala(1 hunks)src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala(2 hunks)src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala(27 hunks)src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala(23 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala (2)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
MockEbookSearch(11-57)src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
MockEbookRegistry(11-30)
src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala (1)
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (2)
availablePermits(78-78)ConcurrencyLimitExceeded(84-90)
src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala (1)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
MockEbookRegistry(11-30)
src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (5)
search(19-33)search(35-37)search(39-41)search(43-46)search(48-51)src/main/scala/dpla/api/v2/search/mappings/JsonFieldReader.scala (6)
readObject(16-29)readString(34-37)readObjectArray(31-34)readInt(40-43)readStringArray(37-40)readBoolean(43-49)src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)
query(128-161)filterQuery(184-205)
src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala (6)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
MockEbookRegistry(11-30)src/test/scala/dpla/api/v2/registry/MockSmrRegistry.scala (1)
MockSmrRegistry(10-29)src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
MockEbookSearch(11-57)src/test/scala/dpla/api/v2/smr/MockSmrRequestHandler.scala (1)
MockSmrRequestHandler(8-42)src/main/scala/dpla/api/v2/smr/SmrProtocol.scala (1)
SmrProtocol(6-44)src/main/scala/dpla/api/v2/smr/SmrParamValidator.scala (1)
validService(85-90)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (2)
src/main/scala/dpla/api/v2/search/SearchProtocol.scala (14)
search(61-64)search(66-70)search(72-75)search(77-80)search(82-86)search(88-91)search(93-97)search(99-104)search(106-109)search(111-115)search(117-121)search(123-127)search(129-132)search(134-138)src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (5)
TextField(7-7)IntField(9-9)URLField(11-11)DateField(13-13)WildcardField(17-17)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala (2)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
MockEbookSearch(11-57)src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
MockEbookRegistry(11-30)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala (3)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
MockEbookRegistry(11-30)src/test/scala/dpla/api/v2/registry/MockPssRegistry.scala (1)
MockPssRegistry(11-30)src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
MockEbookSearch(11-57)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
getElasticSearchField(64-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (26)
src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala (1)
17-18: Good addition of test timeout.The explicit 3-second timeout prevents tests from hanging indefinitely and is appropriate for mock-based end-to-end tests.
Also applies to: 33-34
src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala (1)
17-18: LGTM! Route test timeout added for test stability.The
RouteTestTimeout(3.seconds)addition helps prevent test flakiness by providing an explicit timeout. This is a good practice for Akka HTTP route tests.Also applies to: 33-34
src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala (1)
17-18: LGTM! Consistent timeout configuration with other test files.The route test timeout addition follows the same pattern established in other test files in this PR.
Also applies to: 33-34
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (3)
110-110: Verifytrack_total_hitsvalue aligns with pagination depth.The value
10000is a reasonable optimization to avoid counting all documents. However, given thatParamValidatornow limits pagination to 1000 items (fromValue + pageSize > 1000), you could consider lowering this to1000or keeping it at10000for safety margin. The current value works correctly but is 10x the maximum accessible records.
163-178: Good migration fromquery_stringtomulti_match.Using
multi_matchwithbest_fieldsandlenient: trueis a safer and more predictable approach thanquery_string. This avoids potential Lucene query syntax injection issues while maintaining effective full-text search capability.
113-123: Verify reduced field set meets search requirements.The field set has been curated to high-value fields with reasonable boost weights. Ensure that removing previously included fields doesn't negatively impact search recall for your use cases.
src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala (1)
20-21: LGTM! Consistent test infrastructure updates.Route test timeout added following the same pattern as other test files in this PR.
Also applies to: 38-39
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (3)
125-132: Breaking change: Stricter pagination and facet limits.The limits have been significantly reduced:
maxFacetSize: 2000 → 200maxPage: 100 → 10maxPageSize: 500 → 100Ensure API consumers are aware of these changes, as existing integrations using higher values will now receive capped results or errors.
249-256: Good addition: Requiring query or filter with facets.This validation prevents expensive facet-only queries that scan the entire index. This is a sensible performance safeguard.
431-439: Good improvement: Empty string handling.Treating empty/whitespace-only values as absent parameters (
None) is a sensible approach that prevents unnecessary validation errors.src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
273-277: LGTM: Updated track_total_hits expectation from Boolean to Int.The test now expects
track_total_hitsto be10000instead of a boolean value, aligning with theQueryBuilder.scalachanges that limit total hit tracking for performance optimization.
287-311: LGTM: Tests updated to usemulti_matchinstead ofquery_string.The keyword query tests now correctly traverse the
multi_matchpath instead ofquery_string, reflecting the query builder's switch tomulti_matchqueries. The assertions forquery,fields,operator, andlenientare properly adjusted.
930-937: LGTM: Filter query path updated to match new query structure.The filter query test now reads from
"query", "bool", "filter"path directly, consistent with the simplified filter structure in the updatedQueryBuilder.scala.src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala (1)
35-40: LGTM: Typed actor system integration and route timeout added.The addition of
typedSystemandcreateActorSystem()override enables proper typed/classic actor system interop, and the 3-secondRouteTestTimeoutprovides reasonable test timing. This pattern is consistent with other test files updated in this PR.src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala (1)
36-40: LGTM: Test updated to include required query parameter.The addition of
"q" -> "test"aligns with the new validation rule that requires a query or filter when facets are specified (as tested inParamValidatorTest.scalaat lines 322-326). The test still correctly verifies that non-applicable DPLA Map fields are ignored.src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (4)
321-326: LGTM: New test validates facets require query or filter.This test ensures that requests with facets but no accompanying query (
q) or filter are rejected. This prevents expensive facet-only aggregation queries, which is a sensible performance guard.
353-360: LGTM: Updated max facet_size expectation to 200.The expected maximum
facet_sizeis now 200 (reduced from 2000), reflecting the stricter limits in the updatedParamValidator.scala.
581-592: LGTM: Pagination validation tests updated for stricter limits.The tests now verify:
- Too-large page values (600) result in
InvalidSearchParamsrather than capping to max- Deep pagination beyond 1000 results (page=50 × page_size=50 = 2500) is rejected
These stricter limits protect against expensive deep-pagination queries on Elasticsearch.
619-624: LGTM: Updated page_size validation to reject rather than cap.The test now expects
InvalidSearchParamsfor excessively largepage_sizevalues instead of silently capping to the maximum. This explicit rejection provides clearer feedback to API consumers.src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala (4)
1-22: LGTM: Well-structured test setup with appropriate async configuration.The test class properly mixes in
ScalaFuturesandEventuallyfor async testing, with a reasonablePatienceConfig(5s timeout, 50ms polling interval).
82-119: LGTM: Thorough concurrency limit enforcement test.This test properly validates:
- Two permits are acquired when two Futures are in-flight
- A third request fails with
ConcurrencyLimitExceededafter timeout- Permits are released when the original Futures complete
Using
Promiseto control Future completion timing is the correct approach for testing concurrency limits.
146-162: LGTM: Constructor validation tests for edge cases.Good coverage of invalid constructor arguments (zero and negative values for both
maxConcurrentandtimeoutSeconds), ensuring the limiter fails fast with clear errors.
63-80: Permit release on Future construction throw is synchronous and safe.The implementation at lines 64-71 of
ConcurrencyLimiter.scalaconfirms the permit is released synchronously: whenval future = fthrows (line 65), the catch block at line 68-70 immediately callssemaphore.release()before returning the failed Future. The test assertion at line 79 is correct and not flaky.src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (3)
6-24: Docs correctly call out blocking behavior and operational guidanceThe Scaladoc explicitly flags the blocking
tryAcquiresemantics and suggests using a dedicated dispatcher / tuning parameters, which matches the implementation below and should help prevent misuse in Akka/HTTP paths.
25-38: Constructor validation and semaphore setup look correctThe
requireguards onmaxConcurrentandtimeoutSecondsand the correspondingSemaphore(maxConcurrent)initialization are consistent with the intended concurrency cap and error reporting; no changes needed here.
81-90: Exception type is informative and matches limiter configuration
ConcurrencyLimitExceededcarries bothmaxConcurrentandtimeoutSecondsand renders a clear message, which should be helpful in logs and client error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements Phase 2 of API improvements focused on search performance optimization and request limiting. Key changes include migrating from Elasticsearch's query_string to multi_match queries, enforcing stricter pagination limits to prevent deep pagination, requiring queries or filters for facet requests, and introducing a new ConcurrencyLimiter class for managing concurrent request loads.
Key Changes
- Search query optimization: Replaced
query_stringwithmulti_matchfor keyword queries, reduced searchable fields from 27 to 8 high-value fields, and simplified filter query structure - Pagination limits: Reduced max page from 100 to 10, max page_size from 500 to 100, max facet_size from 2000 to 200, and added validation to reject requests exceeding 1000 total results (from + size)
- Concurrency control: Added new
ConcurrencyLimiterclass with comprehensive test coverage to manage concurrent request loads using semaphores
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala | Migrated from query_string to multi_match queries, reduced searchable fields, simplified filter structure, and changed track_total_hits from boolean to 10000 |
| src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala | Added deep pagination validation, facet requirement checks, reduced max pagination limits, and added empty string handling in parameter validation |
| src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala | New class implementing semaphore-based concurrency limiting for Futures with timeout support |
| src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala | Comprehensive test coverage for ConcurrencyLimiter including edge cases and failure scenarios |
| src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala | Updated tests to validate multi_match queries, new filter structure, and track_total_hits value |
| src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala | Updated tests to validate new pagination limits, facet requirements, and deep pagination rejection |
| src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala | Updated test to require query parameter with facets |
| src/test/scala/dpla/api/v2/httpHeadersAndMethods/*.scala | Added route test timeouts and reformatted for consistency |
| src/test/scala/dpla/api/v2/endToEnd/*.scala | Added route test timeouts and reformatted for consistency |
| src/test/scala/dpla/api/helpers/FileReader.scala | Improved resource handling with try-finally and explicit UTF-8 encoding |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala
Outdated
Show resolved
Hide resolved
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala
Outdated
Show resolved
Hide resolved
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala
Outdated
Show resolved
Hide resolved
…cala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…or.scala Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
196-274: Add defensive error handling to basic field query branch for consistencyThe basic field query branch (else block, lines 268-274) silently falls back to
keywordQueryFieldswhengetElasticSearchField()returnsNone, whereas the.before/.afterand exact-match branches throw aRuntimeExceptionfor unrecognized field names. Although upstream validation inParamValidator.scalaensures that only valid, searchable field names reachsingleFieldQuery()(viasearchableDataFields.flatMap(getValidFieldQuery(...))), the inconsistent error handling makes the code less defensive and harder to debug if upstream validation changes or if this method is called directly in the future.For consistency and defensive programming, adopt the same pattern as other branches:
- } else { - // Basic field query - val fields: Seq[String] = - Seq(getElasticSearchField(fieldQuery.fieldName)).flatten - val obj: JsObject = keywordQuery(fieldQuery.value, fields) - Seq(obj) - } + } else { + // Basic field query + val field: String = getElasticSearchField(fieldQuery.fieldName) + .getOrElse( + throw new RuntimeException( + "Unrecognized field name: " + fieldQuery.fieldName + ) + ) + val obj: JsObject = keywordQuery(fieldQuery.value, Seq(field)) + Seq(obj) + }This aligns all branches to fail fast on invalid field names rather than silently changing query semantics.
🧹 Nitpick comments (2)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)
180-195: Avoid unsafe cast infilterQuery
filterQuerycurrently relies on.toJson.asInstanceOf[JsArray]for the sequence of term filters, which is brittle and ties correctness to the implementation detail thatSeqalways serializes toJsArray.You can build the
JsArraydirectly and drop the cast:- private def filterQuery(filters: Seq[Filter]): JsArray = - filters - .map(filter => { - JsObject( - "term" -> JsObject( - filter.fieldName -> filter.value.toJson - ) - ) - }) - .toJson - .asInstanceOf[JsArray] + private def filterQuery(filters: Seq[Filter]): JsArray = + JsArray( + filters.map { filter => + JsObject( + "term" -> JsObject( + filter.fieldName -> filter.value.toJson + ) + ) + }: _* + )This is safer and keeps the return type explicit.
276-283: Minor naming nit:stripLeadingAndTrainingQuotationMarksThe method name still says
Trainingwhile the doc comment (Lines 276-278) correctly says “leading and trailing quotation marks.” Consider renaming the method tostripLeadingAndTrailingQuotationMarksfor clarity the next time you touch this area.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala(18 hunks)src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala(11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
getElasticSearchField(64-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Prepare
🔇 Additional comments (6)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (6)
16-43: Message routing inapplyis clear and side‑effect usage is confinedThe actor behavior cleanly delegates to the next phase for search, fetch (single vs multi), and random queries, and uses
Behaviors.same/Behaviors.unhandledappropriately. No issues spotted here.
67-93: Random query composition with optional filters looks correctThe new
composeRandomQueryimplementation correctly wraps optional filters in abool.filterand nests that underfunction_score, while keeping the unfiltered case minimal. The use ofsize = 1matches the intent of a single random hit. No functional problems here.
96-111: Bool query assembly and top‑level search structure are soundThe refactored
composeSearchQuery/querypair correctly handle:
- Pagination (
from/size),- Combining keyword and field queries into a bool
must/shoulddepending onop,- Attaching filters as a separate
filterclause,- Falling back to
match_allwhen both terms and filters are absent,- Adding
track_total_hits = 10000to cap expensive hit tracking.The structure matches typical ES expectations and looks consistent with the rest of the builder.
Also applies to: 128-161
113-123: Keyword multi‑match configuration and curated field list look reasonableThe reduced
keywordQueryFieldsset with boosts, combined withmulti_matchusingbest_fields,operator = "AND", andlenient = true, forms a coherent default keyword query strategy. ThetargetFieldsfallback ensures calls without explicit fields still use this curated set.Also applies to: 163-178
285-381: Facet (aggs) composition for geo and date fields looks consistentThe
aggsbuilder correctly distinguishes:
- Spatial facets using
geo_distancebuckets with a generatedrangesarray (Lines 302-307),- Date facets using a filtered
date_histogramwith month/year granularity and corresponding formats andgtewindows (Lines 327-340),- Regular term facets via
termson exact‑match fields.The control flow and error handling for unknown facet names are consistent with the rest of the file. No issues stood out here.
383-447: Sort and field retrieval defaults are reasonable
- The
sortfunction cleanly handles geo sort, field‑based sort, and falls back to_docwhen there is no query/sort, which is appropriate for “no‑order” browsing.fieldRetrievalreturningSeq("*").toJsonwhen no fields are specified gives a sensible default of retrieving the full source.No functional concerns in these sections.
Also applies to: 449-453
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)
62-75: Consider verifying exact field mappings.The test only checks that the count of mapped fields matches the count of keyword fields. If one field fails to map and is silently dropped, the test wouldn't catch it. Consider using
contain allElementsOfto verify that all expected fields are present.Apply this diff to make the test more precise:
"have searchable keywordQuery fields defined in ES mapping" in { val keywordFields = Seq( "sourceResource.title", "sourceResource.subject.name", "sourceResource.description", "sourceResource.creator", "sourceResource.collection.title", "dataProvider.name", "provider.name" ) val mapped = keywordFields.flatMap(tester.getElasticSearchField) - mapped.size shouldBe keywordFields.size + mapped should have size keywordFields.size + // Optionally verify all fields mapped successfully + keywordFields.foreach { field => + tester.getElasticSearchField(field) should not be empty + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala(11 hunks)src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala(1 hunks)src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala(23 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (2)
getElasticSearchField(64-67)getElasticSearchExactMatchField(75-81)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
getElasticSearchField(64-67)
src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (5)
search(19-33)search(35-37)search(39-41)search(43-46)search(48-51)src/main/scala/dpla/api/v2/search/mappings/JsonFieldReader.scala (6)
readObject(16-29)readString(34-37)readObjectArray(31-34)readInt(40-43)readStringArray(37-40)readBoolean(43-49)src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)
query(127-160)filterQuery(183-205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Agent
🔇 Additional comments (11)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)
77-91: LGTM!The test correctly verifies that exact match fields use not_analyzed subfields. The use of
foreachensures each field is tested individually, which provides clear failure messages.src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (7)
113-122: LGTM!The reduced set of keyword query fields with appropriate boost weights aligns well with cultural heritage search patterns. Prioritizing title (^2) over other fields is a sensible default.
162-177: LGTM!The migration from
query_stringtomulti_matchsimplifies the keyword query behavior and makes it more user-friendly by removing advanced Lucene query syntax. Thelenientflag appropriately handles field type mismatches.
183-193: LGTM!Changing the return type to
JsArraycorrectly reflects Elasticsearch's bool filter syntax, where filters are an array of query clauses.
195-204: LGTM!The updated comment correctly describes the multi_match query behavior and clearly distinguishes it from the exact-match "term" query.
205-275: LGTM!The improved error messages and the fix for the function name typo (
stripLeadingAndTrailingQuotationMarks) enhance code clarity. The logic for handling range queries, exact matches, and basic field queries is correct.
277-284: LGTM!The function name typo has been corrected (
stripLeadingAndTrailingQuotationMarks), and the comment accurately describes the behavior.
110-110: Pagination depth is already limited by existing parameter validation.The
ParamValidatorenforcesmaxPage = 10andmaxPageSize = 100, which limits the maximum offset to 900 and total accessible results to approximately 1,000. This is more restrictive than thetrack_total_hits = 10000setting, so users cannot request results beyond what is already enforced by the parameter validation layer. No additional validation is needed.src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
273-277: LGTM!The test correctly updated to expect
track_total_hitsas an integer value of 10000, matching the QueryBuilder implementation change.
287-320: LGTM!The tests correctly updated to read
multi_matchquery structures instead ofquery_string, and the operator field name change fromdefault_operatortooperatormatches Elasticsearch's multi_match syntax.
404-509: LGTM!Excellent comprehensive test coverage for the quote stripping behavior. The tests cover all relevant edge cases: properly quoted strings, strings with internal quotes, strings with only leading or trailing quotes, empty quoted strings, and unquoted strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def validIntWithRange(intString: String, param: String): Int = { | ||
| val (min: Int, max: Int) = param match { | ||
| case "facet_size" => (minFacetSize, maxFacetSize) | ||
| case "page" => (minPage, maxPage) | ||
| case "page_size" => (minPageSize, maxPageSize) | ||
| case _ => (0, 2147483647) | ||
| case "page" => (minPage, maxPage) | ||
| case "page_size" => (minPageSize, maxPageSize) | ||
| case _ => (0, 2147483647) | ||
| } | ||
|
|
||
| val rule = s"$param must be an integer between 0 and $max" |
Copilot
AI
Dec 9, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The max limits were reduced (e.g., maxPageSize from 500 to 100), and corresponding tests now expect rejection of over-limit values (see ParamValidatorTest.scala lines 619-624). However, the logic at line 502 (unchanged in this PR) still returns max instead of throwing an exception when values exceed the limit. This creates inconsistent behavior where the tests expect rejection but the implementation clamps values. The unchanged line 502 should also be updated to throw ValidationException(rule) to match the new expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
…ing (#94) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: moltude <1419766+moltude@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (3)
125-133: Pagination limits and depth guard interplayThe combination of
maxPage = 10,maxPageSize = 100, and thefromValue + requestedPageSize > 1000guard means that, with current constants, no valid combination can actually exceed the 1000 window—the guard is effectively a future‑proof safety net.To avoid drift if either the ES
index.max_result_windowor these limits change later, consider either:
- deriving the
1000constant from shared configuration, or- adding a short comment near
maxPage/maxPageSizeexplaining that they’re chosen sofrom + size <= 1000, and that the guard is an extra safety check.Also applies to: 238-247
152-158: Fetch ID count capped bymaxPageSizemay deserve its own constantReusing
maxPageSizeto limit the number ofids in a fetch (ids.size > maxPageSize) is functionally fine, but it couples two concerns: search page size and fetch batch size. If you ever want to change one without the other, this will be surprising.Consider introducing a dedicated constant (e.g.,
maxFetchIds) or at least a short comment explaining that the fetch ID limit is intentionally tied tomaxPageSize.Also applies to: 167-170
490-499:validIntWithRangeerror message hard-codes lower bound 0
validIntWithRangecomputes(min, max)per param (e.g.,pageuses(minPage, maxPage)whereminPage = 1), but therulestring always says:val rule = s"$param must be an integer between 0 and $max"So for
page, an input of0or a negative value will yield an error message that mentions0as the lower bound, even though the actual check isint < min.Consider interpolating
mininstead of0:- val rule = s"$param must be an integer between 0 and $max" + val rule = s"$param must be an integer between $min and $max"This keeps behavior the same while making validation errors clearer to API consumers.
Also applies to: 502-507
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala(18 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (5)
TextField(7-7)IntField(9-9)URLField(11-11)DateField(13-13)WildcardField(17-17)
| val filter = | ||
| getValidFilter(rawParams) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Facet guard can be bypassed via empty filter values
The new facet guard:
val hasFacets = facets.nonEmptyval hasQuery = q.isDefined || fieldQueries.nonEmptyval hasFilter = filter.isDefinedif (hasFacets && !hasQuery && !hasFilter) throw ...
assumes that hasFilter implies at least one effective filter value. However:
getValidFiltersplits on":"/"AND"and producesvalues: Array[String],- each
valueis passed throughgetValid, and getValidnow drops empty strings (case (_, v) if v.trim.isEmpty => None).
This means a request like filter=type: (no value) yields values = Array(""), which becomes Seq.empty after getValid drops the empty value. The result is filter = Some(Seq.empty):
filter.isDefinedistrue→hasFilter == true,- but there are actually no filters applied downstream.
That allows facet requests with no real query or filter to pass the new guard, undermining its intent and potentially re‑enabling high‑cardinality facet requests you were trying to constrain.
One way to tighten this up is to treat “all values empty” as an invalid filter rather than as Some(Nil). For example, you can enforce this inside getValidFilter:
- private def getValidFilter(
- rawParams: Map[String, String]
- ): Option[Seq[Filter]] =
- rawParams.get("filter").map { filter =>
- val fieldName = filter
- .split(":", 2)
- .headOption
- .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
-
- val values = filter
- .split(":", 2)
- .lastOption
- .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
- .split("AND")
- .map(_.trim)
-
- if (searchableDataFields.contains(fieldName)) {
- val validationMethod = getValidationMethod(fieldName)
-
- values.flatMap(value => {
- val params = Map(fieldName -> value)
- getValid(params, fieldName, validationMethod)
- .map(Filter(fieldName, _))
- })
- } else {
- throw ValidationException(s"$fieldName is not a valid filter field")
- }
- }
+ private def getValidFilter(
+ rawParams: Map[String, String]
+ ): Option[Seq[Filter]] =
+ rawParams.get("filter").map { filter =>
+ val fieldName = filter
+ .split(":", 2)
+ .headOption
+ .map(_.trim)
+ .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
+
+ val rawValues = filter
+ .split(":", 2)
+ .lastOption
+ .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
+ .split("AND")
+ .map(_.trim)
+
+ val values = rawValues.filter(_.nonEmpty)
+ if (values.isEmpty)
+ throw ValidationException(s"$filter is not a valid filter")
+
+ if (searchableDataFields.contains(fieldName)) {
+ val validationMethod = getValidationMethod(fieldName)
+
+ values.flatMap { value =>
+ val params = Map(fieldName -> value)
+ getValid(params, fieldName, validationMethod)
+ .map(Filter(fieldName, _))
+ }
+ } else {
+ throw ValidationException(s"$fieldName is not a valid filter field")
+ }
+ }This keeps the new “ignore empty values” behavior in getValid for other params, but makes empty filters explicitly invalid and ensures hasFilter truly implies at least one applied filter.
Also applies to: 249-256, 338-367, 431-439
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (2)
297-304: Facet tests correctly enforce q/filter requirement; consider adding filter-only coverageThe facet tests now (a) require
facetsto be paired with a primary search param (viaq) and (b) explicitly rejectfacetswith no query or filter, which matches the tightened validation behavior. To fully exercise the new rule, it would be useful to add a case wherefacetsis combined withfilterbut noq, to lock in that this combination remains accepted.Example sketch, placed alongside the existing facet tests:
"facet validator" should { @@ "accept valid param" in { @@ } @@ + "accept facets with filter but no query" in { + val given = "provider.@id" + val filter = "provider.@id:http://dp.la/api/contributor/lc" + val params = Map("facets" -> given, "filter" -> filter) + ebookParamValidator ! RawSearchParams(params, replyProbe.ref) + val msg = interProbe.expectMessageType[ValidSearchParams] + msg.params.facets shouldEqual Some(Seq("provider.@id")) + } + "reject unfacetable field" in {Also applies to: 313-320, 322-326
353-358: Pagination guard tests align with new behavior; rename specs for clarityThe facet-size and pagination tests now correctly treat oversize values as invalid and the new “reject deep pagination beyond 1000 results” case exercises the depth guard nicely. The only mismatch is that the spec descriptions at Line [579] and Line [617] still say “default to max if param is too large” even though the expectations now assert invalid input; renaming them would keep the tests self-documenting and consistent with the facet-size test.
Suggested rename:
- "default to max if param is too large" in { + "reject param if too large" in { @@ - "default to max if param is too large" in { + "reject param if too large" in {Also applies to: 556-563, 579-584, 586-590, 617-622
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala(27 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (1)
src/main/scala/dpla/api/v2/search/SearchProtocol.scala (1)
Random(24-27)
🔇 Additional comments (3)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (3)
13-16: Readability-only reflow looks goodRewrapping the class header, long strings, and map literals over multiple lines improves readability without changing behavior; no issues here.
Also applies to: 52-53, 74-79, 434-438, 768-768, 823-824, 834-835, 850-851, 859-860, 869-870
123-124: Consistent field-value extraction across validatorsThese assertions now all use the same pattern to pull out the stored value for each field, which keeps the tests uniform and easier to scan; behavior is unchanged and looks correct.
Also applies to: 135-136, 147-148, 177-178, 214-215, 244-245, 403-404, 443-444, 473-474, 510-511, 642-643, 761-762, 798-799
875-887: Filter validator assertions are precise and robustThese filter tests now assert both the parsed field name and value for the first clause, which tightens the contract without over-coupling to internal representation. This is a clear improvement with no downsides.
Also applies to: 889-900
Summary by CodeRabbit
New Features
Bug Fixes
Changes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.