Skip to content

Conversation

@moltude
Copy link
Contributor

@moltude moltude commented Dec 9, 2025

Summary by CodeRabbit

  • New Features

    • Added configurable concurrency limiting with timeout-based rejection.
  • Bug Fixes

    • Prevented pagination requests that would exceed 1000 total items.
    • Ensured file reads use explicit UTF-8 and are safely closed.
  • Changes

    • Search now uses multi-match keyword queries and caps tracked hits at 10,000.
    • Tightened pagination validations and query/facet assembly.
  • Tests

    • Added concurrency limiter tests, expanded parameter-validation tests, and introduced route test timeouts for stability.

✏️ Tip: You can customize this high-level summary in your review settings.

@moltude moltude requested a review from Copilot December 9, 2025 13:59
@coderabbitai
Copy link

coderabbitai bot commented Dec 9, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a semaphore-based ConcurrencyLimiter; tightens ParamValidator with a pagination-depth guard and facet validation; rewrites QueryBuilder to use multi_match and set track_total_hits to 10000; updates many tests (route timeouts, typed ActorSystem, ConcurrencyLimiter tests); fixes test file reading to use UTF-8 and close sources.

Changes

Cohort / File(s) Summary
New Concurrency Control
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala
Adds ConcurrencyLimiter class using a Semaphore to cap concurrent Future executions; constructor validates inputs; apply acquires permit with timeout, executes f, releases permit on completion or synchronous exception; exposes availablePermits; adds ConcurrencyLimitExceeded exception.
Parameter Validation
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala
Adds pagination-depth guard rejecting requests where from + requestedPageSize > 1000; refactors param extraction and enforces that facet requests require a query or filter; various internal formatting/refactors.
Query Builder
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala
Replaces query_string with multi_match (lenient) for keyword queries, reduces/reorders keyword fields, changes track_total_hits from boolean to integer 10000, changes some filter composition to JsArray, renames quote-strip helper, and adjusts agg/facet and single-field query handling.
Test helpers (IO)
src/test/scala/dpla/api/helpers/FileReader.scala
Use explicit UTF-8 when reading files and ensure BufferedSource is closed via try/finally.
End-to-end tests
src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala, src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala
Add implicit RouteTestTimeout, integrate typed ActorSystem accessor (typedSystem) and override createActorSystem() for test compatibility; formatting and timeout config changes.
HTTP header/method tests
src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala, src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala, src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala
Introduce implicit RouteTestTimeout(3.seconds) and typed system support; formatting/constructor-call adjustments.
ParamValidator & QueryBuilder tests
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala, src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala, src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala
Add facet-validation test; update expectations for page/facet limits and deep-pagination rejection; update tests to read multi_match nodes and integer track_total_hits; adjust JSON traversal/assertions to match query builder changes.
ConcurrencyLimiter tests
src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala
New test suite covering permit acquisition/release on success and failure, synchronous exception handling, concurrency/timeouts and reuse of permits, and constructor parameter validation and exception contents.
DPLA MAP fields tests
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala
Adds tests confirming keyword-backed fields have searchable mappings and that exact-match fields include not_analyzed subfields.
Misc test formatting
src/test/... (various)
Multiple tests reformatted (multiline imports/constructors, assertion style changes) without semantic changes beyond the additions above.
Build / manifest
manifest_file, build.sbt
Minor manifest/build references included in diff summary (no functional details).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

  • Areas needing focused review:
    • QueryBuilder: verify multi_match substitution, lenient settings, field list reduction, and track_total_hits = 10000 produce intended ES queries and preserve scoring/filters.
    • ParamValidator: confirm pagination-depth guard and facet-requirement semantics and error messages match API expectations.
    • ConcurrencyLimiter: ensure permit acquisition/release on all paths (successful Future, failed Future, and synchronous exceptions) and correctness of timeout failure behavior and exception contents.
    • Tests: ensure RouteTestTimeout and typed ActorSystem changes don't introduce flakiness; update expected JSON paths/assertions to match QueryBuilder changes.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Phase 2 of API improvements' is vague and generic, using non-descriptive terms that do not clearly convey the specific changes in this changeset. Replace with a more specific title that describes the primary changes, such as 'Add concurrency limiter, refine query validation and search query structure' or similar.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/api-phase-2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@moltude moltude marked this pull request as ready for review December 9, 2025 14:00
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/test/scala/dpla/api/helpers/FileReader.scala (1)

8-10: Resource handling is correct; Using could simplify (if Scala 2.13+).

The explicit UTF‑8 codec and try/finally with buffered.close() look correct and fix the resource‑leak risk. If the project is on Scala 2.13+, you could optionally rewrite this with scala.util.Using for a more idiomatic one-liner, but it’s not required.

src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)

184-194: Consider removing explicit cast.

The .asInstanceOf[JsArray] cast is safe but could be avoided by using the type system.

-  private def filterQuery(filters: Seq[Filter]): JsArray =
-    filters
-      .map(filter => {
-        JsObject(
-          "term" -> JsObject(
-            filter.fieldName -> filter.value.toJson
-          )
-        )
-      })
-      .toJson
-      .asInstanceOf[JsArray]
+  private def filterQuery(filters: Seq[Filter]): JsArray =
+    JsArray(filters.map { filter =>
+      JsObject(
+        "term" -> JsObject(
+          filter.fieldName -> filter.value.toJson
+        )
+      )
+    }: _*)
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (1)

55-72: Narrow exception handling when evaluating f and consider documenting interruption semantics

Two small robustness points in apply:

  1. When evaluating f, you currently catch Throwable. In Scala it’s more idiomatic to catch NonFatal so that truly fatal errors (e.g., VirtualMachineError, ThreadDeath) are not transformed into failed Futures:
+import scala.util.control.NonFatal
 ...
-    } else {
-      try {
-        val future = f
-        future.andThen { case _ => semaphore.release() }(ec)
-      } catch {
-        case e: Throwable =>
-          semaphore.release()
-          Future.failed(e)
-      }
-    }
+    } else {
+      try {
+        val future = f
+        future.andThen { case _ => semaphore.release() }(ec)
+      } catch {
+        case NonFatal(e) =>
+          semaphore.release()
+          Future.failed(e)
+      }
+    }
  1. If the thread is interrupted while waiting in semaphore.tryAcquire(timeoutSeconds, TimeUnit.SECONDS), an InterruptedException will currently bubble out of apply instead of becoming a Future failure (and the interrupt flag isn’t restored). It might be worth deciding and documenting whether you want to:
    • Let that propagate as-is (and mention it in the Scaladoc), or
    • Catch it and convert to a failed Future (possibly re‑interrupting the thread).
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5d79d40 and 8f9a684.

📒 Files selected for processing (13)
  • src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (1 hunks)
  • src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (18 hunks)
  • src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (11 hunks)
  • src/test/scala/dpla/api/helpers/FileReader.scala (1 hunks)
  • src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala (8 hunks)
  • src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala (4 hunks)
  • src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala (2 hunks)
  • src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala (2 hunks)
  • src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala (2 hunks)
  • src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala (1 hunks)
  • src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala (2 hunks)
  • src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (27 hunks)
  • src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (23 hunks)
🧰 Additional context used
🧬 Code graph analysis (9)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala (2)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
  • MockEbookSearch (11-57)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
  • MockEbookRegistry (11-30)
src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala (1)
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (2)
  • availablePermits (78-78)
  • ConcurrencyLimitExceeded (84-90)
src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala (1)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
  • MockEbookRegistry (11-30)
src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (5)
  • search (19-33)
  • search (35-37)
  • search (39-41)
  • search (43-46)
  • search (48-51)
src/main/scala/dpla/api/v2/search/mappings/JsonFieldReader.scala (6)
  • readObject (16-29)
  • readString (34-37)
  • readObjectArray (31-34)
  • readInt (40-43)
  • readStringArray (37-40)
  • readBoolean (43-49)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)
  • query (128-161)
  • filterQuery (184-205)
src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala (6)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
  • MockEbookRegistry (11-30)
src/test/scala/dpla/api/v2/registry/MockSmrRegistry.scala (1)
  • MockSmrRegistry (10-29)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
  • MockEbookSearch (11-57)
src/test/scala/dpla/api/v2/smr/MockSmrRequestHandler.scala (1)
  • MockSmrRequestHandler (8-42)
src/main/scala/dpla/api/v2/smr/SmrProtocol.scala (1)
  • SmrProtocol (6-44)
src/main/scala/dpla/api/v2/smr/SmrParamValidator.scala (1)
  • validService (85-90)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (2)
src/main/scala/dpla/api/v2/search/SearchProtocol.scala (14)
  • search (61-64)
  • search (66-70)
  • search (72-75)
  • search (77-80)
  • search (82-86)
  • search (88-91)
  • search (93-97)
  • search (99-104)
  • search (106-109)
  • search (111-115)
  • search (117-121)
  • search (123-127)
  • search (129-132)
  • search (134-138)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (5)
  • TextField (7-7)
  • IntField (9-9)
  • URLField (11-11)
  • DateField (13-13)
  • WildcardField (17-17)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala (2)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
  • MockEbookSearch (11-57)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
  • MockEbookRegistry (11-30)
src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala (3)
src/test/scala/dpla/api/v2/registry/MockEbookRegistry.scala (1)
  • MockEbookRegistry (11-30)
src/test/scala/dpla/api/v2/registry/MockPssRegistry.scala (1)
  • MockPssRegistry (11-30)
src/test/scala/dpla/api/v2/search/MockEbookSearch.scala (1)
  • MockEbookSearch (11-57)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
  • getElasticSearchField (64-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (26)
src/test/scala/dpla/api/v2/endToEnd/MapperFailureTest.scala (1)

17-18: Good addition of test timeout.

The explicit 3-second timeout prevents tests from hanging indefinitely and is appropriate for mock-based end-to-end tests.

Also applies to: 33-34

src/test/scala/dpla/api/v2/httpHeadersAndMethods/HeaderAuthorizationTest.scala (1)

17-18: LGTM! Route test timeout added for test stability.

The RouteTestTimeout(3.seconds) addition helps prevent test flakiness by providing an explicit timeout. This is a good practice for Akka HTTP route tests.

Also applies to: 33-34

src/test/scala/dpla/api/v2/httpHeadersAndMethods/PermittedMediaTypesTest.scala (1)

17-18: LGTM! Consistent timeout configuration with other test files.

The route test timeout addition follows the same pattern established in other test files in this PR.

Also applies to: 33-34

src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (3)

110-110: Verify track_total_hits value aligns with pagination depth.

The value 10000 is a reasonable optimization to avoid counting all documents. However, given that ParamValidator now limits pagination to 1000 items (fromValue + pageSize > 1000), you could consider lowering this to 1000 or keeping it at 10000 for safety margin. The current value works correctly but is 10x the maximum accessible records.


163-178: Good migration from query_string to multi_match.

Using multi_match with best_fields and lenient: true is a safer and more predictable approach than query_string. This avoids potential Lucene query syntax injection issues while maintaining effective full-text search capability.


113-123: Verify reduced field set meets search requirements.

The field set has been curated to high-value fields with reasonable boost weights. Ensure that removing previously included fields doesn't negatively impact search recall for your use cases.

src/test/scala/dpla/api/v2/httpHeadersAndMethods/ResponseHeadersTest.scala (1)

20-21: LGTM! Consistent test infrastructure updates.

Route test timeout added following the same pattern as other test files in this PR.

Also applies to: 38-39

src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (3)

125-132: Breaking change: Stricter pagination and facet limits.

The limits have been significantly reduced:

  • maxFacetSize: 2000 → 200
  • maxPage: 100 → 10
  • maxPageSize: 500 → 100

Ensure API consumers are aware of these changes, as existing integrations using higher values will now receive capped results or errors.


249-256: Good addition: Requiring query or filter with facets.

This validation prevents expensive facet-only queries that scan the entire index. This is a sensible performance safeguard.


431-439: Good improvement: Empty string handling.

Treating empty/whitespace-only values as absent parameters (None) is a sensible approach that prevents unnecessary validation errors.

src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)

273-277: LGTM: Updated track_total_hits expectation from Boolean to Int.

The test now expects track_total_hits to be 10000 instead of a boolean value, aligning with the QueryBuilder.scala changes that limit total hit tracking for performance optimization.


287-311: LGTM: Tests updated to use multi_match instead of query_string.

The keyword query tests now correctly traverse the multi_match path instead of query_string, reflecting the query builder's switch to multi_match queries. The assertions for query, fields, operator, and lenient are properly adjusted.


930-937: LGTM: Filter query path updated to match new query structure.

The filter query test now reads from "query", "bool", "filter" path directly, consistent with the simplified filter structure in the updated QueryBuilder.scala.

src/test/scala/dpla/api/v2/endToEnd/InvalidParamsTest.scala (1)

35-40: LGTM: Typed actor system integration and route timeout added.

The addition of typedSystem and createActorSystem() override enables proper typed/classic actor system interop, and the 3-second RouteTestTimeout provides reasonable test timing. This pattern is consistent with other test files updated in this PR.

src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala (1)

36-40: LGTM: Test updated to include required query parameter.

The addition of "q" -> "test" aligns with the new validation rule that requires a query or filter when facets are specified (as tested in ParamValidatorTest.scala at lines 322-326). The test still correctly verifies that non-applicable DPLA Map fields are ignored.

src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (4)

321-326: LGTM: New test validates facets require query or filter.

This test ensures that requests with facets but no accompanying query (q) or filter are rejected. This prevents expensive facet-only aggregation queries, which is a sensible performance guard.


353-360: LGTM: Updated max facet_size expectation to 200.

The expected maximum facet_size is now 200 (reduced from 2000), reflecting the stricter limits in the updated ParamValidator.scala.


581-592: LGTM: Pagination validation tests updated for stricter limits.

The tests now verify:

  1. Too-large page values (600) result in InvalidSearchParams rather than capping to max
  2. Deep pagination beyond 1000 results (page=50 × page_size=50 = 2500) is rejected

These stricter limits protect against expensive deep-pagination queries on Elasticsearch.


619-624: LGTM: Updated page_size validation to reject rather than cap.

The test now expects InvalidSearchParams for excessively large page_size values instead of silently capping to the maximum. This explicit rejection provides clearer feedback to API consumers.

src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala (4)

1-22: LGTM: Well-structured test setup with appropriate async configuration.

The test class properly mixes in ScalaFutures and Eventually for async testing, with a reasonable PatienceConfig (5s timeout, 50ms polling interval).


82-119: LGTM: Thorough concurrency limit enforcement test.

This test properly validates:

  1. Two permits are acquired when two Futures are in-flight
  2. A third request fails with ConcurrencyLimitExceeded after timeout
  3. Permits are released when the original Futures complete

Using Promise to control Future completion timing is the correct approach for testing concurrency limits.


146-162: LGTM: Constructor validation tests for edge cases.

Good coverage of invalid constructor arguments (zero and negative values for both maxConcurrent and timeoutSeconds), ensuring the limiter fails fast with clear errors.


63-80: Permit release on Future construction throw is synchronous and safe.

The implementation at lines 64-71 of ConcurrencyLimiter.scala confirms the permit is released synchronously: when val future = f throws (line 65), the catch block at line 68-70 immediately calls semaphore.release() before returning the failed Future. The test assertion at line 79 is correct and not flaky.

src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala (3)

6-24: Docs correctly call out blocking behavior and operational guidance

The Scaladoc explicitly flags the blocking tryAcquire semantics and suggests using a dedicated dispatcher / tuning parameters, which matches the implementation below and should help prevent misuse in Akka/HTTP paths.


25-38: Constructor validation and semaphore setup look correct

The require guards on maxConcurrent and timeoutSeconds and the corresponding Semaphore(maxConcurrent) initialization are consistent with the intended concurrency cap and error reporting; no changes needed here.


81-90: Exception type is informative and matches limiter configuration

ConcurrencyLimitExceeded carries both maxConcurrent and timeoutSeconds and renders a clear message, which should be helpful in logs and client error handling.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements Phase 2 of API improvements focused on search performance optimization and request limiting. Key changes include migrating from Elasticsearch's query_string to multi_match queries, enforcing stricter pagination limits to prevent deep pagination, requiring queries or filters for facet requests, and introducing a new ConcurrencyLimiter class for managing concurrent request loads.

Key Changes

  • Search query optimization: Replaced query_string with multi_match for keyword queries, reduced searchable fields from 27 to 8 high-value fields, and simplified filter query structure
  • Pagination limits: Reduced max page from 100 to 10, max page_size from 500 to 100, max facet_size from 2000 to 200, and added validation to reject requests exceeding 1000 total results (from + size)
  • Concurrency control: Added new ConcurrencyLimiter class with comprehensive test coverage to manage concurrent request loads using semaphores

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala Migrated from query_string to multi_match queries, reduced searchable fields, simplified filter structure, and changed track_total_hits from boolean to 10000
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala Added deep pagination validation, facet requirement checks, reduced max pagination limits, and added empty string handling in parameter validation
src/main/scala/dpla/api/v2/search/ConcurrencyLimiter.scala New class implementing semaphore-based concurrency limiting for Futures with timeout support
src/test/scala/dpla/api/v2/search/ConcurrencyLimiterTest.scala Comprehensive test coverage for ConcurrencyLimiter including edge cases and failure scenarios
src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala Updated tests to validate multi_match queries, new filter structure, and track_total_hits value
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala Updated tests to validate new pagination limits, facet requirements, and deep pagination rejection
src/test/scala/dpla/api/v2/search/paramValidators/ItemParamValidatorTest.scala Updated test to require query parameter with facets
src/test/scala/dpla/api/v2/httpHeadersAndMethods/*.scala Added route test timeouts and reformatted for consistency
src/test/scala/dpla/api/v2/endToEnd/*.scala Added route test timeouts and reformatted for consistency
src/test/scala/dpla/api/helpers/FileReader.scala Improved resource handling with try-finally and explicit UTF-8 encoding

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

moltude and others added 2 commits December 9, 2025 07:08
…cala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…or.scala

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)

196-274: Add defensive error handling to basic field query branch for consistency

The basic field query branch (else block, lines 268-274) silently falls back to keywordQueryFields when getElasticSearchField() returns None, whereas the .before/.after and exact-match branches throw a RuntimeException for unrecognized field names. Although upstream validation in ParamValidator.scala ensures that only valid, searchable field names reach singleFieldQuery() (via searchableDataFields.flatMap(getValidFieldQuery(...))), the inconsistent error handling makes the code less defensive and harder to debug if upstream validation changes or if this method is called directly in the future.

For consistency and defensive programming, adopt the same pattern as other branches:

-    } else {
-      // Basic field query
-      val fields: Seq[String] =
-        Seq(getElasticSearchField(fieldQuery.fieldName)).flatten
-      val obj: JsObject = keywordQuery(fieldQuery.value, fields)
-      Seq(obj)
-    }
+    } else {
+      // Basic field query
+      val field: String = getElasticSearchField(fieldQuery.fieldName)
+        .getOrElse(
+          throw new RuntimeException(
+            "Unrecognized field name: " + fieldQuery.fieldName
+          )
+        )
+      val obj: JsObject = keywordQuery(fieldQuery.value, Seq(field))
+      Seq(obj)
+    }

This aligns all branches to fail fast on invalid field names rather than silently changing query semantics.

🧹 Nitpick comments (2)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)

180-195: Avoid unsafe cast in filterQuery

filterQuery currently relies on .toJson.asInstanceOf[JsArray] for the sequence of term filters, which is brittle and ties correctness to the implementation detail that Seq always serializes to JsArray.

You can build the JsArray directly and drop the cast:

-  private def filterQuery(filters: Seq[Filter]): JsArray =
-    filters
-      .map(filter => {
-        JsObject(
-          "term" -> JsObject(
-            filter.fieldName -> filter.value.toJson
-          )
-        )
-      })
-      .toJson
-      .asInstanceOf[JsArray]
+  private def filterQuery(filters: Seq[Filter]): JsArray =
+    JsArray(
+      filters.map { filter =>
+        JsObject(
+          "term" -> JsObject(
+            filter.fieldName -> filter.value.toJson
+          )
+        )
+      }: _*
+    )

This is safer and keeps the return type explicit.


276-283: Minor naming nit: stripLeadingAndTrainingQuotationMarks

The method name still says Training while the doc comment (Lines 276-278) correctly says “leading and trailing quotation marks.” Consider renaming the method to stripLeadingAndTrailingQuotationMarks for clarity the next time you touch this area.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f9a684 and 0bbdc68.

📒 Files selected for processing (2)
  • src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (18 hunks)
  • src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (11 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
  • getElasticSearchField (64-67)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Prepare
🔇 Additional comments (6)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (6)

16-43: Message routing in apply is clear and side‑effect usage is confined

The actor behavior cleanly delegates to the next phase for search, fetch (single vs multi), and random queries, and uses Behaviors.same / Behaviors.unhandled appropriately. No issues spotted here.


67-93: Random query composition with optional filters looks correct

The new composeRandomQuery implementation correctly wraps optional filters in a bool.filter and nests that under function_score, while keeping the unfiltered case minimal. The use of size = 1 matches the intent of a single random hit. No functional problems here.


96-111: Bool query assembly and top‑level search structure are sound

The refactored composeSearchQuery / query pair correctly handle:

  • Pagination (from/size),
  • Combining keyword and field queries into a bool must/should depending on op,
  • Attaching filters as a separate filter clause,
  • Falling back to match_all when both terms and filters are absent,
  • Adding track_total_hits = 10000 to cap expensive hit tracking.

The structure matches typical ES expectations and looks consistent with the rest of the builder.

Also applies to: 128-161


113-123: Keyword multi‑match configuration and curated field list look reasonable

The reduced keywordQueryFields set with boosts, combined with multi_match using best_fields, operator = "AND", and lenient = true, forms a coherent default keyword query strategy. The targetFields fallback ensures calls without explicit fields still use this curated set.

Also applies to: 163-178


285-381: Facet (aggs) composition for geo and date fields looks consistent

The aggs builder correctly distinguishes:

  • Spatial facets using geo_distance buckets with a generated ranges array (Lines 302-307),
  • Date facets using a filtered date_histogram with month/year granularity and corresponding formats and gte windows (Lines 327-340),
  • Regular term facets via terms on exact‑match fields.

The control flow and error handling for unknown facet names are consistent with the rest of the file. No issues stood out here.


383-447: Sort and field retrieval defaults are reasonable

  • The sort function cleanly handles geo sort, field‑based sort, and falls back to _doc when there is no query/sort, which is appropriate for “no‑order” browsing.
  • fieldRetrieval returning Seq("*").toJson when no fields are specified gives a sensible default of retrieving the full source.

No functional concerns in these sections.

Also applies to: 449-453

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)

62-75: Consider verifying exact field mappings.

The test only checks that the count of mapped fields matches the count of keyword fields. If one field fails to map and is silently dropped, the test wouldn't catch it. Consider using contain allElementsOf to verify that all expected fields are present.

Apply this diff to make the test more precise:

     "have searchable keywordQuery fields defined in ES mapping" in {
       val keywordFields = Seq(
         "sourceResource.title",
         "sourceResource.subject.name",
         "sourceResource.description",
         "sourceResource.creator",
         "sourceResource.collection.title",
         "dataProvider.name",
         "provider.name"
       )
 
       val mapped = keywordFields.flatMap(tester.getElasticSearchField)
-      mapped.size shouldBe keywordFields.size
+      mapped should have size keywordFields.size
+      // Optionally verify all fields mapped successfully
+      keywordFields.foreach { field =>
+        tester.getElasticSearchField(field) should not be empty
+      }
     }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bbdc68 and 0e2a1b7.

📒 Files selected for processing (3)
  • src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (11 hunks)
  • src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1 hunks)
  • src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (23 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (2)
  • getElasticSearchField (64-67)
  • getElasticSearchExactMatchField (75-81)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (1)
  • getElasticSearchField (64-67)
src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (5)
  • search (19-33)
  • search (35-37)
  • search (39-41)
  • search (43-46)
  • search (48-51)
src/main/scala/dpla/api/v2/search/mappings/JsonFieldReader.scala (6)
  • readObject (16-29)
  • readString (34-37)
  • readObjectArray (31-34)
  • readInt (40-43)
  • readStringArray (37-40)
  • readBoolean (43-49)
src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (2)
  • query (127-160)
  • filterQuery (183-205)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Agent
🔇 Additional comments (11)
src/test/scala/dpla/api/v2/search/models/DPLAMAPFieldsTest.scala (1)

77-91: LGTM!

The test correctly verifies that exact match fields use not_analyzed subfields. The use of foreach ensures each field is tested individually, which provides clear failure messages.

src/main/scala/dpla/api/v2/search/queryBuilders/QueryBuilder.scala (7)

113-122: LGTM!

The reduced set of keyword query fields with appropriate boost weights aligns well with cultural heritage search patterns. Prioritizing title (^2) over other fields is a sensible default.


162-177: LGTM!

The migration from query_string to multi_match simplifies the keyword query behavior and makes it more user-friendly by removing advanced Lucene query syntax. The lenient flag appropriately handles field type mismatches.


183-193: LGTM!

Changing the return type to JsArray correctly reflects Elasticsearch's bool filter syntax, where filters are an array of query clauses.


195-204: LGTM!

The updated comment correctly describes the multi_match query behavior and clearly distinguishes it from the exact-match "term" query.


205-275: LGTM!

The improved error messages and the fix for the function name typo (stripLeadingAndTrailingQuotationMarks) enhance code clarity. The logic for handling range queries, exact matches, and basic field queries is correct.


277-284: LGTM!

The function name typo has been corrected (stripLeadingAndTrailingQuotationMarks), and the comment accurately describes the behavior.


110-110: Pagination depth is already limited by existing parameter validation.

The ParamValidator enforces maxPage = 10 and maxPageSize = 100, which limits the maximum offset to 900 and total accessible results to approximately 1,000. This is more restrictive than the track_total_hits = 10000 setting, so users cannot request results beyond what is already enforced by the parameter validation layer. No additional validation is needed.

src/test/scala/dpla/api/v2/search/queryBuilders/QueryBuilderTest.scala (3)

273-277: LGTM!

The test correctly updated to expect track_total_hits as an integer value of 10000, matching the QueryBuilder implementation change.


287-320: LGTM!

The tests correctly updated to read multi_match query structures instead of query_string, and the operator field name change from default_operator to operator matches Elasticsearch's multi_match syntax.


404-509: LGTM!

Excellent comprehensive test coverage for the quote stripping behavior. The tests cover all relevant edge cases: properly quoted strings, strings with internal quotes, strings with only leading or trailing quotes, empty quoted strings, and unquoted strings.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 489 to 497
def validIntWithRange(intString: String, param: String): Int = {
val (min: Int, max: Int) = param match {
case "facet_size" => (minFacetSize, maxFacetSize)
case "page" => (minPage, maxPage)
case "page_size" => (minPageSize, maxPageSize)
case _ => (0, 2147483647)
case "page" => (minPage, maxPage)
case "page_size" => (minPageSize, maxPageSize)
case _ => (0, 2147483647)
}

val rule = s"$param must be an integer between 0 and $max"
Copy link

Copilot AI Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The max limits were reduced (e.g., maxPageSize from 500 to 100), and corresponding tests now expect rejection of over-limit values (see ParamValidatorTest.scala lines 619-624). However, the logic at line 502 (unchanged in this PR) still returns max instead of throwing an exception when values exceed the limit. This creates inconsistent behavior where the tests expect rejection but the implementation clamps values. The unchanged line 502 should also be updated to throw ValidationException(rule) to match the new expected behavior.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot open a new pull request to apply changes based on this feedback

…ing (#94)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: moltude <1419766+moltude@users.noreply.github.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (3)

125-133: Pagination limits and depth guard interplay

The combination of maxPage = 10, maxPageSize = 100, and the fromValue + requestedPageSize > 1000 guard means that, with current constants, no valid combination can actually exceed the 1000 window—the guard is effectively a future‑proof safety net.

To avoid drift if either the ES index.max_result_window or these limits change later, consider either:

  • deriving the 1000 constant from shared configuration, or
  • adding a short comment near maxPage/maxPageSize explaining that they’re chosen so from + size <= 1000, and that the guard is an extra safety check.

Also applies to: 238-247


152-158: Fetch ID count capped by maxPageSize may deserve its own constant

Reusing maxPageSize to limit the number of ids in a fetch (ids.size > maxPageSize) is functionally fine, but it couples two concerns: search page size and fetch batch size. If you ever want to change one without the other, this will be surprising.

Consider introducing a dedicated constant (e.g., maxFetchIds) or at least a short comment explaining that the fetch ID limit is intentionally tied to maxPageSize.

Also applies to: 167-170


490-499: validIntWithRange error message hard-codes lower bound 0

validIntWithRange computes (min, max) per param (e.g., page uses (minPage, maxPage) where minPage = 1), but the rule string always says:

val rule = s"$param must be an integer between 0 and $max"

So for page, an input of 0 or a negative value will yield an error message that mentions 0 as the lower bound, even though the actual check is int < min.

Consider interpolating min instead of 0:

-    val rule = s"$param must be an integer between 0 and $max"
+    val rule = s"$param must be an integer between $min and $max"

This keeps behavior the same while making validation errors clearer to API consumers.

Also applies to: 502-507

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e2a1b7 and de04ff6.

📒 Files selected for processing (1)
  • src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (18 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/main/scala/dpla/api/v2/search/paramValidators/ParamValidator.scala (1)
src/main/scala/dpla/api/v2/search/models/FieldDefinitions.scala (5)
  • TextField (7-7)
  • IntField (9-9)
  • URLField (11-11)
  • DateField (13-13)
  • WildcardField (17-17)

Comment on lines +210 to +212
val filter =
getValidFilter(rawParams)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Facet guard can be bypassed via empty filter values

The new facet guard:

  • val hasFacets = facets.nonEmpty
  • val hasQuery = q.isDefined || fieldQueries.nonEmpty
  • val hasFilter = filter.isDefined
  • if (hasFacets && !hasQuery && !hasFilter) throw ...

assumes that hasFilter implies at least one effective filter value. However:

  • getValidFilter splits on ":"/"AND" and produces values: Array[String],
  • each value is passed through getValid, and
  • getValid now drops empty strings (case (_, v) if v.trim.isEmpty => None).

This means a request like filter=type: (no value) yields values = Array(""), which becomes Seq.empty after getValid drops the empty value. The result is filter = Some(Seq.empty):

  • filter.isDefined is truehasFilter == true,
  • but there are actually no filters applied downstream.

That allows facet requests with no real query or filter to pass the new guard, undermining its intent and potentially re‑enabling high‑cardinality facet requests you were trying to constrain.

One way to tighten this up is to treat “all values empty” as an invalid filter rather than as Some(Nil). For example, you can enforce this inside getValidFilter:

-  private def getValidFilter(
-      rawParams: Map[String, String]
-  ): Option[Seq[Filter]] =
-    rawParams.get("filter").map { filter =>
-      val fieldName = filter
-        .split(":", 2)
-        .headOption
-        .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
-
-      val values = filter
-        .split(":", 2)
-        .lastOption
-        .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
-        .split("AND")
-        .map(_.trim)
-
-      if (searchableDataFields.contains(fieldName)) {
-        val validationMethod = getValidationMethod(fieldName)
-
-        values.flatMap(value => {
-          val params = Map(fieldName -> value)
-          getValid(params, fieldName, validationMethod)
-            .map(Filter(fieldName, _))
-        })
-      } else {
-        throw ValidationException(s"$fieldName is not a valid filter field")
-      }
-    }
+  private def getValidFilter(
+      rawParams: Map[String, String]
+  ): Option[Seq[Filter]] =
+    rawParams.get("filter").map { filter =>
+      val fieldName = filter
+        .split(":", 2)
+        .headOption
+        .map(_.trim)
+        .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
+
+      val rawValues = filter
+        .split(":", 2)
+        .lastOption
+        .getOrElse(throw ValidationException(s"$filter is not a valid filter"))
+        .split("AND")
+        .map(_.trim)
+
+      val values = rawValues.filter(_.nonEmpty)
+      if (values.isEmpty)
+        throw ValidationException(s"$filter is not a valid filter")
+
+      if (searchableDataFields.contains(fieldName)) {
+        val validationMethod = getValidationMethod(fieldName)
+
+        values.flatMap { value =>
+          val params = Map(fieldName -> value)
+          getValid(params, fieldName, validationMethod)
+            .map(Filter(fieldName, _))
+        }
+      } else {
+        throw ValidationException(s"$fieldName is not a valid filter field")
+      }
+    }

This keeps the new “ignore empty values” behavior in getValid for other params, but makes empty filters explicitly invalid and ensures hasFilter truly implies at least one applied filter.

Also applies to: 249-256, 338-367, 431-439

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (2)

297-304: Facet tests correctly enforce q/filter requirement; consider adding filter-only coverage

The facet tests now (a) require facets to be paired with a primary search param (via q) and (b) explicitly reject facets with no query or filter, which matches the tightened validation behavior. To fully exercise the new rule, it would be useful to add a case where facets is combined with filter but no q, to lock in that this combination remains accepted.

Example sketch, placed alongside the existing facet tests:

   "facet validator" should {
@@
     "accept valid param" in {
@@
     }
@@
+    "accept facets with filter but no query" in {
+      val given = "provider.@id"
+      val filter = "provider.@id:http://dp.la/api/contributor/lc"
+      val params = Map("facets" -> given, "filter" -> filter)
+      ebookParamValidator ! RawSearchParams(params, replyProbe.ref)
+      val msg = interProbe.expectMessageType[ValidSearchParams]
+      msg.params.facets shouldEqual Some(Seq("provider.@id"))
+    }
+
     "reject unfacetable field" in {

Also applies to: 313-320, 322-326


353-358: Pagination guard tests align with new behavior; rename specs for clarity

The facet-size and pagination tests now correctly treat oversize values as invalid and the new “reject deep pagination beyond 1000 results” case exercises the depth guard nicely. The only mismatch is that the spec descriptions at Line [579] and Line [617] still say “default to max if param is too large” even though the expectations now assert invalid input; renaming them would keep the tests self-documenting and consistent with the facet-size test.

Suggested rename:

-    "default to max if param is too large" in {
+    "reject param if too large" in {
@@
-    "default to max if param is too large" in {
+    "reject param if too large" in {

Also applies to: 556-563, 579-584, 586-590, 617-622

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de04ff6 and d14bcec.

📒 Files selected for processing (1)
  • src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (27 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (1)
src/main/scala/dpla/api/v2/search/SearchProtocol.scala (1)
  • Random (24-27)
🔇 Additional comments (3)
src/test/scala/dpla/api/v2/search/paramValidators/ParamValidatorTest.scala (3)

13-16: Readability-only reflow looks good

Rewrapping the class header, long strings, and map literals over multiple lines improves readability without changing behavior; no issues here.

Also applies to: 52-53, 74-79, 434-438, 768-768, 823-824, 834-835, 850-851, 859-860, 869-870


123-124: Consistent field-value extraction across validators

These assertions now all use the same pattern to pull out the stored value for each field, which keeps the tests uniform and easier to scan; behavior is unchanged and looks correct.

Also applies to: 135-136, 147-148, 177-178, 214-215, 244-245, 403-404, 443-444, 473-474, 510-511, 642-643, 761-762, 798-799


875-887: Filter validator assertions are precise and robust

These filter tests now assert both the parsed field name and value for the first clause, which tightens the contract without over-coupling to internal representation. This is a clear improvement with no downsides.

Also applies to: 889-900

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants