Skip to content

Conversation

@hablutzel1
Copy link
Contributor

Deploying the project to AWS with the default configuration results in a slow and randomly failing API as you can reproduce with the following simple script:

#!/bin/bash

API_URL="https://zounb7fdwc.execute-api.us-east-2.amazonaws.com/v1/mpic"
API_KEY="xxx"

while true; do
  start_time=$(gdate +%s%3N)
  RESPONSE=$(curl --silent --location "$API_URL" \
    --header 'Content-Type: application/json' \
    --header 'Accept: application/json' \
    --header "x-api-key: $API_KEY" \
    --data "{
            \"check_type\": \"caa\",
            \"domain_or_ip_target\": \"example.org\"
          }"
  )
  end_time=$(gdate +%s%3N)
  duration=$((end_time - start_time))
  echo "Response took ${duration}ms: ${RESPONSE:0:100}..."
done

That produces output like the following:

$ ./reproduce_timeout_problem.sh 
Response took 11671ms: {"mpic_completed":true,"request_orchestratio...
Response took 2151ms: {"mpic_completed":true,"request_orchestration...
Response took 5453ms: {"mpic_completed":true,"request_orchestration...
Response took 28451ms: {"mpic_completed":true,"request_orchestratio...
Response took 26961ms: {"mpic_completed":true,"request_orchestratio...
Response took 29623ms: {"message": "Endpoint request timed out"}...

And it can be observed that the coordinator Lambda is almost always taking 100% memory:

$ aws logs filter-log-events --log-group-name '/aws/lambda/open_mpic_lambda_coordinator_826858333' | grep "Max Memory Used"
            "message": "REPORT RequestId: 510a9d2c-a968-4598-a6d8-befb041dc95f\tDuration: 1700.98 ms\tBilled Duration: 1701 ms\tMemory Size: 128 MB\tMax Memory Used: 124 MB\t\n",
            "message": "REPORT RequestId: f36fde4d-2573-48c0-9726-d352d8455283\tDuration: 1582.41 ms\tBilled Duration: 1583 ms\tMemory Size: 128 MB\tMax Memory Used: 126 MB\t\n",
            "message": "REPORT RequestId: 36fecd8c-dbc3-433b-b2ac-ca1952e9a09b\tDuration: 1569.24 ms\tBilled Duration: 1570 ms\tMemory Size: 128 MB\tMax Memory Used: 128 MB\t\n",
            "message": "REPORT RequestId: 3dd7166c-4387-4b1b-9087-9a788c0de17c\tDuration: 1804.33 ms\tBilled Duration: 1805 ms\tMemory Size: 128 MB\tMax Memory Used: 128 MB\t\n",
            "message": "REPORT RequestId: 99b6cc3d-7d1e-47ca-bcd9-4ef84ecf25eb\tDuration: 7349.00 ms\tBilled Duration: 7350 ms\tMemory Size: 128 MB\tMax Memory Used: 128 MB\t\n",
            "message": "REPORT RequestId: f2665534-7abb-44f6-9a73-e9f02b39c501\tDuration: 63057.97 ms\tBilled Duration: 60000 ms\tMemory Size: 128 MB\tMax Memory Used: 128 MB\t\n",

But if the Lambda memory is doubled to 256 MB, the following memory consumption is now observed and the API is stable:

$ aws logs filter-log-events --log-group-name '/aws/lambda/open_mpic_lambda_coordinator_826858333' | grep "Max Memory Used"
...
            "message": "REPORT RequestId: ef34d358-0d32-4f21-8d04-b94336a49c92\tDuration: 661.62 ms\tBilled Duration: 662 ms\tMemory Size: 256 MB\tMax Memory Used: 218 MB\t\n",
            "message": "REPORT RequestId: bdfee13f-2dbe-4acc-a8e0-c819dd17b13c\tDuration: 674.05 ms\tBilled Duration: 675 ms\tMemory Size: 256 MB\tMax Memory Used: 218 MB\t\n",
            "message": "REPORT RequestId: 88cbaf53-ea55-4b0b-a83e-637927545fe6\tDuration: 685.87 ms\tBilled Duration: 686 ms\tMemory Size: 256 MB\tMax Memory Used: 219 MB\t\n",
            "message": "REPORT RequestId: 8bb729d7-8182-43d9-919e-a705047434c9\tDuration: 723.59 ms\tBilled Duration: 724 ms\tMemory Size: 256 MB\tMax Memory Used: 219 MB\t\n",
            "message": "REPORT RequestId: 4d5f89c4-d569-4eaa-8145-72cbd63ba03d\tDuration: 349.70 ms\tBilled Duration: 350 ms\tMemory Size: 256 MB\tMax Memory Used: 219 MB\t\n",
            "message": "REPORT RequestId: c5d2ce6b-cd47-4e8d-8127-ada9cde7fe4b\tDuration: 241.34 ms\tBilled Duration: 242 ms\tMemory Size: 256 MB\tMax Memory Used: 219 MB\t\n",

Now, ~219 from 256 (~86%) might still be dangerously close to the limit, so you might want to increase the default even further for safety (e.g. to accomodate to the codebase growing).

@birgelee
Copy link
Member

Thanks for bringing this to our attention. We intend to address this. We appreciate you helping with the AWS tuning. We had also noticed some timeouts particularly occurring on high remote perspective counts.

@birgelee
Copy link
Member

birgelee commented Apr 1, 2025

Thanks for contributing this. I confirmed it passes all integration tests and fixed a bug we previously had with it sometimes failing integration tests.

I did take the liberty of doubling the memory to 512 as I feel 80% memory pressure is not good and this could be exceeded if even more perspectives were added.

@birgelee birgelee merged commit 8fb72c0 into open-mpic:main Apr 1, 2025
1 check passed
@hablutzel1 hablutzel1 deleted the coordinator-memory branch April 20, 2025 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants