feat: Upgrade AWS SDK from v1 to v2#930
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: BATMAN-JD The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/test integration-test |
|
Not sure what is happening with PROW but I am unable to do the integration tests. It fails to setup namespace and or returns kube-api server issues. Will wait until later today to run again. |
|
/test integration-test |
|
I had claude run a security review on this PR as well. Here are the results: Overall Assessment: ✅ APPROVED FOR MERGE No critical security vulnerabilities found. The SDK v2 upgrade maintains strong security posture with several improvements. Key Findings: Critical Issues: ❌ None` |
AlexSmithGH
left a comment
There was a problem hiding this comment.
Going to stop doing my review here and let you tackle these set of comments throughout the entire PR before I continue.
My last comment in the tree is to highlight the importance of why we shouldn't hard code expected outputs. For most services, that aren't returning a generic error response, such as Unauthorized/AccessDenied, the error will be specified in the service's type package.
Additionally, input parameter values for API calls should also be sourced from enums or other static sources and not hard coded when it's an AWS managed field.
|
/test integration-test |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #930 +/- ##
==========================================
+ Coverage 46.24% 46.43% +0.19%
==========================================
Files 46 46
Lines 6799 6758 -41
==========================================
- Hits 3144 3138 -6
+ Misses 3323 3294 -29
+ Partials 332 326 -6
🚀 New features to boost your workflow:
|
AlexSmithGH
left a comment
There was a problem hiding this comment.
Further review comments on nil pointer dereferences. I will stop the review here so you can update this everywhere. Be very pessimistic on logical changes to the code outside of updates to error handling, and client instantiations.
87f73e8 to
7c1b51e
Compare
|
I tried running integration tests in a different DRAFT PR and ran into the same error. I believe since it is flaky we have a PROW issue. I saw this on Friday morning, was able to run no problem Friday evening, then since feedback changes I have not been able to run again. Here is the matching line of problems in the other PR that matches this one, keep in mind that other draft PR has no aws-sdk-v2 changes from this. make[1]: Entering directory '/go/src/github.com/openshift/aws-account-operator' |
7c1b51e to
ee682f0
Compare
|
/test integration-test |
ee682f0 to
061b8a4
Compare
|
Going to spend some time on a separate branch and possibly PR to investigate PROW integration tests. Please hold off on this PR for now |
This commit upgrades the AWS SDK from v1 (github.com/aws/aws-sdk-go) to v2 (github.com/aws/aws-sdk-go-v2) across the entire codebase. ## Changes ### SDK v2 Migration (36 files) - Imports: Updated all AWS SDK imports to v2 packages - Context Parameters: Added context.TODO() to all AWS API calls - Error Handling: Converted awserr.Error to smithy.APIError with errors.As() - Error Codes: Changed aerr.Code() to aerr.ErrorCode() - Type Changes: Converted pointer slices to value slices (e.g., []*iam.Role → []iamtypes.Role) - Pointer Derefs: Removed * where SDK v2 returns values instead of pointers ### Master Features Preserved - PauseReconciliationAnnotation - Pause reconciliation feature - shard-name configuration - Reading from configMap for AWS account tagging - Finalizer retry logic - 5 retries with conflict handling ### Bug Fix Fixed pre-existing goroutine channel deadlock in ec2.go InitializeRegion(): - HandleServiceQuotaRequests error path was returning without sending to ec2Errors channel - This caused parent goroutine to hang indefinitely waiting for channel message - Added proper error channel send before return (lines 183-186) ## Testing All integration tests passing (5/5): - test_fake_accountclaim.sh - test_accountpool_size.sh - test_sts_accountclaim.sh - test_nonccs_account_creation.sh - test_aws_ou_logic.sh Real AWS operations validated with SDK v2. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit completes the AWS SDK v2 upgrade by fixing all remaining unit test failures. The key changes ensure 100% test compatibility with SDK v2 while maintaining all existing functionality. Changes: - Fixed LogAwsError in pkg/utils/utils.go to use smithy.APIError instead of awserr.Error for SDK v2 compatibility - Converted totalaccountwatcher_test.go to SDK v2 with proper context parameters and type updates - Fixed account_controller_test.go by removing incorrect GetUser mock expectations that were causing test failures - Added proper GetCallerIdentity expectations for SDK v2 (requires context parameter) Test Results: - Unit tests: 49/49 passing (100%) - Integration tests: 5/5 passing (100%) - All existing functionality preserved - All master branch features maintained The issue with the two failing tests was that extra GetUser mock expectations were incorrectly added during conversion. In SDK v1, the tests only used GetCallerIdentity for the aao-root user, but our initial SDK v2 conversion added GetUser calls that didn't exist in master. This caused gomock to match the wrong expectations when using gomock.Any() for all parameters. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates UBI minimal base image from 9.7-1769056855 to 9.7-1770203734 as required by boilerplate validation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Lint fixes: - Convert localmetrics to use smithy.APIError instead of awserr.Error - Update integration test to use SDK v2 IAM client and types - Fix deprecated EndpointResolverWithOptions usage in awsclient - Simplify loop in awsfederatedrole controller - Use tagged switch in account controller error handling - Add missing aws-sdk-go-v2/config dependency Validate fixes: - Regenerate files with CI tool versions (controller-gen v0.16.4) - Update openapi.go to use full module paths - Update CRD files to match CI controller-gen version - Add blank lines in mock files per CI mockgen format All files generated with: make container-generate Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit addresses reviewer feedback on the AWS SDK v2 upgrade: Type-Safe Error Handling (9 files): - Convert string-based error checking to typed exceptions - Use errors.As() with service-specific exception types - Pattern: aerr.ErrorCode() == "ExceptionName" -> errors.As(err, &typedException) - Updated files: account_controller.go, cases.go, iam.go, region_enablement.go, service_quota.go, organizational_units.go, reuse.go, awsfederatedrole_controller.go, pkg/awsclient/iam.go SDK v2 Best Practices: - Use typed constants: InstanceStateNameRunning, ArchitectureTypeX8664 - Return actual error codes instead of generic strings in getBuildIAMUserErrorReason - Remove misleading comments about ResourceAlreadyExistsException (never valid for Account service) Bug Fixes: - Revert unintended behavior change in service quota error handling (ec2.go) All changes validated: - 104 unit tests passing - 5/5 integration tests passing Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace unsafe direct pointer dereferences with aws.ToString() for nil-safe access: - Line 350: Log message when attaching admin policy - Line 355: Error message for policy attachment failure - Line 360: Log message when creating secrets - Line 374: Error message for access key rotation failure - Line 575: Error log when deleting IAM access keys This prevents potential nil pointer panics and follows AWS SDK v2 best practices. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Check for EntityAlreadyExistsException BEFORE the generic error code switch statement, not in an else if afterwards. This prevents the switch's default case from catching and returning before the typed exception check can be reached. The typed exception is checked first for robust error handling, with a fallback case in the switch for generic error code compatibility. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
061b8a4 to
9dfd332
Compare
|
/test integration-test |
|
PR is ready again. Fixed the integration test failures that were causing all of these reruns #934 We should be good to move forward now. |
AlexSmithGH
left a comment
There was a problem hiding this comment.
Looking good! Few minor comments on some patterns we should change.
controllers/awsfederatedaccountaccess/awsfederatedaccountaccess_controller.go
Outdated
Show resolved
Hide resolved
controllers/awsfederatedaccountaccess/awsfederatedaccountaccess_controller.go
Outdated
Show resolved
Hide resolved
This commit addresses all code review feedback from AlexSmithGH:
- Source error codes from AWS SDK v2 packages instead of hardcoded strings
- account_controller_test.go: Use ErrorCode() method on exception types
to get canonical error strings from organizationstypes package
- All error code constants now sourced via ErrorCode() method
- Replace hardcoded status strings with package constants
- account_controller.go: Use organizationstypes constants for
CreateAccountState (FAILED, IN_PROGRESS) and CreateAccountFailureReason
- region_enablement.go: Use accounttypes constants for RegionOptStatus
- service_quota.go: Use servicequotastypes constants for RequestStatus
- Remove redundant error handling fallback cases
- iam.go: Remove EntityAlreadyExists fallback (handled by typed check)
- organizational_units.go: Remove AccountNotFoundException and
ConcurrentModificationException fallback cases (handled by typed checks)
- Simplify error handling patterns
- byoc.go: Convert if-else chain to switch statement for error handling
- awsfederatedaccountaccess_controller.go: Remove unnecessary
smithy.APIError checks, use only typed EntityAlreadyExistsException
- Inline constants where appropriate
- sts_assume_role.go: Use aws.Int32(3600) directly instead of variable
- Add missing aws import
- Fix test mock errors
- organizational_units_test.go: Use typed organizationstypes exceptions
instead of GenericAPIError in test mocks
All unit tests (42/42) and integration tests (5/5) passing.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
Looks like we are having issues with Quay right now. Will run these again later |
|
/test all |
3a81748 to
7ce3135
Compare
Update UBI minimal base image to latest version. Keep boilerplate at v8.3.2 to match available images in CI infrastructure. Changes: - UBI minimal: 9.7-1770267347 -> 9.7-1771346502 - Boilerplate remains at v8.3.2 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
b2f4905 to
dcd5c20
Compare
|
/test integration-test |
|
integration test failed from no cluster in pool ready to claim. Will retry later |
|
/test integration-test |
|
@BATMAN-JD: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |

What is being added?
This commit upgrades the AWS SDK from v1 (github.com/aws/aws-sdk-go) to v2 (github.com/aws/aws-sdk-go-v2) across the entire codebase.
SDK v2 Migration
Changes
SDK v2 Migration
github.com/aws/aws-sdk-go-v2/...)context.TODO()to all AWS API callserrors.As()(e.g.,*iamtypes.EntityAlreadyExistsException,*organizationstypes.DuplicateOrganizationalUnitException)smithy.APIErrorfor compatibilityaerr.Code()toaerr.ErrorCode()[]*iam.Role → []iamtypes.Role)aws.ToString(),aws.ToInt32(), etc. instead of direct dereference for nil-safe accessec2types.InstanceStateNameRunning,ec2types.ArchitectureTypeX8664)Infrastructure Improvements
PROW CI Stability: Fixed "namespaces ci-op-XXXXX not found" errors affecting PROW integration tests. Root cause: PROW sets NAMESPACE environment variable to ephemeral test namespace (e.g., ci-op-f7vd6446), but Makefile was using ${NAMESPACE} which pointed to the wrong namespace. Since oc process --local strips namespace metadata from templates, resources attempted creation in PROW's test namespace instead of aws-account-operator, causing failures. Solution: Introduced OPERATOR_NAMESPACE variable (defaults to aws-account-operator) and added explicit --namespace=${OPERATOR_NAMESPACE} flags to oc apply commands. Also added namespace readiness polling (up to 20 seconds) as additional safety measure to ensure namespace is Active before resource creation.
Checklist before requesting review
Steps To Manually Test
Please provide us steps to reproduce the scenarios needed to test this. If an integration test is provided, let us know how to run it. If not, enumerate the steps to validate this does what it's supposed to.
Ref SREP-1187