Skip to content

Comments

feat: Upgrade AWS SDK from v1 to v2#930

Open
BATMAN-JD wants to merge 10 commits intoopenshift:masterfrom
BATMAN-JD:upgrade-aws-sdk-v2
Open

feat: Upgrade AWS SDK from v1 to v2#930
BATMAN-JD wants to merge 10 commits intoopenshift:masterfrom
BATMAN-JD:upgrade-aws-sdk-v2

Conversation

@BATMAN-JD
Copy link
Contributor

@BATMAN-JD BATMAN-JD commented Feb 6, 2026

What is being added?

This commit upgrades the AWS SDK from v1 (github.com/aws/aws-sdk-go) to v2 (github.com/aws/aws-sdk-go-v2) across the entire codebase.

SDK v2 Migration

Changes

SDK v2 Migration

  • Imports: Updated all AWS SDK imports to v2 packages (github.com/aws/aws-sdk-go-v2/...)
  • Context Parameters: Added context.TODO() to all AWS API calls
  • Type-Safe Error Handling: Use service-specific typed exceptions with errors.As() (e.g.,
    *iamtypes.EntityAlreadyExistsException, *organizationstypes.DuplicateOrganizationalUnitException)
    • Check typed exceptions first, then fall back to generic smithy.APIError for compatibility
    • Properly ordered error handling (typed checks before switch statements)
  • Error Codes: Changed aerr.Code() to aerr.ErrorCode()
  • Type Changes: Converted pointer slices to value slices (e.g., []*iam.Role → []iamtypes.Role)
  • Safe Pointer Handling: Use aws.ToString(), aws.ToInt32(), etc. instead of direct dereference for nil-safe access
  • Typed Constants: Use SDK v2 constants (e.g., ec2types.InstanceStateNameRunning, ec2types.ArchitectureTypeX8664)

Infrastructure Improvements

PROW CI Stability: Fixed "namespaces ci-op-XXXXX not found" errors affecting PROW integration tests. Root cause: PROW sets NAMESPACE environment variable to ephemeral test namespace (e.g., ci-op-f7vd6446), but Makefile was using ${NAMESPACE} which pointed to the wrong namespace. Since oc process --local strips namespace metadata from templates, resources attempted creation in PROW's test namespace instead of aws-account-operator, causing failures. Solution: Introduced OPERATOR_NAMESPACE variable (defaults to aws-account-operator) and added explicit --namespace=${OPERATOR_NAMESPACE} flags to oc apply commands. Also added namespace readiness polling (up to 20 seconds) as additional safety measure to ensure namespace is Active before resource creation.

Checklist before requesting review

  • I have tested this locally
  • I have included unit tests
  • I have updated any corresponding documentation

Steps To Manually Test

Please provide us steps to reproduce the scenarios needed to test this. If an integration test is provided, let us know how to run it. If not, enumerate the steps to validate this does what it's supposed to.

  1. Start the operator
  2. Run the thing
  3. Observe X in the spec for the thing
  4. Clean up the thing

Ref SREP-1187

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BATMAN-JD
Once this PR has been reviewed and has the lgtm label, please assign jharrington22 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@BATMAN-JD
Copy link
Contributor Author

BATMAN-JD commented Feb 6, 2026

Not sure what is happening with PROW but I am unable to do the integration tests. It fails to setup namespace and or returns kube-api server issues. Will wait until later today to run again.

@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@BATMAN-JD
Copy link
Contributor Author

I had claude run a security review on this PR as well. Here are the results:
`Security Review Results 🔒

Overall Assessment: ✅ APPROVED FOR MERGE

No critical security vulnerabilities found. The SDK v2 upgrade maintains strong security posture with several improvements.

Key Findings:

Critical Issues: ❌ None`

Copy link
Contributor

@AlexSmithGH AlexSmithGH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to stop doing my review here and let you tackle these set of comments throughout the entire PR before I continue.

My last comment in the tree is to highlight the importance of why we shouldn't hard code expected outputs. For most services, that aren't returning a generic error response, such as Unauthorized/AccessDenied, the error will be specified in the service's type package.

Additionally, input parameter values for API calls should also be sourced from enums or other static sources and not hard coded when it's an AWS managed field.

@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2026

Codecov Report

❌ Patch coverage is 40.95450% with 532 lines in your changes missing coverage. Please review.
✅ Project coverage is 46.43%. Comparing base (7ba7c8a) to head (dcd5c20).

Files with missing lines Patch % Lines
pkg/awsclient/client.go 12.60% 207 Missing and 1 partial ⚠️
...ountaccess/awsfederatedaccountaccess_controller.go 27.50% 53 Missing and 5 partials ⚠️
controllers/account/service_quota.go 11.32% 45 Missing and 2 partials ⚠️
controllers/account/region_enablement.go 15.38% 43 Missing and 1 partial ⚠️
controllers/accountclaim/reuse.go 45.71% 35 Missing and 3 partials ⚠️
pkg/awsclient/iam.go 23.40% 35 Missing and 1 partial ⚠️
...rs/awsfederatedrole/awsfederatedrole_controller.go 0.00% 27 Missing ⚠️
controllers/account/cases.go 16.66% 20 Missing ⚠️
controllers/account/account_controller.go 62.50% 18 Missing ⚠️
controllers/account/iam.go 82.75% 8 Missing and 2 partials ⚠️
... and 6 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #930      +/-   ##
==========================================
+ Coverage   46.24%   46.43%   +0.19%     
==========================================
  Files          46       46              
  Lines        6799     6758      -41     
==========================================
- Hits         3144     3138       -6     
+ Misses       3323     3294      -29     
+ Partials      332      326       -6     
Files with missing lines Coverage Δ
controllers/account/byoc.go 67.34% <100.00%> (-0.17%) ⬇️
pkg/awsclient/sts/sts_assume_role.go 76.47% <100.00%> (ø)
pkg/awsclient/tags.go 100.00% <100.00%> (ø)
pkg/localmetrics/localmetrics.go 54.74% <100.00%> (+0.16%) ⬆️
pkg/utils/utils.go 80.00% <100.00%> (+0.16%) ⬆️
main.go 0.00% <0.00%> (ø)
...ollers/validation/account_validation_controller.go 45.31% <84.21%> (+1.17%) ⬆️
...ontrollers/accountclaim/accountclaim_controller.go 54.07% <60.00%> (-0.31%) ⬇️
pkg/totalaccountwatcher/totalaccountwatcher.go 63.39% <50.00%> (-0.25%) ⬇️
controllers/account/ec2.go 62.01% <88.67%> (ø)
... and 11 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@BATMAN-JD
Copy link
Contributor Author

BATMAN-JD commented Feb 11, 2026

looks like the same PROW issues as last week. We see this error. Will retry in the morning. I did not see this prior to attempting to migrate to aws sdk v2, I am wondering if we have some race condition with more dependencies. If it fails again tomorrow I will maybe try a wait or a check in the bootstrap to ensure namespace exists. Kinda suspicious with all the integration tests I have done over the last couple weeks I never witnessed this error in PROW until we introduced aws-sdk-v2. Seems flaky, but did self resolve last Friday. More to come.
image

Copy link
Contributor

@AlexSmithGH AlexSmithGH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further review comments on nil pointer dereferences. I will stop the review here so you can update this everywhere. Be very pessimistic on logical changes to the code outside of updates to error handling, and client instantiations.

@BATMAN-JD
Copy link
Contributor Author

BATMAN-JD commented Feb 12, 2026

I tried running integration tests in a different DRAFT PR and ran into the same error. I believe since it is flaky we have a PROW issue. I saw this on Friday morning, was able to run no problem Friday evening, then since feedback changes I have not been able to run again. Here is the matching line of problems in the other PR that matches this one, keep in mind that other draft PR has no aws-sdk-v2 changes from this.

make[1]: Entering directory '/go/src/github.com/openshift/aws-account-operator'
boilerplate/openshift/golang-osd-operator/standard.mk:107: Setting GOEXPERIMENT=boringcrypto - this generally causes builds to fail unless building inside the provided Dockerfile. If building locally consider calling 'go build .'
Create aws-account-operator namespace
Error from server (NotFound): namespaces "aws-account-operator" not found
namespace/aws-account-operator created
Create aws-account-operator CRDs
customresourcedefinition.apiextensions.k8s.io/accountclaims.aws.managed.openshift.io created
customresourcedefinition.apiextensions.k8s.io/accountpools.aws.managed.openshift.io created
customresourcedefinition.apiextensions.k8s.io/accounts.aws.managed.openshift.io created
customresourcedefinition.apiextensions.k8s.io/awsfederatedaccountaccesses.aws.managed.openshift.io created
customresourcedefinition.apiextensions.k8s.io/awsfederatedroles.aws.managed.openshift.io created
Create zero size account pool
Error from server (NotFound): error when creating "STDIN": namespaces "ci-op-ty3dkhk7" not found
make[1]: *** [Makefile:108: predeploy-aws-account-operator] Error 1
make[1]: Leaving directory /go/src/github.com/openshift/aws-account-operator

@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@BATMAN-JD
Copy link
Contributor Author

Going to spend some time on a separate branch and possibly PR to investigate PROW integration tests. Please hold off on this PR for now

BATMAN-JD and others added 8 commits February 13, 2026 12:09
This commit upgrades the AWS SDK from v1 (github.com/aws/aws-sdk-go) to v2 (github.com/aws/aws-sdk-go-v2) across the entire codebase.

## Changes

### SDK v2 Migration (36 files)
- Imports: Updated all AWS SDK imports to v2 packages
- Context Parameters: Added context.TODO() to all AWS API calls
- Error Handling: Converted awserr.Error to smithy.APIError with errors.As()
- Error Codes: Changed aerr.Code() to aerr.ErrorCode()
- Type Changes: Converted pointer slices to value slices (e.g., []*iam.Role → []iamtypes.Role)
- Pointer Derefs: Removed * where SDK v2 returns values instead of pointers

### Master Features Preserved
- PauseReconciliationAnnotation - Pause reconciliation feature
- shard-name configuration - Reading from configMap for AWS account tagging
- Finalizer retry logic - 5 retries with conflict handling

### Bug Fix
Fixed pre-existing goroutine channel deadlock in ec2.go InitializeRegion():
- HandleServiceQuotaRequests error path was returning without sending to ec2Errors channel
- This caused parent goroutine to hang indefinitely waiting for channel message
- Added proper error channel send before return (lines 183-186)

## Testing

All integration tests passing (5/5):
- test_fake_accountclaim.sh
- test_accountpool_size.sh
- test_sts_accountclaim.sh
- test_nonccs_account_creation.sh
- test_aws_ou_logic.sh

Real AWS operations validated with SDK v2.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit completes the AWS SDK v2 upgrade by fixing all remaining
unit test failures. The key changes ensure 100% test compatibility with
SDK v2 while maintaining all existing functionality.

Changes:
- Fixed LogAwsError in pkg/utils/utils.go to use smithy.APIError instead
  of awserr.Error for SDK v2 compatibility
- Converted totalaccountwatcher_test.go to SDK v2 with proper context
  parameters and type updates
- Fixed account_controller_test.go by removing incorrect GetUser mock
  expectations that were causing test failures
- Added proper GetCallerIdentity expectations for SDK v2 (requires
  context parameter)

Test Results:
- Unit tests: 49/49 passing (100%)
- Integration tests: 5/5 passing (100%)
- All existing functionality preserved
- All master branch features maintained

The issue with the two failing tests was that extra GetUser mock
expectations were incorrectly added during conversion. In SDK v1,
the tests only used GetCallerIdentity for the aao-root user, but
our initial SDK v2 conversion added GetUser calls that didn't exist
in master. This caused gomock to match the wrong expectations when
using gomock.Any() for all parameters.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updates UBI minimal base image from 9.7-1769056855 to 9.7-1770203734 as
required by boilerplate validation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Lint fixes:
- Convert localmetrics to use smithy.APIError instead of awserr.Error
- Update integration test to use SDK v2 IAM client and types
- Fix deprecated EndpointResolverWithOptions usage in awsclient
- Simplify loop in awsfederatedrole controller
- Use tagged switch in account controller error handling
- Add missing aws-sdk-go-v2/config dependency

Validate fixes:
- Regenerate files with CI tool versions (controller-gen v0.16.4)
- Update openapi.go to use full module paths
- Update CRD files to match CI controller-gen version
- Add blank lines in mock files per CI mockgen format

All files generated with: make container-generate

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit addresses reviewer feedback on the AWS SDK v2 upgrade:

Type-Safe Error Handling (9 files):
- Convert string-based error checking to typed exceptions
- Use errors.As() with service-specific exception types
- Pattern: aerr.ErrorCode() == "ExceptionName" -> errors.As(err, &typedException)
- Updated files: account_controller.go, cases.go, iam.go, region_enablement.go,
  service_quota.go, organizational_units.go, reuse.go, awsfederatedrole_controller.go,
  pkg/awsclient/iam.go

SDK v2 Best Practices:
- Use typed constants: InstanceStateNameRunning, ArchitectureTypeX8664
- Return actual error codes instead of generic strings in getBuildIAMUserErrorReason
- Remove misleading comments about ResourceAlreadyExistsException (never valid for Account service)

Bug Fixes:
- Revert unintended behavior change in service quota error handling (ec2.go)

All changes validated:
- 104 unit tests passing
- 5/5 integration tests passing

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replace unsafe direct pointer dereferences with aws.ToString() for nil-safe access:
- Line 350: Log message when attaching admin policy
- Line 355: Error message for policy attachment failure
- Line 360: Log message when creating secrets
- Line 374: Error message for access key rotation failure
- Line 575: Error log when deleting IAM access keys

This prevents potential nil pointer panics and follows AWS SDK v2 best practices.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Check for EntityAlreadyExistsException BEFORE the generic error code
switch statement, not in an else if afterwards. This prevents the
switch's default case from catching and returning before the typed
exception check can be reached.

The typed exception is checked first for robust error handling, with
a fallback case in the switch for generic error code compatibility.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@BATMAN-JD
Copy link
Contributor Author

PR is ready again. Fixed the integration test failures that were causing all of these reruns #934 We should be good to move forward now.

Copy link
Contributor

@AlexSmithGH AlexSmithGH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Few minor comments on some patterns we should change.

This commit addresses all code review feedback from AlexSmithGH:

- Source error codes from AWS SDK v2 packages instead of hardcoded strings
  - account_controller_test.go: Use ErrorCode() method on exception types
    to get canonical error strings from organizationstypes package
  - All error code constants now sourced via ErrorCode() method

- Replace hardcoded status strings with package constants
  - account_controller.go: Use organizationstypes constants for
    CreateAccountState (FAILED, IN_PROGRESS) and CreateAccountFailureReason
  - region_enablement.go: Use accounttypes constants for RegionOptStatus
  - service_quota.go: Use servicequotastypes constants for RequestStatus

- Remove redundant error handling fallback cases
  - iam.go: Remove EntityAlreadyExists fallback (handled by typed check)
  - organizational_units.go: Remove AccountNotFoundException and
    ConcurrentModificationException fallback cases (handled by typed checks)

- Simplify error handling patterns
  - byoc.go: Convert if-else chain to switch statement for error handling
  - awsfederatedaccountaccess_controller.go: Remove unnecessary
    smithy.APIError checks, use only typed EntityAlreadyExistsException

- Inline constants where appropriate
  - sts_assume_role.go: Use aws.Int32(3600) directly instead of variable
  - Add missing aws import

- Fix test mock errors
  - organizational_units_test.go: Use typed organizationstypes exceptions
    instead of GenericAPIError in test mocks

All unit tests (42/42) and integration tests (5/5) passing.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@BATMAN-JD
Copy link
Contributor Author

Looks like we are having issues with Quay right now. Will run these again later

@BATMAN-JD
Copy link
Contributor Author

/test all

@BATMAN-JD BATMAN-JD force-pushed the upgrade-aws-sdk-v2 branch 2 times, most recently from 3a81748 to 7ce3135 Compare February 19, 2026 17:33
Update UBI minimal base image to latest version. Keep boilerplate
at v8.3.2 to match available images in CI infrastructure.

Changes:
- UBI minimal: 9.7-1770267347 -> 9.7-1771346502
- Boilerplate remains at v8.3.2

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@BATMAN-JD
Copy link
Contributor Author

integration test failed from no cluster in pool ready to claim. Will retry later

@BATMAN-JD
Copy link
Contributor Author

/test integration-test

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 19, 2026

@BATMAN-JD: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants