Fix issue with reconciliation of resources using UseAsync=true #562
+18
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes a critical issue #561 where resources configured with
UseAsync=truecreate duplicate cloud resources after provider pod restarts or Kubernetes cluster backup/restore operations (e.g., Velero).Problem
Resources with
UseAsync = truestore Terraform workspace state in ephemeral pod storage (/tmp/<workspace-id>/). When the provider pod restarts:external-namewith new resource ID, orphaning the original resourceReproduction
Solution
When an async resource has the
external-create-succeededannotation (indicating prior successful creation) but workspace state is missing, use Import instead of Refresh to reconstruct state directly from the cloud provider API.Code Changes
File:
pkg/controller/external.goAdded logic in the
Observe()function before callingRefresh():Required import added:
"github.com/crossplane/crossplane-runtime/v2/pkg/meta"How It Works
external-create-succeededannotationImport()to reconstruct Terraform state from cloud provider API using theexternal-nameas resource IDWhy This Works
external-nameannotation persists in Kubernetes (not lost on pod restart)external-create-succeededannotation persists in Kubernetes (backed up by Velero)Testing
Tested successfully with provider-ovh managing OVH Managed Kubernetes Clusters (a resource with
UseAsync = true).Test Scenarios
2. Verify creation succeeded
3. Delete provider pod
4. Wait for pod restart
5. Check for duplicates
✅ external-name unchanged
✅ Resource synced
2. Backup with Velero
3. Reset Kubernetes cluster
4. Restore with Velero
5. Check for duplicates
✅ external-name preserved
✅ Resource synced with existing cloud resource
2. Drain/cordon node A
3. Pod reschedules to node B
4. Check for duplicates
✅ Resource remains synced
Debug Logs Confirming Fix
After applying the fix, provider logs show:
Impact
UseAsync = truethat have been previously createdAlternative Solutions Considered
1. PersistentVolume for Terraform Workspaces
Rejected: Adds infrastructure complexity, requires storage provisioning, doesn't scale well with multiple provider instances
2. Store tfstate in Kubernetes Secrets
Rejected: Large state files could exceed secret size limits (1MB), performance concerns with frequent updates
3. Disable UseAsync
Rejected: Removes async operation tracking capability, breaks long-running operations
4. Velero Filesystem Backup
Rejected: Only solves Velero restore case, doesn't help with pod restarts or node failures
5. Pre-Create Existence Check
Partially Rejected: Doesn't handle all edge cases, Import is more robust and already well-tested
Additional Notes
Provider Implementation Considerations
When using this fix, ensure your provider's
GetIDFnconfigurations handle emptyexternalNamevalues correctly during initial resource creation:This prevents incomplete IDs (e.g.,
service_name/instead ofservice_name/resource_id) from being set in tfstate before resource creation completes.Related Issues
This fix addresses duplicate resource creation issues that have been reported by multiple users of upjet-based providers, particularly for resources requiring long-running operations such as:
Checklist
References