Fix issue with reconciliation of resources using UseAsync=true #562

fabiencastarede · 2025-12-19T11:22:54Z

Description

This PR fixes a critical issue #561 where resources configured with UseAsync=true create duplicate cloud resources after provider pod restarts or Kubernetes cluster backup/restore operations (e.g., Velero).

Problem

Resources with UseAsync = true store Terraform workspace state in ephemeral pod storage (/tmp/<workspace-id>/). When the provider pod restarts:

Workspace state is lost
Provider treats the resource as not existing in Terraform state
Creates a duplicate resource in the cloud provider
Updates external-name with new resource ID, orphaning the original resource

Reproduction

# 1. Create an async resource
apiVersion: kube.ovh.example.io/v1alpha1
kind: Cluster
metadata:
  name: test-cluster
spec:
  forProvider:
    serviceName: "my-project"
    region: "EU-WEST-PAR"

# 2. Wait for creation
kubectl wait --for=condition=Ready cluster/test-cluster

# 3. Note the external-name (resource ID)
kubectl get cluster test-cluster -o jsonpath='{.metadata.annotations.crossplane\.io/external-name}'
# Output: abc123-original-id

# 4. Restart provider pod
kubectl delete pod -n crossplane-system -l pkg.crossplane.io/provider=provider-ovh

# 5. Bug: external-name changes, duplicate created
kubectl get cluster test-cluster -o jsonpath='{.metadata.annotations.crossplane\.io/external-name}'
# Output: xyz789-new-duplicate-id
# Result: Two clusters now exist in cloud provider!

Solution

When an async resource has the external-create-succeeded annotation (indicating prior successful creation) but workspace state is missing, use Import instead of Refresh to reconstruct state directly from the cloud provider API.

Code Changes

File: pkg/controller/external.go

Added logic in the Observe() function before calling Refresh():

// For async resources that were previously created, use Import instead
// of Refresh if the resource has been successfully created before.
// This prevents duplicate resource creation after provider pod restarts
// when the ephemeral workspace state in /tmp is lost.
// The external-create-succeeded annotation persists in Kubernetes and
// indicates the resource was successfully created or imported previously.
if e.config.UseAsync && meta.GetExternalName(tr) != "" {
    annotations := tr.GetAnnotations()
    if _, hasCreateSucceeded := annotations["crossplane.io/external-create-succeeded"]; hasCreateSucceeded {
        e.logger.Debug("Using Import instead of Refresh for async resource with external-create-succeeded annotation",
            "external-name", meta.GetExternalName(tr))
        return e.Import(ctx, tr)
    }
    e.logger.Debug("Async resource missing external-create-succeeded annotation, using Refresh",
        "external-name", meta.GetExternalName(tr),
        "annotations", annotations)
} else {
    e.logger.Debug("Not using Import fallback",
        "useAsync", e.config.UseAsync,
        "externalName", meta.GetExternalName(tr))
}

Required import added:

"github.com/crossplane/crossplane-runtime/v2/pkg/meta"

How It Works

Detection: Checks if resource is async AND has external-create-succeeded annotation
Import: Uses Import() to reconstruct Terraform state from cloud provider API using the external-name as resource ID
State Reconstruction: Import queries cloud provider API and rebuilds the tfstate file
Normal Flow: After Import succeeds, reconciliation continues normally with proper state

Why This Works

✅ external-name annotation persists in Kubernetes (not lost on pod restart)
✅ external-create-succeeded annotation persists in Kubernetes (backed up by Velero)
✅ Import reconstructs state directly from cloud provider API
✅ No duplicate resources created
✅ No manual intervention required
✅ Works with Velero backup/restore
✅ Minimal code change, low risk
✅ Only affects async resources with confirmed prior creation

Testing

Tested successfully with provider-ovh managing OVH Managed Kubernetes Clusters (a resource with UseAsync = true).

Test Scenarios

Scenario	Steps	Result
Provider pod restart	1. Create async resource 2. Verify creation succeeded 3. Delete provider pod 4. Wait for pod restart 5. Check for duplicates	✅ No duplicate created ✅ external-name unchanged ✅ Resource synced
Velero backup/restore	1. Create async resource 2. Backup with Velero 3. Reset Kubernetes cluster 4. Restore with Velero 5. Check for duplicates	✅ No duplicate created ✅ external-name preserved ✅ Resource synced with existing cloud resource
Node failure	1. Create async resource on node A 2. Drain/cordon node A 3. Pod reschedules to node B 4. Check for duplicates	✅ No duplicate created ✅ Resource remains synced

Debug Logs Confirming Fix

After applying the fix, provider logs show:

Using Import instead of Refresh for async resource with external-create-succeeded annotation external-name=abc123-original-id
Successfully imported resource into Terraform state
Resource observation completed, upToDate=true

Impact

Severity: Critical bug fix for async resources
Scope: Only affects resources with UseAsync = true that have been previously created
Risk: Low - only changes behavior when workspace state is missing AND resource has external-create-succeeded annotation
Breaking Changes: None
Performance: Minimal - Import is only used when state is missing (pod restart scenario)

Alternative Solutions Considered

1. PersistentVolume for Terraform Workspaces

Rejected: Adds infrastructure complexity, requires storage provisioning, doesn't scale well with multiple provider instances

2. Store tfstate in Kubernetes Secrets

Rejected: Large state files could exceed secret size limits (1MB), performance concerns with frequent updates

3. Disable UseAsync

Rejected: Removes async operation tracking capability, breaks long-running operations

4. Velero Filesystem Backup

Rejected: Only solves Velero restore case, doesn't help with pod restarts or node failures

5. Pre-Create Existence Check

Partially Rejected: Doesn't handle all edge cases, Import is more robust and already well-tested

Additional Notes

Provider Implementation Considerations

When using this fix, ensure your provider's GetIDFn configurations handle empty externalName values correctly during initial resource creation:

GetIDFn: func(ctx context.Context, externalName string, parameters map[string]any, providerConfig map[string]any) (string, error) {
    // Return empty string if external-name is not set yet (resource being created)
    if externalName == "" {
        return "", nil
    }
    // ... construct composite ID for Import
    return fmt.Sprintf("%s/%s", serviceName, externalName), nil
}

This prevents incomplete IDs (e.g., service_name/ instead of service_name/resource_id) from being set in tfstate before resource creation completes.

Related Issues

This fix addresses duplicate resource creation issues that have been reported by multiple users of upjet-based providers, particularly for resources requiring long-running operations such as:

Managed Kubernetes clusters
RDS database instances
Virtual machines with custom images
Any resource with complex provisioning workflows

Checklist

Code follows the project's style guidelines
Self-review of code completed
Comments added to explain complex logic
Changes tested with real provider (provider-ovh)
All test scenarios passing (pod restart, Velero restore, node failure)
No breaking changes introduced
Logging added for debugging

References

Crossplane managed resources: https://docs.crossplane.io/latest/concepts/managed-resources/
Crossplane external-name annotation: https://docs.crossplane.io/latest/concepts/managed-resources/#external-name
Terraform Import: https://developer.hashicorp.com/terraform/cli/import
Upjet architecture: https://github.com/crossplane/upjet/blob/main/docs/architecture.md

When a provider pod restarts, the ephemeral Terraform workspace state stored in /tmp/<uid>/ is lost. For resources with UseAsync=true, this causes the Refresh operation to fail to detect existing resources, triggering duplicate resource creation. This fix detects async resources that were previously created by checking for the crossplane.io/external-create-succeeded annotation and uses Import instead of Refresh. Import reconstructs state directly from the cloud provider API, avoiding the duplicate creation issue. Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>

Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>

fabiencastarede requested review from erhancagirici, sergenyalcin and ulucinar as code owners December 19, 2025 11:22

fabiencastarede added 3 commits December 19, 2025 12:23

Add debug logging to async resource Import fallback logic

d14aeba

Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>

Add missing meta package import

6797ec8

Signed-off-by: Fabien Castarède <fcastarede@waadoo.cloud>

fabiencastarede force-pushed the fix-async-res-reconcile branch from 981b4a3 to 6797ec8 Compare December 19, 2025 11:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix issue with reconciliation of resources using UseAsync=true #562

Fix issue with reconciliation of resources using UseAsync=true #562

fabiencastarede commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix issue with reconciliation of resources using UseAsync=true #562

Are you sure you want to change the base?

Fix issue with reconciliation of resources using UseAsync=true #562

Conversation

fabiencastarede commented Dec 19, 2025

Description

Problem

Reproduction

Solution

Code Changes

How It Works

Why This Works

Testing

Test Scenarios

Debug Logs Confirming Fix

Impact

Alternative Solutions Considered

1. PersistentVolume for Terraform Workspaces

2. Store tfstate in Kubernetes Secrets

3. Disable UseAsync

4. Velero Filesystem Backup

5. Pre-Create Existence Check

Additional Notes

Provider Implementation Considerations

Related Issues

Checklist

References

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant