# terraform-iac

Part of **TERRAFORM**

# Terraform Infrastructure as Code Troubleshooting Guide

## Problem Index

| Problem | Symptoms | Severity | Solution Summary |
|--------|----------|----------|------------------|
| State Lock Conflict | Error: `Error acquiring the state lock` | High | Release stale lock or force unlock with caution |
| Configuration Drift | `terraform plan` shows unexpected changes | Medium | Reconcile drift via manual import or reapply |
| Undefined Variable | Error: `A variable named "x" has not been declared` | Medium | Declare variable in `.tf` file or pass via `-var` |
| Dependency Cycle Detected | Error: `Cycle: aws_instance.a -> aws_security_group.b -> aws_instance.a` | High | Refactor resource dependencies using `depends_on` or split resources |
| Provider Authentication Failure | Error: `No valid credential sources found` | High | Configure valid credentials via environment variables, CLI, or provider block |

## Problem Details

### Problem 1: State Lock Conflict

**Symptoms**
- Error message: `Error acquiring the state lock`
- Behavior: `terraform apply` or `terraform plan` hangs or fails immediately
- Context: Occurs when a previous Terraform operation was interrupted or crashed, leaving a stale lock on the remote state

**Root Cause**
Terraform uses state locking to prevent concurrent modifications to infrastructure state. If an operation fails abruptly (e.g., process killed, network timeout), the lock may not be released automatically.

**Solution**
1. Identify the lock holder by inspecting your backend (e.g., for S3 + DynamoDB backend, check the DynamoDB table item)
2. If you are certain no other operation is running, force unlock using:
   ```bash
   terraform force-unlock <LOCK_ID>
   ```
   Replace `<LOCK_ID>` with the ID shown in the error message
3. For remote backends like OSS or S3, ensure only one user/team operates on a given workspace at a time

**Verification**
- Run `terraform plan` again — it should proceed without lock errors
- Confirm no other team members are actively applying changes to avoid race conditions

### Problem 2: Configuration Drift

**Symptoms**
- Error message: None directly, but `terraform plan` shows changes not made in code
- Behavior: Resources appear modified outside Terraform (e.g., manually in console)
- Context: Common after manual interventions or external automation tools modify cloud resources

**Root Cause**
Terraform assumes full control of managed resources. Any out-of-band changes cause "drift" between actual infrastructure and Terraform’s state file.

**Solution**
1. Review the drift with:
   ```bash
   terraform plan
   ```
2. If the change is intentional and safe, update your configuration to match reality
3. If the change is accidental, revert it manually or let Terraform overwrite it during `apply`
4. To import an externally created resource into state:
   ```bash
   terraform import <resource_address> <cloud_resource_id>
   ```

**Verification**
- After `terraform apply`, run `terraform plan` again — it should show "No changes"
- Use `terraform show` to confirm state matches expected configuration

### Problem 3: Undefined Variable

**Symptoms**
- Error message: `A variable named "instance_type" has not been declared in the root module`
- Behavior: `terraform init` or `plan` fails during parsing
- Context: Occurs when referencing a variable that isn’t defined in any `.tf` file

**Root Cause**
Terraform requires all variables used in expressions to be explicitly declared with a `variable` block, even if passed via command line or `tfvars`.

**Solution**
1. Declare the missing variable in a `.tf` file (e.g., `variables.tf`):
   ```hcl
   variable "instance_type" {
     description = "EC2 instance type"
     type        = string
     default     = "t3.micro"
   }
   ```
2. Alternatively, pass it at runtime:
   ```bash
   terraform plan -var="instance_type=t3.small"
   ```

**Verification**
- `terraform validate` returns success
- `terraform plan` completes without variable declaration errors

### Problem 4: Dependency Cycle Detected

**Symptoms**
- Error message: `Cycle: aws_instance.web -> aws_security_group.allow_web -> aws_instance.web`
- Behavior: `terraform plan` fails during graph construction
- Context: Happens when two or more resources reference each other in a circular manner

**Root Cause**
Terraform builds a dependency graph to determine execution order. Circular references make ordering impossible.

**Solution**
1. Avoid direct attribute references that create cycles (e.g., don’t use `aws_instance.web.id` inside a security group rule that’s attached to the same instance)
2. Use `depends_on` only if absolutely necessary, but prefer redesign:
   ```hcl
   resource "aws_instance" "web" {
     # Remove direct reference to security group's computed attributes
     vpc_security_group_ids = [aws_security_group.allow_web.id]
   }

   resource "aws_security_group" "allow_web" {
     # Do not reference aws_instance.web.* here
   }
   ```
3. Split tightly coupled logic into separate modules or resources

**Verification**
- `terraform plan` completes successfully
- `terraform graph | dot -Tpng > graph.png` shows no cycles (optional visual check)

### Problem 5: Provider Authentication Failure

**Symptoms**
- Error message: `No valid credential sources found for provider`
- Behavior: `terraform init` succeeds, but `plan` or `apply` fails during provider initialization
- Context: Occurs when the cloud provider plugin cannot authenticate to the target platform

**Root Cause**
The Terraform provider lacks valid credentials. This may be due to missing environment variables, unset CLI profiles, or incorrect provider configuration.

**Solution**
1. Set credentials via environment variables (example for Alibaba Cloud):
   ```bash
   export ALICLOUD_ACCESS_KEY="your-access-key"
   export ALICLOUD_SECRET_KEY="your-secret-key"
   export ALICLOUD_REGION="cn-hangzhou"
   ```
2. Or configure credentials in the provider block:
   ```hcl
   provider "alicloud" {
     access_key = var.alicloud_access_key
     secret_key = var.alicloud_secret_key
     region     = var.region
   }
   ```
3. Ensure secrets are not hardcoded; use secure variable passing or secret managers

**Verification**
- `terraform plan` connects to the provider and lists pending changes
- No authentication-related errors appear in output

## FAQ

**Q: How do I enable debug logging in Terraform?**  
A: Set the `TF_LOG` environment variable to `DEBUG`, `INFO`, `WARN`, or `ERROR`. For example:  
```bash
export TF_LOG=DEBUG
terraform apply
```  
Logs are written to stderr by default. To save to a file:  
```bash
export TF_LOG_PATH=tf-debug.log
```

**Q: What permissions are required to run Terraform against a cloud environment?**  
A: Terraform needs read/write permissions for all resources it manages (create, update, delete, describe/list). It also requires permissions to manage state storage (e.g., S3/OSS bucket write access) and locking mechanisms (e.g., DynamoDB conditional writes). Follow least-privilege principles.

**Q: How can I detect configuration drift automatically?**  
A: Run `terraform plan` regularly in CI/CD pipelines. If it shows changes without code updates, drift exists. You can also use `terraform refresh` (deprecated in newer versions) or rely on `plan` which implicitly refreshes state. Consider integrating with monitoring tools that compare live state vs. Terraform state.

**Q: Why does `terraform init` fail with module download errors?**  
A: This usually indicates network issues, invalid module source URLs, or missing credentials for private registries. Verify the module source path (e.g., `source = "registry.terraform.io/...`), ensure internet access, and authenticate to private registries if needed.

**Q: How do I roll back a failed `terraform apply`?**  
A: Terraform applies changes atomically per resource but not across the entire plan. If apply fails midway, re-run `terraform apply`—it will attempt to complete remaining changes. To revert, manually restore from a prior state snapshot or use version-controlled `.tfstate` backups. Always back up state before major changes.