# ecs-network-troubleshooting

Part of **ECS**

# ECS Network Troubleshooting Guide

## Problem Index

| Problem | Symptom | Severity | Solution Summary |
|---------|---------|----------|------------------|
| External Ping Failure on Windows | `ping` returns "General failure" | High | Configure default gateway, reset TCP/IP stack, disable interfering security software |
| Auxiliary Private IP Breaks Public Access | Cannot access internet after adding auxiliary private IP on Windows | High | Use Netsh with `skipassource=true` to prevent auxiliary IP from being used as source |
| rp_filter Causes Packet Loss | Network connectivity issues or dropped packets in multi-NIC setups | High | Set `rp_filter=2` (loose mode) in `/etc/sysctl.conf` and apply with `sysctl -p` |
| Public IP Unreachable | Cannot ping or connect to ECS instance public IP | High | Verify instance state, security group rules, firewall settings, and bandwidth usage |
| Port Accessible but Service Unreachable | Instance pingable but specific ports time out or refuse connection | Medium | Check security group inbound rules, OS firewall, and whether service is listening on correct interface |
| IPv6 Connectivity Failure | Cannot ping or reach ECS instance via IPv6 address | Medium | Enable IPv6 in VPC, configure security group rules for ICMP-IPv6, verify gateway and routing |
| GRE Tunnel Database Timeout | Database queries fail over self-built GRE tunnel with large result sets | Medium | Use iptables to clamp TCP MSS to PMTU with `--clamp-mss-to-pmtu` |
| Insufficient Network Buffer Size | Remote connection fails due to `net.core.optmem_max` too low | Medium | Increase `net.core.optmem_max` via sysctl and persist in `/etc/sysctl.conf` |
| Softnet Backlog Overflow | High packet drop rate shown in `/proc/net/softnet_stat` | High | Increase `net.core.netdev_max_backlog` and enable RPS for better CPU distribution |
| TCP Socket Buffer Overrun | `packet pruned from receive queue because of socket buffer overrun` in `netstat -s` | Medium | Increase `net.core.rmem_max` and `net.ipv4.tcp_rmem`, or optimize application read rate |

## Problem Details

### Problem 1: External Ping Failure on Windows ("General Failure")

**Symptoms**
- Error message: `General failure`
- Behavior: `ping` command to external addresses (e.g., `8.8.8.8`) fails immediately with "General failure"
- Context: Occurs on Windows ECS instances, often after security software installation or network configuration changes

**Root Cause**
- Third-party antivirus or firewall software may block ICMP traffic
- Missing or incorrect default gateway configuration
- Corrupted TCP/IP stack or Winsock catalog
- Local Security Policy or Routing and Remote Access service interference

**Solution**
1. Verify default gateway is configured:
   ```powershell
   ipconfig /all
   ```
2. If missing, add persistent default route:
   ```powershell
   route -p add 0.0.0.0 mask 0.0.0.0 <default_gateway>
   ```
3. Reset TCP/IP stack and Winsock:
   ```powershell
   netsh int ip reset
   netsh winsock reset
   ```
4. Reboot the instance
5. Temporarily disable third-party security software to test

**Verification**
- After reboot, run:
  ```powershell
  ping 8.8.8.8
  ```
- Expected: Successful replies with round-trip times

### Problem 2: Auxiliary Private IP Breaks Public Access on Windows

**Symptoms**
- Error message: `` (Cannot access public network)
- Behavior: After configuring an auxiliary private IP, outbound internet access fails
- Context: Windows Server 2008+ with multiple private IPs on primary NIC

**Root Cause**
- Windows selects the source IP based on longest prefix match with next-hop gateway
- Auxiliary private IP may have longer prefix match than primary IP, causing it to be used as source
- Since auxiliary IPs are not associated with public NAT/EIP, outbound traffic fails

**Solution**
1. Remove existing auxiliary IP via GUI or PowerShell
2. Re-add auxiliary IP with `skipassource=true`:
   ```powershell
   Netsh int ipv4 add address 'Ethernet' 192.168.1.252 255.255.255.0 skipassource=true
   ```
   (Replace `'Ethernet'` with actual adapter name and IP/subnet as needed)

**Verification**
- Test internet access:
  ```powershell
  ping www.aliyun.com
  ```
- Confirm source IP used is the primary private IP (not auxiliary) via packet capture or logging

### Problem 3: rp_filter Causes Packet Loss

**Symptoms**
- Error message: `rp_filter_dropped_packet`
- Behavior: Intermittent connectivity, high packet loss, or complete failure in multi-NIC or load-balanced environments
- Context: Linux instances with multiple network interfaces or using LVS-DR mode

**Root Cause**
- `rp_filter` (Reverse Path Filter) in strict mode (`=1`) drops packets if incoming interface doesn't match route back to source
- Asymmetric routing (common in multi-homed setups) triggers this behavior
- Default setting may be too restrictive for cloud networking topologies

**Solution**
1. Temporarily set loose mode for all interfaces:
   ```bash
   echo 2 > /proc/sys/net/ipv4/conf/all/rp_filter
   echo 2 > /proc/sys/net/ipv4/conf/default/rp_filter
   ```
2. Make permanent by editing `/etc/sysctl.conf`:
   ```bash
   net.ipv4.conf.all.rp_filter = 2
   net.ipv4.conf.default.rp_filter = 2
   ```
3. Apply changes:
   ```bash
   sudo sysctl -p
   ```

**Verification**
- Monitor packet loss:
  ```bash
  ping -c 100 <target>
  ```
- Check kernel logs for dropped packets:
  ```bash
  dmesg | grep -i "rp_filter"
  ```
- Expected: No new drop messages and stable ping results

### Problem 4: Public IP Unreachable

**Symptoms**
- Behavior: Cannot ping or establish TCP connections to ECS instance public IP
- Context: After instance creation, security group changes, or during suspected attacks

**Root Cause**
- Instance not in `Running` state
- Missing or restrictive security group inbound rules (ICMP/TCP)
- Instance OS firewall blocking traffic
- Public bandwidth exhausted or instance in blackhole due to attack
- Network ACL on VSwitch blocking traffic

**Solution**
1. Confirm instance status is `Running` in ECS console
2. Verify public IP assignment (EIP or assigned public IP)
3. Check security group inbound rules allow:
   - Protocol: `ICMP` for ping
   - Protocol: `TCP` on required ports (e.g., 22, 80, 443)
   - Source: Your IP or appropriate CIDR (avoid `0.0.0.0/0` in production)
4. Log in via VNC and check OS firewall:
   - Linux: `firewall-cmd --list-all` or `ufw status`
   - Windows: Windows Defender Firewall with Advanced Security
5. Check CloudMonitor for bandwidth usage (1024 Kbit/s = full utilization)
6. Review Security Center for attack alerts or blackhole notifications

**Verification**
- From external client:
  ```bash
  ping <public_ip>
  telnet <public_ip> <port>
  ```
- Expected: Ping replies and successful TCP handshake

### Problem 5: Port Accessible but Service Unreachable

**Symptoms**
- Error message: `Connection timed out` or `Connection refused`
- Behavior: `ping` succeeds but `telnet <ip> <port>` fails
- Context: After deploying web server, database, or custom application

**Root Cause**
- `Connection timed out`: Traffic blocked by security group or OS firewall (no response)
- `Connection refused`: Service not running or not listening on expected interface/port

**Solution**
1. Verify security group allows inbound on target port
2. Log in to instance and check service status:
   ```bash
   systemctl status <service>
   ```
3. Confirm service listens on correct interface:
   ```bash
   netstat -an | grep <PORT>
   # Should show 0.0.0.0:<PORT> or :::<PORT>, not 127.0.0.1:<PORT>
   ```
4. Check OS firewall:
   - For firewalld:
     ```bash
     firewall-cmd --zone=public --add-port=<PORT>/tcp --permanent
     firewall-cmd --reload
     ```
   - For ufw:
     ```bash
     sudo ufw allow <PORT>/tcp
     sudo ufw reload
     ```
   - For Windows: Enable corresponding inbound rule in Advanced Security Firewall

**Verification**
- From external client:
  ```bash
  telnet <public_ip> <port>
  ```
- Expected: Connection established (blank screen for telnet)

### Problem 6: IPv6 Connectivity Failure

**Symptoms**
- Behavior: Cannot ping or connect to ECS instance via IPv6 address
- Context: After enabling IPv6 in VPC and assigning IPv6 address to instance

**Root Cause**
- IPv6 not enabled at VPC or vSwitch level
- Missing security group rules for ICMPv6 or target protocol
- No IPv6 default route configured in instance
- Instance OS not configured to use IPv6 gateway

**Solution**
1. In ECS console, ensure IPv6 is enabled for VPC and vSwitch
2. Add security group inbound rule:
   - Protocol: `All ICMP-IPv6`
   - Source: Your IPv6 CIDR or `::/0` (temporarily for testing)
3. Log in via VNC and verify IPv6 configuration:
   ```bash
   curl http://100.100.100.200/latest/meta-data/network/interfaces/macs/${MAC}/ipv6s
   curl http://100.100.100.200/latest/meta-data/network/interfaces/macs/${MAC}/ipv6-gateway
   ```
4. Check IPv6 routing:
   ```bash
   ip -6 route show
   # Should include default route via IPv6 gateway
   ```

**Verification**
- From external IPv6-capable client:
  ```bash
  ping6 <ecs_ipv6_address>
  ```
- Expected: Successful replies

### Problem 7: GRE Tunnel Database Timeout

**Symptoms**
- Error message: `Connection Timeout`
- Behavior: Large database queries fail over self-built GRE tunnel, small queries succeed
- Context: ECS connected to remote database via GRE tunnel

**Root Cause**
- MTU mismatch causes fragmentation
- DF (Don't Fragment) bit set prevents fragmentation
- Packets dropped due to size exceeding path MTU
- TCP retransmissions eventually time out

**Solution**
1. On tunnel endpoint (ECS instance), adjust TCP MSS:
   ```bash
   sudo iptables -t mangle -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu
   ```
2. Alternatively, set explicit MSS value (e.g., 1452 for PPPoE):
   ```bash
   sudo iptables -t mangle -A POSTROUTING -o gre0 -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1452
   ```

**Verification**
- Run large query again
- Expected: Query completes successfully without timeout

### Problem 8: Insufficient Network Buffer Size

**Symptoms**
- Error message: `ConnectionFailed`
- Behavior: SSH or other remote connections fail intermittently
- Context: High-concurrency or high-bandwidth workloads on Linux instances

**Root Cause**
- `net.core.optmem_max` too low limits per-socket buffer size
- Under heavy load, buffers fill up, causing connection failures
- Default value insufficient for modern workloads

**Solution**
1. Connect via VNC (since SSH may be unavailable)
2. Check current value:
   ```bash
   sudo sysctl net.core.optmem_max
   ```
3. Increase value (e.g., to 65536):
   ```bash
   echo "net.core.optmem_max = 65536" >> /etc/sysctl.conf
   ```
4. Apply immediately:
   ```bash
   sudo sysctl -p
   ```

**Verification**
- Attempt SSH connection again
- Expected: Stable connection establishment

### Problem 9: Softnet Backlog Overflow

**Symptoms**
- Behavior: High packet drop rate under network load
- Context: High-throughput applications (e.g., video streaming, large file transfers)

**Root Cause**
- `net.core.netdev_max_backlog` too low for incoming packet rate
- CPU cannot process packets fast enough, causing queue overflow
- Drops recorded in `/proc/net/softnet_stat`

**Solution**
1. Monitor drops:
   ```bash
   watch -d 'awk "{print \"CPU\"(NR-1)\": dropped=\"\$2\", squeezed=\"\$3}\" /proc/net/softnet_stat'
   ```
2. Increase backlog size:
   ```bash
   sysctl -w net.core.netdev_max_backlog=5000
   echo "net.core.netdev_max_backlog = 5000" >> /etc/sysctl.conf
   ```
3. Enable RPS to distribute processing across CPUs:
   ```bash
   echo ff > /sys/class/net/eth0/queues/rx-0/rps_cpus
   ```

**Verification**
- Under load, observe `dropped` counter in `softnet_stat`
- Expected: Drop count stabilizes or decreases significantly

### Problem 10: TCP Socket Buffer Overrun

**Symptoms**
- Error message: `packet pruned from receive queue because of socket buffer overrun`
- Behavior: Application receives incomplete data, high retransmission rate
- Context: High-speed networks with bursty traffic

**Root Cause**
- Application reads from socket slower than data arrives
- Receive buffer fills up, kernel drops new packets
- Auto-tuning insufficient for workload pattern

**Solution**
1. Diagnose with:
   ```bash
   netstat -s | grep -i "pruned"
   ```
2. Increase max buffer size:
   ```bash
   echo "net.core.rmem_max = 134217728" >> /etc/sysctl.conf
   echo "net.ipv4.tcp_rmem = 4096 87380 134217728" >> /etc/sysctl.conf
   ```
3. Apply:
   ```bash
   sysctl -p
   ```
4. Alternatively, optimize application to read faster or use `SO_RCVBUF` explicitly

**Verification**
- After change, monitor pruned packets:
  ```bash
  netstat -s | grep -i "pruned"
  ```
- Expected: Count stops increasing during normal operation

## FAQ

**Q: How do I check if my ECS instance security group is blocking traffic?**
A: In the ECS console, navigate to the instance > Security Groups > Inbound Rules. Ensure there's a rule allowing your protocol (e.g., TCP, ICMP) from your source IP. For testing, you can temporarily allow `0.0.0.0/0`, but restrict it in production.

**Q: What permissions are needed to manage ECS network configurations?**
A: You need the `ecs:DescribeInstances`, `ecs:ModifyInstanceNetworkSpec`, and `ecs:DescribeSecurityGroups` permissions for basic network operations. For EIP binding, add `vpc:AssociateEipAddress`. Full network management requires `AliyunECSFullAccess` or equivalent custom policy.

**Q: How do I enable debug logging for network issues on Linux?**
A: Use `tcpdump` to capture packets: `sudo tcpdump -i eth0 -s 0 -w debug.cap`. For kernel-level diagnostics, check `dmesg` output and `/proc/net/softnet_stat`. Enable detailed TCP stats with `netstat -s` and monitor socket states with `ss -tan`.

**Q: What are the common causes of timeout errors when connecting to ECS?**
A: Timeouts usually indicate traffic is being silently dropped. Common causes include: restrictive security group rules, OS firewall blocking (without sending RST), network ACLs on VSwitch, or the instance being in a blackhole due to detected attacks. Always verify each layer from client to application.

**Q: How do I roll back a failed network configuration change?**
A: For kernel parameters, revert entries in `/etc/sysctl.conf` and run `sysctl -p`. For Windows NIC changes, use Device Manager to uninstall and reinstall the network adapter. For security groups, restore previous rules via console or CLI. Always test changes in staging first and keep backups of configuration files.