Linux运维故障排查实战:33个常见问题解决方案与排查思路 前言 在Linux运维工作中,故障排查是最考验技术功底的技能之一。一个优秀的运维工程师不仅要能快速定位问题,还要有系统性的排查思路和丰富的实战经验。本文总结了33个生产环境中最常见的故障案例,提供详细的排查思路和解决方案,帮助运维人员提升故障处理能力。
一、故障排查基本思路 1.1 故障排查的基本原则
保持冷静 :遇到故障不要慌张,按步骤系统性排查
收集信息 :详细了解故障现象、发生时间、影响范围
查看日志 :日志是故障排查的第一手资料
分层排查 :从应用层到系统层,从软件到硬件
记录过程 :详细记录排查过程,便于总结和分享
验证修复 :修复后要充分验证,确保问题彻底解决
1.2 故障排查工具箱 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 uname -a uptime whoami id date ps aux top htop systemctl status journalctl ping telnet ss -tuln netstat -i traceroute df -h du -sh lsof fuser free -h vmstat iostat sar
二、系统启动和服务类故障 2.1 系统无法启动 故障现象 :服务器开机后无法正常启动到登录界面
排查思路 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 init=/bin/bash single fsck /dev/sda1 fsck -y /dev/sda1 cat /etc/fstabgrub2-mkconfig -o /boot/grub2/grub.cfg dracut --force
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 fsck -y /dev/sda1 vi /etc/fstab grub2-install /dev/sda grub2-mkconfig -o /boot/grub2/grub.cfg touch /.autorelabelreboot
2.2 服务无法启动 故障现象 :systemctl start service_name 失败
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 systemctl status nginx systemctl is-enabled nginx journalctl -u nginx -f journalctl -u nginx --since "2024-01-01 00:00:00" nginx -t apachectl configtest ss -tuln | grep :80 lsof -i :80 ls -la /etc/nginx/nginx.confls -la /var/log/nginx/getenforce sestatus audit2why < /var/log/audit/audit.log
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 nginx -t kill -9 $(lsof -t -i:80)chown -R nginx:nginx /var/log/nginx/chmod 755 /var/log/nginx/setsebool -P httpd_can_network_connect 1 restorecon -R /etc/nginx/
2.3 Shell脚本不执行 故障现象 :脚本文件存在但无法执行
排查方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ls -la script.shfile script.sh hexdump -C script.sh | head head -1 script.shbash -n script.sh sh -x script.sh which bashwhich sh
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 chmod +x script.shdos2unix script.sh /bin/bash script.sh
三、网络连接类故障 3.1 SSH连接缓慢 故障现象 :SSH登录需要等待很长时间
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 time nslookup client_iptime dig -x client_ipgrep -E "UseDNS|GSSAPIAuthentication" /etc/ssh/sshd_config ping -c 5 client_ip mtr client_ip tail -f /var/log/securejournalctl -u sshd -f
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 echo "UseDNS no" >> /etc/ssh/sshd_configecho "GSSAPIAuthentication no" >> /etc/ssh/sshd_configsystemctl restart sshd Host * GSSAPIAuthentication no UseDNS no
3.2 网络不通 故障现象 :无法访问外网或内网服务
排查思路 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ip addr show ifconfig ip route show route -n cat /etc/resolv.confnslookup google.com iptables -L -n firewall-cmd --list-all ping 8.8.8.8 ping gateway_ip telnet target_host 80 systemctl status NetworkManager systemctl status network
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 systemctl restart NetworkManager systemctl restart network nmcli con show nmcli con up connection_name echo "nameserver 8.8.8.8" > /etc/resolv.confecho "nameserver 114.114.114.114" >> /etc/resolv.conffirewall-cmd --zone=public --add-port=80/tcp --permanent firewall-cmd --reload
3.3 端口无法访问 故障现象 :服务正常运行但端口无法访问
排查方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ss -tuln | grep :80 netstat -tuln | grep :80 ps aux | grep nginx systemctl status nginx iptables -L -n | grep 80 firewall-cmd --list-ports getenforce getsebool -a | grep http telnet localhost 80 curl -I http://localhost
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 firewall-cmd --zone=public --add-port=80/tcp --permanent firewall-cmd --reload iptables -I INPUT -p tcp --dport 80 -j ACCEPT service iptables save setsebool -P httpd_can_network_connect 1 semanage port -a -t http_port_t -p tcp 8080
四、磁盘和文件系统故障 4.1 磁盘空间不足 故障现象 :系统提示磁盘空间不足,应用无法写入文件
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 df -hdf -i du -sh /* | sort -hrfind / -type f -size +100M -exec ls -lh {} \; du -h --max-depth=1 / | sort -hrdu -sh /var/log/*ls -lah /var/log/du -sh /tmp/*du -sh /var/tmp/*
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 journalctl --vacuum-time=7d journalctl --vacuum-size=100M find /var/log -name "*.log" -mtime +30 -delete find /var/log -name "*.log.*" -mtime +7 -delete find /tmp -type f -mtime +7 -delete find /var/tmp -type f -mtime +30 -delete yum clean all apt-get clean find / -name "core.*" -delete lvextend -L +10G /dev/mapper/centos-root xfs_growfs /
4.2 文件系统只读 故障现象 :系统提示”Read-only file system”
排查方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 mount | grep "ro," cat /proc/mountsdmesg | grep -i error journalctl | grep -i "read-only" smartctl -a /dev/sda badblocks -v /dev/sda1 tail -f /var/log/messages
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 mount -o remount,rw / umount /dev/sda1 fsck -y /dev/sda1 mount /dev/sda1 /mnt fsck -y /dev/sda1 reboot
4.3 文件删除但空间未释放 故障现象 :删除大文件后磁盘空间没有释放
排查方法 :
1 2 3 4 5 6 7 8 9 lsof | grep deleted lsof +L1 lsof | grep deleted | awk '{print $2}' | sort -u ls -la /proc/PID/fd/
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 systemctl restart service_name kill -USR1 PID kill -HUP PID lsof | grep deleted exec 3>&- > /path/to/large/file truncate -s 0 /path/to/large/file
五、内存和CPU故障 5.1 内存不足 故障现象 :系统响应缓慢,出现OOM错误
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 free -h cat /proc/meminfops aux --sort =-%mem | head -10 top -o %MEM dmesg | grep -i "killed process" journalctl | grep -i "out of memory" grep -i "killed process" /var/log/messages swapon -s cat /proc/swapsvalgrind --tool=memcheck --leak-check=full program
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 echo 1 > /proc/sys/vm/drop_cachesecho 2 > /proc/sys/vm/drop_cachesecho 3 > /proc/sys/vm/drop_cachesecho 10 > /proc/sys/vm/swappinessecho 'vm.swappiness = 10' >> /etc/sysctl.confdd if =/dev/zero of=/swapfile bs=1M count=2048chmod 600 /swapfilemkswap /swapfile swapon /swapfile echo '/swapfile swap swap defaults 0 0' >> /etc/fstabsystemctl restart high_memory_service export JAVA_OPTS="-Xmx2g -Xms1g"
5.2 CPU使用率过高 故障现象 :系统负载过高,响应缓慢
排查方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 top htop uptime ps aux --sort =-%cpu | head -10 pidstat -u 1 10 perf top perf record -g ./program perf report strace -c -p PID strace -p PID cat /proc/interruptswatch -n 1 cat /proc/interrupts
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 renice -10 PID renice 10 PID cpulimit -p PID -l 50 echo PID > /sys/fs/cgroup/cpu/limited/cgroup.procsecho 50000 > /sys/fs/cgroup/cpu/limited/cpu.cfs_quota_usworker_processes auto; worker_cpu_affinity auto;
六、日志和监控故障 6.1 日志文件过大 故障现象 :日志文件占用大量磁盘空间
处理方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 find /var/log -type f -size +100M -exec ls -lh {} \; du -sh /var/log/* | sort -hrwatch -n 1 "du -sh /var/log/messages" cat > /etc/logrotate.d/custom << EOF /var/log/application.log { daily rotate 7 compress delaycompress missingok notifempty copytruncate } EOF logrotate -f /etc/logrotate.d/custom journalctl --vacuum-time=7d journalctl --vacuum-size=100M
6.2 系统时间不同步 故障现象 :系统时间与实际时间不符
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 date timedatectl status timedatectl set-timezone Asia/Shanghai ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtimetimedatectl set-ntp true systemctl enable chronyd systemctl start chronyd ntpdate -s time.nist.gov chrony sources -v echo "server 0.centos.pool.ntp.org iburst" >> /etc/chrony.confsystemctl restart chronyd
七、权限和安全故障 7.1 权限拒绝错误 故障现象 :Permission denied错误
排查方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ls -la file_or_directorynamei -l /path/to/file id usernamegroups usernamegetenforce ls -Z file_or_directoryaudit2why < /var/log/audit/audit.log getfacl file_or_directory
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 chmod 755 file_or_directorychown user:group file_or_directoryusermod -a -G groupname username restorecon -R /path/to/directory chcon -t httpd_exec_t /path/to/filesetfacl -m u:username:rwx file_or_directory
7.2 sudo权限问题 故障现象 :sudo命令执行失败
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 sudo -lvisudo -c tail -f /var/log/securejournalctl | grep sudo groups usernameid username
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 usermod -a -G wheel username usermod -a -G sudo username visudo username ALL=(ALL) ALL chmod 440 /etc/sudoerschown root:root /etc/sudoers
八、应用服务故障 8.1 数据库连接失败 故障现象 :应用无法连接数据库
排查步骤 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 systemctl status mysqld systemctl status postgresql ss -tuln | grep :3306 ss -tuln | grep :5432 mysql -h localhost -u root -p psql -h localhost -U postgres tail -f /var/log/mysqld.logtail -f /var/log/postgresql/postgresql.logfirewall-cmd --list-ports getsebool -a | grep mysql
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 systemctl start mysqld systemctl enable mysqld systemctl stop mysqld mysqld_safe --skip-grant-tables & mysql -u root UPDATE mysql.user SET authentication_string=PASSWORD('newpassword' ) WHERE User='root' ; FLUSH PRIVILEGES; firewall-cmd --zone=public --add-port=3306/tcp --permanent firewall-cmd --reload
8.2 Web服务无法访问 故障现象 :网站无法正常访问
排查流程 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 systemctl status nginx systemctl status httpd nginx -t apachectl configtest ss -tuln | grep :80 ss -tuln | grep :443 tail -f /var/log/nginx/error.logtail -f /var/log/httpd/error_logcurl -I http://localhost wget --spider http://localhost
解决方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 nginx -t systemctl reload nginx lsof -i :80 kill -9 PIDchown -R nginx:nginx /var/www/htmlchmod -R 755 /var/www/htmlfirewall-cmd --zone=public --add-service=http --permanent firewall-cmd --zone=public --add-service=https --permanent firewall-cmd --reload
九、性能相关故障 9.1 系统响应缓慢 排查思路 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 uptime top htop vmstat 1 10 iostat -x 1 10 free -h cat /proc/meminfoss -s netstat -i df -hdu -sh /*
优化方案 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 echo 3 > /proc/sys/vm/drop_cachesecho 10 > /proc/sys/vm/swappinessecho deadline > /sys/block/sda/queue/schedulerrenice -10 $(pgrep important_process) systemctl disable unnecessary_service systemctl stop unnecessary_service
9.2 网络延迟高 诊断方法 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ping -c 10 target_host mtr target_host traceroute target_host tracepath target_host ip addr show ip route show iftop nload bandwidth nslookup target_host dig target_host
优化措施 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 echo 'net.ipv4.tcp_fin_timeout = 30' >> /etc/sysctl.confecho 'net.ipv4.tcp_keepalive_time = 1200' >> /etc/sysctl.confecho 'net.core.rmem_max = 134217728' >> /etc/sysctl.confecho 'net.core.wmem_max = 134217728' >> /etc/sysctl.confsysctl -p echo "nameserver 8.8.8.8" > /etc/resolv.confecho "nameserver 114.114.114.114" >> /etc/resolv.confethtool -G eth0 rx 4096 tx 4096 ethtool -K eth0 gso on gro on tso on
十、故障预防和监控 10.1 建立监控体系 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 LOGFILE="/var/log/system_monitor.log" DATE=$(date '+%Y-%m-%d %H:%M:%S' ) CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1) MEM_USAGE=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}' ) DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | cut -d'%' -f1) LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}' ) echo "$DATE CPU:${CPU_USAGE} % MEM:${MEM_USAGE} % DISK:${DISK_USAGE} % LOAD:${LOAD_AVG} " >> $LOGFILE if (( $(echo "$CPU_USAGE > 80 " | bc -l) )); then echo "WARNING: High CPU usage: ${CPU_USAGE} %" | mail -s "CPU Alert" admin@example.com fi if (( $(echo "$MEM_USAGE > 85 " | bc -l) )); then echo "WARNING: High memory usage: ${MEM_USAGE} %" | mail -s "Memory Alert" admin@example.com fi if (( $(echo "$DISK_USAGE > 90 " | bc -l) )); then echo "WARNING: High disk usage: ${DISK_USAGE} %" | mail -s "Disk Alert" admin@example.com fi
10.2 自动化故障处理 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 check_service () { local service_name=$1 if ! systemctl is-active --quiet $service_name ; then echo "$(date) : $service_name is down, restarting..." >> /var/log/auto_recovery.log systemctl restart $service_name if systemctl is-active --quiet $service_name ; then echo "$(date) : $service_name restarted successfully" >> /var/log/auto_recovery.log else echo "$(date) : Failed to restart $service_name " >> /var/log/auto_recovery.log echo "CRITICAL: Failed to restart $service_name " | mail -s "Service Alert" admin@example.com fi fi } check_service "nginx" check_service "mysqld" check_service "sshd" find /tmp -type f -mtime +1 -delete find /var/log -name "*.log" -size +100M -exec truncate -s 50M {} \;
10.3 故障预防检查清单 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 echo "=== System Health Check Report ===" echo "Date: $(date) " echo echo "1. System Information:" uname -auptime echo echo "2. Disk Space:" df -hecho echo "3. Memory Usage:" free -h echo echo "4. CPU Load:" top -bn1 | head -5 echo echo "5. Network Connections:" ss -s echo echo "6. Critical Services:" for service in sshd nginx mysqld; do if systemctl is-active --quiet $service ; then echo "$service : Running" else echo "$service : Stopped" fi done echo echo "7. Recent Errors:" journalctl --since "1 hour ago" --priority=err --no-pager | tail -10 echo echo "8. Security Check:" echo "Failed login attempts:" lastb | head -5 echo echo "=== End of Report ==="
十一、总结和最佳实践 11.1 故障排查最佳实践
建立标准化流程 :制定详细的故障排查SOP
完善监控体系 :建立全面的系统监控和告警机制
定期健康检查 :定期执行系统健康检查脚本
文档化管理 :详细记录故障处理过程和解决方案
知识库建设 :建立故障案例知识库,便于快速查询
技能提升 :定期进行故障演练,提升团队技能
11.2 预防措施
定期备份 :建立完善的数据备份策略
容量规划 :合理规划系统资源,避免资源不足
安全加固 :定期进行安全检查和加固
版本管理 :建立配置文件版本管理机制
测试验证 :变更前在测试环境充分验证
11.3 工具推荐
监控工具 :Zabbix、Prometheus、Nagios
日志分析 :ELK Stack、Fluentd
性能分析 :perf、strace、tcpdump
自动化 :Ansible、Puppet、SaltStack
文档管理 :GitLab、Confluence
通过系统性的故障排查方法和预防措施,可以大大提升Linux系统的稳定性和可靠性。记住,优秀的运维工程师不仅要能快速解决问题,更要能预防问题的发生。
本文总结了Linux运维中最常见的33个故障案例,提供了详细的排查思路和解决方案。建议运维人员收藏备用,并结合实际环境进行实践验证。