Linux运维故障排查实战：33个常见问题解决方案与排查思路

前言

在Linux运维工作中，故障排查是最考验技术功底的技能之一。一个优秀的运维工程师不仅要能快速定位问题，还要有系统性的排查思路和丰富的实战经验。本文总结了33个生产环境中最常见的故障案例，提供详细的排查思路和解决方案，帮助运维人员提升故障处理能力。

一、故障排查基本思路

1.1 故障排查的基本原则

保持冷静：遇到故障不要慌张，按步骤系统性排查
收集信息：详细了解故障现象、发生时间、影响范围
查看日志：日志是故障排查的第一手资料
分层排查：从应用层到系统层，从软件到硬件
记录过程：详细记录排查过程，便于总结和分享
验证修复：修复后要充分验证，确保问题彻底解决

1.2 故障排查工具箱

# 系统信息查看
uname -a          # 系统信息
uptime            # 系统运行时间和负载
whoami            # 当前用户
id                # 用户ID和组信息
date              # 系统时间

# 进程和服务
ps aux            # 查看所有进程
top               # 实时进程监控
htop              # 更友好的进程监控
systemctl status  # 服务状态
journalctl        # 系统日志

# 网络诊断
ping              # 网络连通性测试
telnet            # 端口连通性测试
ss -tuln          # 查看网络连接
netstat -i        # 网络接口统计
traceroute        # 路由跟踪

# 磁盘和文件系统
df -h             # 磁盘使用情况
du -sh            # 目录大小
lsof              # 打开文件列表
fuser             # 查看文件使用进程

# 内存和CPU
free -h           # 内存使用情况
vmstat            # 虚拟内存统计
iostat            # I/O统计
sar               # 系统活动报告

二、系统启动和服务类故障

2.1 系统无法启动

故障现象：服务器开机后无法正常启动到登录界面

排查思路：

# 1. 检查硬件状态
# - 查看服务器指示灯状态
# - 检查内存、硬盘是否正常
# - 查看BIOS/UEFI设置

# 2. 进入救援模式
# 启动时按e键编辑grub，在linux行末添加：
init=/bin/bash
# 或者
single

# 3. 检查文件系统
fsck /dev/sda1
fsck -y /dev/sda1  # 自动修复

# 4. 检查fstab配置
cat /etc/fstab
# 注释掉有问题的挂载点

# 5. 检查grub配置
grub2-mkconfig -o /boot/grub2/grub.cfg

# 6. 重建initramfs
dracut --force

解决方案：

# 常见解决方法
# 1. 修复文件系统错误
fsck -y /dev/sda1

# 2. 修复fstab错误
vi /etc/fstab
# 注释或修正错误的挂载配置

# 3. 重建grub
grub2-install /dev/sda
grub2-mkconfig -o /boot/grub2/grub.cfg

# 4. 修复SELinux问题
touch /.autorelabel
reboot

2.2 服务无法启动

故障现象：systemctl start service_name 失败

排查步骤：

# 1. 查看服务状态
systemctl status nginx
systemctl is-enabled nginx

# 2. 查看服务日志
journalctl -u nginx -f
journalctl -u nginx --since "2024-01-01 00:00:00"

# 3. 检查配置文件
nginx -t  # 测试nginx配置
apachectl configtest  # 测试apache配置

# 4. 检查端口占用
ss -tuln | grep :80
lsof -i :80

# 5. 检查文件权限
ls -la /etc/nginx/nginx.conf
ls -la /var/log/nginx/

# 6. 检查SELinux
getenforce
sestatus
audit2why < /var/log/audit/audit.log

解决方案：

# 1. 修复配置文件错误
nginx -t  # 根据错误提示修复

# 2. 解决端口冲突
# 杀死占用端口的进程
kill -9 $(lsof -t -i:80)

# 3. 修复权限问题
chown -R nginx:nginx /var/log/nginx/
chmod 755 /var/log/nginx/

# 4. 解决SELinux问题
setsebool -P httpd_can_network_connect 1
restorecon -R /etc/nginx/

2.3 Shell脚本不执行

故障现象：脚本文件存在但无法执行

排查方法：

# 1. 检查文件权限
ls -la script.sh

# 2. 检查文件格式
file script.sh
hexdump -C script.sh | head

# 3. 检查shebang
head -1 script.sh

# 4. 检查语法
bash -n script.sh
sh -x script.sh

# 5. 检查路径
which bash
which sh

解决方案：

# 1. 添加执行权限
chmod +x script.sh

# 2. 转换文件格式（Windows到Linux）
dos2unix script.sh

# 3. 修复shebang
#!/bin/bash
# 或
#!/usr/bin/env bash

# 4. 使用绝对路径执行
/bin/bash script.sh

三、网络连接类故障

3.1 SSH连接缓慢

故障现象：SSH登录需要等待很长时间

排查步骤：

# 1. 检查DNS解析
time nslookup client_ip
time dig -x client_ip

# 2. 检查SSH配置
grep -E "UseDNS|GSSAPIAuthentication" /etc/ssh/sshd_config

# 3. 检查网络延迟
ping -c 5 client_ip
mtr client_ip

# 4. 检查SSH日志
tail -f /var/log/secure
journalctl -u sshd -f

解决方案：

# 1. 禁用DNS反向解析
echo "UseDNS no" >> /etc/ssh/sshd_config

# 2. 禁用GSSAPI认证
echo "GSSAPIAuthentication no" >> /etc/ssh/sshd_config

# 3. 重启SSH服务
systemctl restart sshd

# 4. 客户端优化
# ~/.ssh/config
Host *
    GSSAPIAuthentication no
    UseDNS no

3.2 网络不通

故障现象：无法访问外网或内网服务

排查思路：

# 1. 检查网络接口
ip addr show
ifconfig

# 2. 检查路由表
ip route show
route -n

# 3. 检查DNS配置
cat /etc/resolv.conf
nslookup google.com

# 4. 检查防火墙
iptables -L -n
firewall-cmd --list-all

# 5. 测试网络连通性
ping 8.8.8.8
ping gateway_ip
telnet target_host 80

# 6. 检查网络服务
systemctl status NetworkManager
systemctl status network

解决方案：

# 1. 重启网络服务
systemctl restart NetworkManager
# 或
systemctl restart network

# 2. 重新配置网络
nmcli con show
nmcli con up connection_name

# 3. 修复DNS
echo "nameserver 8.8.8.8" > /etc/resolv.conf
echo "nameserver 114.114.114.114" >> /etc/resolv.conf

# 4. 检查防火墙规则
firewall-cmd --zone=public --add-port=80/tcp --permanent
firewall-cmd --reload

3.3 端口无法访问

故障现象：服务正常运行但端口无法访问

排查方法：

# 1. 检查服务监听
ss -tuln | grep :80
netstat -tuln | grep :80

# 2. 检查进程状态
ps aux | grep nginx
systemctl status nginx

# 3. 检查防火墙
iptables -L -n | grep 80
firewall-cmd --list-ports

# 4. 检查SELinux
getenforce
getsebool -a | grep http

# 5. 本地测试
telnet localhost 80
curl -I http://localhost

解决方案：

# 1. 开放防火墙端口
firewall-cmd --zone=public --add-port=80/tcp --permanent
firewall-cmd --reload

# 2. 配置iptables
iptables -I INPUT -p tcp --dport 80 -j ACCEPT
service iptables save

# 3. 配置SELinux
setsebool -P httpd_can_network_connect 1
semanage port -a -t http_port_t -p tcp 8080

四、磁盘和文件系统故障

4.1 磁盘空间不足

故障现象：系统提示磁盘空间不足，应用无法写入文件

排查步骤：

# 1. 查看磁盘使用情况
df -h
df -i  # 查看inode使用情况

# 2. 查找大文件
du -sh /* | sort -hr
find / -type f -size +100M -exec ls -lh {} \;

# 3. 查找大目录
du -h --max-depth=1 / | sort -hr

# 4. 查看日志文件
du -sh /var/log/*
ls -lah /var/log/

# 5. 查看临时文件
du -sh /tmp/*
du -sh /var/tmp/*

解决方案：

# 1. 清理日志文件
journalctl --vacuum-time=7d
journalctl --vacuum-size=100M

# 清理旧日志
find /var/log -name "*.log" -mtime +30 -delete
find /var/log -name "*.log.*" -mtime +7 -delete

# 2. 清理临时文件
find /tmp -type f -mtime +7 -delete
find /var/tmp -type f -mtime +30 -delete

# 3. 清理包缓存
yum clean all
apt-get clean

# 4. 清理核心转储文件
find / -name "core.*" -delete

# 5. 扩展磁盘空间
# LVM扩展
lvextend -L +10G /dev/mapper/centos-root
xfs_growfs /
# 或 resize2fs /dev/mapper/centos-root

4.2 文件系统只读

故障现象：系统提示”Read-only file system”

排查方法：

# 1. 查看挂载状态
mount | grep "ro,"
cat /proc/mounts

# 2. 查看文件系统错误
dmesg | grep -i error
journalctl | grep -i "read-only"

# 3. 检查磁盘健康状态
smartctl -a /dev/sda
badblocks -v /dev/sda1

# 4. 查看系统日志
tail -f /var/log/messages

解决方案：

# 1. 重新挂载为读写
mount -o remount,rw /

# 2. 检查并修复文件系统
# 先卸载文件系统
umount /dev/sda1
# 检查修复
fsck -y /dev/sda1
# 重新挂载
mount /dev/sda1 /mnt

# 3. 如果是根文件系统
# 进入单用户模式
# 在grub启动参数中添加：single
# 然后执行
fsck -y /dev/sda1
reboot

4.3 文件删除但空间未释放

故障现象：删除大文件后磁盘空间没有释放

排查方法：

# 1. 查看被删除但仍被占用的文件
lsof | grep deleted
lsof +L1

# 2. 查看具体进程
lsof | grep deleted | awk '{print $2}' | sort -u

# 3. 查看文件描述符
ls -la /proc/PID/fd/

解决方案：

# 1. 重启相关进程
systemctl restart service_name

# 2. 发送信号给进程重新打开日志文件
kill -USR1 PID  # 对于nginx
kill -HUP PID   # 对于rsyslog

# 3. 强制关闭文件描述符
# 找到进程PID和文件描述符
lsof | grep deleted
# 关闭文件描述符
exec 3>&-  # 关闭文件描述符3

# 4. 清空文件内容而不删除
> /path/to/large/file
truncate -s 0 /path/to/large/file

五、内存和CPU故障

5.1 内存不足

故障现象：系统响应缓慢，出现OOM错误

排查步骤：

# 1. 查看内存使用情况
free -h
cat /proc/meminfo

# 2. 查看进程内存使用
ps aux --sort=-%mem | head -10
top -o %MEM

# 3. 查看OOM日志
dmesg | grep -i "killed process"
journalctl | grep -i "out of memory"
grep -i "killed process" /var/log/messages

# 4. 查看swap使用
swapon -s
cat /proc/swaps

# 5. 分析内存泄漏
valgrind --tool=memcheck --leak-check=full program

解决方案：

# 1. 释放缓存
echo 1 > /proc/sys/vm/drop_caches
echo 2 > /proc/sys/vm/drop_caches
echo 3 > /proc/sys/vm/drop_caches

# 2. 调整swap策略
echo 10 > /proc/sys/vm/swappiness
echo 'vm.swappiness = 10' >> /etc/sysctl.conf

# 3. 增加swap空间
dd if=/dev/zero of=/swapfile bs=1M count=2048
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile swap swap defaults 0 0' >> /etc/fstab

# 4. 重启占用内存过多的进程
systemctl restart high_memory_service

# 5. 优化应用配置
# 例如：调整Java堆内存
export JAVA_OPTS="-Xmx2g -Xms1g"

5.2 CPU使用率过高

故障现象：系统负载过高，响应缓慢

排查方法：

# 1. 查看CPU使用情况
top
htop
uptime

# 2. 查看进程CPU使用
ps aux --sort=-%cpu | head -10
pidstat -u 1 10

# 3. 分析CPU热点
perf top
perf record -g ./program
perf report

# 4. 查看系统调用
strace -c -p PID
strace -p PID

# 5. 查看中断情况
cat /proc/interrupts
watch -n 1 cat /proc/interrupts

解决方案：

# 1. 调整进程优先级
renice -10 PID  # 降低优先级
renice 10 PID   # 提高优先级

# 2. 限制进程CPU使用
cpulimit -p PID -l 50  # 限制CPU使用率为50%

# 3. 使用cgroup限制资源
echo PID > /sys/fs/cgroup/cpu/limited/cgroup.procs
echo 50000 > /sys/fs/cgroup/cpu/limited/cpu.cfs_quota_us

# 4. 优化应用配置
# 例如：调整nginx worker进程数
worker_processes auto;
worker_cpu_affinity auto;

# 5. 分析并优化代码
# 使用profiling工具找出性能瓶颈

六、日志和监控故障

6.1 日志文件过大

故障现象：日志文件占用大量磁盘空间

处理方法：

# 1. 查看大日志文件
find /var/log -type f -size +100M -exec ls -lh {} \;
du -sh /var/log/* | sort -hr

# 2. 实时监控日志增长
watch -n 1 "du -sh /var/log/messages"

# 3. 配置日志轮转
cat > /etc/logrotate.d/custom << EOF
/var/log/application.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    copytruncate
}
EOF

# 4. 手动轮转日志
logrotate -f /etc/logrotate.d/custom

# 5. 清理旧日志
journalctl --vacuum-time=7d
journalctl --vacuum-size=100M

6.2 系统时间不同步

故障现象：系统时间与实际时间不符

解决方案：

# 1. 查看当前时间
date
timedatectl status

# 2. 设置时区
timedatectl set-timezone Asia/Shanghai
ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

# 3. 启用NTP同步
timedatectl set-ntp true
systemctl enable chronyd
systemctl start chronyd

# 4. 手动同步时间
ntpdate -s time.nist.gov
chrony sources -v

# 5. 配置NTP服务器
echo "server 0.centos.pool.ntp.org iburst" >> /etc/chrony.conf
systemctl restart chronyd

七、权限和安全故障

7.1 权限拒绝错误

故障现象：Permission denied错误

排查方法：

# 1. 检查文件权限
ls -la file_or_directory
namei -l /path/to/file

# 2. 检查用户和组
id username
groups username

# 3. 检查SELinux
getenforce
ls -Z file_or_directory
audit2why < /var/log/audit/audit.log

# 4. 检查ACL权限
getfacl file_or_directory

解决方案：

# 1. 修改文件权限
chmod 755 file_or_directory
chown user:group file_or_directory

# 2. 添加用户到组
usermod -a -G groupname username

# 3. 修复SELinux上下文
restorecon -R /path/to/directory
chcon -t httpd_exec_t /path/to/file

# 4. 设置ACL权限
setfacl -m u:username:rwx file_or_directory

7.2 sudo权限问题

故障现象：sudo命令执行失败

排查步骤：

# 1. 检查sudo配置
sudo -l
visudo -c

# 2. 查看sudo日志
tail -f /var/log/secure
journalctl | grep sudo

# 3. 检查用户组
groups username
id username

解决方案：

# 1. 添加用户到sudo组
usermod -a -G wheel username  # CentOS/RHEL
usermod -a -G sudo username   # Ubuntu/Debian

# 2. 编辑sudoers文件
visudo
# 添加：
username ALL=(ALL) ALL

# 3. 修复sudoers文件权限
chmod 440 /etc/sudoers
chown root:root /etc/sudoers

八、应用服务故障

8.1 数据库连接失败

故障现象：应用无法连接数据库

排查步骤：

# 1. 检查数据库服务状态
systemctl status mysqld
systemctl status postgresql

# 2. 检查数据库监听端口
ss -tuln | grep :3306
ss -tuln | grep :5432

# 3. 测试数据库连接
mysql -h localhost -u root -p
psql -h localhost -U postgres

# 4. 检查数据库日志
tail -f /var/log/mysqld.log
tail -f /var/log/postgresql/postgresql.log

# 5. 检查防火墙和SELinux
firewall-cmd --list-ports
getsebool -a | grep mysql

解决方案：

# 1. 启动数据库服务
systemctl start mysqld
systemctl enable mysqld

# 2. 重置数据库密码
# MySQL
systemctl stop mysqld
mysqld_safe --skip-grant-tables &
mysql -u root
UPDATE mysql.user SET authentication_string=PASSWORD('newpassword') WHERE User='root';
FLUSH PRIVILEGES;

# 3. 修复数据库配置
# 检查my.cnf配置文件
# 确保bind-address设置正确

# 4. 开放防火墙端口
firewall-cmd --zone=public --add-port=3306/tcp --permanent
firewall-cmd --reload

8.2 Web服务无法访问

故障现象：网站无法正常访问

排查流程：

# 1. 检查Web服务状态
systemctl status nginx
systemctl status httpd

# 2. 检查配置文件
nginx -t
apachectl configtest

# 3. 检查端口监听
ss -tuln | grep :80
ss -tuln | grep :443

# 4. 检查日志文件
tail -f /var/log/nginx/error.log
tail -f /var/log/httpd/error_log

# 5. 测试本地访问
curl -I http://localhost
wget --spider http://localhost

解决方案：

# 1. 修复配置错误
nginx -t  # 根据提示修复配置
systemctl reload nginx

# 2. 解决端口冲突
lsof -i :80
kill -9 PID

# 3. 修复权限问题
chown -R nginx:nginx /var/www/html
chmod -R 755 /var/www/html

# 4. 检查防火墙
firewall-cmd --zone=public --add-service=http --permanent
firewall-cmd --zone=public --add-service=https --permanent
firewall-cmd --reload

九、性能相关故障

9.1 系统响应缓慢

排查思路：

# 1. 查看系统负载
uptime
top
htop

# 2. 分析I/O等待
vmstat 1 10
iostat -x 1 10

# 3. 查看内存使用
free -h
cat /proc/meminfo

# 4. 分析网络状况
ss -s
netstat -i

# 5. 查看磁盘使用
df -h
du -sh /*

优化方案：

# 1. 清理系统缓存
echo 3 > /proc/sys/vm/drop_caches

# 2. 调整swap使用
echo 10 > /proc/sys/vm/swappiness

# 3. 优化I/O调度
echo deadline > /sys/block/sda/queue/scheduler

# 4. 调整进程优先级
renice -10 $(pgrep important_process)

# 5. 清理不必要的进程
systemctl disable unnecessary_service
systemctl stop unnecessary_service

9.2 网络延迟高

诊断方法：

# 1. 测试网络延迟
ping -c 10 target_host
mtr target_host

# 2. 分析网络路径
traceroute target_host
tracepath target_host

# 3. 检查网络配置
ip addr show
ip route show

# 4. 监控网络流量
iftop
nload
bandwidth

# 5. 检查DNS解析
nslookup target_host
dig target_host

优化措施：

# 1. 优化TCP参数
echo 'net.ipv4.tcp_fin_timeout = 30' >> /etc/sysctl.conf
echo 'net.ipv4.tcp_keepalive_time = 1200' >> /etc/sysctl.conf
echo 'net.core.rmem_max = 134217728' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' >> /etc/sysctl.conf
sysctl -p

# 2. 优化DNS配置
echo "nameserver 8.8.8.8" > /etc/resolv.conf
echo "nameserver 114.114.114.114" >> /etc/resolv.conf

# 3. 调整网卡参数
ethtool -G eth0 rx 4096 tx 4096
ethtool -K eth0 gso on gro on tso on

十、故障预防和监控

10.1 建立监控体系

# 1. 系统监控脚本
#!/bin/bash
# system_monitor.sh

LOGFILE="/var/log/system_monitor.log"
DATE=$(date '+%Y-%m-%d %H:%M:%S')

# CPU使用率
CPU_USAGE=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)

# 内存使用率
MEM_USAGE=$(free | grep Mem | awk '{printf "%.2f", $3/$2 * 100.0}')

# 磁盘使用率
DISK_USAGE=$(df -h / | awk 'NR==2 {print $5}' | cut -d'%' -f1)

# 负载平均值
LOAD_AVG=$(uptime | awk -F'load average:' '{print $2}')

# 记录日志
echo "$DATE CPU:${CPU_USAGE}% MEM:${MEM_USAGE}% DISK:${DISK_USAGE}% LOAD:${LOAD_AVG}" >> $LOGFILE

# 告警检查
if (( $(echo "$CPU_USAGE > 80" | bc -l) )); then
    echo "WARNING: High CPU usage: ${CPU_USAGE}%" | mail -s "CPU Alert" admin@example.com
fi

if (( $(echo "$MEM_USAGE > 85" | bc -l) )); then
    echo "WARNING: High memory usage: ${MEM_USAGE}%" | mail -s "Memory Alert" admin@example.com
fi

if (( $(echo "$DISK_USAGE > 90" | bc -l) )); then
    echo "WARNING: High disk usage: ${DISK_USAGE}%" | mail -s "Disk Alert" admin@example.com
fi

10.2 自动化故障处理

# 2. 自动故障恢复脚本
#!/bin/bash
# auto_recovery.sh

# 检查服务状态并自动重启
check_service() {
    local service_name=$1
    if ! systemctl is-active --quiet $service_name; then
        echo "$(date): $service_name is down, restarting..." >> /var/log/auto_recovery.log
        systemctl restart $service_name
        if systemctl is-active --quiet $service_name; then
            echo "$(date): $service_name restarted successfully" >> /var/log/auto_recovery.log
        else
            echo "$(date): Failed to restart $service_name" >> /var/log/auto_recovery.log
            echo "CRITICAL: Failed to restart $service_name" | mail -s "Service Alert" admin@example.com
        fi
    fi
}

# 检查关键服务
check_service "nginx"
check_service "mysqld"
check_service "sshd"

# 清理临时文件
find /tmp -type f -mtime +1 -delete

# 清理日志文件
find /var/log -name "*.log" -size +100M -exec truncate -s 50M {} \;

10.3 故障预防检查清单

# 3. 系统健康检查脚本
#!/bin/bash
# health_check.sh

echo "=== System Health Check Report ==="
echo "Date: $(date)"
echo

# 1. 系统基本信息
echo "1. System Information:"
uname -a
uptime
echo

# 2. 磁盘空间检查
echo "2. Disk Space:"
df -h
echo

# 3. 内存使用检查
echo "3. Memory Usage:"
free -h
echo

# 4. CPU负载检查
echo "4. CPU Load:"
top -bn1 | head -5
echo

# 5. 网络连接检查
echo "5. Network Connections:"
ss -s
echo

# 6. 服务状态检查
echo "6. Critical Services:"
for service in sshd nginx mysqld; do
    if systemctl is-active --quiet $service; then
        echo "$service: Running"
    else
        echo "$service: Stopped"
    fi
done
echo

# 7. 日志错误检查
echo "7. Recent Errors:"
journalctl --since "1 hour ago" --priority=err --no-pager | tail -10
echo

# 8. 安全检查
echo "8. Security Check:"
echo "Failed login attempts:"
lastb | head -5
echo

echo "=== End of Report ==="

十一、总结和最佳实践

11.1 故障排查最佳实践

建立标准化流程：制定详细的故障排查SOP
完善监控体系：建立全面的系统监控和告警机制
定期健康检查：定期执行系统健康检查脚本
文档化管理：详细记录故障处理过程和解决方案
知识库建设：建立故障案例知识库，便于快速查询
技能提升：定期进行故障演练，提升团队技能

11.2 预防措施

定期备份：建立完善的数据备份策略
容量规划：合理规划系统资源，避免资源不足
安全加固：定期进行安全检查和加固
版本管理：建立配置文件版本管理机制
测试验证：变更前在测试环境充分验证

11.3 工具推荐

监控工具：Zabbix、Prometheus、Nagios
日志分析：ELK Stack、Fluentd
性能分析：perf、strace、tcpdump
自动化：Ansible、Puppet、SaltStack
文档管理：GitLab、Confluence

通过系统性的故障排查方法和预防措施，可以大大提升Linux系统的稳定性和可靠性。记住，优秀的运维工程师不仅要能快速解决问题，更要能预防问题的发生。

本文总结了Linux运维中最常见的33个故障案例，提供了详细的排查思路和解决方案。建议运维人员收藏备用，并结合实际环境进行实践验证。