http://c20031776.blog.163.com/blog/static/6847162520134154549681/
昨晚收到服务器报警,进程数显示为655,并且进程一直在涨load也在攀升,登录服务器无响应,通过dell远程控制终端查看还是无响应,但能ping通网站也能访问,为了不影响紧急重启了服务器,今天排查结果如下: 监控截图:
务器log: Aug 16 22:48:20 lvs02 kernel: INFO: task sshpass:23238 blocked for more than 120 seconds. Aug 16 22:48:20 lvs02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 16 22:48:20 lvs02 kernel: sshpass D 0000000000000003 0 23238 23237 0x00000080 Aug 16 22:48:20 lvs02 kernel: ffff880041233a78 0000000000000046 0000000000000000 ffff88004122a040 Aug 16 22:48:20 lvs02 kernel: ffff8800412339f8 ffffffff8126a4c1 ffff88003f4d6cc0 ffff88003f4d6f88 Aug 16 22:48:20 lvs02 kernel: ffff88004122a5f8 ffff880041233fd8 000000000000f4e8 ffff88004122a5f8 Aug 16 22:48:20 lvs02 kernel: Call Trace: Aug 16 22:48:20 lvs02 kernel: [<ffffffff8126a4c1>] ? cpumask_any_but+0x31/0x50 Aug 16 22:48:20 lvs02 kernel: [<ffffffff814ed975>] schedule_timeout+0x215/0x2e0 Aug 16 22:48:20 lvs02 kernel: [<ffffffff810519c3>] ? __wake_up+0x53/0x70 Aug 16 22:48:20 lvs02 kernel: [<ffffffff814ed5f3>] wait_for_common+0x123/0x180 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8105e7f0>] ? default_wake_function+0x0/0x20 Aug 16 22:48:20 lvs02 kernel: [<ffffffff814ed70d>] wait_for_completion+0x1d/0x20 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8108b521>] flush_cpu_workqueue+0x61/0x90 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8108b5d0>] ? wq_barrier_func+0x0/0x20 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8108bdf4>] flush_workqueue+0x54/0x80 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8108be35>] flush_scheduled_work+0x15/0x20 Aug 16 22:48:20 lvs02 kernel: [<ffffffff81313efc>] tty_ldisc_release+0x3c/0x90 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8130e18b>] tty_release_dev+0x40b/0x5e0 Aug 16 22:48:20 lvs02 kernel: [<ffffffff81132ece>] ? __dec_zone_page_state+0x2e/0x30 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8130e37e>] tty_release+0x1e/0x30 Aug 16 22:48:20 lvs02 kernel: [<ffffffff81177e35>] __fput+0xf5/0x210 Aug 16 22:48:20 lvs02 kernel: [<ffffffff81177f75>] fput+0x25/0x30 Aug 16 22:48:20 lvs02 kernel: [<ffffffff811739bd>] filp_close+0x5d/0x90 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106c53f>] put_files_struct+0x7f/0xf0 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106c603>] exit_files+0x53/0x70 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106e6a5>] do_exit+0x185/0x860 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106dd3e>] ? sys_wait4+0xae/0x100 Aug 16 22:48:20 lvs02 kernel: [<ffffffff810d4692>] ? audit_syscall_entry+0x272/0x2a0 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106edd8>] do_group_exit+0x58/0xd0 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8106ee67>] sys_exit_group+0x17/0x20 Aug 16 22:48:20 lvs02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
sshpass是服务器定时更新脚本里密码认证进程 以上log显示sshpass进程超过120秒阻塞,判断为机房网络出现短暂中断或是内存缓冲与磁盘IO同步超过120秒所导致 本周收到机房通知8-17 00:00-05:00网络割接
网上查看资料说内存缓冲与磁盘IO同步问题解决方法 原理: Linux会设置40%的可用内存用来做系统cache,当flush数据时这40%内存中的数据由于和IO同步问题导致超时(120s),所将40%减小到10%,避免超时
在文件/etc/sysctl.conf中加入 vm.dirty_ratio=10
May 15 13:14:20 localhost kernel: INFO: task sshd:18132 blocked for more than 120 seconds.
May 15 13:14:20 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 15 13:14:20 localhost kernel: sshd D 0000000000000000 0 18132 2067 0x00000080
May 15 13:14:20 localhost kernel: ffff88042edcf768 0000000000000082 0000000000000000 ffff88042edcf758
May 15 13:14:20 localhost kernel: ffffea0009b47070 ffffea0009b47048 0000000100000002 ffff88042edcf8c0
May 15 13:14:20 localhost kernel: ffff8802ee5b7a78 ffff88042edcffd8 000000000000f4e8 ffff8802ee5b7a78
May 15 13:14:20 localhost kernel: Call Trace:
May 15 13:14:20 localhost kernel: [<ffffffff814ed8a5>] schedule_timeout+0x215/0x2e0
May 15 13:14:20 localhost kernel: [<ffffffff814ed523>] wait_for_common+0x123/0x180
May 15 13:14:20 localhost kernel: [<ffffffff8105fa50>] ? default_wake_function+0x0/0x20
May 15 13:14:20 localhost kernel: [<ffffffff814ed63d>] wait_for_completion+0x1d/0x20
May 15 13:14:20 localhost kernel: [<ffffffff8108bcc7>] flush_work+0x77/0xc0
May 15 13:14:20 localhost kernel: [<ffffffff8108b730>] ? wq_barrier_func+0x0/0x20
May 15 13:14:20 localhost kernel: [<ffffffff8108bee4>] flush_delayed_work+0x54/0x70
May 15 13:14:20 localhost kernel: [<ffffffff81314c75>] tty_flush_to_ldisc+0x15/0x20
May 15 13:14:20 localhost kernel: [<ffffffff8130f7b7>] n_tty_poll+0x67/0x1d0
May 15 13:14:20 localhost kernel: [<ffffffff8130b28a>] tty_poll+0x8a/0xa0
May 15 13:14:20 localhost kernel: [<ffffffff8118b6e2>] do_select+0x392/0x6b0
May 15 13:14:20 localhost kernel: [<ffffffff8118ba00>] ? __pollwait+0x0/0xf0
May 15 13:14:20 localhost kernel: [<ffffffff8118baf0>] ? pollwake+0x0/0x60
May 15 13:14:20 localhost kernel: [<ffffffff8118baf0>] ? pollwake+0x0/0x60
May 15 13:14:20 localhost kernel: [<ffffffff8118baf0>] ? pollwake+0x0/0x60
May 15 13:14:20 localhost kernel: [<ffffffff8118baf0>] ? pollwake+0x0/0x60
May 15 13:14:20 localhost kernel: [<ffffffff814ef4fb>] ? _spin_unlock_bh+0x1b/0x20
May 15 13:14:20 localhost kernel: [<ffffffff8141cd2e>] ? release_sock+0xce/0xe0
May 15 13:14:20 localhost kernel: [<ffffffff8146fb9c>] ? tcp_sendmsg+0x73c/0xa10
May 15 13:14:20 localhost kernel: [<ffffffff8141bd9e>] ? sock_aio_write+0x15e/0x170
May 15 13:14:20 localhost kernel: [<ffffffff8118c4ea>] core_sys_select+0x18a/0x2c0
May 15 13:14:20 localhost kernel: [<ffffffff813108cc>] ? n_tty_read+0x3dc/0x970
May 15 13:14:20 localhost kernel: [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
May 15 13:14:20 localhost kernel: [<ffffffff81218fcf>] ? selinux_file_permission+0xbf/0x150
May 15 13:14:20 localhost kernel: [<ffffffff8118c877>] sys_select+0x47/0x110
May 15 13:14:20 localhost kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
|