私のサーバー負荷平均が150に達し、時には200に達する理由を見つけようとしています。
これは、8つのCPUと26 GBのRAMを備えたVMware ESXi仮想マシンで実行されるUbuntuサーバーです。サーバーは、約500のユーザーアカウントを持つメールサーバー(Kerio Mailserverを使用)です。サーバーは営業時間(午前8時30分から午後6時30分まで)の間に負荷が非常に高く、Kerio MailserverはExchangeに似たメールサーバーであり、ユーザーはコネクタを使用してOutlookを同期します。
以下は、高負荷の原因を探している間に使用したいくつかのコマンドの結果です。
トップ:
top - 16:25:32 up 46 days, 20:59, 2 users, load average: 178.44, 164.61, 156.84
Tasks: 241 total, 1 running, 240 sleeping, 0 stopped, 0 zombie
%Cpu(s): 45.8 us, 53.6 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.5 si, 0.0 st
KiB Mem: 24689476 total, 24234796 used, 454680 free, 675324 buffers
KiB Swap: 23436284 total, 763136 used, 22673148 free. 14616960 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1253 root 20 0 13.889g 6.918g 22680 S 99.3 29.4 190621:37 mailserver
64 root 20 0 0 0 0 S 0.2 0.0 257:29.22 kswapd0
30482 root 20 0 25036 3160 2516 R 0.1 0.0 0:01.33 top
8227 root 20 0 318856 12232 11288 S 0.0 0.0 15:39.14 smbd
1 root 20 0 36408 3144 1884 S 0.0 0.0 0:06.40 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.49 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 2:32.48 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
7 root 20 0 0 0 0 S 0.0 0.0 81:48.05 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:05.96 migration/0
10 root rt 0 0 0 0 S 0.0 0.0 0:12.15 watchdog/0
11 root rt 0 0 0 0 S 0.0 0.0 0:11.93 watchdog/1
12 root rt 0 0 0 0 S 0.0 0.0 0:06.02 migration/1
13 root 20 0 0 0 0 S 0.0 0.0 2:17.19 ksoftirqd/1
14 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
16 root rt 0 0 0 0 S 0.0 0.0 0:10.84 watchdog/2
17 root rt 0 0 0 0 S 0.0 0.0 0:06.29 migration/2
18 root 20 0 0 0 0 S 0.0 0.0 2:22.20 ksoftirqd/2
19 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0
20 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
21 root rt 0 0 0 0 S 0.0 0.0 0:10.75 watchdog/3
22 root rt 0 0 0 0 S 0.0 0.0 0:06.34 migration/3
23 root 20 0 0 0 0 S 0.0 0.0 2:07.07 ksoftirqd/3
24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0
25 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
26 root rt 0 0 0 0 S 0.0 0.0 0:11.49 watchdog/4
27 root rt 0 0 0 0 S 0.0 0.0 0:06.34 migration/4
28 root 20 0 0 0 0 S 0.0 0.0 1:50.66 ksoftirqd/4
30 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
31 root rt 0 0 0 0 S 0.0 0.0 0:11.48 watchdog/5
32 root rt 0 0 0 0 S 0.0 0.0 0:06.45 migration/5
33 root 20 0 0 0 0 S 0.0 0.0 2:04.74 ksoftirqd/5
34 root 20 0 0 0 0 S 0.0 0.0 1:30.98 kworker/5:0
35 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/5:0H
36 root rt 0 0 0 0 S 0.0 0.0 0:11.22 watchdog/6
37 root rt 0 0 0 0 S 0.0 0.0 0:06.40 migration/6
38 root 20 0 0 0 0 S 0.0 0.0 2:23.44 ksoftirqd/6
40 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/6:0H
41 root rt 0 0 0 0 S 0.0 0.0 0:11.06 watchdog/7
42 root rt 0 0 0 0 S 0.0 0.0 0:06.50 migration/7
43 root 20 0 0 0 0 S 0.0 0.0 2:06.70 ksoftirqd/7
45 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/7:0H
46 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kdevtmpfs
47 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 netns
48 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 perf
49 root 20 0 0 0 0 S 0.0 0.0 0:08.92 khungtaskd
IOSTAT
Linux 4.4.0-124-generic (mardom-mail) 08/06/2018 _x86_64_ (8 CPU)
08/06/2018 04:26:33 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.86 88.05 95.18 148.71 1809.49 2014.93 31.36 1.03 4.21 6.02 3.05 0.82 20.08
08/06/2018 04:26:35 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 1.50 164.00 255.00 234.00 3744.00 2028.00 23.61 2.06 4.21 6.00 2.26 1.35 66.00
08/06/2018 04:26:37 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.00 160.50 211.00 25.50 3164.00 1188.00 36.80 1.30 5.45 5.90 1.80 2.88 68.00
08/06/2018 04:26:39 PM
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdb 0.50 207.00 129.00 63.50 2808.00 1942.00 49.35 1.16 6.06 8.48 1.13 3.35 64.40
*sdbは電子メールデータを保存するデバイスです*
システム制御:
fs.file-max = 2097152
vm.swappiness = 10
vm.dirty_ratio = 60
vm.dirty_background_ratio = 2
net.ipv4.tcp_synack_retries = 2
net.ipv4.ip_local_port_range = 2000 65535
net.ipv4.tcp_rfc1337 = 1
net.ipv4.tcp_fin_timeout = 15
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_keepalive_probes = 5
net.ipv4.tcp_keepalive_intvl = 15
net.core.rmem_default = 31457280
net.core.rmem_max = 12582912
net.core.wmem_default = 31457280
net.core.wmem_max = 12582912
net.ipv4.tcp_max_syn_backlog = 4096
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 65536
net.core.optmem_max = 25165824
net.ipv4.tcp_mem = 65536 131072 262144
net.ipv4.udp_mem = 65536 131072 262144
net.ipv4.tcp_rmem = 8192 87380 16777216
net.ipv4.udp_rmem_min = 16384
net.ipv4.tcp_wmem = 8192 65536 16777216
net.ipv4.udp_wmem_min = 16384
net.ipv4.tcp_max_tw_buckets = 1440000
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_rmem = 20240 87380 16582912
net.ipv4.tcp_wmem = 20240 87380 16582912
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_no_metrics_save = 1
net.core.netdev_max_backlog = 8000
NETSTAT-S これは "netstat -s"コマンド出力の一部です。
TcpExt:
147823 SYN cookies sent
90353 SYN cookies received
649892 invalid SYN cookies received
35112 resets received for embryonic SYN_RECV sockets
598 packets pruned from receive queue because of socket buffer overrun
1248 ICMP packets dropped because they were out-of-window
1750991 TCP sockets finished time wait in fast timer
115289 TCP sockets finished time wait in slow timer
198679 passive connections rejected because of time stamp
313672 packets rejects in established connections because of timestamp
30746108 delayed acks sent
52041 delayed acks further delayed because of locked socket
Quick ack mode was activated 3386942 times
708473 times the listen queue of a socket overflowed
1060673 SYNs to LISTEN sockets dropped
3845729 packets directly queued to recvmsg prequeue.
356530 bytes directly in process context from backlog
352524121 bytes directly received in process context from prequeue
477210889 packet headers predicted
269376 packets header predicted and directly queued to user
863932645 acknowledgments not containing data payload received
1164901897 predicted acknowledgments
5140 times recovered from packet loss due to fast retransmit
3122668 times recovered from packet loss by selective acknowledgements
17 bad SACK blocks received
Detected reordering 2283 times using FACK
Detected reordering 2792 times using SACK
Detected reordering 79 times using reno fast retransmit
Detected reordering 7587 times using time stamp
11657 congestion windows fully recovered without slow start
7574 congestion windows partially recovered using Hoe heuristic
113083 congestion windows recovered without slow start by DSACK
1749877 congestion windows recovered without slow start after partial ack
TCPLostRetransmit: 505007
446 timeouts after reno fast retransmit
69083 timeouts after SACK recovery
37057 timeouts in loss state
10089153 fast retransmits
193583 forward retransmits
842022 retransmits in slow start
5758947 other TCP timeouts
TCPLossProbes: 16066532
TCPLossProbeRecovery: 213644
551 classic Reno fast retransmits failed
81107 SACK retransmits failed
74 times receiver scheduled too late for direct processing
9695 packets collapsed in receive queue due to low socket buffer
3556447 DSACKs sent for old packets
99902 DSACKs sent for out of order packets
11030608 DSACKs received
47032 DSACKs for out of order packets received
2171835 connections reset due to unexpected data
1370307 connections reset due to early user close
1984601 connections aborted due to timeout
TCPSACKDiscard: 52
TCPDSACKIgnoredOld: 38928
TCPDSACKIgnoredNoUndo: 4151872
TCPSpuriousRTOs: 50607
TCPSackShifted: 11911393
TCPSackMerged: 12335047
TCPSackShiftFallback: 12395942
IPReversePathFilter: 1
TCPReqQFullDoCookies: 161423
TCPRetransFail: 123
TCPRcvCoalesce: 306299180
TCPOFOQueue: 6851749
TCPOFOMerge: 94743
TCPChallengeACK: 74466
TCPSYNChallenge: 6119
TCPFastOpenCookieReqd: 3
TCPSpuriousRtxHostQueues: 1761
TCPAutoCorking: 59800735
TCPFromZeroWindowAdv: 52396
TCPToZeroWindowAdv: 52396
TCPWantZeroWindowAdv: 416614
TCPSynRetrans: 2205685
TCPOrigDataSent: -230466592
TCPHystartTrainDetect: 950609
TCPHystartTrainCwnd: 17827670
TCPHystartDelayDetect: 649377
TCPHystartDelayCwnd: 27358928
TCPACKSkippedSynRecv: 227586
TCPACKSkippedPAWS: 32292
TCPACKSkippedSeq: 30707
TCPACKSkippedFinWait2: 7
TCPACKSkippedTimeWait: 48
TCPACKSkippedChallenge: 1494
TCPWinProbe: 279293
TCPKeepAlive: 1685
"netstat -nat | wc -l"を実行すると、2988個の接続が発生します。
「top」出力でわかるように、「mailserver」プロセスはCPUの90%以上を消費しますが、これにより高い負荷が発生する可能性がありますか?
**アップデート**アップデートで特定のVLANのサーバー接続をブロックしたとき(約60台以上のシステムを持つVLAN)、負荷が低下し始めたことを発見しました。ネットワーク関連の問題ですか?それともサーバーの総容量ですか?