私は(中古)ベアメタルサーバーでDebian Bullseyeを実行していますが、時々クラッシュが発生します(8日間3回発生)、なぜ理解できないようです。私もそれを再現する方法を見つけることができませんでした。〜らしいシステムの外部から。
3つのケースでは、次のことが起こります。
- システムは(実際に)アイドル状態です。
- カーネルログのスタックトレースには
NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
エラーがあり、古いメッセージはありません(以前のメッセージとの間隔は通常数時間です)。 - このメッセージは
e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
10〜20秒ごとに繰り返されます。 - この時点でネットワークがダウンしてアクセスできなくなったため、ハードウェアリセットを数回実行して再び動作させるようにしました。
これで、最初に(コンソール経由で)ネットワークをリセットできることを確認しようとしました。 (ただし、ドライバモジュールをアンインストール/再挿入してみませんでしたが、役に立つかどうかはわかりません。)それは生産的な努力ではないようです。
この状況が再発した場合にデバッグできる方法、問題を再現する方法、およびハードウェアをリセットせずに再動作させる方法に関するアドバイスを提供できる人はいますか?
ログファイル
(これは初めてです。3つのログがすべて同じか、少なくとも非常に似ています。)
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662109] ------------[ cut here ]------------
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662249] NETDEV WATCHDOG: enp0s31f6 (e1000e): transmit queue 0 timed out
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662401] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:467 dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.662554] Modules linked in: dm_mod xt_nat vhost_net vhost vhost_iotlb tap tun xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_
ipv6 nf_defrag_ipv4 nft_counter nf_tables nfnetlink bridge stp llc intel_rapl_msr intel_rapl_common intel_pmc_core_pltdrv intel_pmc_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel evdev kvm irqbypass rapl intel_cstate intel_uncore wdat_wdt intel_pch_thermal
watchdog ee1004 serio_raw ie31200_edac acpi_pad button drm fuse configfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid0 multip
ath linear raid1 md_mod crc32_pclmul crc32c_intel ahci xhci_pci ghash_clmulni_intel xhci_hcd libahci nvme e1000e libata aesni_intel usbcore libaes crypto_simd scsi_mod nvme_core ptp psmouse pps_core cryptd glue_helper t10_pi i2c_i801 crc_t10dif
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.663385] crct10dif_generic i2c_smbus crct10dif_pclmul crct10dif_common wmi usb_common video
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664310] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.10.0-21-amd64 #1 Debian 5.10.162-1
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664461] Hardware name: FUJITSU /D3417-B2, BIOS V5.0.0.12 R1.27.0.SR.1 for D3417-B2x 06/10/2020
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664630] RIP: 0010:dev_watchdog+0x260/0x270
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664747] Code: eb a9 48 8b 1c 24 c6 05 c7 16 0d 01 01 48 89 df e8 b5 73 fa ff 44 89 e9 48 89 de 48 c7 c7 08 b8 b6 91 48 89 c2 e8 da a0 14 00 <0f> 0b eb 86 66 66 2e 0f 1f 84 00 00 00 00 00 90 0f 1f 44 00 00 41
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.664968] RSP: 0018:ffffbb7e40128eb0 EFLAGS: 00010282
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665088] RAX: 0000000000000000 RBX: ffff920c20740000 RCX: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665234] RDX: 0000000000000000 RSI: 00000000000000f6 RDI: 000000000000083f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665381] RBP: ffff920c207403dc R08: 0000000000000000 R09: ffffbb7e40128cd0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665532] R10: ffffbb7e40128cc8 R11: ffffffff920cb6a8 R12: ffff920b4143c080
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665681] R13: 0000000000000000 R14: ffff920c20740480 R15: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665832] FS: 0000000000000000(0000) GS:ffff921a2e440000(0000) knlGS:0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.665985] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666101] CR2: 000000c0002f9000 CR3: 0000000c9480a001 CR4: 00000000003726e0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666249] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666394] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666537] Call Trace:
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666646] <IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666754] ? pfifo_fast_enqueue+0x150/0x150
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666868] call_timer_fn+0x27/0x100
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.666988] __run_timers.part.0+0x1d9/0x250
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667106] ? ktime_get+0x35/0xa0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667223] ? lapic_next_deadline+0x28/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667340] ? clockevents_program_event+0x8a/0xf0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667462] run_timer_softirq+0x26/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667536] __do_softirq+0xc2/0x279
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667610] asm_call_irq_on_stack+0xf/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667684] </IRQ>
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667755] do_softirq_own_stack+0x37/0x50
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667830] irq_exit_rcu+0x92/0xc0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667904] sysvec_apic_timer_interrupt+0x36/0x80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.667980] asm_sysvec_apic_timer_interrupt+0x12/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668057] RIP: 0010:cpuidle_enter_state+0xc7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668133] Code: 8b 3d dd 71 f4 6e e8 b8 9a 9f ff 49 89 c5 0f 1f 44 00 00 31 ff e8 29 a6 9f ff 45 84 ff 0f 85 fe 00 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 0a 01 00 00 49 63 c6 4c 2b 2c 24 48 8d 14 40 48 8d
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668256] RSP: 0018:ffffbb7e400c3ea8 EFLAGS: 00000246
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668334] RAX: ffff921a2e473c40 RBX: 0000000000000006 RCX: 000000000000001f
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668425] RDX: 0000000000000000 RSI: 0000000021c15a3d RDI: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668517] RBP: ffff921a2e47e800 R08: 00007429fb821b6a R09: 0000000000000001
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668608] R10: 0000000000000000 R11: 0000000000002b55 R12: ffffffff921aea80
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668700] R13: 00007429fb821b6a R14: 0000000000000006 R15: 0000000000000000
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668792] ? cpuidle_enter_state+0xb7/0x350
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668867] cpuidle_enter+0x29/0x40
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.668941] do_idle+0x1f3/0x2b0
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669015] cpu_startup_entry+0x19/0x20
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669089] secondary_startup_64_no_verify+0xb0/0xbb
Feb 21 06:34:54 Debian-1106-bullseye-amd64-base kernel: [127723.669166] ---[ end trace 4e1f5ac6215c3384 ]---
ハードウェア情報
# lspci -vvvv -s 0000:00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-LM (rev 31)
Subsystem: Fujitsu Technology Solutions Ethernet Connection (2) I219-LM
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 126
IOMMU group: 8
Region 0: Memory at ef200000 (32-bit, non-prefetchable) [size=128K]
Capabilities: [c8] Power Management version 3
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: 00000000fee002b8 Data: 0000
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e
Kernel modules: e1000e
uname -a
Linux Debian-1106-bullseye-amd64-base 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux
カーネルパッケージについて
apt show linux-image-5.10.0-21-amd64
Package: linux-image-5.10.0-21-amd64
Version: 5.10.162-1
Built-Using: linux (= 5.10.162-1)
Priority: optional
Section: kernel
Source: linux-signed-amd64 (5.10.162+1)
Maintainer: Debian Kernel Team <[email protected]>
Installed-Size: 318 MB
Depends: kmod, linux-base (>= 4.3~), initramfs-tools (>= 0.120+deb8u2) | linux-initramfs-tool
Recommends: firmware-linux-free, apparmor
Suggests: linux-doc-5.10, debian-kernel-handbook, grub-pc | grub-efi-amd64 | extlinux
Conflicts: linux-image-5.10.0-21-amd64-unsigned
Breaks: fwupdate (<< 12-7), initramfs-tools (<< 0.120+deb8u2), wireless-regdb (<< 2019.06.03-1~), xserver-xorg-input-vmmouse (<< 1:13.0.99)
Replaces: linux-image-5.10.0-21-amd64-unsigned
Homepage: https://www.kernel.org/
Download-Size: 55.5 MB
APT-Manual-Installed: no
APT-Sources: http://security.debian.org/debian-security bullseye-security/main amd64 Packages
Description: Linux 5.10 for 64-bit PCs (signed)
The Linux kernel 5.10 and modules for use on PCs with AMD64, Intel 64 or
VIA Nano processors.
.
The kernel image and modules are signed for use with Secure Boot.
答え1
TX タイムアウトが表示されたときに最初に試す必要のあるタスクの 1 つは、TSO を無効にすることです。
sudo ethtool -k enp0s31f6 tso off
ethtool -S enp0s31f6
また、エラー、特に、tx_tcp_seg_failed
などの奇妙なカウンターが表示されるかどうかを知りたいですtx_tcp_seg_good
。
割り込みの問題がある場合(この事実に驚きました)、パラメータを使用してドライバをロードするときはいつでもMSIまたはMSI-Xを無効にしてみることができますIntMode=
。バラよりカーネル文書。
ちなみに、e1000eを実行するI219の出力は次のようになります。以下の統計のいずれかがゼロではなく、私の統計がゼロである場合は、その統計が上がる理由を詳しく調べることをお勧めします。
$ ethtool -S enp0s31f6 | grep tx_
tx_packets: 133102433
tx_bytes: 178802443357
tx_broadcast: 163
tx_multicast: 5121
tx_errors: 0
tx_dropped: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 0
tx_tcp_seg_good: 20245901
tx_tcp_seg_failed: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
tx_smbus: 0
tx_dma_failed: 0
tx_hwtstamp_timeouts: 0
tx_hwtstamp_skipped: 0