radeonエラー:GPUロック:xミリ秒以上リング0で停止しました。

radeonエラー:GPUロック:xミリ秒以上リング0で停止しました。

Debian Busterが新しくインストールされたコンピュータがあります。 GPUはRadeonですFirePro W2100。数時間使用した後、突然本機が停止し、ディスプレイが「ホワイトノイズ」に切り替わり、本機を使用できなくなりました。

ログに次のエラーがたくさん表示されます。

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10240msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)
kernel: adeon 0000:65:00.0: failed to get a new IB (-35)
kernel: [drm:ffffffff816219d0] *ERROR* Couldn't update BO_VA (-35)
kernel: radeon 0000:65:00.0: failed to get a new IB (-35)

それから

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10032msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)

これらのエラーは何を意味し、どのように解決しますか?

ハードウェアの問題ですか、ソフトウェアの問題ですか?

答え1

私はradeon 0000:04:00.0: ring 0 stalled for more than 10240msec私を犯した[AMD/ATI] RV620 GL [ファイアプロ2450]以下のオペラWebブラウザを起動するとUbuntu 20.04.5 LTS数分。 Firefoxや他のプログラムには問題がなく、Operaだけが問題になります。

[128524.943553] radeon 0000:04:00.0: ring 0 stalled for more than 10240msec
[128524.943565] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf6 last fence id 0x000000000029cafc on ring 0)
[128524.955392] radeon 0000:04:00.0: Saved 185 dwords of commands on ring 0.
[128524.955409] radeon 0000:04:00.0: GPU softreset: 0x00000009
[128524.955413] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA2303030
[128524.955417] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128524.955420] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200010C0
[128524.955423] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128524.955426] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00008002
[128524.955429] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008086
[128524.955432] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80018645
[128524.955435] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.013038] radeon 0000:04:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
[128525.013097] radeon 0000:04:00.0: SRBM_SOFT_RESET=0x00000100
[128525.015187] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[128525.015191] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128525.015195] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[128525.015198] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128525.015201] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[128525.015204] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[128525.015207] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80100000
[128525.015210] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.015220] radeon 0000:04:00.0: GPU reset succeeded, trying to resume
[128525.031584] [drm] PCIE gen 2 link speeds already enabled
[128525.034184] [drm] PCIE GART of 512M enabled (table at 0x0000000000142000).
[128525.034222] radeon 0000:04:00.0: WB enabled
[128525.034224] radeon 0000:04:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00
[128525.034579] radeon 0000:04:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0
[128525.034797] debugfs: File 'radeon_ring_gfx' in directory '0' already present!
[128525.066237] [drm] ring test on 0 succeeded in 1 usecs
[128525.066242] debugfs: File 'radeon_ring_uvd' in directory '0' already present!
[128525.240884] [drm] ring test on 5 succeeded in 1 usecs
[128525.240893] [drm] UVD initialized successfully.
[128535.695467] radeon 0000:04:00.0: ring 0 stalled for more than 10456msec
[128535.695479] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf8 last fence id 0x000000000029cafc on ring 0)
[128535.697433] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[128535.697551] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-35).

答え2

これは実際にはハードウェア障害である可能性があります。カーネルがAMD ATI Radeon HD 8670あるArch LinuxでGPUでゲームをプレイするとき、私のPCからこれを得ます6.3.1-zen1-1-zen。 HP Zendeskは参考用です。カーネルを最後のLTSとそれ以前のLTS(5.10 iirc)にドロップしようとしましたが、数分間ゲームをした後もクラッシュが発生します。

私は偶然同じOSとカーネル(zenを使ったアーチ)を実行するDellホームサーバーを持っていて、AMD ATI Radeon HD 8570GPUを持っています。本質的に同じカードですが、DDR5のオンボードiircは少し少ないです。

さて、グラフィックカード(現在のHP mbでは8570、Dellでは8670)を変更しましたが、8570でゲームをプレイするのに問題はありません。

したがって、すべて同じハードウェア/ソフトウェア/ファームウェア/ドライバを使用しても8570は機能しますが、8670は機能しません。私がしたことはカードを交換するだけでした。ドライバや他のものを再インストールする必要はありませんでした。ゲームも参考にする必要があります。使用される8670ではうまく機能するので、いつかは消えると思います。

したがって、ハードウェアのエラーがまれであることを知っていますが、これがエラーでない場合は何がわかりません。もしかしたら悪いお知らせをお伝えすることになり申し訳ありません。私はゲーム用にホームサーバーを使用していないので、このスイッチを作成することをお勧めします。

これは私のHPでクラッシュした8760のdmesgログの1つです。

...
[32776.529276] radeon 0000:0b:00.0: ring 0 stalled for more than 28224msec
[32776.529282] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086ba on ring 0)
[32776.673264] radeon 0000:0b:00.0: ring 3 stalled for more than 28228msec
[32776.673268] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038154 on ring 3)
[32777.033251] radeon 0000:0b:00.0: ring 0 stalled for more than 28728msec
[32777.033259] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bb on ring 0)
[32777.177236] radeon 0000:0b:00.0: ring 3 stalled for more than 28732msec
[32777.177240] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038156 on ring 3)
[32777.537217] radeon 0000:0b:00.0: ring 0 stalled for more than 29232msec
[32777.537221] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bc on ring 0)
[32777.681206] radeon 0000:0b:00.0: ring 3 stalled for more than 29236msec
[32777.681209] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038159 on ring 3)
[32778.041191] radeon 0000:0b:00.0: ring 0 stalled for more than 29736msec
[32778.041194] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bd on ring 0)
[32778.185183] radeon 0000:0b:00.0: ring 3 stalled for more than 29740msec
[32778.185186] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x000000000003815a on ring 3)
[32779.776047] BUG: unable to handle page fault for address: ffffbdd0c13e9ffc
[32779.776052] #PF: supervisor read access in kernel mode
[32779.776054] #PF: error_code(0x0000) - not-present page
[32779.776055] PGD 100000067 P4D 100000067 PUD 0 
[32779.776058] Oops: 0000 [#1] PREEMPT SMP NOPTI
[32779.776061] CPU: 8 PID: 157222 Comm: openmw Tainted: G S                 6.1.12-zen1-1-zen #1 f86a89fe584efe7bcf920c69db3728bed4671799
[32779.776064] Hardware name: HP HP EliteDesk 705 G5 SFF/8618, BIOS R09 Ver. 02.02.02 11/15/2019
[32779.776065] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776196] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776197] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776199] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776201] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776202] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776203] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776204] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776205] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776207] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776208] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0
[32779.776210] Call Trace:
[32779.776212]  <TASK>
[32779.776213]  radeon_gpu_reset+0xf7/0x2f0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776243]  radeon_gem_wait_idle_ioctl+0xb8/0x100 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776273]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776302]  drm_ioctl_kernel+0xcd/0x170
[32779.776306]  drm_ioctl+0x1eb/0x450
[32779.776308]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776337]  radeon_drm_ioctl+0x4d/0x80 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776364]  __x64_sys_ioctl+0x94/0xd0
[32779.776369]  do_syscall_64+0x5f/0x90
[32779.776373]  ? do_syscall_64+0x6b/0x90
[32779.776375]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776378]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776380]  ? do_syscall_64+0x6b/0x90
[32779.776382]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776384]  ? do_syscall_64+0x6b/0x90
[32779.776385]  ? do_syscall_64+0x6b/0x90
[32779.776387]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[32779.776390] RIP: 0033:0x7fbdb591553f
[32779.776418] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[32779.776420] RSP: 002b:00007fbd98eb80f0 EFLAGS: 00200246 ORIG_RAX: 0000000000000010
[32779.776422] RAX: ffffffffffffffda RBX: 00007fbd7d74eb80 RCX: 00007fbdb591553f
[32779.776423] RDX: 00007fbd98eb8190 RSI: 0000000040086464 RDI: 0000000000000010
[32779.776425] RBP: 00007fbd98eb8190 R08: 0000000000000000 R09: ffffffffffffffff
[32779.776426] R10: 0000000000000000 R11: 0000000000200246 R12: 0000000040086464
[32779.776427] R13: 0000000000000010 R14: 000055d27885abd0 R15: 000055d278a375d8
[32779.776429]  </TASK>
[32779.776430] Modules linked in: rfcomm xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink br_netfilter bridge stp llc rpcsec_gss_krb5 rpcrdma rdma_cm iw_cm nfsv4 ib_cm dns_resolver ib_core nfs fscache wireguard netfs curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay cmac algif_hash algif_skcipher af_alg bnep isofs cdrom amdgpu gpu_sched drm_buddy squashfs vfat fat iwlmvm mac80211 snd_hda_codec_conexant snd_hda_codec_generic libarc4 ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr radeon snd_hda_intel intel_rapl_common btusb edac_mce_amd btrtl snd_intel_dspcfg btbcm snd_intel_sdw_acpi drm_ttm_helper kvm_amd snd_hda_codec btintel iwlwifi snd_hda_core hp_wmi btmtk ttm snd_hwdep sparse_keymap kvm platform_profile wmi_bmof sp5100_tco bluetooth snd_pcm irqbypass r8169 ucsi_acpi drm_display_helper video cfg80211 psmouse rapl typec_ucsi pcspkr snd_timer realtek k10temp i2c_piix4 ecdh_generic cec
[32779.776479]  ipmi_devintf typec snd mdio_devres soundcore ipmi_msghandler ip6t_REJECT rfkill libphy roles nf_reject_ipv6 joydev wmi mousedev gpio_amdpt xt_hl gpio_generic acpi_cpufreq ip6_tables ip6t_rt mac_hid ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nfsd auth_rpcgss nfs_acl lockd grace sg crypto_user sunrpc loop fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid uas usb_storage dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel serio_raw polyval_clmulni atkbd polyval_generic gf128mul libps2 ghash_clmulni_intel vivaldi_fmap sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ccp cryptd xhci_pci i8042 xhci_pci_renesas nvme_common serio
[32779.776522] CR2: ffffbdd0c13e9ffc
[32779.776523] ---[ end trace 0000000000000000 ]---
[32779.776524] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776554] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776555] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776557] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776558] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776559] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776560] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776561] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776562] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776565] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0

関連情報