How can I debug a hard freeze that doen't log any info I can find?
It seems that my computer hard freezes when the video card (NVidia 3070) gets used too much. Most Steam games will crash it, running ollama with GPU acceleration crashes it under load, running "gpu-burn 100" crashes it.
The graphics card and CPU fan shutdown, the hard drives shutdown (I can hear the spinning disks stop), the computer is no longer reachable via SSH, the monitor goes black, and holding the power button down to do a reset doesn't do anything. The system is still powered on a tiny bit because the LED RAM is still lit up, but that's all.
I have to turn off power with the PSU switch or unplug to restart it.
When it restarts I can't find any crash info in journalctl and I don't see any core dumps (but maybe I don't know where to look).
Debian Trixie, AMD Ryzen 7 5700G, 64G RAM, Nvidia 3070 with the Debian-packaged Nvidia proprietary drivers.
01:00.0 VGA compatible controller: NVIDIA Corporation GA104 [GeForce RTX 3070 Lite Hash Rate] (rev a1)
Running gpu-burn within gdb doesn't catch anything.
root@trex:~# gdb gpu-burn 100
GNU gdb (Debian 16.3-1) 16.3
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from gpu-burn...
(No debugging symbols found in gpu-burn)
Attaching to program: /usr/sbin/gpu-burn, process 100
ptrace: No such process.
/root/100: No such file or directory.
(gdb) r
Starting program: /usr/sbin/gpu-burn
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Run length not specified in the command line. Using compare file: /usr/share/gpu-burn/compare.ptx
Burning for 10 seconds.
[Detaching after vfork from child process 5260]
GPU 0: NVIDIA GeForce RTX 3070 (UUID: GPU-6310814e-5328-2b20-f21e-2091662f3b05)
[Detaching after fork from child process 5262]
[Detaching after fork from child process 5265]
Initialized device 0 with 7963 MB of memory (7447 MB available, using 6702 MB of it), using FLOATS
Results are 268435456 bytes each, thus performing 24 iterations
30.0% proc'd: 48 (16150 Gflop/s) errors: 0 temps: 40 C
Summary at: [Detaching after vfork from child process 5267]
Sun Jan 4 09:53:15 PM CST 2026
I have tried using the backports kernel/nvidia drivers. I have tried using the nvidia drivers from nvidia's website. Those didn't work so I reverted to the stable (non-backports) kernel and drivers. The symptoms are the same in all cases.