Hey everyone,

I’m running into a frustrating issue and could use some guidance on how to pinpoint the faulty component.

My system completely locks up every few hours. It’s not just a DE crash; the entire machine becomes unresponsive. The mouse and keyboard are completely dead (no cursor movement, Caps Lock key doesn’t toggle). I’ve tried waiting 10-15 minutes to see if it recovers, but it never does.

REISUB does not work. Holding Alt + SysRq and pressing the keys in order does nothing. The only way out is a hard reset using the case button.

The last time this happened, I ended up buying components for a new computer and replaced them one by one until I found the faulty one. I’d rather try a more targeted approach this time. Though if it takes too much effort, I do have another computer I can fall back on.

Any advice on how to diagnose this efficiently? Logs to check, stress tests to run, or hardware to suspect first?

Thanks in advance!

  • brownmustardminion@lemmy.ml
    link
    fedilink
    arrow-up
    4
    ·
    2 months ago

    I’ve had these issues during high intensity GPU usage on an nvidia gpu. It’s the only times REISUB didn’t work and I’ve had to do a hard reset.

    Not much I can contribute other than don’t rule out a nvidia driver problem.

  • shoveler@piefed.social
    link
    fedilink
    English
    arrow-up
    4
    ·
    2 months ago

    This does sound like a hardware problem:

    1. Find the motherboard brand/model # and memory brand/model.
    2. Check the motherboard’s manufacturer’s support page for compatible memory with that specific motherboard. Also check their forum for similar problems and solutions.
    3. Unplug all non-critical peripherals ( might be a driver issue).
    4. Swap the two (or more) memory sticks, see if that changes the freezing somehow.
    5. Check mother board manufacturer for updated BIOS, especially if the new BIOS addresses memory concerns.
    6. If the BIOS doesn’t have a memory test, Try MemTest86+.
    7. If it’s not the hardware, the BIOS, or the BIOS settings: Boot a live USB stick, see if the problem still exists ( might be a corrupt install somewhere; backup data and install a different distro; on a different drive if available; or stick (a backup of?) the boot drive in a different machine).
    8. Dig into the logs, as mentioned elsewhere.
  • lattrommi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    3
    ·
    2 months ago

    Others mentioned that it sounds hardware related. That reminded me that you should visually inspect all ports, especially unused ones. I had a system locking up constantly once and it turned out to be from debris in a usb port on one of the monitors was causing a short. It’s something that wont necessarily cause logged errors and can cause seemingly random behaviour that will have one chasing problems that don’t exist. USB ports and DVI plugs especially can be hard to notice with how they are constructed. Audio jack holes too. Make sure there’s no breaks in any cords. It seems too simple to be true and then it happens and you feel foolish for not having thought of it.

  • Kory@lemmy.ml
    link
    fedilink
    arrow-up
    3
    ·
    2 months ago

    First I always check with sudo journalctl -r Check journalctl --help for more options or do sudo journalctl --since "2015-06-26 23:15:00" --until "2015-06-26 23:20:00" Then search errors online or come back with more questions.

  • isgleas@lemmy.ml
    link
    fedilink
    arrow-up
    2
    ·
    2 months ago

    I sugest you to install/enable sysstat if you have not done that already, and with those metrics you will have some great starting point about what resources may be the culprit next time it happens. It will help you pinpoint if there is a hardware related issue.

    Do you have kdump enabled? If so, you can try to force a coredump when the system freezes, so you can uater analyze what the issue is. It is harder to follow this path, as you may need analyzing such dump, but it will help you identify issues not only on the hardware side, but on the software side as well.

    Both tools are our bread and butter for RCAs/postmortems

  • kyub@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    2
    ·
    2 months ago

    There was or is a specific bug with earlier Ryzen CPU generations that causes system freezes when the CPU enters the powersaving c-state #6. If that applies to your machine, try the kernel parameter max_cstate=5 for a while. I had one PC build where this 100% resolved the system freezing after several minutes. Note though that this is ~6 year old info. Issue might have been resolved in the meantime. But worth a try probably if you have an older Ryzen CPU.

  • RedstoneValley@sh.itjust.works
    link
    fedilink
    arrow-up
    2
    ·
    2 months ago

    I had this happen when a game at random times filled up the available memory so quickly that it froze completely before any OOM watchdog could catch it.

    • Minnels@lemmy.zip
      link
      fedilink
      arrow-up
      1
      ·
      2 months ago

      Same happend to me. I just made the swap file really big. Haven’t had the problem since. It was 4gb default, tried 16, was better and now on 32gb it is all good. I have 32gb ram.

  • devtoolkit_api@discuss.tchncs.deBannedBanned from community
    link
    fedilink
    arrow-up
    1
    ·
    2 months ago

    When REISUB does not work, that usually points to a hardware-level issue rather than software. Here is my debugging checklist for hard freezes:

    Step 1: Rule out RAM

    • Boot a live USB and run memtest86+ overnight. Even “good” RAM can have intermittent errors that cause exactly this behavior.

    Step 2: Check thermals

    • Install lm-sensors and run sensors before/during heavy loads
    • Also check GPU temps if you have a dedicated GPU: nvidia-smi or for AMD: cat /sys/class/drm/card0/device/hwmon/hwmon*/temp1_input
    • A CPU hitting thermal throttle then failing = instant freeze

    Step 3: GPU driver

    • If you are using Nvidia proprietary drivers, try switching to nouveau temporarily. Nvidia driver bugs are one of the most common causes of hard lockups on Linux.
    • Check dmesg | grep -i nvidia or dmesg | grep -i gpu after reboot

    Step 4: Kernel logs from previous boot

    • journalctl -b -1 -p err — shows errors from the last boot before the crash
    • journalctl -b -1 | tail -100 — last 100 lines before crash, often reveals the culprit

    Step 5: SSH test

    • Set up SSH from another device. Next time it freezes, try to SSH in. If SSH works but display is dead = GPU/display issue. If SSH also fails = kernel panic or hardware.

    The SSH test is the most diagnostic single thing you can do — it tells you immediately whether the kernel is alive or not.