Unraid friert ein, CPU stall, Kernel Bug

Wirman · 21.10.2022

Hallo zusammen,

ich habe die Frage schon im Unraid Forum gestellt, dort konnte mir jedoch nicht wirklich weitergeholfen werden, vielleicht hat von euch jemand eine Ahnung oder einen ähnlichen Fehler schonmal gesehen.
Hier mal der Link zum Thread:

[6.10.3] Unraid freeze, only way unclean shutdown

Hi, i have a problem, every once in a while (about 3-4 times a month), my Unraid System freezes. Freeze means: I am not able to remotely connect to the WebGUI, and the VMs and Dockers are also not reachable. If i put on the monitor connected to the server, it does not recognize the source. If i r...

forums.unraid.net

Kurze Zusammenfassung: Unraid bleibt bei mir nach mehreren Tagen Laufzeit immer hängen. Sobald das passiert, kann ich nicht mehr auf das System zugreifen, es läuft jedoch noch. Wenn ich mir die Logs anschaue, dann bekomme ich folgenden Fehler in Dauerschleife:

Code:

Oct 10 21:43:39 jarvis kernel: rcu: INFO: rcu_sched self-detected stall on CPU
Oct 10 21:43:39 jarvis kernel: rcu:     12-....: (60000 ticks this GP) idle=03f/1/0x4000000000000000 softirq=10533211/10533211 fqs=14583
Oct 10 21:43:39 jarvis kernel:     (t=60000 jiffies g=36180177 q=58701)
Oct 10 21:43:39 jarvis kernel: NMI backtrace for cpu 12
Oct 10 21:43:39 jarvis kernel: CPU: 12 PID: 16137 Comm: kworker/u40:6 Tainted: G        W  O      5.15.46-Unraid #1
Oct 10 21:43:39 jarvis kernel: Hardware name: Gigabyte Technology Co., Ltd. W480M VISION W/W480M VISION W, BIOS F21 11/23/2021
Oct 10 21:43:39 jarvis kernel: Workqueue: events_power_efficient gc_worker [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: Call Trace:
Oct 10 21:43:39 jarvis kernel: <IRQ>
Oct 10 21:43:39 jarvis kernel: dump_stack_lvl+0x46/0x5a
Oct 10 21:43:39 jarvis kernel: nmi_cpu_backtrace+0xae/0xd2
Oct 10 21:43:39 jarvis kernel: ? lapic_can_unplug_cpu+0x93/0x93
Oct 10 21:43:39 jarvis kernel: nmi_trigger_cpumask_backtrace+0x56/0xd3
Oct 10 21:43:39 jarvis kernel: rcu_dump_cpu_stacks+0xc3/0xea
Oct 10 21:43:39 jarvis kernel: rcu_sched_clock_irq+0x22d/0x631
Oct 10 21:43:39 jarvis kernel: ? trigger_load_balance+0x7a/0x292
Oct 10 21:43:39 jarvis kernel: ? tick_sched_do_timer+0x3e/0x3e
Oct 10 21:43:39 jarvis kernel: update_process_times+0x8c/0xab
Oct 10 21:43:39 jarvis kernel: tick_sched_timer+0x38/0x65
Oct 10 21:43:39 jarvis kernel: __hrtimer_run_queues+0xf8/0x18a
Oct 10 21:43:39 jarvis kernel: hrtimer_interrupt+0x92/0x160
Oct 10 21:43:39 jarvis kernel: __sysvec_apic_timer_interrupt+0x96/0xdb
Oct 10 21:43:39 jarvis kernel: sysvec_apic_timer_interrupt+0x61/0x7d
Oct 10 21:43:39 jarvis kernel: </IRQ>
Oct 10 21:43:39 jarvis kernel: <TASK>
Oct 10 21:43:39 jarvis kernel: asm_sysvec_apic_timer_interrupt+0x12/0x20
Oct 10 21:43:39 jarvis kernel: RIP: 0010:gc_worker+0x140/0x30b [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: Code: 00 0f 9e c1 e9 83 01 00 00 41 8b 45 08 48 8b 15 b1 04 eb e1 29 d0 85 c0 7f 10 4c 89 ef 41 ff c6 e8 57 fb ff ff e9 d8 00 00 00 <41> 8b 45 08 48 8b 15 90 04 eb e1 29 d0 ba 00 00 00 00 0f 48 c2 ba
Oct 10 21:43:39 jarvis kernel: RSP: 0018:ffffc9000293fe48 EFLAGS: 00000202
Oct 10 21:43:39 jarvis kernel: RAX: 000000000ed17370 RBX: 00000000000493df RCX: ffff8881ad600000
Oct 10 21:43:39 jarvis kernel: RDX: 000000010ed39b66 RSI: ffffc9000293fe54 RDI: ffff8885c23d6a48
Oct 10 21:43:39 jarvis kernel: RBP: 0000000000015b2d R08: 0000000000000000 R09: 0000000080190017
Oct 10 21:43:39 jarvis kernel: R10: ffff88810319e3c0 R11: ffff88810319e3c0 R12: ffffffffa036d620
Oct 10 21:43:39 jarvis kernel: R13: ffff8885c23d6a00 R14: 0000000000000030 R15: ffff8885c23d6a48
Oct 10 21:43:39 jarvis kernel: ? gc_worker+0xb2/0x30b [nf_conntrack]
Oct 10 21:43:39 jarvis kernel: process_one_work+0x195/0x27a
Oct 10 21:43:39 jarvis kernel: worker_thread+0x19c/0x240
Oct 10 21:43:39 jarvis kernel: ? rescuer_thread+0x28b/0x28b
Oct 10 21:43:39 jarvis kernel: kthread+0xdc/0xe3
Oct 10 21:43:39 jarvis kernel: ? set_kthread_struct+0x32/0x32
Oct 10 21:43:39 jarvis kernel: ret_from_fork+0x1f/0x30
Oct 10 21:43:39 jarvis kernel: </TASK>

Beim letzten Freeze hatte ich 13h vor dem Freeze folgende Warning:

Code:

Oct 17 00:27:59 jarvis kernel: WARNING: CPU: 4 PID: 27239 at arch/x86/kvm/mmu/mmu.c:3835 kvm_mmu_page_fault+0x1e8/0x4b5 [kvm]

und dann schließlich:

Code:

Oct 17 13:32:38 jarvis kernel: Invalid SPTE change: cannot replace a present leaf
Oct 17 13:32:38 jarvis kernel: SPTE with another present leaf SPTE mapping a
Oct 17 13:32:38 jarvis kernel: different PFN!
Oct 17 13:32:38 jarvis kernel: as_id: 0 gfn: 127644 old_spte: ffff8881057ddd58 new_spte: 6000003f6e44b77 level: 1
Oct 17 13:32:38 jarvis kernel: ------------[ cut here ]------------
Oct 17 13:32:38 jarvis kernel: kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:446!
Oct 17 13:32:38 jarvis kernel: invalid opcode: 0000 [#1] SMP NOPTI
Oct 17 13:32:38 jarvis kernel: CPU: 8 PID: 29233 Comm: CPU 0/KVM Tainted: G        W  O      5.15.46-Unraid #1
Oct 17 13:32:38 jarvis kernel: Hardware name: Gigabyte Technology Co., Ltd. W480M VISION W/W480M VISION W, BIOS F21 11/23/2021
Oct 17 13:32:38 jarvis kernel: RIP: 0010:__handle_changed_spte+0x113/0x42d [kvm]

Danach kam wieder der CPU Stall Error.

Ich hab mir mal den Code vom Kernel angeschaut, daraus werde ich aber nicht schlau:

Code:

/**
 * __handle_changed_spte - handle bookkeeping associated with an SPTE change
 * @kvm: kvm instance
 * @as_id: the address space of the paging structure the SPTE was a part of
 * @gfn: the base GFN that was mapped by the SPTE
 * @old_spte: The value of the SPTE before the change
 * @new_spte: The value of the SPTE after the change
 * @level: the level of the PT the SPTE is part of in the paging structure
 * @shared: This operation may not be running under the exclusive use of
 *        the MMU lock and the operation must synchronize with other
 *        threads that might be modifying SPTEs.
 *
 * Handle bookkeeping that might result from the modification of a SPTE.
 * This function must be called for all TDP SPTE modifications.
 */
static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
                  u64 old_spte, u64 new_spte, int level,
                  bool shared)
{
    bool was_present = is_shadow_present_pte(old_spte);
    bool is_present = is_shadow_present_pte(new_spte);
    bool was_leaf = was_present && is_last_spte(old_spte, level);
    bool is_leaf = is_present && is_last_spte(new_spte, level);
    bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);

    WARN_ON(level > PT64_ROOT_MAX_LEVEL);
    WARN_ON(level < PG_LEVEL_4K);
    WARN_ON(gfn & (KVM_PAGES_PER_HPAGE(level) - 1));

    /*
     * If this warning were to trigger it would indicate that there was a
     * missing MMU notifier or a race with some notifier handler.
     * A present, leaf SPTE should never be directly replaced with another
     * present leaf SPTE pointing to a different PFN. A notifier handler
     * should be zapping the SPTE before the main MM's page table is
     * changed, or the SPTE should be zeroed, and the TLBs flushed by the
     * thread before replacement.
     */
    if (was_leaf && is_leaf && pfn_changed) {
        pr_err("Invalid SPTE change: cannot replace a present leaf\n"
               "SPTE with another present leaf SPTE mapping a\n"
               "different PFN!\n"
               "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
               as_id, gfn, old_spte, new_spte, level);

        /*
         * Crash the host to prevent error propagation and guest data
         * corruption.
         */
        BUG();
    }

Was ist denn ein STPE? Ist das eher ein Software oder ein Hardwarefehler? Was könnte helfen? Oder wer könnte da helfen, wer kennt sich mit dem Kernel aus?

Danke schonmal und viele Grüße
Wirman

System:
Unraid system: Unraid server Plus, version 6.10.3
Motherboard: Gigabyte Technology Co., Ltd. - W480M VISION W
Processor: Intel® Xeon® W-1290E CPU @ 3.50GHz
Memory: 32 GB ECC

i-B4se · 21.10.2022

Hast du Powertop aktiviert?
Bei mir, hab zwar ein AMD, hatte ich ein paar Probleme mit Powertop.

Wirman · 21.10.2022

Ich habs installiert ums mal zu testen, es ist aber nicht im Autostart drin, ist also bei den Freezes nicht gelaufen.
Es laufen im Moment auch nur 3 Docker (Jellyfin, JDownloader2, Grocy) und 3 VMs (HomeAssistant, Windows 10 BlueIris, Xubuntu mit PythonBot)
Und bei den ersten Freezes gab es weder Grocy, noch BlueIris und Xubuntu. Den Tipp vom Unraid Forum, im SafeMode schauen ob der Fehler auch auftritt, kann ich schlecht umsetzen, da der Fehler ja teils erst nach 15 Tagen auftritt, und ich vor allem HomeAssistant benötige.

i-B4se · 21.10.2022

Wirman schrieb:
Ich habs installiert ums mal zu testen, es ist aber nicht im Autostart drin, ist also bei den Freezes nicht gelaufen.

Was heißt installiert?
Wo liegt die Datei bzw. womit hast du es installiert?
Wenn die Powertop in /boot/extra liegt, wird es bei jedem Neustart installiert.

Wirman · 21.10.2022

ich hab keinen Ordner /boot/extra. Wenn ich im Terminal aber powertop eingebe, startet es Powertop 2.13.

Edit; Ah, grep -r "powertop" /boot/ sagt mir, dass es im Nerdpack Plugin mitinstalliert ist

warchild · 21.10.2022

Naja aber so lange man nicht "powertop --auto-tune" ausführt kann das eigentlich egal sein ob powertop installiert ist oder nicht. Von alleine macht das ja erstmal nichts

@Wirman : im Log steht ja was von "kvm", liefen die VMs bei den freezes?
Ist zwar keine Lösung, aber zum eingrenzen des Problems vielleicht Mal die VMs aus lassen und gucken ob der Fehler wieder auftritt

Wirman · 21.10.2022

Ja, zum Zeitpunkt des Freezes liefen die VMs. Die einzige VM, die dann das Problem sein könnte, wäre die HomeAssistant VM. Die benötige ich jedoch, bzw ist das neben dem NAS einer der Hauptgründe für den Server.
Home Assistant kann man mit einem Backup ja recht einfach umziehen oder? Ich könnte HA evtl. zur Überbrückung auf einem Raspberry Pi3 laufen lassen, oder die VM neu aufsetzen.

In dem Fehler steht ja immer eine CPU Id. Ist diese identisch mit den Ids, die Unraid hat? Bisher ist der "stall" aufgetreten bei den Ids 6,8,12. Jedoch ist nur 8 der HA VM zugewiesen. Auf 6 und 12 laufen keine aktiven VMS.

Ich würde jetzt mal Die CPUs Isolieren, sodass jede VM und jeder Docker genau auf bestimmte CPU Ids zugewiesen sind und keine davon doppelt verwendet werden. Dann müsste es ja eindeutig zugeiwesen werden können, richtig?

warchild · 21.10.2022

Wirman schrieb:
In dem Fehler steht ja immer eine CPU Id. Ist diese identisch mit den Ids, die Unraid hat? Bisher ist der "stall" aufgetreten bei den Ids 6,8,12. Jedoch ist nur 8 der HA VM zugewiesen. Auf 6 und 12 laufen keine aktiven VMS.

Ich würde jetzt mal Die CPUs Isolieren, sodass jede VM und jeder Docker genau auf bestimmte CPU Ids zugewiesen sind und keine davon doppelt verwendet werden. Dann müsste es ja eindeutig zugeiwesen werden können, richtig

Bin da auch kein Fachmann, aber klingt erstmal plausibel.

Was ich beim schnellen googlen noch gefunden habe, waren Verweise auf zuwenig RAM.
Kann das bei dir ein Problem sein?

i-B4se · 21.10.2022

Hast du mal Memtest86 durchlaufen lassen?
Das befindet sich auch auf dem Unraid Stick und kann am Anfang ausgewählt werden.

Wirman · 21.10.2022

Werde ich mal laufen lassen, momentan läuft ein Parity Check.
Ich habe 32 GB RAM, sollte eigentlich reichen. Ich habe 40% Auslastung laut Unraid, HA und die Windows Blue Iris VM haben 4096M zugewiesen, die Xubuntu VM 2048M. Wenn eine VM in einen Memory Leak laufen sollte, sollte doch eigentlich nicht das ganze System freezen, oder?

Suche

Unraid friert ein, CPU stall, Kernel Bug

Wirman

Profi

[6.10.3] Unraid freeze, only way unclean shutdown

i-B4se

Urgestein

Wirman

Profi

i-B4se

Urgestein

Wirman

Profi

warchild

Experte

Wirman

Profi

warchild

Experte

i-B4se

Urgestein

Wirman

Profi