Page 1 of 1

Kernel panic Debian 4.9.30-2+deb9u5

Posted: 14 Sep 2018, 09:44
by aftar
Добрый день, коллеги!
Помогите, пожалуйста, уже все нервы вымотались и мозг сломался.
Периодически сервер падает, даже ночью, когда загрузка минимальная.

Сервер HP DL380G6
Сетевая карта: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
Прерывания от портов распределены через одно ядро, для rx-0 от pppoe включен RPS
Спойлер
DEV=eth0-TxRx-0 00000001 -> 1
DEV=eth0-TxRx-1 00000004 -> 4
DEV=eth0-TxRx-2 00000010 -> 10
DEV=eth0-TxRx-3 00000040 -> 40
DEV=eth0-TxRx-4 00000100 -> 100
DEV=eth0-TxRx-5 00000400 -> 400
DEV=eth0-TxRx-6 00000001 -> 1
DEV=eth0-TxRx-7 00000004 -> 4
DEV=eth0-TxRx-8 00000010 -> 10
DEV=eth0-TxRx-9 00000040 -> 40
DEV=eth0-TxRx-10 00000100 -> 100
DEV=eth0-TxRx-11 00000400 -> 400
DEV=eth0 00000001 -> 1
DEV=eth2-TxRx-0 00000002 -> 2
DEV=eth2-TxRx-1 00000008 -> 8
DEV=eth2-TxRx-2 00000020 -> 20
DEV=eth2-TxRx-3 00000080 -> 80
DEV=eth2-TxRx-4 00000200 -> 200
DEV=eth2-TxRx-5 00000800 -> 800
DEV=eth2-TxRx-6 00000002 -> 2
DEV=eth2-TxRx-7 00000008 -> 8
DEV=eth2-TxRx-8 00000020 -> 20
DEV=eth2-TxRx-9 00000080 -> 80
DEV=eth2-TxRx-10 00000200 -> 200
DEV=eth2-TxRx-11 00000800 -> 800
DEV=eth2 00000002 -> 2

Code: Select all

#!/bin/bash
DEV=eth2

echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
echo 2048 > /sys/class/net/$DEV/queues/rx-0/rps_flow_cnt
echo aaa > /sys/class/net/$DEV/queues/rx-0/rps_cpus
echo 8192 > /proc/sys/net/core/flow_limit_table_len
echo ffff > /proc/sys/net/core/flow_limit_cpu_bitmap
Крутится accel-ppp(pppoe 1300 сессий)+NAT+OSFP(quagga)+iptables+tc

В GRUB добавил: nox2apic intremap=off intel_idle.max_cstate=0 processor.max_cstate=1

Code: Select all

uname -a
Distributor ID:	Debian
Description:	Debian GNU/Linux 9.5 (stretch)
Release:	9.5
Codename:	stretch

Code: Select all

uname -a
Linux nas 4.9.0-3-amd64 #1 SMP Debian 4.9.30-2+deb9u5 (2017-09-19) x86_64 GNU/Linux
Спойлер
Sep 14 03:53:45 nas kernel: [576373.703909] INFO: rcu_sched self-detected stall on CPU
Sep 14 03:53:45 nas kernel: [576373.705175] 5-...: (981 ticks this GP) idle=2f5/140000000000001/0 softirq=68543340/68543340 fqs=3
Sep 14 03:53:45 nas kernel: [576373.706615] (t=5250 jiffies g=35865711 c=35865710 q=69345)
Sep 14 03:53:45 nas kernel: [576373.708038] rcu_sched kthread starved for 1575 jiffies! g35865711 c35865710 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x0
Sep 14 03:53:45 nas kernel: [576373.709614] rcu_sched R running task 0 8 2 0x00000000
Sep 14 03:53:45 nas kernel: [576373.709617] ffff9bfac4c4a400 0000000000000000 ffff9bfacda58e40 ffff9bfedf718240
Sep 14 03:53:45 nas kernel: [576373.709619] ffff9bfecd19a040 ffffbabac316fdb0 ffffffff984016d3 ffffbabac316fde0
Sep 14 03:53:45 nas kernel: [576373.709621] 0000000108959fc1 ffff9bfedf718240 0000000000000009 ffff9bfacda58e40
Sep 14 03:53:45 nas kernel: [576373.709623] Call Trace:
Sep 14 03:53:45 nas kernel: [576373.709630] [<ffffffff984016d3>] ? __schedule+0x233/0x6d0
Sep 14 03:53:45 nas kernel: [576373.709632] [<ffffffff98401ba2>] ? schedule+0x32/0x80
Sep 14 03:53:45 nas kernel: [576373.709633] [<ffffffff98404eae>] ? schedule_timeout+0x17e/0x310
Sep 14 03:53:45 nas kernel: [576373.709637] [<ffffffff97ee3e50>] ? del_timer_sync+0x50/0x50
Sep 14 03:53:45 nas kernel: [576373.709639] [<ffffffff97edd605>] ? rcu_gp_kthread+0x505/0x850
Sep 14 03:53:45 nas kernel: [576373.709642] [<ffffffff97eb8799>] ? __wake_up_common+0x49/0x80
Sep 14 03:53:45 nas kernel: [576373.709643] [<ffffffff97edd100>] ? rcu_note_context_switch+0xe0/0xe0
Sep 14 03:53:45 nas kernel: [576373.709645] [<ffffffff97e965d7>] ? kthread+0xd7/0xf0
Sep 14 03:53:45 nas kernel: [576373.709647] [<ffffffff97e96500>] ? kthread_park+0x60/0x60
Sep 14 03:53:45 nas kernel: [576373.709648] [<ffffffff984065f5>] ? ret_from_fork+0x25/0x30
Sep 14 03:53:45 nas kernel: [576373.709650] Task dump for CPU 5:
Sep 14 03:53:45 nas kernel: [576373.709650] kworker/5:0 R running task 0 3413 2 0x00000008
Sep 14 03:53:45 nas kernel: [576373.709661] Workqueue: events_long gc_worker [nf_conntrack]
Sep 14 03:53:45 nas kernel: [576373.709662] ffffffff98b13580 ffffffff97ea3bcb 0000000000000005 ffffffff98b13580
Sep 14 03:53:45 nas kernel: [576373.709664] ffffffff97f7a4b6 ffff9bfedf698fc0 ffffffff98a4a6c0 0000000000000000
Sep 14 03:53:45 nas kernel: [576373.709665] ffffffff98b13580 00000000ffffffff ffffffff97edee04 0000000000a2b6d1
Sep 14 03:53:45 nas kernel: [576373.709667] Call Trace:
Sep 14 03:53:45 nas kernel: [576373.709667] <IRQ>
Sep 14 03:53:45 nas kernel: [576373.709670] [<ffffffff97ea3bcb>] ? sched_show_task+0xcb/0x130
Sep 14 03:53:45 nas kernel: [576373.709672] [<ffffffff97f7a4b6>] ? rcu_dump_cpu_stacks+0x92/0xb2
Sep 14 03:53:45 nas kernel: [576373.709673] [<ffffffff97edee04>] ? rcu_check_callbacks+0x754/0x8a0
Sep 14 03:53:45 nas kernel: [576373.709675] [<ffffffff97eed0c3>] ? update_wall_time+0x473/0x790
Sep 14 03:53:45 nas kernel: [576373.709677] [<ffffffff97ef48c0>] ? tick_sched_handle.isra.12+0x50/0x50
Sep 14 03:53:45 nas kernel: [576373.709678] [<ffffffff97ee5718>] ? update_process_times+0x28/0x50
Sep 14 03:53:45 nas kernel: [576373.709679] [<ffffffff97ef4890>] ? tick_sched_handle.isra.12+0x20/0x50
Sep 14 03:53:45 nas kernel: [576373.709680] [<ffffffff97ef48f8>] ? tick_sched_timer+0x38/0x70
Sep 14 03:53:45 nas kernel: [576373.709682] [<ffffffff97ee60fc>] ? __hrtimer_run_queues+0xdc/0x240
Sep 14 03:53:45 nas kernel: [576373.709683] [<ffffffff97ee67cc>] ? hrtimer_interrupt+0x9c/0x1a0
Sep 14 03:53:45 nas kernel: [576373.709684] [<ffffffff98408ca9>] ? smp_apic_timer_interrupt+0x39/0x50
Sep 14 03:53:45 nas kernel: [576373.709687] [<ffffffffc04928e0>] ? nf_nat_l3proto_register+0x70/0x70 [nf_nat]
Sep 14 03:53:45 nas kernel: [576373.709688] [<ffffffff98407fc2>] ? apic_timer_interrupt+0x82/0x90
Sep 14 03:53:45 nas kernel: [576373.709689] <EOI>
Sep 14 03:53:45 nas kernel: [576373.709691] [<ffffffffc04928e0>] ? nf_nat_l3proto_register+0x70/0x70 [nf_nat]
Sep 14 03:53:45 nas kernel: [576373.709693] [<ffffffff97ec0f02>] ? native_queued_spin_lock_slowpath+0x112/0x190
Sep 14 03:53:45 nas kernel: [576373.709694] [<ffffffff98406018>] ? _raw_spin_lock_bh+0x28/0x30
Sep 14 03:53:45 nas kernel: [576373.709696] [<ffffffffc0492a64>] ? nf_nat_cleanup_conntrack+0xb4/0x1e0 [nf_nat]
Sep 14 03:53:45 nas kernel: [576373.709701] [<ffffffffc07d96b3>] ? __nf_ct_ext_destroy+0x43/0x60 [nf_conntrack]
Sep 14 03:53:45 nas kernel: [576373.709704] [<ffffffffc07d02d0>] ? nf_conntrack_free+0x20/0x50 [nf_conntrack]
Sep 14 03:53:45 nas kernel: [576373.709707] [<ffffffffc07d10da>] ? gc_worker+0xba/0x190 [nf_conntrack]
Sep 14 03:53:45 nas kernel: [576373.709709] [<ffffffff97e90384>] ? process_one_work+0x184/0x410
Sep 14 03:53:45 nas kernel: [576373.709710] [<ffffffff97e9065d>] ? worker_thread+0x4d/0x480
Sep 14 03:53:45 nas kernel: [576373.709711] [<ffffffff97e90610>] ? process_one_work+0x410/0x410
Sep 14 03:53:45 nas kernel: [576373.709714] [<ffffffff97e7bb0a>] ? do_group_exit+0x3a/0xa0
Sep 14 03:53:45 nas kernel: [576373.709715] [<ffffffff97e965d7>] ? kthread+0xd7/0xf0
Sep 14 03:53:45 nas kernel: [576373.709716] [<ffffffff97e96500>] ? kthread_park+0x60/0x60
Sep 14 03:53:45 nas kernel: [576373.709718] [<ffffffff984065f5>] ? ret_from_fork+0x25/0x30
Помогите, пожалуйста, мыслями куда копать? :roll:

Re: Kernel panic Debian 4.9.30-2+deb9u5

Posted: 16 Sep 2018, 15:06
by dimka88
Доброго времени суток.
1. Если включен HT, постарайтесь не навешивать на эти ядра прерывания.
2. Очень часто помогала магия

Code: Select all

echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
...
ну и так для каждого ядра.

Re: Kernel panic Debian 4.9.30-2+deb9u5

Posted: 17 Sep 2018, 04:30
by aftar
Благодарю, Дмитрий!
НТ выключен, виртуализация тоже, все настройки процессора и памяти на max performance.

Включил магию, помониторим.