accelppp exits unexpectedly with log error md:epoll_wait

Questions related to general functionality
Post Reply
exopedro
Posts: 2
Joined: 23 Jun 2020, 15:17

accelppp exits unexpectedly with log error md:epoll_wait

Post by exopedro » 23 Jun 2020, 16:19

Hi,

We are hitting an error with accel-ppp latest stable version (1.12.0) with nonprofit ISP exo.cat

The same accelppp server uses L2TP for subscribers with wifi network and PPPoE for subscribers with FTTH network (there we use vlan_mon custom kernel module from accelppp)

The problem is that accelppp suddenly exits and systemd reloads it, so we suffer a very small cut that these days people note it because they are heavinly using their internet for the videoconference tools.

It's difficult to see the error because after accel exits, the debug.log is overriden we still don't know how make it more "append". Looking at the logs and so on we saw that exits in the code sometimes are not reported. So we tracked all exits this way [1], and we got the stderr from accelppp this way in systemd [2]

Our error is:

`md.c: Debug-eXO: (). Function md_thread(). n < 0. errno != EINTR. Exiting` in our code (just adding stderr before exits) [3]

which corresponds to this line in the official code [4]

Accelppp is fine most of the time according to `accel-cmd show stat` [5]

this is our config file /etc/accel-ppp.conf [6]

we build accelppp with this install script [7] on top of a debian 10 stable that is in a virtual machine (KVM) in a proxmox.org cluster with resources shared with other unrelated services. This one has: 4 vCPUs and 2 GB of RAM

the backbone interface has this ethtool options [8]

cat /proc/interrupts [9]

the load average is, in general: 0.00, 0.00, 0.00

with `htop` program, in general, all 4 vCPUs are around 0-4 % and some of them sometimes do 10%. accelppp uses around 0 and 1%. The most important process is node exporter with that 10% sometimes. The prometheus scrape_interval is 5s. The process is monitored with prometheus node exporter and with a grafana dashboard, so in case you want a specific metric we can get it, or show screen capture.

We think that the service requires more CPU time so we would like to:

- increase cpu units [10]
- we would like to do `ethtool -G ens18 rx 4096 tx 4096` as suggested by [11], but we cannot make it, it says "cannot set device ring parameters: operation not supported"

It would be nice to have a torture testbed to reproduce this in the lab and apply parameters in a more freely way. Do we have something? Any help is appreciated.

Thanks,
Pedro

[1] https://gitlab.com/guifi-exo/accel-ppp/ ... 38af112409
[2] (1) /usr/sbin/accel-pppd without `-d` (2) `Type=simple` https://gitlab.com/guifi-exo/accel-ppp/ ... pp.service
[3] https://gitlab.com/guifi-exo/accel-ppp/ ... n/md.c#L80
[4] https://github.com/xebd/accel-ppp/blob/ ... n/md.c#L77
[5] http://paste.debian.net/1153497/
[6] http://paste.debian.net/1153496/
[7] https://gitlab.com/guifi-exo/accel-ppp/ ... git-exo.sh
[8] (with proxmox queues=4 in a virtio iface)
auto ens18
iface ens18 inet manual
pre-up /sbin/ethtool -K ens18 tx off rx off && /sbin/ethtool -L ens18 combined 4
[9] http://paste.debian.net/1153498/
[10] https://serverfault.com/questions/20547 ... mox/364809
[11] viewtopic.php?f=9&t=2689#p7110


exopedro
Posts: 2
Joined: 23 Jun 2020, 15:17

Re: accelppp exits unexpectedly with log error md:epoll_wait

Post by exopedro » 06 Jul 2020, 17:53

we followed all steps to monitor the error in the link you provided

in order to get the backtrace (which is not documented in accel-ppp docs) we used the following script (latest version)

Code: Select all

#!/bin/sh

# put this in tmux
# thanks pespin

outfile="/root/accelppp_gdb.log"

# generic debug in accelppp -> src https://accel-ppp.readthedocs.io/en/latest/debugging/index.html
# about attach -> src http://sourceware.org/gdb/onlinedocs/gdb/Attach.html

if [ $(id -u) -ne 0 ]
  then echo "Please run as root"
  exit
fi

log_header="================== $(date +'%Y-%m-%d-%H-%M-%S-%N') =================="
echo "$log_header" >> "$outfile"

# appropriate order to send stderr to stdout, and stdout to append file
#   -> src https://stackoverflow.com/questions/876239/how-to-redirect-and-append-both-stdout-and-stderr-to-a-file-with-bash/876242#876242
gdb -ex 'set breakpoint pending on' -ex 'set confirm off' \
  -ex 'set pagination off' -ex 'b _exit' -ex 'c' -ex 'bt full' \
  -ex 'q' -p "$(pidof accel-pppd)" >> "$outfile" 2>&1
the Debug-eXO function is this line we added (high detail):

Code: Select all

fprintf(stderr, "md.c: Debug-eXO: (). Function md_run(). pthread_create(&md_thr, NULL, md_thread, NULL) => 0. Exiting. errno = %s\n", strerror(errno));
and we hit this line that is shown in systemd journal (our systemd service version):

Code: Select all

Jul 05 18:10:12 bng accelppp-exo[20044]: md.c: Debug-eXO: (). Function md_thread(). n < 0. errno = Bad file descriptor
Jul 05 18:10:14 bng systemd[1]: accel-ppp.service: Main process exited, code=exited, status=255/EXCEPTION
Jul 05 18:10:14 bng systemd[1]: accel-ppp.service: Failed with result 'exit-code'.
Jul 05 18:10:14 bng systemd[1]: accel-ppp.service: Service RestartSec=100ms expired, scheduling restart.
Jul 05 18:10:14 bng systemd[1]: accel-ppp.service: Scheduled restart job, restart counter is at 5.
Jul 05 18:10:14 bng systemd[1]: Stopped Accel-PPP.
Jul 05 18:10:14 bng systemd[1]: Started Accel-PPP.
accel-pppd backtrace:

Code: Select all

================== 2020-07-03-10-25-00-413179131 ==================
GNU gdb (Debian 8.2.1-2+b3) 8.2.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 20044
[New LWP 20045]
[New LWP 20046]
[New LWP 20047]
[New LWP 20048]
[New LWP 20049]
[New LWP 20050]
[New LWP 20051]
[New LWP 20052]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fb197f154cc in __GI___sigtimedwait (set=set@entry=0x7ffc8bee2740, info=info@entry=0x7ffc8bee2650, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:29
29      ../sysdeps/unix/sysv/linux/sigtimedwait.c: No such file or directory.
Breakpoint 1 at 0x7fb197fa39a0: file ../sysdeps/unix/sysv/linux/_exit.c, line 27.
Continuing.

Thread 5 "accel-pppd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fb1968f0700 (LWP 20048)]
0x00007fb197e5efa3 in find_pd (ses=0x7fb17c00bcc8) at /usr/local/src/accel-ppp/accel-pppd/radius/radius.c:731
731             list_for_each_entry(pd, &ses->pd_list, entry) {
#0  0x00007fb197e5efa3 in find_pd (ses=0x7fb17c00bcc8) at /usr/local/src/accel-ppp/accel-pppd/radius/radius.c:731
        pd = 0x0
        rpd = 0x7fb18c00f1c8
#1  0x00007fb197e5e9b2 in ses_finishing (ses=0x7fb17c00bcc8) at /usr/local/src/accel-ppp/accel-pppd/radius/radius.c:608
        rpd = 0x55d16afcf670
        fr6 = 0x7fb197a88526 <ev_ses_finishing+39>
        fr = 0x7fb18c0386b8
#2  0x00007fb1984cfb44 in triton_event_fire (ev_id=3, arg=0x7fb17c00bcc8) at /usr/local/src/accel-ppp/accel-pppd/triton/event.c:103
        ev = 0x55d16afcf670
        h = 0x55d16afddde0
#3  0x000055d169e5f300 in ap_session_terminate (ses=0x7fb17c00bcc8, cause=1, hard=0) at /usr/local/src/accel-ppp/accel-pppd/session.c:302
No locals.
#4  0x000055d169e68b92 in lcp_recv (h=0x7fb18c0386e0) at /usr/local/src/accel-ppp/accel-pppd/ppp/ppp_lcp.c:817
        hdr = 0x7fb178051d68
        lcp = 0x7fb18c0386b8
        r = 32689
        term_msg = 0x55d16afcf028 "\310\360\374j\321U"
#5  0x000055d169e63811 in ppp_chan_read (h=0x7fb17c00bdf8) at /usr/local/src/accel-ppp/accel-pppd/ppp/ppp.c:423
        ppp = 0x7fb17c00bcc8
        ppp_h = 0x7fb18c0386e0
        proto = 49185
#6  0x00007fb1984cbcaa in ctx_thread (ctx=0x7fb17804e138) at /usr/local/src/accel-ppp/accel-pppd/triton/triton.c:251
        h = 0x7fb18c021d38
        t = 0x7fb17c00bcc8
        call = 0x7fb197e92b00 <l2tp_ctx_switch>
        tt = 140400266361928
        events = 1
#7  0x00007fb1984cba56 in triton_thread (thread=0x55d16b0125f8) at /usr/local/src/accel-ppp/accel-pppd/triton/triton.c:192
        set = {__val = {516, 0 <repeats 15 times>}}
        sig = 10
        need_free = 0
        stack = 0x0
#8  0x00007fb19849bfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
        ret = <optimized out>
        pd = <optimized out>
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140400711894784, -8357117975546649785, 140722656126414, 140722656126415, 140400711894784, 0, 8315510127604179783, 8315542574019420999}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#9  0x00007fb197fd64cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
No locals.
Detaching from program: /usr/sbin/accel-pppd, process 20044
[Inferior 1 (process 20044) detached]

dimka88
Posts: 716
Joined: 13 Oct 2014, 05:51
Contact:

Re: accelppp exits unexpectedly with log error md:epoll_wait

Post by dimka88 » 09 Jul 2020, 21:06

Hi, did you have a chance to build and run accel-ppp from the master branch https://github.com/accel-ppp/accel-ppp
Did you save debug logs?

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest