radius module hangs - stops sending requests

Radius related questions
Post Reply
zash
Posts: 12
Joined: 03 Apr 2016, 22:17

radius module hangs - stops sending requests

Post by zash »

Hi.

- Debian 8 Jessie Linux pppoe 4.4.23 #1 SMP Fri Oct 7 12:23:46 CEST 2016 x86_64 GNU/Linux
- accel-ppp version 1.11.0 amd64
- freeradius 3.0.12

Wondering if anyone experiences anything similar with their accel-ppp installations. Occasionally (every few weeks) randomly on one of my accel-ppp installations radius communication stops completely working:
The radius server itself is fine since other NASes work without problems. It only happens on my accel-ppp NASes.

1) Accel-ppp stops sending accounting updates to radius ( There's no radius packets being sent out at all from accel-ppp - tested with tcpdump )
2) New connected sessions get stuck forever like this:

Code: Select all

#accel-ppp show sessions
|          | 24:a4:3c:62:20:9 | 0.0.0.0       |              | pppoe |      | start  | 00:10:34
I can terminate it by 'accel-ppp terminate csid 24:a4:3c:62:20:9''
3) CoA works, I can change parameters of existing connections.

After enabling verbose logging and log level=5, during connection from mac '24:a4:3c:62:20':

Code: Select all

#/var/log/accel-ppp/auth-fail.log
[2017-01-19 15:47:52]: debug: eth5: lcp_layer_init
[2017-01-19 15:47:52]: debug: eth5: auth_layer_init
[2017-01-19 15:47:52]: debug: eth5: ccp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ipcp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ipv6cp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ppp establishing
[2017-01-19 15:47:52]: debug: eth5: lcp_layer_start
[2017-01-19 15:47:52]:  info: eth5: send [LCP ConfReq id=1 <auth CHAP-md5> <mru 1492> <magic 4d74d70d>]
[2017-01-19 15:47:52]:  info: eth5: recv [LCP ConfReq id=d <mru 1492> <magic d78417f1>]
[2017-01-19 15:47:52]:  info: eth5: send [LCP ConfAck id=d ]
[2017-01-19 15:47:55]: debug: eth5: fsm timeout 9
[2017-01-19 15:47:55]:  info: eth5: send [LCP ConfReq id=1 <auth CHAP-md5> <mru 1492> <magic 4d74d70d>]
[2017-01-19 15:47:55]:  info: eth5: recv [LCP ConfAck id=1 <auth CHAP-md5> <mru 1492> <magic 4d74d70d>]
[2017-01-19 15:47:55]: debug: eth5: lcp_layer_started
[2017-01-19 15:47:55]: debug: eth5: auth_layer_start
[2017-01-19 15:47:55]:  info: eth5: send [CHAP Challenge id=1 <f610cf2387e0cc656bcca2654e27945>]
[2017-01-19 15:47:55]: debug: eth5: recv [LCP EchoReq id=0 <magic d78417f1>]
[2017-01-19 15:47:55]: debug: eth5: send [LCP EchoRep id=0 <magic 4d74d70d>]
[2017-01-19 15:47:55]:  info: eth5: recv [CHAP Response id=1 <f4b0fc83d84296dddfe6be9d42be6b>, name="user573"]
[2017-01-19 15:47:55]: debug: eth5: radius(1): queue 0x7f890019ee38
[2017-01-19 15:47:56]: debug: eth5: recv [LCP EchoReq id=1 <magic d78417f1>]
[2017-01-19 15:47:56]: debug: eth5: send [LCP EchoRep id=1 <magic 4d74d70d>]
[2017-01-19 15:47:57]: debug: eth5: recv [LCP EchoReq id=2 <magic d78417f1>]
[2017-01-19 15:47:57]: debug: eth5: send [LCP EchoRep id=2 <magic 4d74d70d>]
[2017-01-19 15:47:58]: debug: eth5: recv [LCP EchoReq id=3 <magic d78417f1>]
[2017-01-19 15:47:58]: debug: eth5: send [LCP EchoRep id=3 <magic 4d74d70d>]
[2017-01-19 15:47:59]: debug: eth5: recv [LCP EchoReq id=4 <magic d78417f1>]
[2017-01-19 15:47:59]: debug: eth5: send [LCP EchoRep id=4 <magic 4d74d70d>]
[2017-01-19 15:48:00]: debug: eth5: recv [LCP EchoReq id=5 <magic d78417f1>]
[2017-01-19 15:48:00]: debug: eth5: send [LCP EchoRep id=5 <magic 4d74d70d>]
[2017-01-19 15:48:01]: debug: eth5: recv [LCP EchoReq id=6 <magic d78417f1>]
[2017-01-19 15:48:01]: debug: eth5: send [LCP EchoRep id=6 <magic 4d74d70d>]
[2017-01-19 15:48:02]: debug: eth5: recv [LCP EchoReq id=7 <magic d78417f1>]
[2017-01-19 15:48:02]: debug: eth5: send [LCP EchoRep id=7 <magic 4d74d70d>]
[2017-01-19 15:48:03]: debug: eth5: recv [LCP EchoReq id=8 <magic d78417f1>]
[2017-01-19 15:48:03]: debug: eth5: send [LCP EchoRep id=8 <magic 4d74d70d>]

#/var/log/accel-ppp/accel-ppp.log
[2017-01-19 15:47:52]: debug: ppp226: send [LCP EchoReq id=4f <magic 08c82c8e>]
[2017-01-19 15:47:52]: debug: ppp1314: recv [LCP EchoRep id=53 <magic c2c65a45>]
[2017-01-19 15:47:52]: debug: ppp664: recv [LCP EchoRep id=a2 <magic d0cbb492>]
[2017-01-19 15:47:52]:  info: eth5: recv [PPPoE PADI 24:a4:3c:62:20:95 => ff:ff:ff:ff:ff:ff sid=0000 <Service-Name net> <Host-Uniq 000056d2>]
[2017-01-19 15:47:52]:  info: eth5: send [PPPoE PADO 00:1c:21:b1:3c:79 => 24:a4:3c:62:20:95 sid=0000 <AC-Name accel-ppp> <Service-Name net> <AC-Cookie $
[2017-01-19 15:47:52]: debug: ppp661: send [LCP EchoReq id=a2 <magic 5963e94c>]
[2017-01-19 15:47:52]: debug: ppp661: recv [LCP EchoRep id=a2 <magic f2728c1e>]
[2017-01-19 15:47:52]: debug: ppp40: send [LCP EchoReq id=b3 <magic 38479dba>]
[2017-01-19 15:47:52]: debug: ppp260: send [LCP EchoReq id=f1 <magic 38bebedd>]
[2017-01-19 15:47:52]: debug: ppp660: send [LCP EchoReq id=a2 <magic 4de64be7>]
[2017-01-19 15:47:52]:  info: eth5: recv [PPPoE PADR 24:a4:3c:62:20:95 => 00:1c:21:b1:3c:79 sid=0000 <Service-Name net> <Host-Uniq 000056d2> <AC-Cookie$
[2017-01-19 15:47:52]:  info: eth5: send [PPPoE PADS 00:1c:21:b1:3c:79 => 24:a4:3c:62:20:95 sid=1201 <AC-Name accel-ppp> <Service-Name net> <Host-Uniq $
[2017-01-19 15:47:52]: debug: eth5: lcp_layer_init
[2017-01-19 15:47:52]: debug: eth5: auth_layer_init
[2017-01-19 15:47:52]: debug: eth5: ccp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ipcp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ipv6cp_layer_init
[2017-01-19 15:47:52]: debug: eth5: ppp establishing
[2017-01-19 15:47:52]: debug: eth5: lcp_layer_start
[2017-01-19 15:47:52]:  info: eth5: send [LCP ConfReq id=1 <auth CHAP-md5> <mru 1492> <magic 4d74d70d>]
[2017-01-19 15:47:52]: debug: ppp660: recv [LCP EchoRep id=a2 <magic 175cb2c8>]
[2017-01-19 15:47:52]: debug: ppp221: send [LCP EchoReq id=c2 <magic 7ee7e4f2>]
[2017-01-19 15:47:52]: debug: ppp40: recv [LCP EchoRep id=b3 <magic 1cbc1d82>]
[2017-01-19 15:47:52]: debug: ppp260: recv [LCP EchoRep id=f1 <magic 747b8cfb>]
[2017-01-19 15:47:52]: debug: ppp7: recv [LCP EchoRep id=42 <magic 2dbf16cd>]
[2017-01-19 15:47:52]:  info: eth5: recv [LCP ConfReq id=d <mru 1492> <magic d78417f1>]
[2017-01-19 15:47:52]:  info: eth5: send [LCP ConfAck id=d ]
[2017-01-19 15:47:52]: debug: ppp232: send [LCP EchoReq id=95 <magic 04abb9e1>]
[2017-01-19 15:47:52]: debug: ppp866: send [LCP EchoReq id=f0 <magic 5bafd71f>]
There's a large queue request count for radius that never drops.
It is probably the problem. Radius requests are queued, and never dequeued - causing no requests to go out at all.

accel-ppp show stat

Code: Select all

accel-ppp# show stat
uptime: 34.15:50:25
cpu: 1%
mem(rss/virt): 44208/511724 kB
core:
  mempool_allocated: 6913660
  mempool_available: 962036
  thread_count: 4
  thread_active: 1
  context_count: 422
  context_sleeping: 0
  context_pending: 0
  md_handler_count: 3486
  md_handler_pending: 0
  timer_count: 1170
  timer_pending: 0
sessions:
  starting: 0
  active: 411
  finishing: 0
pppoe:
  starting: 0
  active: 411
  delayed PADO: 0
  recv PADI: 468476
  drop PADI: 0
  sent PADO: 347042
  recv PADR(dup): 40068(98)
  sent PADS: 37035
  filtered: 0
radius(1, 10.1.1.2):
  state: active
  fail count: 0
  request count: 50
  queue length: 751
  auth sent: 55539
  auth lost(total/5m/1m): 27924/0/0
  auth avg query time(5m/1m): 0/0 ms
  acct sent: 29663
  acct lost(total/5m/1m): 744/0/0
  acct avg query time(5m/1m): 0/0 ms
  interim sent: 3825115
  interim lost(total/5m/1m): 3106/0/0
  interim avg query time(5m/1m): 0/0 ms
  
I can "unblock" the radius module by changing the "req-limit" radius setting from 50 to 60 in accel-ppp.conf and reloading the config with accel-ppp reload, then the queue length gets back to 0 and everything starts working again.

This looks suspicious:
https://sourceforge.net/p/accel-ppp/cod ... req.c#l355
and this:
https://sourceforge.net/p/accel-ppp/cod ... cct.c#l386
and this: https://sourceforge.net/p/accel-ppp/cod ... erv.c#l376

Code: Select all

	if (req->try++ == conf_max_try) {
	  ...

Code: Select all

  if (req->try == conf_max_try) {
    ...
It probably should be using

Code: Select all

if (... >= conf_max_try)
condition in those locations. If for whatever reason the req->try variable gets bigger then the radius request queue will never be cleared.

accel-ppp.conf used:

Code: Select all

[modules]
log_file
#log_syslog
pppoe
auth_chap_md5
radius
shaper
#connlimit

[core]
log-error=/var/log/accel-ppp/core.log
thread-count=4

[common]
#single-session=deny

[ppp]
verbose=1
min-mtu=1280
max-mtu=1492
#mtu=1492
#mru=1492
ipv4=require
ipv6=deny
ipv6-intf-id=0:0:0:1
ipv6-peer-intf-id=0:0:0:2
ipv6-accept-peer-intf-id=1
lcp-echo-interval=12
lcp-echo-timeout=60
unit-cache=1
mppe=deny
ccp=0

[pppoe]
verbose=1
called-sid=mac
interface=eth5
interface=eth5.2011
pado-delay=0,50:500,60:550,70:600,80:650,90:700,100:750,110:800,120:850,130:900,140:950,150:1000,160:1050,170:1100,180:1150,190:1200,200:1250,210:1300,220:1350,230:1400,240:1450,250:1500,260:1550,270:1600,280:1650,290:1700,300:1750,310:1800,320:1850,330:1900,340:1950,350:2000,360:2050,370:2100,380:2150,390:2200,400:2250,410:2300,420:2350,430:2400,440:2450,450:2500,460:2550,470:2600,480:2650,490:2700,500:2750,510:2800,520:2850,530:2900,540:2950,550:3000
mppe=deny

[dns]
dns1=8.8.4.4
dns2=8.8.8.8

[radius]
nas-identifier=accel-ppp
nas-ip-address=10.1.1.2
gw-ip-address=10.1.1.1
server=10.1.1.2,4LTPv4BJ8gfdgffg324243oFDosdf,auth-port=1812,acct-port=1813,req-limit=50,fail-timeout=0,max-fail=10,weight=1
dae-server=10.1.1.2:3799,hFdxbwfnLDFFDSfs234oiOQdefdLS
acct-interim-interval=600
acct-timeout=0
verbose=0

[log]
log-file=/var/log/accel-ppp/accel-ppp.log
log-emerg=/var/log/accel-ppp/emerg.log
log-fail-file=/var/log/accel-ppp/auth-fail.log
#log-debug=/dev/stdout
#syslog=accel-pppd,daemon
#log-tcp=127.0.0.1:3000
copy=1
#color=1
#per-user-dir=per_user
#per-session-dir=per_session
#per-session=1
level=3

[shaper]
time-range=1,11:00-23:00
time-range=2,23:00-11:00
attr=Filter-Id
up-limiter=police
down-limiter=tbf
verbose=1
#down-limiter=htb
#leaf-qdisc=fq_codel 10240 1024 5 100 1514 ecn

[cli]
verbose=1
telnet=10.1.1.2:5010
tcp=127.0.0.1:2001
password=qPpLgfdFSfdsfds32423122Fgfd
history-file=/dev/null

[snmp]
master=0
agent-name=accel-ppp

[connlimit]
limit=10/min
burst=3
timeout=60
kompex
Posts: 5
Joined: 30 Jun 2017, 07:00

Re: radius module hangs - stops sending requests

Post by kompex »

edit: apperently using connlimit module fixes this issue
edsonag
Posts: 19
Joined: 22 Mar 2018, 20:30

Re: radius module hangs - stops sending requests

Post by edsonag »

Hi,
I am having a problem similar to that reported in this post.
There are connections that start as "start" in the state field, and do not evolve to "active".
I changed all the settings described here and it did not work.
One detail is that it happened only with d-link branded routers.
Any new suggestions?

Thanks
Post Reply