Mass disconnects under high load

PPPoE related questions
Post Reply
marekm
Posts: 9
Joined: 09 Jun 2015, 11:01

Mass disconnects under high load

Post by marekm » 01 Jul 2020, 09:44

I'm using accel-ppp 1.12.0 that comes as part of VyOS 1.2.5 - running on PC Engines apu4d4.
Works fine most of the time, but I was getting occasional mass disconnects of PPPoE clients.
One thing suggested by VyOS people that helped a bit (but not completely yet) was to set [radius] acct-timeout=0 - so no disconnect when we don't get response to interim update (better lose an accounting record sometimes, than disconnect the customer as happens with default acct-timeout=3).
But I still managed to accidentally trigger the issue again, by rebooting some device in the network that caused some clients to disconnect.
Those disconnected clients that were behind the rebooted device have "Acct-Terminate-Cause = Lost-Carrier" - as expected (we didn't get LCP Echo responses from them, because the network between us and them was down for a minute or two).
Then they all try to connect again at roughly the same time, causing a spike in CPU load. This, in turn, causes many more clients (completely unrelated, in different parts of the network - but not all of them) to disconnect, this time with "Acct-Terminate-Cause = User-Request".
I suspect this time accel-ppp is too busy with all the new connections, and fails to respond to LCP Echo from existing connected clients.
This causes clients to disconnect, then connect again, increase CPU load, and so on - resulting in even more disconnects, a runaway condition.
It recovers after some time, and everything works again. Some clients stay connected for days, I suspect those have longer LCP timeouts (I have no control over that - customers use all sorts of different SOHO routers, mostly cheap ones).
Has anyone else seen similar issues?

dimka88
Posts: 722
Joined: 13 Oct 2014, 05:51
Contact:

Re: Mass disconnects under high load

Post by dimka88 » 02 Jul 2020, 14:34

Hello @marekm, I think you need to debug detailed this situation with log files. I'm not sure but can you try to add `[ppp]unit-canche=n`
https://accel-ppp.readthedocs.io/en/lat ... n/ppp.html

hashbang
Posts: 77
Joined: 12 Jul 2015, 10:28

Re: Mass disconnects under high load

Post by hashbang » 06 Jul 2020, 11:39

hi,
looks like we are travelling on the same boat. I'm experiencing the same problem but I have dell xeon 2650 x 2 server.
Can u check with accel-cmd show stat ur uptime at this time. I'd seen the service restarts when there is heavy load bcoz of network flaps
thanks and regards]

marekm
Posts: 9
Joined: 09 Jun 2015, 11:01

Re: Mass disconnects under high load

Post by marekm » 07 Jul 2020, 12:05

It's a bit difficult to debug without making customers angry, especially now that everyone is working remotely etc.
VyOS people also suggested the unit-cache option, we'll see.

In the meantime I'm also seeing another issue, adaptive LCP echo sometimes has the unwanted effect of keeping alive stale PPPoE sessions that were long gone at the client. The other end (various cheap SOHO routers I have no control over) keeps reconnecting and this counts as peer activity that prevents the server from closing the stale session. It becomes especially funny when you have a failover/load-balanced setup (it worked with a pair of MikroTik PPPoE servers) and both servers appear to have sessions open to the same client with the same IP which is then redistributed by OSPF or iBGP messing it all up. So adaptive LCP echo code probably should check that the client activity really belongs to the same session we think we have still open.

marekm
Posts: 9
Joined: 09 Jun 2015, 11:01

Re: Mass disconnects under high load

Post by marekm » 21 Jul 2020, 20:32

After some more tuning... wel'll see, testing is not easy in a production setup as I don't want the customers more angry than they already are.

unit-cache= - disabled again, I suspect (not confirmed yet) it might have bad effect on dynamic routing protocols (I redistribute connected PPPoE routes via iBGP and routes to interfaces that were disconnected and gone are still present in the routing table, I see different routes to the same IP where only one is correct but the wrong one is selected)

thread-count=4 - had to set manually, the VyOS config script divided number of cores by 2 (probably to account for Intel HT, but here on AMD I have no HT, just 4 real cores), I just hope it will be a bit faster to destroy lots of ppp interfaces using 4 cores than it was with 2.

lcp-echo-interval=10, lcp-echo-failure=5 - no adaptive LCP echo as that seems to keep alive sessions that are long gone at the client

I also hacked FreeRADIUS config to set NAS-Port-Id = "ppp-%{Stripped-User-Name}", this also should avoid having multipe ppp interfaces with the same IP.

It would be nice for accel-ppp to have an option like MikroTik one-session-per-host: "Allow only one session per host (determined by MAC address). If a host tries to establish a new session, the old one will be closed." (single-session=replace is determined by username, sometimes this results in repeatedly disconnecting and reconnecting when we think session is still open when it is long gone at the client).

dimka88
Posts: 722
Joined: 13 Oct 2014, 05:51
Contact:

Re: Mass disconnects under high load

Post by dimka88 » 24 Jul 2020, 11:36

Hi @marekm, you can add feature request on the accel-ppp phabricator https://phabricator.accel-ppp.org/

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest