Mass disconnects under high load
Posted: 01 Jul 2020, 09:44
I'm using accel-ppp 1.12.0 that comes as part of VyOS 1.2.5 - running on PC Engines apu4d4.
Works fine most of the time, but I was getting occasional mass disconnects of PPPoE clients.
One thing suggested by VyOS people that helped a bit (but not completely yet) was to set [radius] acct-timeout=0 - so no disconnect when we don't get response to interim update (better lose an accounting record sometimes, than disconnect the customer as happens with default acct-timeout=3).
But I still managed to accidentally trigger the issue again, by rebooting some device in the network that caused some clients to disconnect.
Those disconnected clients that were behind the rebooted device have "Acct-Terminate-Cause = Lost-Carrier" - as expected (we didn't get LCP Echo responses from them, because the network between us and them was down for a minute or two).
Then they all try to connect again at roughly the same time, causing a spike in CPU load. This, in turn, causes many more clients (completely unrelated, in different parts of the network - but not all of them) to disconnect, this time with "Acct-Terminate-Cause = User-Request".
I suspect this time accel-ppp is too busy with all the new connections, and fails to respond to LCP Echo from existing connected clients.
This causes clients to disconnect, then connect again, increase CPU load, and so on - resulting in even more disconnects, a runaway condition.
It recovers after some time, and everything works again. Some clients stay connected for days, I suspect those have longer LCP timeouts (I have no control over that - customers use all sorts of different SOHO routers, mostly cheap ones).
Has anyone else seen similar issues?
Works fine most of the time, but I was getting occasional mass disconnects of PPPoE clients.
One thing suggested by VyOS people that helped a bit (but not completely yet) was to set [radius] acct-timeout=0 - so no disconnect when we don't get response to interim update (better lose an accounting record sometimes, than disconnect the customer as happens with default acct-timeout=3).
But I still managed to accidentally trigger the issue again, by rebooting some device in the network that caused some clients to disconnect.
Those disconnected clients that were behind the rebooted device have "Acct-Terminate-Cause = Lost-Carrier" - as expected (we didn't get LCP Echo responses from them, because the network between us and them was down for a minute or two).
Then they all try to connect again at roughly the same time, causing a spike in CPU load. This, in turn, causes many more clients (completely unrelated, in different parts of the network - but not all of them) to disconnect, this time with "Acct-Terminate-Cause = User-Request".
I suspect this time accel-ppp is too busy with all the new connections, and fails to respond to LCP Echo from existing connected clients.
This causes clients to disconnect, then connect again, increase CPU load, and so on - resulting in even more disconnects, a runaway condition.
It recovers after some time, and everything works again. Some clients stay connected for days, I suspect those have longer LCP timeouts (I have no control over that - customers use all sorts of different SOHO routers, mostly cheap ones).
Has anyone else seen similar issues?