PPPoE session still open after client connects to different server
Posted: 11 Aug 2020, 22:36
I've seen a situation which looks like a bug to me. It happens in a redundant / load balancing setup with two (or more) PPPoE servers on the same network. I've never seen it happen while running a similar setup with two MikroTik PPPoE servers for a long time earlier on the same network with mostly the same set of a few hundreds of clients.
Most of the time it works fine, but a few clients seem to be stuck connected to both PPPoE servers at the same time. Both servers periodically sent LCP echo requests to the client, and both get LCP echo replies. This continues indefinitely (hours) - even though the client really has a PPPoE session only with one server, it still responds to LCP echo requests for the no longer active session, keeping it alive at the server. Now, since I assign a static IP to each PPPoE username, this results in duplicate IP on both PPPoE servers which "redistribute connected" into dynamic routing protocols (previously OSPF, now iBGP following the recent BCP documents), and all sorts of problems for that client (they report Internet disappearing frequently, some web pages failing to load etc. as part of the traffic comes through the inactive PPPoE session).
When I looked at the traffic with tcpdump, I see these suspicious things:
- magic number is the same in LCP echo replies to both servers
- session IDs have only 10 bits really used, low 6 bits always zero
Looking at the code I see the check for looped-back link is done as in the RFC1661, but this part seems not implemented:
"Reception of a Magic-Number other than the negotiated local Magic-Number, the peer's negotiated Magic-Number, or zero if the peer didn't negotiate one, indicates a link which has been (mis)configured for communications with a different peer."
Also, probability of duplicate session ID (with only 10 of 16 bits used) is quite high. Not sure if this isn't also a buggy client (probably should respond only to LCP echo requests from the same MAC address of the server to which it has an active session, and also check the magic number), but we can't really change the clients (customers are free to use all sorts of their own cheap routers as long as they do PPPoE). Perhaps it could at least be fixed at the accel-ppp end. MikroTik seems to get this right (of course I haven't seen the code, but have been running such a redundant setup for a few years and have never seen a stuck duplicate PPPoE session on two servers that wouldn't clear by itself).
Most of the time it works fine, but a few clients seem to be stuck connected to both PPPoE servers at the same time. Both servers periodically sent LCP echo requests to the client, and both get LCP echo replies. This continues indefinitely (hours) - even though the client really has a PPPoE session only with one server, it still responds to LCP echo requests for the no longer active session, keeping it alive at the server. Now, since I assign a static IP to each PPPoE username, this results in duplicate IP on both PPPoE servers which "redistribute connected" into dynamic routing protocols (previously OSPF, now iBGP following the recent BCP documents), and all sorts of problems for that client (they report Internet disappearing frequently, some web pages failing to load etc. as part of the traffic comes through the inactive PPPoE session).
When I looked at the traffic with tcpdump, I see these suspicious things:
- magic number is the same in LCP echo replies to both servers
- session IDs have only 10 bits really used, low 6 bits always zero
Looking at the code I see the check for looped-back link is done as in the RFC1661, but this part seems not implemented:
"Reception of a Magic-Number other than the negotiated local Magic-Number, the peer's negotiated Magic-Number, or zero if the peer didn't negotiate one, indicates a link which has been (mis)configured for communications with a different peer."
Also, probability of duplicate session ID (with only 10 of 16 bits used) is quite high. Not sure if this isn't also a buggy client (probably should respond only to LCP echo requests from the same MAC address of the server to which it has an active session, and also check the magic number), but we can't really change the clients (customers are free to use all sorts of their own cheap routers as long as they do PPPoE). Perhaps it could at least be fixed at the accel-ppp end. MikroTik seems to get this right (of course I haven't seen the code, but have been running such a redundant setup for a few years and have never seen a stuck duplicate PPPoE session on two servers that wouldn't clear by itself).