Packet loss when using bird with large internet tables and accel-pppd.

Questions related to general functionality
Post Reply
kompex
Posts: 5
Joined: 30 Jun 2017, 07:00

Packet loss when using bird with large internet tables and accel-pppd.

Post by kompex » 06 Jul 2022, 11:55

Hi,

Some background:

I have an 8 core CPU.
I have the usual optimizations on the machine, like spectre/meltdown/retpoline mitigations off, and some NIC offloads off as per the guide in accel-ppp docs.
I have 6 RSS queues on a NIC which are assigned to cores 0 to 5 via /proc/irq/XXX/snmp_affinity_list
I have RPS activated to spread PPPOE load among the 0 to 5 CPUs
Accel-pppd is using 6 threads, which I have pinned to cores 0-5 via taskset -cp 0-5 $(pgrep accel-pppd).
Bird is pinned to the last *unused* cpu core, taskset -cp 7 $(bird)

At this point, I just distribute client ip addresses via Bird and receive default routes.

Everything works perfect with about 1500-2000 pppoe interfaces and 0% packet loss.

Since I have two bgp routers upstream of my pppoe servers I wanted to receive full internet tables (just /24..8 prefixes) so that packets would exit the pppoe servers towards the correct bgp router most of the time since they have the exact prefix and path, instead of jumping between them (as is with default routes, packets exit randomly based on a calculated hash).

So with that in place I notice that Bird CPU utilization is periodically jumping to 100% for a short moment; before it was maybe 20% (this is expected since it has much more to calculate and should not be a problem because I pinned it to a dedicated core for route calculation).

BUT at the same time when that happens Accel-pppd process is also jumping to about 80% cpu utilization for a short moment. (Surprising to me)

This causes a packet loss which I can measure, not huge, but about 0.05% to 0.1% on the entire machine for all PPPOE clients.

Question is, why would accel-pppd have high cpu spikes in tandem when Bird is calculating routes? Especially that I have Bird and accel-pppd pinned to different CPU cores.

Of course when I just leave default routes in Bird then accel-pppd doesn't have any CPU spikes when route calculation takes place.

It is as if some interface scanning/route calculation performed by Bird is locking Accel-pppd for a brief moment when the route count is very large.

Anything I can do to optimize/prevent this? The machine is powerful enough and should run with no problem even with the entire internet routing table since Bird is fast at it and I set a dedicated core for doing route calculations, but unfortunately for some reason it causes accel-pppd CPU spikes as well.

I use Kernel 4.19, Bird 2.0.10 and latest accel-pppd from Git on Debian 11.

Bird config (since its more relevant in this case than the standard accel-pppd config):

Code: Select all

protocol kernel {
    merge paths yes;
    learn yes;
    ipv4 {
        import all;
        export all;
    };
}

protocol device {
}

protocol direct {
    ipv4;
}

protocol ospf {
    tick 2;
    ecmp yes;
    merge external yes;

    ipv4 {
        export none;
        import all;
    };

    area 0.0.0.0 {

        interface "dummy0" {
            stub;
        };

        interface "enp1s0f1" {
            dead count 4;
        };
    };
}

protocol bgp rr_pppoe_serv1_to_bgp1 {
    local as 1111;
    neighbor <loopback ip of bgp1> as 1111;
    source address <loopback ip of pppoe serv 1>;

    ipv4 {
        next hop self;
        export filter {
            if source = RTS_DEVICE || source = RTS_STATIC_DEVICE || source
= RTS_STATIC || source = RTS_INHERIT then {
                accept;
            }
            reject;
        };
        import all;
    };
}

protocol bgp rr_pppoe_to_bgp2 {
    local as 1111;
    neighbor <loopback ip of bgp2> as 1111;
    source address <loopback ip of pppoe serv>;

    ipv4 {
        next hop self;
        export filter {
            if source = RTS_DEVICE || source = RTS_STATIC_DEVICE || source
= RTS_STATIC || source = RTS_INHERIT then {
                accept;
            }
            reject;
        };
        import all;
    };
}



dimka88
Posts: 823
Joined: 13 Oct 2014, 05:51
Contact:

Re: Packet loss when using bird with large internet tables and accel-pppd.

Post by dimka88 » 09 Jul 2022, 10:35

Hi @ kompex, did you apply modification to reduce systemd ifquery?
https://accel-ppp.readthedocs.io/en/lat ... timization
https://accel-ppp.readthedocs.io/en/lat ... imizations

Try to increase in bird config protocol device scan time to 100 or higher.

Post Reply

Who is online

Users browsing this forum: No registered users and 1 guest