2013-06-07 12:06:34

by Vitaly V. Bursov

[permalink] [raw]
Subject: Scaling problem with a lot of AF_PACKET sockets on different interfaces

Hello,

I have a Linux router with a lot of interfaces (hundreds or
thousands of VLANs) and an application that creates AF_PACKET
socket per interface and bind()s sockets to interfaces.

Each socket has attached BPF filter too.

The problem is observed on linux-3.8.13, but as far I can see
from the source the latest version has alike behavior.

I noticed that box has strange performance problems with
most of the CPU time spent in __netif_receive_skb:
86.15% [k] __netif_receive_skb
1.41% [k] _raw_spin_lock
1.09% [k] fib_table_lookup
0.99% [k] local_bh_enable_ip

and this the assembly with the "hot spot":
│ shr $0x8,%r15w
│ and $0xf,%r15d
0.00 │ shl $0x4,%r15
│ add $0xffffffff8165ec80,%r15
│ mov (%r15),%rax
0.09 │ mov %rax,0x28(%rsp)
│ mov 0x28(%rsp),%rbp
0.01 │ sub $0x28,%rbp
│ jmp 5c7
1.72 │5b0: mov 0x28(%rbp),%rax
0.05 │ mov 0x18(%rsp),%rbx
0.00 │ mov %rax,0x28(%rsp)
0.03 │ mov 0x28(%rsp),%rbp
5.67 │ sub $0x28,%rbp
1.71 │5c7: lea 0x28(%rbp),%rax
1.73 │ cmp %r15,%rax
│ je 640
1.74 │ cmp %r14w,0x0(%rbp)
│ jne 5b0
81.36 │ mov 0x8(%rbp),%rax
2.74 │ cmp %rax,%r8
│ je 5eb
1.37 │ cmp 0x20(%rbx),%rax
│ je 5eb
1.39 │ cmp %r13,%rax
│ jne 5b0
0.04 │5eb: test %r12,%r12
0.04 │ je 6f4
│ mov 0xc0(%rbx),%eax
│ mov 0xc8(%rbx),%rdx
│ testb $0x8,0x1(%rdx,%rax,1)
│ jne 6d5

This corresponds to:

net/core/dev.c:
type = skb->protocol;
list_for_each_entry_rcu(ptype,
&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
if (ptype->type == type &&
(ptype->dev == null_or_dev || ptype->dev == skb->dev ||
ptype->dev == orig_dev)) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}

Which works perfectly OK until there are a lot of AF_PACKET sockets, since
the socket adds a protocol to ptype list:

# cat /proc/net/ptype
Type Device Function
0800 eth2.1989 packet_rcv+0x0/0x400
0800 eth2.1987 packet_rcv+0x0/0x400
0800 eth2.1986 packet_rcv+0x0/0x400
0800 eth2.1990 packet_rcv+0x0/0x400
0800 eth2.1995 packet_rcv+0x0/0x400
0800 eth2.1997 packet_rcv+0x0/0x400
.......
0800 eth2.1004 packet_rcv+0x0/0x400
0800 ip_rcv+0x0/0x310
0011 llc_rcv+0x0/0x3a0
0004 llc_rcv+0x0/0x3a0
0806 arp_rcv+0x0/0x150

And this obviously results in a huge performance penalty.

ptype_all, by the looks, should be the same.

Probably one way to fix this it to perform interface name matching in
af_packet handler, but there could be other cases, other protocols.

Ideas are welcome :)

--
Thanks
Vitaly


2013-06-07 12:41:17

by Mike Galbraith

[permalink] [raw]
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

(CC's net-fu dojo)

On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
> Hello,
>
> I have a Linux router with a lot of interfaces (hundreds or
> thousands of VLANs) and an application that creates AF_PACKET
> socket per interface and bind()s sockets to interfaces.
>
> Each socket has attached BPF filter too.
>
> The problem is observed on linux-3.8.13, but as far I can see
> from the source the latest version has alike behavior.
>
> I noticed that box has strange performance problems with
> most of the CPU time spent in __netif_receive_skb:
> 86.15% [k] __netif_receive_skb
> 1.41% [k] _raw_spin_lock
> 1.09% [k] fib_table_lookup
> 0.99% [k] local_bh_enable_ip
>
> and this the assembly with the "hot spot":
> │ shr $0x8,%r15w
> │ and $0xf,%r15d
> 0.00 │ shl $0x4,%r15
> │ add $0xffffffff8165ec80,%r15
> │ mov (%r15),%rax
> 0.09 │ mov %rax,0x28(%rsp)
> │ mov 0x28(%rsp),%rbp
> 0.01 │ sub $0x28,%rbp
> │ jmp 5c7
> 1.72 │5b0: mov 0x28(%rbp),%rax
> 0.05 │ mov 0x18(%rsp),%rbx
> 0.00 │ mov %rax,0x28(%rsp)
> 0.03 │ mov 0x28(%rsp),%rbp
> 5.67 │ sub $0x28,%rbp
> 1.71 │5c7: lea 0x28(%rbp),%rax
> 1.73 │ cmp %r15,%rax
> │ je 640
> 1.74 │ cmp %r14w,0x0(%rbp)
> │ jne 5b0
> 81.36 │ mov 0x8(%rbp),%rax
> 2.74 │ cmp %rax,%r8
> │ je 5eb
> 1.37 │ cmp 0x20(%rbx),%rax
> │ je 5eb
> 1.39 │ cmp %r13,%rax
> │ jne 5b0
> 0.04 │5eb: test %r12,%r12
> 0.04 │ je 6f4
> │ mov 0xc0(%rbx),%eax
> │ mov 0xc8(%rbx),%rdx
> │ testb $0x8,0x1(%rdx,%rax,1)
> │ jne 6d5
>
> This corresponds to:
>
> net/core/dev.c:
> type = skb->protocol;
> list_for_each_entry_rcu(ptype,
> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> if (ptype->type == type &&
> (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
> ptype->dev == orig_dev)) {
> if (pt_prev)
> ret = deliver_skb(skb, pt_prev, orig_dev);
> pt_prev = ptype;
> }
> }
>
> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
> the socket adds a protocol to ptype list:
>
> # cat /proc/net/ptype
> Type Device Function
> 0800 eth2.1989 packet_rcv+0x0/0x400
> 0800 eth2.1987 packet_rcv+0x0/0x400
> 0800 eth2.1986 packet_rcv+0x0/0x400
> 0800 eth2.1990 packet_rcv+0x0/0x400
> 0800 eth2.1995 packet_rcv+0x0/0x400
> 0800 eth2.1997 packet_rcv+0x0/0x400
> .......
> 0800 eth2.1004 packet_rcv+0x0/0x400
> 0800 ip_rcv+0x0/0x310
> 0011 llc_rcv+0x0/0x3a0
> 0004 llc_rcv+0x0/0x3a0
> 0806 arp_rcv+0x0/0x150
>
> And this obviously results in a huge performance penalty.
>
> ptype_all, by the looks, should be the same.
>
> Probably one way to fix this it to perform interface name matching in
> af_packet handler, but there could be other cases, other protocols.
>
> Ideas are welcome :)
>

2013-06-07 13:05:39

by Daniel Borkmann

[permalink] [raw]
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

On 06/07/2013 02:41 PM, Mike Galbraith wrote:
> (CC's net-fu dojo)
>
> On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
>> Hello,
>>
>> I have a Linux router with a lot of interfaces (hundreds or
>> thousands of VLANs) and an application that creates AF_PACKET
>> socket per interface and bind()s sockets to interfaces.
>>
>> Each socket has attached BPF filter too.
>>
>> The problem is observed on linux-3.8.13, but as far I can see
>> from the source the latest version has alike behavior.
>>
>> I noticed that box has strange performance problems with
>> most of the CPU time spent in __netif_receive_skb:
>> 86.15% [k] __netif_receive_skb
>> 1.41% [k] _raw_spin_lock
>> 1.09% [k] fib_table_lookup
>> 0.99% [k] local_bh_enable_ip
>>
>> and this the assembly with the "hot spot":
>> │ shr $0x8,%r15w
>> │ and $0xf,%r15d
>> 0.00 │ shl $0x4,%r15
>> │ add $0xffffffff8165ec80,%r15
>> │ mov (%r15),%rax
>> 0.09 │ mov %rax,0x28(%rsp)
>> │ mov 0x28(%rsp),%rbp
>> 0.01 │ sub $0x28,%rbp
>> │ jmp 5c7
>> 1.72 │5b0: mov 0x28(%rbp),%rax
>> 0.05 │ mov 0x18(%rsp),%rbx
>> 0.00 │ mov %rax,0x28(%rsp)
>> 0.03 │ mov 0x28(%rsp),%rbp
>> 5.67 │ sub $0x28,%rbp
>> 1.71 │5c7: lea 0x28(%rbp),%rax
>> 1.73 │ cmp %r15,%rax
>> │ je 640
>> 1.74 │ cmp %r14w,0x0(%rbp)
>> │ jne 5b0
>> 81.36 │ mov 0x8(%rbp),%rax
>> 2.74 │ cmp %rax,%r8
>> │ je 5eb
>> 1.37 │ cmp 0x20(%rbx),%rax
>> │ je 5eb
>> 1.39 │ cmp %r13,%rax
>> │ jne 5b0
>> 0.04 │5eb: test %r12,%r12
>> 0.04 │ je 6f4
>> │ mov 0xc0(%rbx),%eax
>> │ mov 0xc8(%rbx),%rdx
>> │ testb $0x8,0x1(%rdx,%rax,1)
>> │ jne 6d5
>>
>> This corresponds to:
>>
>> net/core/dev.c:
>> type = skb->protocol;
>> list_for_each_entry_rcu(ptype,
>> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>> if (ptype->type == type &&
>> (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>> ptype->dev == orig_dev)) {
>> if (pt_prev)
>> ret = deliver_skb(skb, pt_prev, orig_dev);
>> pt_prev = ptype;
>> }
>> }
>>
>> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
>> the socket adds a protocol to ptype list:
>>
>> # cat /proc/net/ptype
>> Type Device Function
>> 0800 eth2.1989 packet_rcv+0x0/0x400
>> 0800 eth2.1987 packet_rcv+0x0/0x400
>> 0800 eth2.1986 packet_rcv+0x0/0x400
>> 0800 eth2.1990 packet_rcv+0x0/0x400
>> 0800 eth2.1995 packet_rcv+0x0/0x400
>> 0800 eth2.1997 packet_rcv+0x0/0x400
>> .......
>> 0800 eth2.1004 packet_rcv+0x0/0x400
>> 0800 ip_rcv+0x0/0x310
>> 0011 llc_rcv+0x0/0x3a0
>> 0004 llc_rcv+0x0/0x3a0
>> 0806 arp_rcv+0x0/0x150
>>
>> And this obviously results in a huge performance penalty.
>>
>> ptype_all, by the looks, should be the same.
>>
>> Probably one way to fix this it to perform interface name matching in
>> af_packet handler, but there could be other cases, other protocols.
>>
>> Ideas are welcome :)

Probably, that depends on _your scenario_ and/or BPF filter, but would it be
an alternative if you have only a few packet sockets (maybe one pinned to each
cpu) and cluster/load-balance them together via packet fanout? (Where you
bind the socket to ifindex 0, so that you get traffic from all devs...) That
would at least avoid that "hot spot", and you could post-process the interface
via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
observed. ;-)

2013-06-07 13:31:43

by David Laight

[permalink] [raw]
Subject: RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

> > I have a Linux router with a lot of interfaces (hundreds or
> > thousands of VLANs) and an application that creates AF_PACKET
> > socket per interface and bind()s sockets to interfaces.
...
> > I noticed that box has strange performance problems with
> > most of the CPU time spent in __netif_receive_skb:
> > 86.15% [k] __netif_receive_skb
> > 1.41% [k] _raw_spin_lock
> > 1.09% [k] fib_table_lookup
> > 0.99% [k] local_bh_enable_ip
...
> > This corresponds to:
> >
> > net/core/dev.c:
> > type = skb->protocol;
> > list_for_each_entry_rcu(ptype,
> > &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
> > if (ptype->type == type &&
> > (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
> > ptype->dev == orig_dev)) {
> > if (pt_prev)
> > ret = deliver_skb(skb, pt_prev, orig_dev);
> > pt_prev = ptype;
> > }
> > }
> >
> > Which works perfectly OK until there are a lot of AF_PACKET sockets, since
> > the socket adds a protocol to ptype list:

Presumably the 'ethertype' is the same for all the sockets?
(And probably the '& PTYPE_HASH_MASH' doesn't separate it from 0800
or 0806 (IIRC IP and ICMP))

How often is that deliver_skb() inside the loop called?
If the code could be arranged so that the scan loop didn't contain
a function call then the loop code would be a lot faster since
the compiler can cache values in registers.
While that woukd speed the code up somewhat, there would still be a
significant cost to iterate 1000+ times.

Looks like the ptype_base[] should be per 'dev'?
Or just put entries where ptype->dev != null_or_dev on a per-interface
list and do two searches?

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-07 13:54:53

by Eric Dumazet

[permalink] [raw]
Subject: RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

On Fri, 2013-06-07 at 14:30 +0100, David Laight wrote:

> Looks like the ptype_base[] should be per 'dev'?
> Or just put entries where ptype->dev != null_or_dev on a per-interface
> list and do two searches?

Yes, but then we would have two searches instead of one in fast path.

ptype_base[] is currently 16 slots, 256 bytes on x86_64.
Presumably the per device list could be a single list, instead of a hash
table, but still...

If the application creating hundred or thousand of AF_PACKET sockets is
a single process, I really question why using a single AF_PACKET was not
chosen.

We now have a FANOUT capability on AF_PACKET, so that its scalable to
million of packets per second.

I would rather try this way before adding yet another section in
__netif_receive_skb()


2013-06-07 14:09:49

by David Laight

[permalink] [raw]
Subject: RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

> > Looks like the ptype_base[] should be per 'dev'?
> > Or just put entries where ptype->dev != null_or_dev on a per-interface
> > list and do two searches?
>
> Yes, but then we would have two searches instead of one in fast path.

Usually it would be empty - so the search would be very quick!

David

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-06-07 14:17:38

by Vitaly V. Bursov

[permalink] [raw]
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

07.06.2013 16:05, Daniel Borkmann пишет:
> On 06/07/2013 02:41 PM, Mike Galbraith wrote:
>> (CC's net-fu dojo)
>>
>> On Fri, 2013-06-07 at 14:56 +0300, Vitaly V. Bursov wrote:
>>> Hello,
>>>
>>> I have a Linux router with a lot of interfaces (hundreds or
>>> thousands of VLANs) and an application that creates AF_PACKET
>>> socket per interface and bind()s sockets to interfaces.
>>>
>>> Each socket has attached BPF filter too.
>>>
>>> The problem is observed on linux-3.8.13, but as far I can see
>>> from the source the latest version has alike behavior.
>>>
>>> I noticed that box has strange performance problems with
>>> most of the CPU time spent in __netif_receive_skb:
>>> 86.15% [k] __netif_receive_skb
>>> 1.41% [k] _raw_spin_lock
>>> 1.09% [k] fib_table_lookup
>>> 0.99% [k] local_bh_enable_ip
>>>
>>> and this the assembly with the "hot spot":
>>> │ shr $0x8,%r15w
>>> │ and $0xf,%r15d
>>> 0.00 │ shl $0x4,%r15
>>> │ add $0xffffffff8165ec80,%r15
>>> │ mov (%r15),%rax
>>> 0.09 │ mov %rax,0x28(%rsp)
>>> │ mov 0x28(%rsp),%rbp
>>> 0.01 │ sub $0x28,%rbp
>>> │ jmp 5c7
>>> 1.72 │5b0: mov 0x28(%rbp),%rax
>>> 0.05 │ mov 0x18(%rsp),%rbx
>>> 0.00 │ mov %rax,0x28(%rsp)
>>> 0.03 │ mov 0x28(%rsp),%rbp
>>> 5.67 │ sub $0x28,%rbp
>>> 1.71 │5c7: lea 0x28(%rbp),%rax
>>> 1.73 │ cmp %r15,%rax
>>> │ je 640
>>> 1.74 │ cmp %r14w,0x0(%rbp)
>>> │ jne 5b0
>>> 81.36 │ mov 0x8(%rbp),%rax
>>> 2.74 │ cmp %rax,%r8
>>> │ je 5eb
>>> 1.37 │ cmp 0x20(%rbx),%rax
>>> │ je 5eb
>>> 1.39 │ cmp %r13,%rax
>>> │ jne 5b0
>>> 0.04 │5eb: test %r12,%r12
>>> 0.04 │ je 6f4
>>> │ mov 0xc0(%rbx),%eax
>>> │ mov 0xc8(%rbx),%rdx
>>> │ testb $0x8,0x1(%rdx,%rax,1)
>>> │ jne 6d5
>>>
>>> This corresponds to:
>>>
>>> net/core/dev.c:
>>> type = skb->protocol;
>>> list_for_each_entry_rcu(ptype,
>>> &ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
>>> if (ptype->type == type &&
>>> (ptype->dev == null_or_dev || ptype->dev == skb->dev ||
>>> ptype->dev == orig_dev)) {
>>> if (pt_prev)
>>> ret = deliver_skb(skb, pt_prev, orig_dev);
>>> pt_prev = ptype;
>>> }
>>> }
>>>
>>> Which works perfectly OK until there are a lot of AF_PACKET sockets, since
>>> the socket adds a protocol to ptype list:
>>>
>>> # cat /proc/net/ptype
>>> Type Device Function
>>> 0800 eth2.1989 packet_rcv+0x0/0x400
>>> 0800 eth2.1987 packet_rcv+0x0/0x400
>>> 0800 eth2.1986 packet_rcv+0x0/0x400
>>> 0800 eth2.1990 packet_rcv+0x0/0x400
>>> 0800 eth2.1995 packet_rcv+0x0/0x400
>>> 0800 eth2.1997 packet_rcv+0x0/0x400
>>> .......
>>> 0800 eth2.1004 packet_rcv+0x0/0x400
>>> 0800 ip_rcv+0x0/0x310
>>> 0011 llc_rcv+0x0/0x3a0
>>> 0004 llc_rcv+0x0/0x3a0
>>> 0806 arp_rcv+0x0/0x150
>>>
>>> And this obviously results in a huge performance penalty.
>>>
>>> ptype_all, by the looks, should be the same.
>>>
>>> Probably one way to fix this it to perform interface name matching in
>>> af_packet handler, but there could be other cases, other protocols.
>>>
>>> Ideas are welcome :)
>
> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
> an alternative if you have only a few packet sockets (maybe one pinned to each
> cpu) and cluster/load-balance them together via packet fanout? (Where you
> bind the socket to ifindex 0, so that you get traffic from all devs...) That
> would at least avoid that "hot spot", and you could post-process the interface
> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
> observed. ;-)

I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
work for me (applications is a custom DHCP server) it'll surely
increase the overhead of BPF (I don't need to tap the traffic from all
interfaces), there are vlans, bridges and bonds - likely the server will receive
same packets multiple times and replies must be sent too...
but it still should be faster.

I just checked isc-dhcpd-V3.1.3 running on multiple interfaces
(another system with 2.6.32):
$ cat /proc/net/ptype
Type Device Function
ALL eth0 packet_rcv_spkt+0x0/0x190
ALL eth0.10 packet_rcv_spkt+0x0/0x190
ALL eth0.11 packet_rcv_spkt+0x0/0x190
....

As I understand, it'll hit this code:
list_for_each_entry_rcu(ptype, &ptype_all, list) {
if (!ptype->dev || ptype->dev == skb->dev) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
}
}
which scales the same.

Thanks.

2013-06-07 14:30:35

by Eric Dumazet

[permalink] [raw]
Subject: RE: Scaling problem with a lot of AF_PACKET sockets on different interfaces

On Fri, 2013-06-07 at 15:09 +0100, David Laight wrote:
> > > Looks like the ptype_base[] should be per 'dev'?
> > > Or just put entries where ptype->dev != null_or_dev on a per-interface
> > > list and do two searches?
> >
> > Yes, but then we would have two searches instead of one in fast path.
>
> Usually it would be empty - so the search would be very quick!

quick + quick + quick + quick + quick == not so quick ;)

Plus adding another code for /proc/net/packet

Plus adding the xmit side (AF_PACKET captures receive and xmit)


2013-06-07 14:33:22

by Daniel Borkmann

[permalink] [raw]
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

On 06/07/2013 04:17 PM, Vitaly V. Bursov wrote:
> 07.06.2013 16:05, Daniel Borkmann пишет:
[...]
>>>> Ideas are welcome :)
>>
>> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
>> an alternative if you have only a few packet sockets (maybe one pinned to each
>> cpu) and cluster/load-balance them together via packet fanout? (Where you
>> bind the socket to ifindex 0, so that you get traffic from all devs...) That
>> would at least avoid that "hot spot", and you could post-process the interface
>> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
>> observed. ;-)
>
> I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
> work for me (applications is a custom DHCP server) it'll surely
> increase the overhead of BPF (I don't need to tap the traffic from all
> interfaces), there are vlans, bridges and bonds - likely the server will receive
> same packets multiple times and replies must be sent too...
> but it still should be faster.

Well, as already said, if you use a fanout socket group, then you won't receive the
_exact_ same packet twice. Rather, packets are balanced by different policies among
your packet sockets in that group. What you could do is to have a (e.g.) single BPF
filter (jitted) for all those sockets that'll let needed packets pass and you can then
access the interface they came from via sockaddr_ll, which then is further processed
in your fast path (or dropped depending on the iface). There's also a BPF extension
(BPF_S_ANC_IFINDEX) that lets you load the ifindex of the skb into the BPF accumulator,
so you could also filter early from there for a range of ifindexes (in combination to
bind the sockets to index 0). Probably that could work.

2013-06-10 06:34:19

by Vitaly V. Bursov

[permalink] [raw]
Subject: Re: Scaling problem with a lot of AF_PACKET sockets on different interfaces

07.06.2013 17:33, Daniel Borkmann пишет:
> On 06/07/2013 04:17 PM, Vitaly V. Bursov wrote:
>> 07.06.2013 16:05, Daniel Borkmann пишет:
> [...]
>>>>> Ideas are welcome :)
>>>
>>> Probably, that depends on _your scenario_ and/or BPF filter, but would it be
>>> an alternative if you have only a few packet sockets (maybe one pinned to each
>>> cpu) and cluster/load-balance them together via packet fanout? (Where you
>>> bind the socket to ifindex 0, so that you get traffic from all devs...) That
>>> would at least avoid that "hot spot", and you could post-process the interface
>>> via sockaddr_ll. But I'd agree that this will not solve the actual problem you've
>>> observed. ;-)
>>
>> I was't aware of the ifindex 0 thing, it can help, thanks! Of course, if it'll
>> work for me (applications is a custom DHCP server) it'll surely
>> increase the overhead of BPF (I don't need to tap the traffic from all
>> interfaces), there are vlans, bridges and bonds - likely the server will receive
>> same packets multiple times and replies must be sent too...
>> but it still should be faster.
>
> Well, as already said, if you use a fanout socket group, then you won't receive the
> _exact_ same packet twice. Rather, packets are balanced by different policies among
> your packet sockets in that group. What you could do is to have a (e.g.) single BPF
> filter (jitted) for all those sockets that'll let needed packets pass and you can then
> access the interface they came from via sockaddr_ll, which then is further processed
> in your fast path (or dropped depending on the iface). There's also a BPF extension
> (BPF_S_ANC_IFINDEX) that lets you load the ifindex of the skb into the BPF accumulator,
> so you could also filter early from there for a range of ifindexes (in combination to
> bind the sockets to index 0). Probably that could work.

Thanks everybody, this should help a lot.

--
Vitaly