2022-07-29 10:45:38

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: [PATCH 0/2] neighbour: fix possible DoS due to net iface start/stop loop

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
skb->pkt_type == PACKET_HOST ||
NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
sip, dev, tip, sha,
dev->dev_addr, sha,
reply_dst);
} else {
pneigh_enqueue(&arp_tbl, // reply with delay
in_dev->arp_parms, skb);
goto out_free_dst;
}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Yajun Deng <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Denis V. Lunev <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Konstantin Khorenko <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Andrey Zhadchenko <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: [email protected]

Alexander Mikhalitsyn (1):
neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
neigh: fix possible DoS due to net iface start/stop loop

include/net/neighbour.h | 1 +
net/core/neighbour.c | 43 +++++++++++++++++++++++++++++++++--------
2 files changed, 36 insertions(+), 8 deletions(-)

--
2.36.1


2022-07-29 10:45:45

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop

From: "Denis V. Lunev" <[email protected]>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
procedure
* the request is put into the queue using pneigh_enqueue() if
corresponding ARP record is not local (common case for container
records on the host)
* the request is processed by timer (within 80 jiffies by default) and
ARP reply is sent from the same arp_process() using
NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

Investigated-by: Alexander Mikhalitsyn <[email protected]>
Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Yajun Deng <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Alexey Kuznetsov <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: Konstantin Khorenko <[email protected]>
Cc: [email protected]
Signed-off-by: Denis V. Lunev <[email protected]>
---
net/core/neighbour.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 54625287ee5b..213ec0be800b 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,24 @@ static int neigh_del_timer(struct neighbour *n)
return 0;
}

-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
{
+ unsigned long flags;
struct sk_buff *skb;

- while ((skb = skb_dequeue(list)) != NULL) {
- dev_put(skb->dev);
- kfree_skb(skb);
+ spin_lock_irqsave(&list->lock, flags);
+ skb = skb_peek(list);
+ while (skb) {
+ struct sk_buff *skb_next = skb_peek_next(skb, list);
+
+ if (!net || net_eq(dev_net(skb->dev), net)) {
+ __skb_unlink(skb, list);
+ dev_put(skb->dev);
+ kfree_skb(skb);
+ }
+ skb = skb_next;
}
+ spin_unlock_irqrestore(&list->lock, flags);
}

static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
neigh_flush_dev(tbl, dev, skip_perm);
pneigh_ifdown_and_unlock(tbl, dev);

- del_timer_sync(&tbl->proxy_timer);
- pneigh_queue_purge(&tbl->proxy_queue);
+ pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
return 0;
}

@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
cancel_delayed_work_sync(&tbl->managed_work);
cancel_delayed_work_sync(&tbl->gc_work);
del_timer_sync(&tbl->proxy_timer);
- pneigh_queue_purge(&tbl->proxy_queue);
+ pneigh_queue_purge(&tbl->proxy_queue, NULL);
neigh_ifdown(tbl, NULL);
if (atomic_read(&tbl->entries))
pr_crit("neighbour leakage\n");
--
2.36.1

2022-08-01 15:20:42

by David Ahern

[permalink] [raw]
Subject: Re: [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop

On 7/29/22 4:35 AM, Alexander Mikhalitsyn wrote:
> The patch proposed doing very simple thing. It drops only packets from

it does 2 things - adds a namespace check and a performance based change
with the way the list is walked.

> the same namespace in the pneigh_queue_purge() where network interface
> state change is detected. This is enough to prevent the problem for the
> whole node preserving original semantics of the code.
>


> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 54625287ee5b..213ec0be800b 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c

> @@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
> neigh_flush_dev(tbl, dev, skip_perm);
> pneigh_ifdown_and_unlock(tbl, dev);
>
> - del_timer_sync(&tbl->proxy_timer);

why are you removing this line too?

> - pneigh_queue_purge(&tbl->proxy_queue);
> + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
> return 0;
> }
>

2022-08-01 16:27:12

by Denis V. Lunev

[permalink] [raw]
Subject: Re: [PATCH 1/2] neigh: fix possible DoS due to net iface start/stop loop

On 01.08.2022 17:08, David Ahern wrote:
> On 7/29/22 4:35 AM, Alexander Mikhalitsyn wrote:
>> The patch proposed doing very simple thing. It drops only packets from
> it does 2 things - adds a namespace check and a performance based change
> with the way the list is walked.
I believe no. All items are removed before the patch and
all items are walked after the patch, similar O(n) operation.


>> the same namespace in the pneigh_queue_purge() where network interface
>> state change is detected. This is enough to prevent the problem for the
>> whole node preserving original semantics of the code.
>>
>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 54625287ee5b..213ec0be800b 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -386,8 +396,7 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>> neigh_flush_dev(tbl, dev, skip_perm);
>> pneigh_ifdown_and_unlock(tbl, dev);
>>
>> - del_timer_sync(&tbl->proxy_timer);
> why are you removing this line too?

Because we have packets in the queue after the purge and they have to be
processed.
May be we should better do

if (skb_queue_empty(&tbl->proxy_queue))
   del_timer_sync(&tbl->proxy_timer);

after the purge. That could be more correct.

>> - pneigh_queue_purge(&tbl->proxy_queue);
>> + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
>> return 0;
>> }
>>


2022-08-10 16:25:59

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
skb->pkt_type == PACKET_HOST ||
NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
sip, dev, tip, sha,
dev->dev_addr, sha,
reply_dst);
} else {
pneigh_enqueue(&arp_tbl, // reply with delay
in_dev->arp_parms, skb);
goto out_free_dst;
}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

v2:
- only ("neigh: fix possible DoS due to net iface start/stop") is changed
do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Yajun Deng <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Denis V. Lunev <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Konstantin Khorenko <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Andrey Zhadchenko <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Denis V. Lunev <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>

Alexander Mikhalitsyn (1):
neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
neigh: fix possible DoS due to net iface start/stop loop

include/net/neighbour.h | 1 +
net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 9 deletions(-)

--
2.36.1

2022-08-10 16:46:13

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop

From: "Denis V. Lunev" <[email protected]>

Normal processing of ARP request (usually this is Ethernet broadcast
packet) coming to the host is looking like the following:
* the packet comes to arp_process() call and is passed through routing
procedure
* the request is put into the queue using pneigh_enqueue() if
corresponding ARP record is not local (common case for container
records on the host)
* the request is processed by timer (within 80 jiffies by default) and
ARP reply is sent from the same arp_process() using
NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
pneigh_enqueue())

And here the problem comes. Linux kernel calls pneigh_queue_purge()
which destroys the whole queue of ARP requests on ANY network interface
start/stop event through __neigh_ifdown().

This is actually not a problem within the original world as network
interface start/stop was accessible to the host 'root' only, which
could do more destructive things. But the world is changed and there
are Linux containers available. Here container 'root' has an access
to this API and could be considered as untrusted user in the hosting
(container's) world.

Thus there is an attack vector to other containers on node when
container's root will endlessly start/stop interfaces. We have observed
similar situation on a real production node when docker container was
doing such activity and thus other containers on the node become not
accessible.

The patch proposed doing very simple thing. It drops only packets from
the same namespace in the pneigh_queue_purge() where network interface
state change is detected. This is enough to prevent the problem for the
whole node preserving original semantics of the code.

v2:
- do del_timer_sync() if queue is empty after pneigh_queue_purge()

Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Yajun Deng <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Alexey Kuznetsov <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: Konstantin Khorenko <[email protected]>
Cc: [email protected]
Cc: [email protected]
Investigated-by: Alexander Mikhalitsyn <[email protected]>
Signed-off-by: Denis V. Lunev <[email protected]>
---
net/core/neighbour.c | 25 +++++++++++++++++--------
1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 54625287ee5b..19d99d1eff53 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
return 0;
}

-static void pneigh_queue_purge(struct sk_buff_head *list)
+static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
{
+ unsigned long flags;
struct sk_buff *skb;

- while ((skb = skb_dequeue(list)) != NULL) {
- dev_put(skb->dev);
- kfree_skb(skb);
+ spin_lock_irqsave(&list->lock, flags);
+ skb = skb_peek(list);
+ while (skb != NULL) {
+ struct sk_buff *skb_next = skb_peek_next(skb, list);
+ if (net == NULL || net_eq(dev_net(skb->dev), net)) {
+ __skb_unlink(skb, list);
+ dev_put(skb->dev);
+ kfree_skb(skb);
+ }
+ skb = skb_next;
}
+ spin_unlock_irqrestore(&list->lock, flags);
}

static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
@@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
write_lock_bh(&tbl->lock);
neigh_flush_dev(tbl, dev, skip_perm);
pneigh_ifdown_and_unlock(tbl, dev);
-
- del_timer_sync(&tbl->proxy_timer);
- pneigh_queue_purge(&tbl->proxy_queue);
+ pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
+ if (skb_queue_empty_lockless(&tbl->proxy_queue))
+ del_timer_sync(&tbl->proxy_timer);
return 0;
}

@@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
cancel_delayed_work_sync(&tbl->managed_work);
cancel_delayed_work_sync(&tbl->gc_work);
del_timer_sync(&tbl->proxy_timer);
- pneigh_queue_purge(&tbl->proxy_queue);
+ pneigh_queue_purge(&tbl->proxy_queue, NULL);
neigh_ifdown(tbl, NULL);
if (atomic_read(&tbl->entries))
pr_crit("neighbour leakage\n");
--
2.36.1

2022-08-11 15:04:26

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop

Hi, Jakub

On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <[email protected]> wrote:
>
> On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> > include/net/neighbour.h | 1 +
> > net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
> > 2 files changed, 38 insertions(+), 9 deletions(-)
>
> Which tree are these based on? They don't seem to apply cleanly

It's based on 5.19 tree, but I can easily resent it based on net-next.

Regards,
Alex

2022-08-11 15:20:13

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop

On Thu, 11 Aug 2022 17:51:32 +0300 Alexander Mikhalitsyn wrote:
> On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <[email protected]> wrote:
> > On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> > > include/net/neighbour.h | 1 +
> > > net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
> > > 2 files changed, 38 insertions(+), 9 deletions(-)
> >
> > Which tree are these based on? They don't seem to apply cleanly
>
> It's based on 5.19 tree, but I can easily resent it based on net-next.

netdev/net would be the most appropriate tree for a fix.
Not that it differs much from net-next at this stage of
the merge window.

2022-08-11 15:21:52

by Jakub Kicinski

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop

On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> include/net/neighbour.h | 1 +
> net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
> 2 files changed, 38 insertions(+), 9 deletions(-)

Which tree are these based on? They don't seem to apply cleanly

2022-08-11 15:30:45

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop

Dear friends,

Recently one of OpenVZ users reported that they have issues with network
availability of some containers. It was discovered that the reason is absence
of ARP replies from the Host Node on the requests about container IPs.

Of course, we started from tcpdump analysis and noticed that ARP requests
successfuly comes to the problematic node external interface. So, something
was wrong from the kernel side.

I've played a lot with arping and perf in attempts to understand what's
happening. And the key observation was that we experiencing issues only
with ARP requests with broadcast source ip (skb->pkt_type == PACKET_BROADCAST).
But for packets skb->pkt_type == PACKET_HOST everything works flawlessly.

Let me show a small piece of code:

static int arp_process(struct sock *sk, struct sk_buff *skb)
...
if (NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED ||
skb->pkt_type == PACKET_HOST ||
NEIGH_VAR(in_dev->arp_parms, PROXY_DELAY) == 0) { // reply instantly
arp_send_dst(ARPOP_REPLY, ETH_P_ARP,
sip, dev, tip, sha,
dev->dev_addr, sha,
reply_dst);
} else {
pneigh_enqueue(&arp_tbl, // reply with delay
in_dev->arp_parms, skb);
goto out_free_dst;
}

The problem was that for PACKET_BROADCAST packets we delaying replies and use pneigh_enqueue() function.
For some reason, queued packets were lost almost all the time! The reason for such behaviour is pneigh_queue_purge()
function which cleanups all the queue, and this function called everytime once some network device in the system
gets link down.

neigh_ifdown -> pneigh_queue_purge

Now imagine that we have a node with 500+ containers with microservices. And some of that microservices are buggy
and always restarting... in this case, pneigh_queue_purge function will be called very frequently.

This problem is reproducible only with so-called "host routed" setup. The classical scheme bridge + veth
is not affected.

Minimal reproducer

Suppose that we have a network 172.29.1.1/16 brd 172.29.255.255
and we have free-to-use IP, let it be 172.29.128.3

1. Network configuration. I showing the minimal configuration, it makes no sense
as we have both veth devices stay at the same net namespace, but for demonstation and simplicity sake it's okay.

ip l a veth31427 type veth peer name veth314271
ip l s veth31427 up
ip l s veth314271 up

# setup static arp entry and publish it
arp -Ds -i br0 172.29.128.3 veth31427 pub
# setup static route for this address
route add 172.29.128.3/32 dev veth31427

2. "attacker" side (kubernetes pod with buggy microservice :) )

unshare -n
ip l a type veth
ip l s veth0 up
ip l s veth1 up
for i in {1..100000}; do ip link set veth0 down; sleep 0.01; ip link set veth0 up; done

This will totaly block ARP replies for 172.29.128.3 address. Just try
# arping -I eth0 172.29.128.3 -c 4

Our proposal is simple:
1. Let's cleanup queue partially. Remove only skb's that related to the net namespace
of the adapter which link is down.

2. Let's account proxy_queue limit properly per-device. Current limitation looks
not fully correct because we comparing per-device configurable limit with the
"global" qlen of proxy_queue.

Thanks,
Alex

v2:
- only ("neigh: fix possible DoS due to net iface start/stop") is changed
do del_timer_sync() if queue is empty after pneigh_queue_purge()

v3:
- rebase to net tree

Cc: "David S. Miller" <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Jakub Kicinski <[email protected]>
Cc: Paolo Abeni <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: David Ahern <[email protected]>
Cc: Yajun Deng <[email protected]>
Cc: Roopa Prabhu <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Denis V. Lunev <[email protected]>
Cc: Alexey Kuznetsov <[email protected]>
Cc: Konstantin Khorenko <[email protected]>
Cc: Pavel Tikhomirov <[email protected]>
Cc: Andrey Zhadchenko <[email protected]>
Cc: Alexander Mikhalitsyn <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Denis V. Lunev <[email protected]>
Signed-off-by: Alexander Mikhalitsyn <[email protected]>

Alexander Mikhalitsyn (1):
neighbour: make proxy_queue.qlen limit per-device

Denis V. Lunev (1):
neigh: fix possible DoS due to net iface start/stop loop

include/net/neighbour.h | 1 +
net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
2 files changed, 38 insertions(+), 9 deletions(-)

--
2.36.1

2022-08-11 15:45:21

by Alexander Mikhalitsyn

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] neighbour: fix possible DoS due to net iface start/stop loop

On Thu, Aug 11, 2022 at 5:53 PM Jakub Kicinski <[email protected]> wrote:
>
> On Thu, 11 Aug 2022 17:51:32 +0300 Alexander Mikhalitsyn wrote:
> > On Thu, Aug 11, 2022 at 5:46 PM Jakub Kicinski <[email protected]> wrote:
> > > On Wed, 10 Aug 2022 19:08:38 +0300 Alexander Mikhalitsyn wrote:
> > > > include/net/neighbour.h | 1 +
> > > > net/core/neighbour.c | 46 +++++++++++++++++++++++++++++++++--------
> > > > 2 files changed, 38 insertions(+), 9 deletions(-)
> > >
> > > Which tree are these based on? They don't seem to apply cleanly
> >
> > It's based on 5.19 tree, but I can easily resent it based on net-next.
>
> netdev/net would be the most appropriate tree for a fix.
> Not that it differs much from net-next at this stage of
> the merge window.

Yes, thanks Jakub. I'm a newbie here ;) Sorry for the inconvenience.

Will rebase and send patches soon.

2022-08-15 09:59:41

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop

On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote:
> From: "Denis V. Lunev" <[email protected]>
>
> Normal processing of ARP request (usually this is Ethernet broadcast
> packet) coming to the host is looking like the following:
> * the packet comes to arp_process() call and is passed through routing
> procedure
> * the request is put into the queue using pneigh_enqueue() if
> corresponding ARP record is not local (common case for container
> records on the host)
> * the request is processed by timer (within 80 jiffies by default) and
> ARP reply is sent from the same arp_process() using
> NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
> pneigh_enqueue())
>
> And here the problem comes. Linux kernel calls pneigh_queue_purge()
> which destroys the whole queue of ARP requests on ANY network interface
> start/stop event through __neigh_ifdown().
>
> This is actually not a problem within the original world as network
> interface start/stop was accessible to the host 'root' only, which
> could do more destructive things. But the world is changed and there
> are Linux containers available. Here container 'root' has an access
> to this API and could be considered as untrusted user in the hosting
> (container's) world.
>
> Thus there is an attack vector to other containers on node when
> container's root will endlessly start/stop interfaces. We have observed
> similar situation on a real production node when docker container was
> doing such activity and thus other containers on the node become not
> accessible.
>
> The patch proposed doing very simple thing. It drops only packets from
> the same namespace in the pneigh_queue_purge() where network interface
> state change is detected. This is enough to prevent the problem for the
> whole node preserving original semantics of the code.

This is how I'd do it as well.

>
> v2:
> - do del_timer_sync() if queue is empty after pneigh_queue_purge()
>
> Cc: "David S. Miller" <[email protected]>
> Cc: Eric Dumazet <[email protected]>
> Cc: Jakub Kicinski <[email protected]>
> Cc: Paolo Abeni <[email protected]>
> Cc: Daniel Borkmann <[email protected]>
> Cc: David Ahern <[email protected]>
> Cc: Yajun Deng <[email protected]>
> Cc: Roopa Prabhu <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Alexey Kuznetsov <[email protected]>
> Cc: Alexander Mikhalitsyn <[email protected]>
> Cc: Konstantin Khorenko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Investigated-by: Alexander Mikhalitsyn <[email protected]>
> Signed-off-by: Denis V. Lunev <[email protected]>
> ---
> net/core/neighbour.c | 25 +++++++++++++++++--------
> 1 file changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 54625287ee5b..19d99d1eff53 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
> return 0;
> }
>
> -static void pneigh_queue_purge(struct sk_buff_head *list)
> +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
> {
> + unsigned long flags;
> struct sk_buff *skb;
>
> - while ((skb = skb_dequeue(list)) != NULL) {
> - dev_put(skb->dev);
> - kfree_skb(skb);
> + spin_lock_irqsave(&list->lock, flags);

I'm a bit surprised to see a spinlock held around a while loop walking a
linked list but that seems to be quite common in this file. I take it
the lists are guaranteed to be short.

> + skb = skb_peek(list);
> + while (skb != NULL) {
> + struct sk_buff *skb_next = skb_peek_next(skb, list);
> + if (net == NULL || net_eq(dev_net(skb->dev), net)) {
> + __skb_unlink(skb, list);
> + dev_put(skb->dev);
> + kfree_skb(skb);
> + }
> + skb = skb_next;
> }
> + spin_unlock_irqrestore(&list->lock, flags);
> }
>
> static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
> @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
> write_lock_bh(&tbl->lock);
> neigh_flush_dev(tbl, dev, skip_perm);
> pneigh_ifdown_and_unlock(tbl, dev);
> -
> - del_timer_sync(&tbl->proxy_timer);
> - pneigh_queue_purge(&tbl->proxy_queue);
> + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
> + if (skb_queue_empty_lockless(&tbl->proxy_queue))
> + del_timer_sync(&tbl->proxy_timer);
> return 0;
> }
>
> @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
> cancel_delayed_work_sync(&tbl->managed_work);
> cancel_delayed_work_sync(&tbl->gc_work);
> del_timer_sync(&tbl->proxy_timer);
> - pneigh_queue_purge(&tbl->proxy_queue);
> + pneigh_queue_purge(&tbl->proxy_queue, NULL);
> neigh_ifdown(tbl, NULL);
> if (atomic_read(&tbl->entries))
> pr_crit("neighbour leakage\n");
> --
> 2.36.1
>

2022-08-15 10:59:35

by patchwork-bot+netdevbpf

[permalink] [raw]
Subject: Re: [PATCH v3 0/2] neighbour: fix possible DoS due to net iface start/stop loop

Hello:

This series was applied to netdev/net.git (master)
by David S. Miller <[email protected]>:

On Thu, 11 Aug 2022 18:20:10 +0300 you wrote:
> Dear friends,
>
> Recently one of OpenVZ users reported that they have issues with network
> availability of some containers. It was discovered that the reason is absence
> of ARP replies from the Host Node on the requests about container IPs.
>
> Of course, we started from tcpdump analysis and noticed that ARP requests
> successfuly comes to the problematic node external interface. So, something
> was wrong from the kernel side.
>
> [...]

Here is the summary with links:
- [v3,1/2] neigh: fix possible DoS due to net iface start/stop loop
https://git.kernel.org/netdev/net/c/66ba215cb513
- [v3,2/2] neighbour: make proxy_queue.qlen limit per-device
https://git.kernel.org/netdev/net/c/0ff4eb3d5ebb

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html


2022-08-15 11:02:03

by Denis V. Lunev

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] neigh: fix possible DoS due to net iface start/stop loop

On 15.08.2022 11:44, Christian Brauner wrote:
> On Wed, Aug 10, 2022 at 07:08:39PM +0300, Alexander Mikhalitsyn wrote:
>> From: "Denis V. Lunev" <[email protected]>
>>
>> Normal processing of ARP request (usually this is Ethernet broadcast
>> packet) coming to the host is looking like the following:
>> * the packet comes to arp_process() call and is passed through routing
>> procedure
>> * the request is put into the queue using pneigh_enqueue() if
>> corresponding ARP record is not local (common case for container
>> records on the host)
>> * the request is processed by timer (within 80 jiffies by default) and
>> ARP reply is sent from the same arp_process() using
>> NEIGH_CB(skb)->flags & LOCALLY_ENQUEUED condition (flag is set inside
>> pneigh_enqueue())
>>
>> And here the problem comes. Linux kernel calls pneigh_queue_purge()
>> which destroys the whole queue of ARP requests on ANY network interface
>> start/stop event through __neigh_ifdown().
>>
>> This is actually not a problem within the original world as network
>> interface start/stop was accessible to the host 'root' only, which
>> could do more destructive things. But the world is changed and there
>> are Linux containers available. Here container 'root' has an access
>> to this API and could be considered as untrusted user in the hosting
>> (container's) world.
>>
>> Thus there is an attack vector to other containers on node when
>> container's root will endlessly start/stop interfaces. We have observed
>> similar situation on a real production node when docker container was
>> doing such activity and thus other containers on the node become not
>> accessible.
>>
>> The patch proposed doing very simple thing. It drops only packets from
>> the same namespace in the pneigh_queue_purge() where network interface
>> state change is detected. This is enough to prevent the problem for the
>> whole node preserving original semantics of the code.
> This is how I'd do it as well.
>
>> v2:
>> - do del_timer_sync() if queue is empty after pneigh_queue_purge()
>>
>> Cc: "David S. Miller" <[email protected]>
>> Cc: Eric Dumazet <[email protected]>
>> Cc: Jakub Kicinski <[email protected]>
>> Cc: Paolo Abeni <[email protected]>
>> Cc: Daniel Borkmann <[email protected]>
>> Cc: David Ahern <[email protected]>
>> Cc: Yajun Deng <[email protected]>
>> Cc: Roopa Prabhu <[email protected]>
>> Cc: Christian Brauner <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Cc: Alexey Kuznetsov <[email protected]>
>> Cc: Alexander Mikhalitsyn <[email protected]>
>> Cc: Konstantin Khorenko <[email protected]>
>> Cc: [email protected]
>> Cc: [email protected]
>> Investigated-by: Alexander Mikhalitsyn <[email protected]>
>> Signed-off-by: Denis V. Lunev <[email protected]>
>> ---
>> net/core/neighbour.c | 25 +++++++++++++++++--------
>> 1 file changed, 17 insertions(+), 8 deletions(-)
>>
>> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
>> index 54625287ee5b..19d99d1eff53 100644
>> --- a/net/core/neighbour.c
>> +++ b/net/core/neighbour.c
>> @@ -307,14 +307,23 @@ static int neigh_del_timer(struct neighbour *n)
>> return 0;
>> }
>>
>> -static void pneigh_queue_purge(struct sk_buff_head *list)
>> +static void pneigh_queue_purge(struct sk_buff_head *list, struct net *net)
>> {
>> + unsigned long flags;
>> struct sk_buff *skb;
>>
>> - while ((skb = skb_dequeue(list)) != NULL) {
>> - dev_put(skb->dev);
>> - kfree_skb(skb);
>> + spin_lock_irqsave(&list->lock, flags);
> I'm a bit surprised to see a spinlock held around a while loop walking a
> linked list but that seems to be quite common in this file. I take it
> the lists are guaranteed to be short.
Within the current code the size of the list is 64 packets at most
(same spinlock is held during packets processing).

Though this semantics is changed in the next patch from
Alexander, where we will get 64 packets/interface.

Den

>> + skb = skb_peek(list);
>> + while (skb != NULL) {
>> + struct sk_buff *skb_next = skb_peek_next(skb, list);
>> + if (net == NULL || net_eq(dev_net(skb->dev), net)) {
>> + __skb_unlink(skb, list);
>> + dev_put(skb->dev);
>> + kfree_skb(skb);
>> + }
>> + skb = skb_next;
>> }
>> + spin_unlock_irqrestore(&list->lock, flags);
>> }
>>
>> static void neigh_flush_dev(struct neigh_table *tbl, struct net_device *dev,
>> @@ -385,9 +394,9 @@ static int __neigh_ifdown(struct neigh_table *tbl, struct net_device *dev,
>> write_lock_bh(&tbl->lock);
>> neigh_flush_dev(tbl, dev, skip_perm);
>> pneigh_ifdown_and_unlock(tbl, dev);
>> -
>> - del_timer_sync(&tbl->proxy_timer);
>> - pneigh_queue_purge(&tbl->proxy_queue);
>> + pneigh_queue_purge(&tbl->proxy_queue, dev_net(dev));
>> + if (skb_queue_empty_lockless(&tbl->proxy_queue))
>> + del_timer_sync(&tbl->proxy_timer);
>> return 0;
>> }
>>
>> @@ -1787,7 +1796,7 @@ int neigh_table_clear(int index, struct neigh_table *tbl)
>> cancel_delayed_work_sync(&tbl->managed_work);
>> cancel_delayed_work_sync(&tbl->gc_work);
>> del_timer_sync(&tbl->proxy_timer);
>> - pneigh_queue_purge(&tbl->proxy_queue);
>> + pneigh_queue_purge(&tbl->proxy_queue, NULL);
>> neigh_ifdown(tbl, NULL);
>> if (atomic_read(&tbl->entries))
>> pr_crit("neighbour leakage\n");
>> --
>> 2.36.1
>>