LinuxLists.cc - [PATCH 0/4] ipv6: frags: fixups for linux-4.4.y

2019-05-29 10:27:26

Subject: [PATCH 0/4] ipv6: frags: fixups for linux-4.4.y

While this backport proposal is based on the 4.4.y stable tree, it
might also apply in some form to any stable tree which backported

05c0b86b96: "ipv6: frags: rewrite ip6_expire_frag_queue()"

While this made ip6_expire_frag_queue() similar to ip_exire(),
it did not follow the additional changes to ip_expire() which
were also backported:

fa0f527358: "ip: use rb trees for IP frag queue."

a4fd284a1f: "ip: process in-order fragments efficiently"

The former of the two not only adds handling for rb trees, but
also modifies ip_expire() to take the first skb off the queue
before using it for the sending the icmp message. This also got
rid of the need to protect the skb by incrementing its reference
count (which is the reason for the crash in ip6_expire_frag_queue()).

My first approach was do those changes in ip6_expire_frag_queue(),
but only the former of the two can be done without problems. The
latter uses code which is only locally defined in ipv4/ip_fragment.c.

This was changed upstream in 5.1 when moving code around to be shared

c23f35d19d: "net: IP defrag: encapsulate rbtree defrag code into
callable functions"

And while backporting that I found the two other changes which sounded
like one might want them backported, too. Maybe even more since the
second (ip: fail fast on IP defrag errors) is already partially
included in the backport of "net: ipv4: do not handle duplicate
fragments as overlapping".

Though I do realize that "net: IP defrag: encapsulate rbtree
defrag code into callable functions" is rather large and for
that reason maybe not qualifying as a stable backport.
So I would like to ask what the net-developers think about
this.

Thanks,
Stefan

0001: v4.20: ipv4: ipv6: netfilter: Adjust the frag mem limit when
truesize changes
0002: v4.20: ip: fail fast on IP defrag errors
0003: v5.1 : net: IP defrag: encapsulate rbtree defrag code into
callable functions
0004: n/a : ipv6: frags: Use inet_frag_pull_head() in
ip6_expire_frag_queue()

Jiri Wiesner (1):
ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

Peter Oskolkov (2):
ip: fail fast on IP defrag errors
net: IP defrag: encapsulate rbtree defrag code into callable functions

Stefan Bader (1):
ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue()

include/net/inet_frag.h | 16 +-
net/ipv4/inet_fragment.c | 293 +++++++++++++++++++++++
net/ipv4/ip_fragment.c | 294 +++---------------------
net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +-
net/ipv6/reassembly.c | 20 +-
5 files changed, 359 insertions(+), 272 deletions(-)

--
2.17.1

2019-05-29 10:27:32

by Stefan Bader

[permalink] [raw]

Subject: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

From: Jiri Wiesner <[email protected]>

The *_frag_reasm() functions are susceptible to miscalculating the byte
count of packet fragments in case the truesize of a head buffer changes.
The truesize member may be changed by the call to skb_unclone(), leaving
the fragment memory limit counter unbalanced even if all fragments are
processed. This miscalculation goes unnoticed as long as the network
namespace which holds the counter is not destroyed.

Should an attempt be made to destroy a network namespace that holds an
unbalanced fragment memory limit counter the cleanup of the namespace
never finishes. The thread handling the cleanup gets stuck in
inet_frags_exit_net() waiting for the percpu counter to reach zero. The
thread is usually in running state with a stacktrace similar to:

PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
#5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
#6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
#7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
#8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
#9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
#10 [ffff880621563e38] process_one_work at ffffffff81096f14

It is not possible to create new network namespaces, and processes
that call unshare() end up being stuck in uninterruptible sleep state
waiting to acquire the net_mutex.

The bug was observed in the IPv6 netfilter code by Per Sundstrom.
I thank him for his analysis of the problem. The parts of this patch
that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.

Signed-off-by: Jiri Wiesner <[email protected]>
Reported-by: Per Sundstrom <[email protected]>
Acked-by: Peter Oskolkov <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

(backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
[smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
Signed-off-by: Stefan Bader <[email protected]>
---
net/ipv4/ip_fragment.c | 7 +++++++
net/ipv6/netfilter/nf_conntrack_reasm.c | 8 +++++++-
net/ipv6/reassembly.c | 8 +++++++-
3 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 2c75330b1c71..5387e6ab78d7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -519,6 +519,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
struct rb_node *rbn;
int len;
int ihlen;
+ int delta;
int err;
u8 ecn;

@@ -560,10 +561,16 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
if (len > 65535)
goto out_oversize;

+ delta = - head->truesize;
+
/* Head of list must not be cloned. */
if (skb_unclone(head, GFP_ATOMIC))
goto out_nomem;

+ delta += head->truesize;
+ if (delta)
+ add_frag_mem_limit(qp->q.net, delta);
+
/* If the first fragment is fragmented itself, we split
* it to two chunks: the first with data and paged part
* and the second, holding only fragments. */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 664c84e47bab..e5d06ceb2c0c 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -353,7 +353,7 @@ static struct sk_buff *
nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
{
struct sk_buff *fp, *op, *head = fq->q.fragments;
- int payload_len;
+ int payload_len, delta;
u8 ecn;

inet_frag_kill(&fq->q);
@@ -374,12 +374,18 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct net_device *dev)
goto out_oversize;
}

+ delta = - head->truesize;
+
/* Head of list must not be cloned. */
if (skb_unclone(head, GFP_ATOMIC)) {
pr_debug("skb is cloned but can't expand head");
goto out_oom;
}

+ delta += head->truesize;
+ if (delta)
+ add_frag_mem_limit(fq->q.net, delta);
+
/* If the first fragment is fragmented itself, we split
* it to two chunks: the first with data and paged part
* and the second, holding only fragments. */
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index ec917f58d105..020cf70273c9 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -343,7 +343,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
{
struct net *net = container_of(fq->q.net, struct net, ipv6.frags);
struct sk_buff *fp, *head = fq->q.fragments;
- int payload_len;
+ int payload_len, delta;
unsigned int nhoff;
int sum_truesize;
u8 ecn;
@@ -384,10 +384,16 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
if (payload_len > IPV6_MAXPLEN)
goto out_oversize;

+ delta = - head->truesize;
+
/* Head of list must not be cloned. */
if (skb_unclone(head, GFP_ATOMIC))
goto out_oom;

+ delta += head->truesize;
+ if (delta)
+ add_frag_mem_limit(fq->q.net, delta);
+
/* If the first fragment is fragmented itself, we split
* it to two chunks: the first with data and paged part
* and the second, holding only fragments. */
--
2.17.1

2019-05-29 10:27:43

by Stefan Bader

[permalink] [raw]

Subject: [PATCH 2/4] ip: fail fast on IP defrag errors

From: Peter Oskolkov <[email protected]>

The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
by timeout;
- testing of overlapping fragments is now much easier: any kind of
random fragment length mutation now leads to the frag queue being
discarded (IP packet dropped); before this patch, some overlaps were
"corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

(backported from commit 0ff89efb524631ac9901b81446b453c29711c376)
[smb: context adjustments and ignoring those changes already done
in backport for "net: ipv4: do not handle duplicate fragments
as overlapping"]
Signed-off-by: Stefan Bader <[email protected]>
---
net/ipv4/ip_fragment.c | 19 +++++++++++--------
1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 5387e6ab78d7..a53652c8c0fd 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -382,7 +382,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
*/
if (end < qp->q.len ||
((qp->q.flags & INET_FRAG_LAST_IN) && end != qp->q.len))
- goto err;
+ goto discard_qp;
qp->q.flags |= INET_FRAG_LAST_IN;
qp->q.len = end;
} else {
@@ -394,20 +394,20 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
if (end > qp->q.len) {
/* Some bits beyond end -> corruption. */
if (qp->q.flags & INET_FRAG_LAST_IN)
- goto err;
+ goto discard_qp;
qp->q.len = end;
}
}
if (end == offset)
- goto err;
+ goto discard_qp;

err = -ENOMEM;
if (!pskb_pull(skb, skb_network_offset(skb) + ihl))
- goto err;
+ goto discard_qp;

err = pskb_trim_rcsum(skb, end - offset);
if (err)
- goto err;
+ goto discard_qp;

/* Note : skb->rbnode and skb->dev share the same location. */
dev = skb->dev;
@@ -434,7 +434,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
/* This is the common case: skb goes to the end. */
/* Detect and discard overlaps. */
if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
- goto discard_qp;
+ goto overlap;
if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
ip4_frag_append_to_last_run(&qp->q, skb);
else
@@ -457,7 +457,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
end <= skb1_run_end)
goto err; /* No new data, potential duplicate */
else
- goto discard_qp; /* Found an overlap */
+ goto overlap; /* Found an overlap */
} while (*rbn);
/* Here we have parent properly set, and rbn pointing to
* one of its NULL left/right children. Insert skb.
@@ -494,15 +494,18 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
skb->_skb_refdst = 0UL;
err = ip_frag_reasm(qp, skb, prev_tail, dev);
skb->_skb_refdst = orefdst;
+ if (err)
+ inet_frag_kill(&qp->q);
return err;
}

skb_dst_drop(skb);
return -EINPROGRESS;

+overlap:
+ IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS);
discard_qp:
inet_frag_kill(&qp->q);
- IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS);
err:
kfree_skb(skb);
return err;
--
2.17.1

2019-05-29 10:27:55

by Stefan Bader

[permalink] [raw]

Subject: [PATCH 4/4] ipv6: frags: Use inet_frag_pull_head() in ip6_expire_frag_queue()

With frag code shared between IPv4 and IPv6 it is now possible
to use inet_frag_pull_head() in ip6_expire_frag_queue() in order
to properly extract the first skb from the frags queue.
Since this really takes the skb out of the list, it no longer
needs to be protected against removal (which implicitly fixes
a crash that will happen somewhere in the call to icmp6_send()
when the use count is forcefully checked to be 1.

kernel BUG at linux-4.4.0/net/core/skbuff.c:1207!
RIP: 0010:[<ffffffff81740953>]
[<ffffffff81740953>] pskb_expand_head+0x243/0x250
[<ffffffff81740e50>] __pskb_pull_tail+0x50/0x350
[<ffffffff8183939a>] _decode_session6+0x26a/0x400
[<ffffffff817ec719>] __xfrm_decode_session+0x39/0x50
[<ffffffff818239d0>] icmpv6_route_lookup+0xf0/0x1c0
[<ffffffff81824421>] icmp6_send+0x5e1/0x940
[<ffffffff8183d431>] icmpv6_send+0x21/0x30
[<ffffffff8182b500>] ip6_expire_frag_queue+0xe0/0x120

Signed-off-by: Stefan Bader <[email protected]>
---
net/ipv6/reassembly.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 020cf70273c9..2bd45abc2516 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -110,16 +110,14 @@ void ip6_expire_frag_queue(struct net *net, struct frag_queue *fq)
IP6_INC_STATS_BH(net, __in6_dev_get(dev), IPSTATS_MIB_REASMTIMEOUT);

/* Don't send error if the first segment did not arrive. */
- head = fq->q.fragments;
- if (!(fq->q.flags & INET_FRAG_FIRST_IN) || !head)
+ if (!(fq->q.flags & INET_FRAG_FIRST_IN))
+ goto out;
+
+ head = inet_frag_pull_head(&fq->q);
+ if (!head)
goto out;

- /* But use as source device on which LAST ARRIVED
- * segment was received. And do not use fq->dev
- * pointer directly, device might already disappeared.
- */
head->dev = dev;
- skb_get(head);
spin_unlock(&fq->q.lock);

icmpv6_send(head, ICMPV6_TIME_EXCEED, ICMPV6_EXC_FRAGTIME, 0);
--
2.17.1

2019-05-29 10:29:26

by Stefan Bader

[permalink] [raw]

Subject: [PATCH 3/4] net: IP defrag: encapsulate rbtree defrag code into callable functions

From: Peter Oskolkov <[email protected]>

This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Florian Westphal <[email protected]>
Cc: Tom Herbert <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
(backported from commit c23f35d19db3b36ffb9e04b08f1d91565d15f84f)
[smb: open code skb_mark_not_on_list(), context adjustments, use of
IP_INC_STATS_BH instead of __IP_INC_STATS]
Signed-off-by: Stefan Bader <[email protected]>
---
include/net/inet_frag.h | 16 ++-
net/ipv4/inet_fragment.c | 293 +++++++++++++++++++++++++++++++++++++++
net/ipv4/ip_fragment.c | 288 ++++----------------------------------
3 files changed, 332 insertions(+), 265 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6260ec146142..7c8b06302ade 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,8 +75,8 @@ struct inet_frag_queue {
struct timer_list timer;
spinlock_t lock;
atomic_t refcnt;
- struct sk_buff *fragments; /* Used in IPv6. */
- struct rb_root rb_fragments; /* Used in IPv4. */
+ struct sk_buff *fragments; /* used in 6lopwpan IPv6. */
+ struct rb_root rb_fragments; /* Used in IPv4/IPv6. */
struct sk_buff *fragments_tail;
struct sk_buff *last_run_head;
ktime_t stamp;
@@ -152,4 +152,16 @@ static inline void add_frag_mem_limit(struct netns_frags *nf, long val)

extern const u8 ip_frag_ecn_table[16];

+/* Return values of inet_frag_queue_insert() */
+#define IPFRAG_OK 0
+#define IPFRAG_DUP 1
+#define IPFRAG_OVERLAP 2
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+ int offset, int end);
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent);
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+ void *reasm_data);
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q);
+
#endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index c03e5f5859e1..5c167efd54bd 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -24,6 +24,62 @@
#include <net/sock.h>
#include <net/inet_frag.h>
#include <net/inet_ecn.h>
+#include <net/ip.h>
+#include <net/ipv6.h>
+
+/* Use skb->cb to track consecutive/adjacent fragments coming at
+ * the end of the queue. Nodes in the rb-tree queue will
+ * contain "runs" of one or more adjacent fragments.
+ *
+ * Invariants:
+ * - next_frag is NULL at the tail of a "run";
+ * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
+ */
+struct ipfrag_skb_cb {
+ union {
+ struct inet_skb_parm h4;
+ struct inet6_skb_parm h6;
+ };
+ struct sk_buff *next_frag;
+ int frag_run_len;
+};
+
+#define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb))
+
+static void fragcb_clear(struct sk_buff *skb)
+{
+ RB_CLEAR_NODE(&skb->rbnode);
+ FRAG_CB(skb)->next_frag = NULL;
+ FRAG_CB(skb)->frag_run_len = skb->len;
+}
+
+/* Append skb to the last "run". */
+static void fragrun_append_to_last(struct inet_frag_queue *q,
+ struct sk_buff *skb)
+{
+ fragcb_clear(skb);
+
+ FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
+ FRAG_CB(q->fragments_tail)->next_frag = skb;
+ q->fragments_tail = skb;
+}
+
+/* Create a new "run" with the skb. */
+static void fragrun_create(struct inet_frag_queue *q, struct sk_buff *skb)
+{
+ BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
+ fragcb_clear(skb);
+
+ if (q->last_run_head)
+ rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
+ &q->last_run_head->rbnode.rb_right);
+ else
+ rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
+ rb_insert_color(&skb->rbnode, &q->rb_fragments);
+
+ q->fragments_tail = skb;
+ q->last_run_head = skb;
+}

/* Given the OR values of all fragments, apply RFC 3168 5.3 requirements
* Value : 0xff if frame should be dropped.
@@ -130,6 +186,28 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
kmem_cache_free(f->frags_cachep, q);
}

+unsigned int inet_frag_rbtree_purge(struct rb_root *root)
+{
+ struct rb_node *p = rb_first(root);
+ unsigned int sum = 0;
+
+ while (p) {
+ struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
+
+ p = rb_next(p);
+ rb_erase(&skb->rbnode, root);
+ while (skb) {
+ struct sk_buff *next = FRAG_CB(skb)->next_frag;
+
+ sum += skb->truesize;
+ kfree_skb(skb);
+ skb = next;
+ }
+ }
+ return sum;
+}
+EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
void inet_frag_destroy(struct inet_frag_queue *q)
{
struct sk_buff *fp;
@@ -231,3 +309,218 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, void *key)
return fq;
}
EXPORT_SYMBOL(inet_frag_find);
+
+int inet_frag_queue_insert(struct inet_frag_queue *q, struct sk_buff *skb,
+ int offset, int end)
+{
+ struct sk_buff *last = q->fragments_tail;
+
+ /* RFC5722, Section 4, amended by Errata ID : 3089
+ * When reassembling an IPv6 datagram, if
+ * one or more its constituent fragments is determined to be an
+ * overlapping fragment, the entire datagram (and any constituent
+ * fragments) MUST be silently discarded.
+ *
+ * Duplicates, however, should be ignored (i.e. skb dropped, but the
+ * queue/fragments kept for later reassembly).
+ */
+ if (!last)
+ fragrun_create(q, skb); /* First fragment. */
+ else if (last->ip_defrag_offset + last->len < end) {
+ /* This is the common case: skb goes to the end. */
+ /* Detect and discard overlaps. */
+ if (offset < last->ip_defrag_offset + last->len)
+ return IPFRAG_OVERLAP;
+ if (offset == last->ip_defrag_offset + last->len)
+ fragrun_append_to_last(q, skb);
+ else
+ fragrun_create(q, skb);
+ } else {
+ /* Binary search. Note that skb can become the first fragment,
+ * but not the last (covered above).
+ */
+ struct rb_node **rbn, *parent;
+
+ rbn = &q->rb_fragments.rb_node;
+ do {
+ struct sk_buff *curr;
+ int curr_run_end;
+
+ parent = *rbn;
+ curr = rb_to_skb(parent);
+ curr_run_end = curr->ip_defrag_offset +
+ FRAG_CB(curr)->frag_run_len;
+ if (end <= curr->ip_defrag_offset)
+ rbn = &parent->rb_left;
+ else if (offset >= curr_run_end)
+ rbn = &parent->rb_right;
+ else if (offset >= curr->ip_defrag_offset &&
+ end <= curr_run_end)
+ return IPFRAG_DUP;
+ else
+ return IPFRAG_OVERLAP;
+ } while (*rbn);
+ /* Here we have parent properly set, and rbn pointing to
+ * one of its NULL left/right children. Insert skb.
+ */
+ fragcb_clear(skb);
+ rb_link_node(&skb->rbnode, parent, rbn);
+ rb_insert_color(&skb->rbnode, &q->rb_fragments);
+ }
+
+ skb->ip_defrag_offset = offset;
+
+ return IPFRAG_OK;
+}
+EXPORT_SYMBOL(inet_frag_queue_insert);
+
+void *inet_frag_reasm_prepare(struct inet_frag_queue *q, struct sk_buff *skb,
+ struct sk_buff *parent)
+{
+ struct sk_buff *fp, *head = skb_rb_first(&q->rb_fragments);
+ struct sk_buff **nextp;
+ int delta;
+
+ if (head != skb) {
+ fp = skb_clone(skb, GFP_ATOMIC);
+ if (!fp)
+ return NULL;
+ FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
+ if (RB_EMPTY_NODE(&skb->rbnode))
+ FRAG_CB(parent)->next_frag = fp;
+ else
+ rb_replace_node(&skb->rbnode, &fp->rbnode,
+ &q->rb_fragments);
+ if (q->fragments_tail == skb)
+ q->fragments_tail = fp;
+ skb_morph(skb, head);
+ FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
+ rb_replace_node(&head->rbnode, &skb->rbnode,
+ &q->rb_fragments);
+ consume_skb(head);
+ head = skb;
+ }
+ WARN_ON(head->ip_defrag_offset != 0);
+
+ delta = -head->truesize;
+
+ /* Head of list must not be cloned. */
+ if (skb_unclone(head, GFP_ATOMIC))
+ return NULL;
+
+ delta += head->truesize;
+ if (delta)
+ add_frag_mem_limit(q->net, delta);
+
+ /* If the first fragment is fragmented itself, we split
+ * it to two chunks: the first with data and paged part
+ * and the second, holding only fragments.
+ */
+ if (skb_has_frag_list(head)) {
+ struct sk_buff *clone;
+ int i, plen = 0;
+
+ clone = alloc_skb(0, GFP_ATOMIC);
+ if (!clone)
+ return NULL;
+ skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
+ skb_frag_list_init(head);
+ for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
+ plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
+ clone->data_len = head->data_len - plen;
+ clone->len = clone->data_len;
+ head->truesize += clone->truesize;
+ clone->csum = 0;
+ clone->ip_summed = head->ip_summed;
+ add_frag_mem_limit(q->net, clone->truesize);
+ skb_shinfo(head)->frag_list = clone;
+ nextp = &clone->next;
+ } else {
+ nextp = &skb_shinfo(head)->frag_list;
+ }
+
+ return nextp;
+}
+EXPORT_SYMBOL(inet_frag_reasm_prepare);
+
+void inet_frag_reasm_finish(struct inet_frag_queue *q, struct sk_buff *head,
+ void *reasm_data)
+{
+ struct sk_buff **nextp = (struct sk_buff **)reasm_data;
+ struct rb_node *rbn;
+ struct sk_buff *fp;
+
+ skb_push(head, head->data - skb_network_header(head));
+
+ /* Traverse the tree in order, to build frag_list. */
+ fp = FRAG_CB(head)->next_frag;
+ rbn = rb_next(&head->rbnode);
+ rb_erase(&head->rbnode, &q->rb_fragments);
+ while (rbn || fp) {
+ /* fp points to the next sk_buff in the current run;
+ * rbn points to the next run.
+ */
+ /* Go through the current run. */
+ while (fp) {
+ *nextp = fp;
+ nextp = &fp->next;
+ fp->prev = NULL;
+ memset(&fp->rbnode, 0, sizeof(fp->rbnode));
+ fp->sk = NULL;
+ head->data_len += fp->len;
+ head->len += fp->len;
+ if (head->ip_summed != fp->ip_summed)
+ head->ip_summed = CHECKSUM_NONE;
+ else if (head->ip_summed == CHECKSUM_COMPLETE)
+ head->csum = csum_add(head->csum, fp->csum);
+ head->truesize += fp->truesize;
+ fp = FRAG_CB(fp)->next_frag;
+ }
+ /* Move to the next run. */
+ if (rbn) {
+ struct rb_node *rbnext = rb_next(rbn);
+
+ fp = rb_to_skb(rbn);
+ rb_erase(rbn, &q->rb_fragments);
+ rbn = rbnext;
+ }
+ }
+ sub_frag_mem_limit(q->net, head->truesize);
+
+ *nextp = NULL;
+ head->next = NULL;
+ head->prev = NULL;
+ head->tstamp = q->stamp;
+}
+EXPORT_SYMBOL(inet_frag_reasm_finish);
+
+struct sk_buff *inet_frag_pull_head(struct inet_frag_queue *q)
+{
+ struct sk_buff *head;
+
+ if (q->fragments) {
+ head = q->fragments;
+ q->fragments = head->next;
+ } else {
+ struct sk_buff *skb;
+
+ head = skb_rb_first(&q->rb_fragments);
+ if (!head)
+ return NULL;
+ skb = FRAG_CB(head)->next_frag;
+ if (skb)
+ rb_replace_node(&head->rbnode, &skb->rbnode,
+ &q->rb_fragments);
+ else
+ rb_erase(&head->rbnode, &q->rb_fragments);
+ memset(&head->rbnode, 0, sizeof(head->rbnode));
+ barrier();
+ }
+ if (head == q->fragments_tail)
+ q->fragments_tail = NULL;
+
+ sub_frag_mem_limit(q->net, head->truesize);
+
+ return head;
+}
+EXPORT_SYMBOL(inet_frag_pull_head);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index a53652c8c0fd..726fdd1998b7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -58,57 +58,6 @@
static int sysctl_ipfrag_max_dist __read_mostly = 64;
static const char ip_frag_cache_name[] = "ip4-frags";

-/* Use skb->cb to track consecutive/adjacent fragments coming at
- * the end of the queue. Nodes in the rb-tree queue will
- * contain "runs" of one or more adjacent fragments.
- *
- * Invariants:
- * - next_frag is NULL at the tail of a "run";
- * - the head of a "run" has the sum of all fragment lengths in frag_run_len.
- */
-struct ipfrag_skb_cb {
- struct inet_skb_parm h;
- struct sk_buff *next_frag;
- int frag_run_len;
-};
-
-#define FRAG_CB(skb) ((struct ipfrag_skb_cb *)((skb)->cb))
-
-static void ip4_frag_init_run(struct sk_buff *skb)
-{
- BUILD_BUG_ON(sizeof(struct ipfrag_skb_cb) > sizeof(skb->cb));
-
- FRAG_CB(skb)->next_frag = NULL;
- FRAG_CB(skb)->frag_run_len = skb->len;
-}
-
-/* Append skb to the last "run". */
-static void ip4_frag_append_to_last_run(struct inet_frag_queue *q,
- struct sk_buff *skb)
-{
- RB_CLEAR_NODE(&skb->rbnode);
- FRAG_CB(skb)->next_frag = NULL;
-
- FRAG_CB(q->last_run_head)->frag_run_len += skb->len;
- FRAG_CB(q->fragments_tail)->next_frag = skb;
- q->fragments_tail = skb;
-}
-
-/* Create a new "run" with the skb. */
-static void ip4_frag_create_run(struct inet_frag_queue *q, struct sk_buff *skb)
-{
- if (q->last_run_head)
- rb_link_node(&skb->rbnode, &q->last_run_head->rbnode,
- &q->last_run_head->rbnode.rb_right);
- else
- rb_link_node(&skb->rbnode, NULL, &q->rb_fragments.rb_node);
- rb_insert_color(&skb->rbnode, &q->rb_fragments);
-
- ip4_frag_init_run(skb);
- q->fragments_tail = skb;
- q->last_run_head = skb;
-}
-
/* Describe an entry in the "incomplete datagrams" queue. */
struct ipq {
struct inet_frag_queue q;
@@ -212,27 +161,9 @@ static void ip_expire(unsigned long arg)
* pull the head out of the tree in order to be able to
* deal with head->dev.
*/
- if (qp->q.fragments) {
- head = qp->q.fragments;
- qp->q.fragments = head->next;
- } else {
- head = skb_rb_first(&qp->q.rb_fragments);
- if (!head)
- goto out;
- if (FRAG_CB(head)->next_frag)
- rb_replace_node(&head->rbnode,
- &FRAG_CB(head)->next_frag->rbnode,
- &qp->q.rb_fragments);
- else
- rb_erase(&head->rbnode, &qp->q.rb_fragments);
- memset(&head->rbnode, 0, sizeof(head->rbnode));
- barrier();
- }
- if (head == qp->q.fragments_tail)
- qp->q.fragments_tail = NULL;
-
- sub_frag_mem_limit(qp->q.net, head->truesize);
-
+ head = inet_frag_pull_head(&qp->q);
+ if (!head)
+ goto out;
head->dev = dev_get_by_index_rcu(net, qp->iif);
if (!head->dev)
goto out;
@@ -345,12 +276,10 @@ static int ip_frag_reinit(struct ipq *qp)
static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
{
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
- struct rb_node **rbn, *parent;
- struct sk_buff *skb1, *prev_tail;
- int ihl, end, skb1_run_end;
+ int ihl, end, flags, offset;
+ struct sk_buff *prev_tail;
struct net_device *dev;
unsigned int fragsize;
- int flags, offset;
int err = -ENOENT;
u8 ecn;

@@ -414,62 +343,13 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
/* Makes sure compiler wont do silly aliasing games */
barrier();

- /* RFC5722, Section 4, amended by Errata ID : 3089
- * When reassembling an IPv6 datagram, if
- * one or more its constituent fragments is determined to be an
- * overlapping fragment, the entire datagram (and any constituent
- * fragments) MUST be silently discarded.
- *
- * We do the same here for IPv4 (and increment an snmp counter) but
- * we do not want to drop the whole queue in response to a duplicate
- * fragment.
- */
-
- err = -EINVAL;
- /* Find out where to put this fragment. */
prev_tail = qp->q.fragments_tail;
- if (!prev_tail)
- ip4_frag_create_run(&qp->q, skb); /* First fragment. */
- else if (prev_tail->ip_defrag_offset + prev_tail->len < end) {
- /* This is the common case: skb goes to the end. */
- /* Detect and discard overlaps. */
- if (offset < prev_tail->ip_defrag_offset + prev_tail->len)
- goto overlap;
- if (offset == prev_tail->ip_defrag_offset + prev_tail->len)
- ip4_frag_append_to_last_run(&qp->q, skb);
- else
- ip4_frag_create_run(&qp->q, skb);
- } else {
- /* Binary search. Note that skb can become the first fragment,
- * but not the last (covered above).
- */
- rbn = &qp->q.rb_fragments.rb_node;
- do {
- parent = *rbn;
- skb1 = rb_to_skb(parent);
- skb1_run_end = skb1->ip_defrag_offset +
- FRAG_CB(skb1)->frag_run_len;
- if (end <= skb1->ip_defrag_offset)
- rbn = &parent->rb_left;
- else if (offset >= skb1_run_end)
- rbn = &parent->rb_right;
- else if (offset >= skb1->ip_defrag_offset &&
- end <= skb1_run_end)
- goto err; /* No new data, potential duplicate */
- else
- goto overlap; /* Found an overlap */
- } while (*rbn);
- /* Here we have parent properly set, and rbn pointing to
- * one of its NULL left/right children. Insert skb.
- */
- ip4_frag_init_run(skb);
- rb_link_node(&skb->rbnode, parent, rbn);
- rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
- }
+ err = inet_frag_queue_insert(&qp->q, skb, offset, end);
+ if (err)
+ goto insert_error;

if (dev)
qp->iif = dev->ifindex;
- skb->ip_defrag_offset = offset;

qp->q.stamp = skb->tstamp;
qp->q.meat += skb->len;
@@ -502,10 +382,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
skb_dst_drop(skb);
return -EINPROGRESS;

-overlap:
+insert_error:
+ if (err == IPFRAG_DUP) {
+ kfree_skb(skb);
+ return -EINVAL;
+ }
IP_INC_STATS_BH(net, IPSTATS_MIB_REASM_OVERLAPS);
discard_qp:
inet_frag_kill(&qp->q);
+ IP_INC_STATS_BH(net, IPSTATS_MIB_REASMFAILS);
err:
kfree_skb(skb);
return err;
@@ -517,13 +402,8 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
{
struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
struct iphdr *iph;
- struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
- struct sk_buff **nextp; /* To build frag_list. */
- struct rb_node *rbn;
- int len;
- int ihlen;
- int delta;
- int err;
+ void *reasm_data;
+ int len, err;
u8 ecn;

ipq_kill(qp);
@@ -534,116 +414,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
goto out_fail;
}
/* Make the one we just received the head. */
- if (head != skb) {
- fp = skb_clone(skb, GFP_ATOMIC);
- if (!fp)
- goto out_nomem;
- FRAG_CB(fp)->next_frag = FRAG_CB(skb)->next_frag;
- if (RB_EMPTY_NODE(&skb->rbnode))
- FRAG_CB(prev_tail)->next_frag = fp;
- else
- rb_replace_node(&skb->rbnode, &fp->rbnode,
- &qp->q.rb_fragments);
- if (qp->q.fragments_tail == skb)
- qp->q.fragments_tail = fp;
- skb_morph(skb, head);
- FRAG_CB(skb)->next_frag = FRAG_CB(head)->next_frag;
- rb_replace_node(&head->rbnode, &skb->rbnode,
- &qp->q.rb_fragments);
- consume_skb(head);
- head = skb;
- }
-
- WARN_ON(head->ip_defrag_offset != 0);
-
- /* Allocate a new buffer for the datagram. */
- ihlen = ip_hdrlen(head);
- len = ihlen + qp->q.len;
+ reasm_data = inet_frag_reasm_prepare(&qp->q, skb, prev_tail);
+ if (!reasm_data)
+ goto out_nomem;

+ len = ip_hdrlen(skb) + qp->q.len;
err = -E2BIG;
if (len > 65535)
goto out_oversize;

- delta = - head->truesize;
+ inet_frag_reasm_finish(&qp->q, skb, reasm_data);
+ skb->dev = dev;
+ IPCB(skb)->frag_max_size = max(qp->max_df_size, qp->q.max_size);

- /* Head of list must not be cloned. */
- if (skb_unclone(head, GFP_ATOMIC))
- goto out_nomem;
-
- delta += head->truesize;
- if (delta)
- add_frag_mem_limit(qp->q.net, delta);
-
- /* If the first fragment is fragmented itself, we split
- * it to two chunks: the first with data and paged part
- * and the second, holding only fragments. */
- if (skb_has_frag_list(head)) {
- struct sk_buff *clone;
- int i, plen = 0;
-
- clone = alloc_skb(0, GFP_ATOMIC);
- if (!clone)
- goto out_nomem;
- skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
- skb_frag_list_init(head);
- for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
- plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
- clone->len = clone->data_len = head->data_len - plen;
- head->truesize += clone->truesize;
- clone->csum = 0;
- clone->ip_summed = head->ip_summed;
- add_frag_mem_limit(qp->q.net, clone->truesize);
- skb_shinfo(head)->frag_list = clone;
- nextp = &clone->next;
- } else {
- nextp = &skb_shinfo(head)->frag_list;
- }
-
- skb_push(head, head->data - skb_network_header(head));
-
- /* Traverse the tree in order, to build frag_list. */
- fp = FRAG_CB(head)->next_frag;
- rbn = rb_next(&head->rbnode);
- rb_erase(&head->rbnode, &qp->q.rb_fragments);
- while (rbn || fp) {
- /* fp points to the next sk_buff in the current run;
- * rbn points to the next run.
- */
- /* Go through the current run. */
- while (fp) {
- *nextp = fp;
- nextp = &fp->next;
- fp->prev = NULL;
- memset(&fp->rbnode, 0, sizeof(fp->rbnode));
- fp->sk = NULL;
- head->data_len += fp->len;
- head->len += fp->len;
- if (head->ip_summed != fp->ip_summed)
- head->ip_summed = CHECKSUM_NONE;
- else if (head->ip_summed == CHECKSUM_COMPLETE)
- head->csum = csum_add(head->csum, fp->csum);
- head->truesize += fp->truesize;
- fp = FRAG_CB(fp)->next_frag;
- }
- /* Move to the next run. */
- if (rbn) {
- struct rb_node *rbnext = rb_next(rbn);
-
- fp = rb_to_skb(rbn);
- rb_erase(rbn, &qp->q.rb_fragments);
- rbn = rbnext;
- }
- }
- sub_frag_mem_limit(qp->q.net, head->truesize);
-
- *nextp = NULL;
- head->next = NULL;
- head->prev = NULL;
- head->dev = dev;
- head->tstamp = qp->q.stamp;
- IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
-
- iph = ip_hdr(head);
+ iph = ip_hdr(skb);
iph->tot_len = htons(len);
iph->tos |= ecn;

@@ -656,7 +440,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
* from one very small df-fragment and one large non-df frag.
*/
if (qp->max_df_size == qp->q.max_size) {
- IPCB(head)->flags |= IPSKB_FRAG_PMTU;
+ IPCB(skb)->flags |= IPSKB_FRAG_PMTU;
iph->frag_off = htons(IP_DF);
} else {
iph->frag_off = 0;
@@ -754,28 +538,6 @@ struct sk_buff *ip_check_defrag(struct net *net, struct sk_buff *skb, u32 user)
}
EXPORT_SYMBOL(ip_check_defrag);

-unsigned int inet_frag_rbtree_purge(struct rb_root *root)
-{
- struct rb_node *p = rb_first(root);
- unsigned int sum = 0;
-
- while (p) {
- struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
-
- p = rb_next(p);
- rb_erase(&skb->rbnode, root);
- while (skb) {
- struct sk_buff *next = FRAG_CB(skb)->next_frag;
-
- sum += skb->truesize;
- kfree_skb(skb);
- skb = next;
- }
- }
- return sum;
-}
-EXPORT_SYMBOL(inet_frag_rbtree_purge);
-
#ifdef CONFIG_SYSCTL
static int dist_min;

--
2.17.1

2019-05-29 10:38:58

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
> From: Jiri Wiesner <[email protected]>
>
> The *_frag_reasm() functions are susceptible to miscalculating the byte
> count of packet fragments in case the truesize of a head buffer changes.
> The truesize member may be changed by the call to skb_unclone(), leaving
> the fragment memory limit counter unbalanced even if all fragments are
> processed. This miscalculation goes unnoticed as long as the network
> namespace which holds the counter is not destroyed.
>
> Should an attempt be made to destroy a network namespace that holds an
> unbalanced fragment memory limit counter the cleanup of the namespace
> never finishes. The thread handling the cleanup gets stuck in
> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
> thread is usually in running state with a stacktrace similar to:
>
> PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
> #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
> #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
> #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
> #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
> #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
> #10 [ffff880621563e38] process_one_work at ffffffff81096f14
>
> It is not possible to create new network namespaces, and processes
> that call unshare() end up being stuck in uninterruptible sleep state
> waiting to acquire the net_mutex.
>
> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
> I thank him for his analysis of the problem. The parts of this patch
> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>
> Signed-off-by: Jiri Wiesner <[email protected]>
> Reported-by: Per Sundstrom <[email protected]>
> Acked-by: Peter Oskolkov <[email protected]>
> Signed-off-by: David S. Miller <[email protected]>
>
> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
> Signed-off-by: Stefan Bader <[email protected]>

I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
kernel versions would have a regression :(

Can you also provide a backport of the needed patches for 4.9.y for this
issue so I can take these?

thanks,

greg k-h

2019-05-29 12:33:03

by Stefan Bader

[permalink] [raw]

Subject: Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner <[email protected]>
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>> PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
>> #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
>> #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
>> #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
>> #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
>> #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
>> #10 [ffff880621563e38] process_one_work at ffffffff81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner <[email protected]>
>> Reported-by: Per Sundstrom <[email protected]>
>> Acked-by: Peter Oskolkov <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader <[email protected]>
>
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
>
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

I will, once it is clear that a) the backport looks alright and b) is ok to be
done. Alternatively it might be decided that only the parts necessary for
pulling out a frag head should be picked.

Or the net-devs might decide they want to send things out. The problem
potentially exists in anything that has some stable support up to v5.1
and I have no complete overview where this was backported to.

So this is more a start of discussion that a request to apply it. stable was
just included to make stable maintainers aware.

-Stefan
>
> thanks,
>
> greg k-h
>

Attachments:

signature.asc (849.00 B)
OpenPGP digital signature

2019-06-04 13:34:17

by Stefan Bader

[permalink] [raw]

Subject: Re: [PATCH 1/4] ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes

On 29.05.19 12:37, Greg KH wrote:
> On Wed, May 29, 2019 at 12:25:39PM +0200, Stefan Bader wrote:
>> From: Jiri Wiesner <[email protected]>
>>
>> The *_frag_reasm() functions are susceptible to miscalculating the byte
>> count of packet fragments in case the truesize of a head buffer changes.
>> The truesize member may be changed by the call to skb_unclone(), leaving
>> the fragment memory limit counter unbalanced even if all fragments are
>> processed. This miscalculation goes unnoticed as long as the network
>> namespace which holds the counter is not destroyed.
>>
>> Should an attempt be made to destroy a network namespace that holds an
>> unbalanced fragment memory limit counter the cleanup of the namespace
>> never finishes. The thread handling the cleanup gets stuck in
>> inet_frags_exit_net() waiting for the percpu counter to reach zero. The
>> thread is usually in running state with a stacktrace similar to:
>>
>> PID: 1073 TASK: ffff880626711440 CPU: 1 COMMAND: "kworker/u48:4"
>> #5 [ffff880621563d48] _raw_spin_lock at ffffffff815f5480
>> #6 [ffff880621563d48] inet_evict_bucket at ffffffff8158020b
>> #7 [ffff880621563d80] inet_frags_exit_net at ffffffff8158051c
>> #8 [ffff880621563db0] ops_exit_list at ffffffff814f5856
>> #9 [ffff880621563dd8] cleanup_net at ffffffff814f67c0
>> #10 [ffff880621563e38] process_one_work at ffffffff81096f14
>>
>> It is not possible to create new network namespaces, and processes
>> that call unshare() end up being stuck in uninterruptible sleep state
>> waiting to acquire the net_mutex.
>>
>> The bug was observed in the IPv6 netfilter code by Per Sundstrom.
>> I thank him for his analysis of the problem. The parts of this patch
>> that apply to IPv4 and IPv6 fragment reassembly are preemptive measures.
>>
>> Signed-off-by: Jiri Wiesner <[email protected]>
>> Reported-by: Per Sundstrom <[email protected]>
>> Acked-by: Peter Oskolkov <[email protected]>
>> Signed-off-by: David S. Miller <[email protected]>
>>
>> (backported from commit ebaf39e6032faf77218220707fc3fa22487784e0)
>> [smb: context adjustments in net/ipv6/netfilter/nf_conntrack_reasm.c]
>> Signed-off-by: Stefan Bader <[email protected]>
>
> I can't take a patch for 4.4.y that is not in 4.9.y as anyone upgrading
> kernel versions would have a regression :(
>
> Can you also provide a backport of the needed patches for 4.9.y for this
> issue so I can take these?

It turns out that I cannot provide patches for 4.9.y because those are not
needed there. Patches #1 and #2 of the list I did do not explicitly appear.
However patch #3 does and it might be possible to implicitly do the changes of
the other two by adjusting the removal of the functions from the old locations
and doing the additions unmodified.
And the final patch for pulling the skb from the list is included in 4.9.y by
backports for using rbtrees in ipv6, too. In 4.9.y however the skb_get() still
needs to be dropped. Sasha did not apply it in the end, maybe partially because
of my warning that this was not enough in 4.4.y.

So I think there are two options for 4.4.y which I would defer to the net-devs
to decide:
- either also backport the patches to use rbtrees in ipv6 to 4.4.y (including
use of inet_frag_pull_head() in ip6_expire_frag_queue() and dropping the
skb_get() there.
- or some limited change to ip6_expire_frag_queue(). Probably only doing the
part not related to rbtrees. This would not require patch #3 as pre-req.
Patches #1 and #2 might be considered separately. Those would be unrelated
to the crash in ip6_expire-frag_queue() but could fix other issues (not
liking the sound of net namespace teardown possibly getting stuck).

For 4.9.y, please re-consider picking "ip6: fix skb leak in
ip6frag_expire_frag_queue()". Depending on checks down the path of icmpv6_send()
it might be a crash, too. In that case it should be enough as
inet_frag_pull_head() takes the skb off the queue, so it won't get double freed
when releasing the whole queue.

-Stefan
>
> thanks,
>
> greg k-h
>

Attachments:

signature.asc (849.00 B)
OpenPGP digital signature