2021-03-08 08:34:35

by Mark Tomlinson

[permalink] [raw]
Subject: [PATCH v2 0/3] Don't use RCU for x_tables synchronization

The patches to change to using RCU synchronization in x_tables cause
updating tables to be slowed down by an order of magnitude. This has
been tried before, see https://lore.kernel.org/patchwork/patch/151796/
and ultimately was rejected. As mentioned in the patch description, a
different method can be used to ensure ordering of reads/writes. This
can simply be done by changing from smp_wmb() to smp_mb().

changes in v2:
- Update commit messages only

Mark Tomlinson (3):
Revert "netfilter: x_tables: Update remaining dereference to RCU"
Revert "netfilter: x_tables: Switch synchronization to RCU"
netfilter: x_tables: Use correct memory barriers.

include/linux/netfilter/x_tables.h | 7 ++---
net/ipv4/netfilter/arp_tables.c | 16 +++++-----
net/ipv4/netfilter/ip_tables.c | 16 +++++-----
net/ipv6/netfilter/ip6_tables.c | 16 +++++-----
net/netfilter/x_tables.c | 49 +++++++++++++++++++++---------
5 files changed, 60 insertions(+), 44 deletions(-)

--
2.30.1


2021-03-08 08:34:42

by Mark Tomlinson

[permalink] [raw]
Subject: [PATCH v2 2/3] Revert "netfilter: x_tables: Switch synchronization to RCU"

This reverts commit cc00bcaa589914096edef7fb87ca5cee4a166b5c.

This (and the preceding) patch basically re-implemented the RCU
mechanisms of patch 784544739a25. That patch was replaced because of the
performance problems that it created when replacing tables. Now, we have
the same issue: the call to synchronize_rcu() makes replacing tables
slower by as much as an order of magnitude.

Prior to using RCU a script calling "iptables" approx. 200 times was
taking 1.16s. With RCU this increased to 11.59s.

Revert these patches and fix the issue in a different way.

Signed-off-by: Mark Tomlinson <[email protected]>
---
include/linux/netfilter/x_tables.h | 5 +--
net/ipv4/netfilter/arp_tables.c | 14 ++++-----
net/ipv4/netfilter/ip_tables.c | 14 ++++-----
net/ipv6/netfilter/ip6_tables.c | 14 ++++-----
net/netfilter/x_tables.c | 49 +++++++++++++++++++++---------
5 files changed, 56 insertions(+), 40 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 8ebb64193757..5deb099d156d 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -227,7 +227,7 @@ struct xt_table {
unsigned int valid_hooks;

/* Man behind the curtain... */
- struct xt_table_info __rcu *private;
+ struct xt_table_info *private;

/* Set this to THIS_MODULE if you are a module, otherwise NULL */
struct module *me;
@@ -448,9 +448,6 @@ xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)

struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);

-struct xt_table_info
-*xt_table_get_private_protected(const struct xt_table *table);
-
#ifdef CONFIG_COMPAT
#include <net/compat.h>

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 563b62b76a5f..d1e04d2b5170 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -203,7 +203,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,

local_bh_disable();
addend = xt_write_recseq_begin();
- private = rcu_access_pointer(table->private);
+ private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct arpt_entry **)private->jumpstack[cpu];
@@ -649,7 +649,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;

/* We need atomic snapshot of counters: rest doesn't change
* (other than comefrom, which userspace doesn't care
@@ -673,7 +673,7 @@ static int copy_entries_to_user(unsigned int total_size,
unsigned int off, num;
const struct arpt_entry *e;
struct xt_counters *counters;
- struct xt_table_info *private = xt_table_get_private_protected(table);
+ struct xt_table_info *private = table->private;
int ret = 0;
void *loc_cpu_entry;

@@ -807,7 +807,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, NFPROTO_ARP, name);
if (!IS_ERR(t)) {
struct arpt_getinfo info;
- const struct xt_table_info *private = xt_table_get_private_protected(t);
+ const struct xt_table_info *private = t->private;
#ifdef CONFIG_COMPAT
struct xt_table_info tmp;

@@ -860,7 +860,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,

t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = xt_table_get_private_protected(t);
+ const struct xt_table_info *private = t->private;

if (get.size == private->size)
ret = copy_entries_to_user(private->size,
@@ -1017,7 +1017,7 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}

local_bh_disable();
- private = xt_table_get_private_protected(t);
+ private = t->private;
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1330,7 +1330,7 @@ static int compat_copy_entries_to_user(unsigned int total_size,
void __user *userptr)
{
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 6e2851f8d3a3..f15bc21d7301 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -258,7 +258,7 @@ ipt_do_table(struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
addend = xt_write_recseq_begin();
- private = rcu_access_pointer(table->private);
+ private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ipt_entry **)private->jumpstack[cpu];
@@ -791,7 +791,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;

/* We need atomic snapshot of counters: rest doesn't change
(other than comefrom, which userspace doesn't care
@@ -815,7 +815,7 @@ copy_entries_to_user(unsigned int total_size,
unsigned int off, num;
const struct ipt_entry *e;
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;
int ret = 0;
const void *loc_cpu_entry;

@@ -964,7 +964,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, AF_INET, name);
if (!IS_ERR(t)) {
struct ipt_getinfo info;
- const struct xt_table_info *private = xt_table_get_private_protected(t);
+ const struct xt_table_info *private = t->private;
#ifdef CONFIG_COMPAT
struct xt_table_info tmp;

@@ -1018,7 +1018,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,

t = xt_find_table_lock(net, AF_INET, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = xt_table_get_private_protected(t);
+ const struct xt_table_info *private = t->private;
if (get.size == private->size)
ret = copy_entries_to_user(private->size,
t, uptr->entrytable);
@@ -1173,7 +1173,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}

local_bh_disable();
- private = xt_table_get_private_protected(t);
+ private = t->private;
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1543,7 +1543,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
void __user *userptr)
{
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index c4f532f4d311..2e2119bfcf13 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -280,7 +280,7 @@ ip6t_do_table(struct sk_buff *skb,

local_bh_disable();
addend = xt_write_recseq_begin();
- private = rcu_access_pointer(table->private);
+ private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -807,7 +807,7 @@ static struct xt_counters *alloc_counters(const struct xt_table *table)
{
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;

/* We need atomic snapshot of counters: rest doesn't change
(other than comefrom, which userspace doesn't care
@@ -831,7 +831,7 @@ copy_entries_to_user(unsigned int total_size,
unsigned int off, num;
const struct ip6t_entry *e;
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;
int ret = 0;
const void *loc_cpu_entry;

@@ -980,7 +980,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, AF_INET6, name);
if (!IS_ERR(t)) {
struct ip6t_getinfo info;
- const struct xt_table_info *private = xt_table_get_private_protected(t);
+ const struct xt_table_info *private = t->private;
#ifdef CONFIG_COMPAT
struct xt_table_info tmp;

@@ -1035,7 +1035,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,

t = xt_find_table_lock(net, AF_INET6, get.name);
if (!IS_ERR(t)) {
- struct xt_table_info *private = xt_table_get_private_protected(t);
+ struct xt_table_info *private = t->private;
if (get.size == private->size)
ret = copy_entries_to_user(private->size,
t, uptr->entrytable);
@@ -1189,7 +1189,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}

local_bh_disable();
- private = xt_table_get_private_protected(t);
+ private = t->private;
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1552,7 +1552,7 @@ compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
void __user *userptr)
{
struct xt_counters *counters;
- const struct xt_table_info *private = xt_table_get_private_protected(table);
+ const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index acce622582e3..af22dbe85e2c 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1349,14 +1349,6 @@ struct xt_counters *xt_counters_alloc(unsigned int counters)
}
EXPORT_SYMBOL(xt_counters_alloc);

-struct xt_table_info
-*xt_table_get_private_protected(const struct xt_table *table)
-{
- return rcu_dereference_protected(table->private,
- mutex_is_locked(&xt[table->af].mutex));
-}
-EXPORT_SYMBOL(xt_table_get_private_protected);
-
struct xt_table_info *
xt_replace_table(struct xt_table *table,
unsigned int num_counters,
@@ -1364,6 +1356,7 @@ xt_replace_table(struct xt_table *table,
int *error)
{
struct xt_table_info *private;
+ unsigned int cpu;
int ret;

ret = xt_jumpstack_alloc(newinfo);
@@ -1373,20 +1366,47 @@ xt_replace_table(struct xt_table *table,
}

/* Do the substitution. */
- private = xt_table_get_private_protected(table);
+ local_bh_disable();
+ private = table->private;

/* Check inside lock: is the old number correct? */
if (num_counters != private->number) {
pr_debug("num_counters != table->private->number (%u/%u)\n",
num_counters, private->number);
+ local_bh_enable();
*error = -EAGAIN;
return NULL;
}

newinfo->initial_entries = private->initial_entries;
+ /*
+ * Ensure contents of newinfo are visible before assigning to
+ * private.
+ */
+ smp_wmb();
+ table->private = newinfo;
+
+ /* make sure all cpus see new ->private value */
+ smp_wmb();

- rcu_assign_pointer(table->private, newinfo);
- synchronize_rcu();
+ /*
+ * Even though table entries have now been swapped, other CPU's
+ * may still be using the old entries...
+ */
+ local_bh_enable();
+
+ /* ... so wait for even xt_recseq on all cpus */
+ for_each_possible_cpu(cpu) {
+ seqcount_t *s = &per_cpu(xt_recseq, cpu);
+ u32 seq = raw_read_seqcount(s);
+
+ if (seq & 1) {
+ do {
+ cond_resched();
+ cpu_relax();
+ } while (seq == raw_read_seqcount(s));
+ }
+ }

audit_log_nfcfg(table->name, table->af, private->number,
!private->number ? AUDIT_XT_OP_REGISTER :
@@ -1422,12 +1442,12 @@ struct xt_table *xt_register_table(struct net *net,
}

/* Simplifies replace_table code. */
- rcu_assign_pointer(table->private, bootstrap);
+ table->private = bootstrap;

if (!xt_replace_table(table, 0, newinfo, &ret))
goto unlock;

- private = xt_table_get_private_protected(table);
+ private = table->private;
pr_debug("table->private->number = %u\n", private->number);

/* save number of initial entries */
@@ -1450,8 +1470,7 @@ void *xt_unregister_table(struct xt_table *table)
struct xt_table_info *private;

mutex_lock(&xt[table->af].mutex);
- private = xt_table_get_private_protected(table);
- RCU_INIT_POINTER(table->private, NULL);
+ private = table->private;
list_del(&table->list);
mutex_unlock(&xt[table->af].mutex);
audit_log_nfcfg(table->name, table->af, private->number,
--
2.30.1

2021-03-08 08:34:48

by Mark Tomlinson

[permalink] [raw]
Subject: [PATCH v2 3/3] netfilter: x_tables: Use correct memory barriers.

When a new table value was assigned, it was followed by a write memory
barrier. This ensured that all writes before this point would complete
before any writes after this point. However, to determine whether the
rules are unused, the sequence counter is read. To ensure that all
writes have been done before these reads, a full memory barrier is
needed, not just a write memory barrier. The same argument applies when
incrementing the counter, before the rules are read.

Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
reported in cc00bcaa5899 (which is still present), while still
maintaining the same speed of replacing tables.

The smb_mb() barriers potentially slow the packet path, however testing
has shown no measurable change in performance on a 4-core MIPS64
platform.

Fixes: 7f5c6d4f665b ("netfilter: get rid of atomic ops in fast path")
Signed-off-by: Mark Tomlinson <[email protected]>
---
include/linux/netfilter/x_tables.h | 2 +-
net/netfilter/x_tables.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index 5deb099d156d..8ec48466410a 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -376,7 +376,7 @@ static inline unsigned int xt_write_recseq_begin(void)
* since addend is most likely 1
*/
__this_cpu_add(xt_recseq.sequence, addend);
- smp_wmb();
+ smp_mb();

return addend;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index af22dbe85e2c..a2b50596b87e 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1387,7 +1387,7 @@ xt_replace_table(struct xt_table *table,
table->private = newinfo;

/* make sure all cpus see new ->private value */
- smp_wmb();
+ smp_mb();

/*
* Even though table entries have now been swapped, other CPU's
--
2.30.1

2021-03-09 11:37:49

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH v2 3/3] netfilter: x_tables: Use correct memory barriers.

Mark Tomlinson <[email protected]> wrote:
> When a new table value was assigned, it was followed by a write memory
> barrier. This ensured that all writes before this point would complete
> before any writes after this point. However, to determine whether the
> rules are unused, the sequence counter is read. To ensure that all
> writes have been done before these reads, a full memory barrier is
> needed, not just a write memory barrier. The same argument applies when
> incrementing the counter, before the rules are read.
>
> Changing to using smp_mb() instead of smp_wmb() fixes the kernel panic
> reported in cc00bcaa5899 (which is still present), while still
> maintaining the same speed of replacing tables.
>
> The smb_mb() barriers potentially slow the packet path, however testing
> has shown no measurable change in performance on a 4-core MIPS64
> platform.

Okay, thanks for testing. I have no further feedback.

2021-03-11 15:54:00

by Tyler Hicks

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Don't use RCU for x_tables synchronization

On 2021-03-08 14:24:10, Mark Tomlinson wrote:
> The patches to change to using RCU synchronization in x_tables cause
> updating tables to be slowed down by an order of magnitude. This has
> been tried before, see https://lore.kernel.org/patchwork/patch/151796/
> and ultimately was rejected. As mentioned in the patch description, a
> different method can be used to ensure ordering of reads/writes. This
> can simply be done by changing from smp_wmb() to smp_mb().

I came across these patches while trying to root-cause a significant boot time
regression we were seeing when comparing 5.4 to 5.10. The RCU changes were
where I had started to investigate but hadn't yet made it to the point that
Mark had reached when he sent out these fixes.

Mark's description of the problem and fix looks correct to me.

To touch on why the performance regression is a problem for us, we have
three services that load various firewall rules during boot up. Here are
the total run times that `systemd-analyze blame` reported for each one
to run to completion:

5.4.88:

129ms fw-1.service
56ms fw-2.service
38ms fw-3.service

5.10.19:

586ms fw-1.service
847ms fw-2.service
193ms fw-3.service

5.10.19 + these fixes:

98ms fw-1.service
25ms fw-2.service
54ms fw-3.service

You can see that the performance regression from 5.4.88 -> 5.10.19
considerably increased our boot time.

I cannot explain why '5.10.19 + these fixes' shows improvements over
5.4.88 for two of the three services but I suspect it is due to systemd
taking different code paths, units executing at different times,
unrelated kernel performance improvements, etc.

For all three patches,

Tested-by: Tyler Hicks <[email protected]>
Reviewed-by: Tyler Hicks <[email protected]>

Tyler

>
> changes in v2:
> - Update commit messages only
>
> Mark Tomlinson (3):
> Revert "netfilter: x_tables: Update remaining dereference to RCU"
> Revert "netfilter: x_tables: Switch synchronization to RCU"
> netfilter: x_tables: Use correct memory barriers.
>
> include/linux/netfilter/x_tables.h | 7 ++---
> net/ipv4/netfilter/arp_tables.c | 16 +++++-----
> net/ipv4/netfilter/ip_tables.c | 16 +++++-----
> net/ipv6/netfilter/ip6_tables.c | 16 +++++-----
> net/netfilter/x_tables.c | 49 +++++++++++++++++++++---------
> 5 files changed, 60 insertions(+), 44 deletions(-)
>
> --
> 2.30.1
>

2021-03-15 17:49:45

by Pablo Neira Ayuso

[permalink] [raw]
Subject: Re: [PATCH v2 0/3] Don't use RCU for x_tables synchronization

On Mon, Mar 08, 2021 at 02:24:10PM +1300, Mark Tomlinson wrote:
> The patches to change to using RCU synchronization in x_tables cause
> updating tables to be slowed down by an order of magnitude. This has
> been tried before, see https://lore.kernel.org/patchwork/patch/151796/
> and ultimately was rejected. As mentioned in the patch description, a
> different method can be used to ensure ordering of reads/writes. This
> can simply be done by changing from smp_wmb() to smp_mb().

Applied.