2024-01-05 22:45:53

by Ale Crismani

[permalink] [raw]
Subject: Performance regression in ip_set_swap on 6.1.69

Dear all,

When upgrading some of our Debian hosts that compose a Kubernetes cluster we found a regression in ip_set_swap on 6.1.69. Calls to ip_set_swap now take roughly 15ms, while they used to take just tens of microseconds before.

The issue is very visible for use, since we use kube-router as our Kubernetes networking interface, and it uses ipset swap all the time to populate sets that enforce firewall policies between containers.

We tracked the issue down with strace, and then took stats with bpftrace running:
---
kfunc:ip_set:ip_set_swap {
@start[tid] = nsecs;
}

kretfunc:ip_set:ip_set_swap {
if (@start[tid]) {
@srlat = hist((nsecs - @start[tid])/1000);
delete(@start[tid]);
}
}

interval:s:20 {
printf("ip_set_swap() latency, milliseconds:\n");
---

On 6.1.69 results look like:

ip_set_swap() latency, milliseconds:
[8K, 16K) 1848 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[16K, 32K) 1017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[32K, 64K) 38 |@

while on 6.1.67:
ip_set_swap() latency, milliseconds:

[0] 166 |@
[1] 378 |@@
[2, 4) 762 |@@@@@
[4, 8) 1624 |@@@@@@@@@@@
[8, 16) 3493 |@@@@@@@@@@@@@@@@@@@@@@@@
[16, 32) 7308 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[32, 64) 6412 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
[64, 128) 1 |

We tried compiling commits between 6.1.67 and 6.1.69 and it seems the performance regression was introduced by 875ee3a, ip_set_swap is fast on 602505 that precedes it, and slow on it.

First time I post here, hope the format is appropriate, and thanks for any help with this! Also, if possible, I'd appreciate if any reply could CC me, as I am not subscribed.

Alessandro Crismani


2024-01-10 10:43:51

by Jozsef Kadlecsik

[permalink] [raw]
Subject: Re: Performance regression in ip_set_swap on 6.1.69

On Wed, 10 Jan 2024, David Wang wrote:

> I confirmed this on 6.7 that this was introduced by commit
> 28628fa952fefc7f2072ce6e8016968cc452b1ba with following changes:
>
> static inline void
> @@ -1397,6 +1394,9 @@ static int ip_set_swap(struct sk_buff *skb, const struct nfnl_info *info,
> ip_set(inst, to_id) = from;
> write_unlock_bh(&ip_set_ref_lock);
>
> + /* Make sure all readers of the old set pointers are completed. */
> + synchronize_rcu();
> +
> return 0;
> }
>
> synchronize_rcu causes the delay, and its usage here is very confusing,
> there is no reclaimer code after it.

As I'm seeing just the end of the discussion, please send a full report of
the problem and how to reproduce it.

Best regards,
Jozsef
--
E-mail : [email protected], [email protected]
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary

2024-01-11 08:26:06

by Jozsef Kadlecsik

[permalink] [raw]
Subject: Re: Performance regression in ip_set_swap on 6.1.69

Hi,

On Wed, 10 Jan 2024, Jozsef Kadlecsik wrote:

> On Wed, 10 Jan 2024, David Wang wrote:
>
> > At 2024-01-10 18:35:02, "Jozsef Kadlecsik" <[email protected]> wrote:
> > >On Wed, 10 Jan 2024, David Wang wrote:
> > >
> > >> I confirmed this on 6.7 that this was introduced by commit
> > >> 28628fa952fefc7f2072ce6e8016968cc452b1ba with following changes:
> > >>
> > >> static inline void
> > >> @@ -1397,6 +1394,9 @@ static int ip_set_swap(struct sk_buff *skb, const struct nfnl_info *info,
> > >> ip_set(inst, to_id) = from;
> > >> write_unlock_bh(&ip_set_ref_lock);
> > >>
> > >> + /* Make sure all readers of the old set pointers are completed. */
> > >> + synchronize_rcu();
> > >> +
> > >> return 0;
> > >> }
> > >>
> > >> synchronize_rcu causes the delay, and its usage here is very confusing,
> > >> there is no reclaimer code after it.
> > >
> > >As I'm seeing just the end of the discussion, please send a full report of
> > >the problem and how to reproduce it.
> > >
> >
> > This was reported in
> > https://lore.kernel.org/lkml/[email protected]/
> > by [email protected] Just out of interest of performance
> > issues, I tried to reproduce it with a test stressing ipset_swap:
> >
> > My test code is as following, it would stress swapping ipset 'foo' with
> > 'bar'; (foo/bar ipset needs to be created before the test.) With latest
> > 6.7, the stress would take about 180 seconds to finish, but with
> > `synchronize_rcu` removed, it only took 3seconds.
> >
> >
> > ```
> > unsigned char mbuffer[4096];
> > int main() {
> > int err;
> > int sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_NETFILTER);
> > if (sock<0) {
> > perror("Fail to create socket");
> > return 1;
> > }
> > struct sockaddr_nl addr = {
> > .nl_family = AF_NETLINK,
> > .nl_pad = 0,
> > .nl_pid = 0,
> > .nl_groups = 0
> > };
> > struct sockaddr raddr = {0};
> > socklen_t rsize;
> > int seq = 0x12345678;
> > err = bind(sock, (struct sockaddr*)&addr, sizeof(addr));
> > if (err) {
> > perror("Fail to bind");
> > return 1;
> > }
> > err = getsockname(sock, &raddr, &rsize);
> > if (err) {
> > perror("Fail to getsockname");
> > return 1;
> > }
> > unsigned char buf[64];
> > struct nlmsghdr *phdr;
> > struct nfgenmsg *pnfg;
> > struct nlattr *pnla;
> > unsigned int total;
> > ssize_t rz;
> > struct iovec iovs;
> > iovs.iov_base = mbuffer;
> > iovs.iov_len = sizeof(mbuffer);
> > struct msghdr msg = {0};
> > msg.msg_name = &addr;
> > msg.msg_namelen = sizeof(addr);
> > msg.msg_iov = &iovs;
> > msg.msg_iovlen = 1;
> >
> > memset(buf, 0, sizeof(buf));
> > total = 0;
> > phdr = (struct nlmsghdr*)(buf+total);
> > total += sizeof(struct nlmsghdr);
> > phdr->nlmsg_type=NFNL_SUBSYS_IPSET<<8|IPSET_CMD_PROTOCOL;
> > phdr->nlmsg_seq = seq;
> > phdr->nlmsg_flags = NLM_F_REQUEST;
> > pnfg = (struct nfgenmsg*)(buf+total);
> > total += sizeof(struct nfgenmsg);
> > pnfg->nfgen_family=AF_INET;
> > pnfg->version= NFNETLINK_V0;
> > pnfg->res_id=htons(0);
> > pnla = (struct nlattr *)(buf+total);
> > pnla->nla_len = 5;
> > pnla->nla_type = 1;
> > buf[total+sizeof(struct nlattr)]=0x06;
> > total+=8;
> > phdr->nlmsg_len = total;
> > rz = sendto(sock, buf, total, 0, (struct sockaddr*)&addr, sizeof(addr));
> > rz = recvmsg(sock, &msg, 0);
> >
> > pnla = (struct nlattr *)(buf+total);
> > pnla->nla_len = 8;
> > pnla->nla_type = 2;
> > char *p = buf+(total+sizeof(struct nlattr));
> > p[0]='f'; p[1]='o'; p[2]='o'; p[3]=0;
> > total+=8;
> > pnla = (struct nlattr *)(buf+total);
> > pnla->nla_len = 8;
> > pnla->nla_type = 3;
> > p = buf+(total+sizeof(struct nlattr));
> > p[0]='b'; p[1]='a'; p[2]='r'; p[3]=0;
> > total+=8;
> > phdr->nlmsg_type = NFNL_SUBSYS_IPSET<<8|IPSET_CMD_SWAP;
> > phdr->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK;
> > phdr->nlmsg_len = total;
> >
> >
> > for (int i=0; i<10000; i++) {
> > // stress swap foo bar
> > phdr->nlmsg_seq++;
> > sendto(sock, buf, total, 0, (struct sockaddr*)&addr, sizeof(addr));
> > recvmsg(sock, &msg, 0);
> > }
> >
> > close(sock);
> > return 0;
> > }
> > ```
>
> Thanks, I'll look into it. The race condition fix between swap/destroy and
> kernel side add/del/test had several versions, either penalizing destroy
> or swap. Finally swap seemed to be the less intrusive. I'm going to
> explore other possibilities.

Could you check that the patch below fixes the performance regression?
Instead of waiting for the RCU grace period at swapping, call_rcu() is
used at destroying the set.

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index e8c350a3ade1..912f750d0bea 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -242,6 +242,8 @@ extern void ip_set_type_unregister(struct ip_set_type *set_type);

/* A generic IP set */
struct ip_set {
+ /* For call_cru in destroy */
+ struct rcu_head rcu;
/* The name of the set */
char name[IPSET_MAXNAMELEN];
/* Lock protecting the set data */
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 4c133e06be1d..cd95f75dd720 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -1009,6 +1009,14 @@ find_set(struct ip_set_net *inst, const char *name)
return find_set_and_id(inst, name, &id);
}

+static void
+ip_set_destroy_set_rcu(struct rcu_head *head)
+{
+ struct ip_set *set = container_of(head, struct ip_set, rcu);
+
+ ip_set_destroy_set(set);
+}
+
static int
find_free_id(struct ip_set_net *inst, const char *name, ip_set_id_t *index,
struct ip_set **set)
@@ -1241,7 +1249,7 @@ static int ip_set_destroy(struct sk_buff *skb, const struct nfnl_info *info,
ip_set(inst, i) = NULL;
read_unlock_bh(&ip_set_ref_lock);

- ip_set_destroy_set(s);
+ call_rcu(&s->rcu, ip_set_destroy_set_rcu);
}
return 0;
out:
@@ -1394,9 +1402,6 @@ static int ip_set_swap(struct sk_buff *skb, const struct nfnl_info *info,
ip_set(inst, to_id) = from;
write_unlock_bh(&ip_set_ref_lock);

- /* Make sure all readers of the old set pointers are completed. */
- synchronize_rcu();
-
return 0;
}


Best regards,
Jozsef
--
E-mail : [email protected], [email protected]
PGP key : https://wigner.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics
H-1525 Budapest 114, POB. 49, Hungary