2021-05-12 22:39:45

by Dexuan Cui

[permalink] [raw]
Subject: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

Hi,
I'm debugging an iptables-restore failure, which happens about 5% of the
time when I keep stopping and starting the Linux VM. The VM has only 1
CPU, and kernel version is 4.15.0-1098-azure, but I suspect the issue may
also exist in the mainline Linux kernel.

When the failure happens, it's always caused by line 27 of the rule file:

1 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
2 *raw
3 :PREROUTING ACCEPT [0:0]
4 :OUTPUT ACCEPT [0:0]
5 -A PREROUTING ! -s 168.63.129.16/32 -p tcp -j NOTRACK
6 -A OUTPUT ! -d 168.63.129.16/32 -p tcp -j NOTRACK
7 COMMIT
8 # Completed on Fri Apr 23 09:22:59 2021
9 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
10 *filter
11 :INPUT ACCEPT [2407:79190058]
12 :FORWARD ACCEPT [0:0]
13 :OUTPUT ACCEPT [1648:2190051]
14 -A OUTPUT -d 169.254.169.254/32 -m owner --uid-owner 33 -j DROP
15 COMMIT
16 # Completed on Fri Apr 23 09:22:59 2021
17 # Generated by iptables-save v1.6.0 on Fri Apr 23 09:22:59 2021
18 *security
19 :INPUT ACCEPT [2345:79155398]
20 :FORWARD ACCEPT [0:0]
21 :OUTPUT ACCEPT [1504:2129015]
22 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
23 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
24 -A OUTPUT -d 168.63.129.16/32 -p tcp -m owner --uid-owner 0 -j ACCEPT
25 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
26 -A OUTPUT -d 168.63.129.16/32 -p tcp -m conntrack --ctstate INVALID,NEW -j DROP
27 COMMIT

The related part of the strace log is:

1 socket(PF_INET, SOCK_RAW, IPPROTO_RAW) = 3
2 getsockopt(3, SOL_IP, IPT_SO_GET_INFO, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., [84]) = 0
3 getsockopt(3, SOL_IP, IPT_SO_GET_ENTRIES, "security\0\357B\16Z\177\0\0Pg\355\0\0\0\0\0Pg\355\0\0\0\0\0"..., [880]) = 0
4 setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 2200) = -1 EAGAIN (Resource temporarily unavailable)
5 close(3) = 0
6 write(2, "iptables-restore: line 27 failed"..., 33) = 33

The -EAGAIN error comes from line 1240 of xt_replace_table():

do_ipt_set_ctl
do_replace
__do_replace
xt_replace_table

1216 xt_replace_table(struct xt_table *table,
1217 unsigned int num_counters,
1218 struct xt_table_info *newinfo,
1219 int *error)
1220 {
1221 struct xt_table_info *private;
1222 unsigned int cpu;
1223 int ret;
1224
1225 ret = xt_jumpstack_alloc(newinfo);
1226 if (ret < 0) {
1227 *error = ret;
1228 return NULL;
1229 }
1230
1231 /* Do the substitution. */
1232 local_bh_disable();
1233 private = table->private;
1234
1235 /* Check inside lock: is the old number correct? */
1236 if (num_counters != private->number) {
1237 pr_debug("num_counters != table->private->number (%u/%u)\n",
1238 num_counters, private->number);
1239 local_bh_enable();
1240 *error = -EAGAIN;
1241 return NULL;
1242 }

When the function returns -EAGAIN, the 'num_counters' is 5 while
'private->number' is 6.

If I re-run the iptables-restore program upon the failure, the program
will succeed.

I checked the function xt_replace_table() in the recent mainline kernel and it
looks like the function is the same.

It looks like there is a race condition between iptables-restore calls
getsockopt() to get the number of table entries and iptables call
setsockopt() to replace the entries? Looks like some other program is
concurrently calling getsockopt()/setsockopt() -- but it looks like this is
not the case according to the messages I print via trace_printk() around
do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
no other program calling do_replace(); the table entry number was changed
to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
this program 'iptables-restore' calls setsockopt() and the kernel sees
'num_counters' being 5 and the 'private->number' being 6 (how can this
happen??); the next setsockopt() call for the same 'security' table
happens in about 1 minute with both the numbers being 6.

Can you please shed some light on the issue? Thanks!

BTW, iptables does have a retry mechanism for getsockopt():
2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
(https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10&ignorews=0&dt=0)

But it looks like this is enough? e.g. here getsockopt() returns 0, but
setsockopt() returns -EAGAIN.

Thanks,
Dexuan


2021-05-13 04:20:54

by Dexuan Cui

[permalink] [raw]
Subject: RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 3:17 PM
> ...
> It looks like there is a race condition between iptables-restore calls
> getsockopt() to get the number of table entries and iptables call
> setsockopt() to replace the entries? Looks like some other program is
> concurrently calling getsockopt()/setsockopt() -- but it looks like this is
> not the case according to the messages I print via trace_printk() around
> do_replace() in do_ipt_set_ctl(): when the -EAGAIN error happens, there is
> no other program calling do_replace(); the table entry number was changed
> to 5 by another program 'iptables' about 1.3 milliseconds ago, and then
> this program 'iptables-restore' calls setsockopt() and the kernel sees
> 'num_counters' being 5 and the 'private->number' being 6 (how can this
> happen??); the next setsockopt() call for the same 'security' table
> happens in about 1 minute with both the numbers being 6.

Well, after tracing the code more closely, it turns out that the program
'iptables' indeed concurrently changes the number from 5 to 6, and this
causes the 'iptable-restores' program to get EAGAIN:

1. iptables-906 (906 is the process ID) calls IPT_SO_GET_ENTRIES and the
current num_entries is 4; the process calls IPT_SO_SET_REPLACE and the
private->number becomes 5.

2. iptables-restor-895 calls IPT_SO_GET_ENTRIES, and gets num_entries==5.

3. iptables-906 calls IPT_SO_GET_ENTRIES again, and also gets num_entries==5;
the process calls IPT_SO_SET_REPLACE and the private->number becomes 6.

4. iptables-restor-895 calls IPT_SO_SET_REPLACE and the kernel function
xt_replace_table() returns -EAGAIN due to num_counters == 5 and
private->number == 6.

I think the latest mainline kernel should also have the same race.
It looks like this by-design race exists since day one?

> BTW, iptables does have a retry mechanism for getsockopt():
> 2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
> (https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10
> &ignorews=0&dt=0)
>
> But it looks like this is enough? e.g. here getsockopt() returns 0, but
> setsockopt() returns -EAGAIN.

It looks like we need to add a retry mechanism to the iptables-restore
program: iptables/iptables-restore.c: ip46tables_restore_main():

if cb->ops->commit(handle) -> ... -> setsockopt(...,IPT_SO_SET_REPLACE, ...)
fails due to EAGAIN, it should start over from the very begining, i.e. call
create_handle() -> handle = cb->ops->init(tablename) again to get the new
num_entries, and retry the commit op. But I'm not sure how to easily
re-create the context associated with the old handle (should/can it re-parse
the rules?), as I'm not really familiar with iptables.

Or, is it possible to fix the race condition from the netfilter module?

Thanks,
-- Dexuan

2021-05-13 06:03:32

by Dexuan Cui

[permalink] [raw]
Subject: RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 9:19 PM
> ...
> I think the latest mainline kernel should also have the same race.
> It looks like this by-design race exists since day one?

I indeed reproduced the issue with the latest stable tree (v5.12.3) as well.

> > BTW, iptables does have a retry mechanism for getsockopt():
> > 2f93205b375e ("Retry ruleset dump when kernel returns EAGAIN.")
> >
> (https://git.netfilter.org/iptables/commit/libiptc?id=2f93205b375e&context=10
> > &ignorews=0&dt=0)
> >
> > But it looks like this is enough?
I missed a "not". IMO 2f93205b375e is not enough.

Thanks,
-- Dexuan

2021-05-13 06:21:20

by Dexuan Cui

[permalink] [raw]
Subject: RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

> From: Dexuan Cui
> Sent: Wednesday, May 12, 2021 11:02 PM

BTW, I found a similar report in 2019:

"
https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
service failed randomly on boot while working ok when launched manually.
Turned out it was conflicting with sshguard service due to systemd trying
to load everything in parallel. What helped is to setting
ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
"

The above report provided a workaround.
I think we need a real fix.

Thanks,
-- Dexuan

2021-05-13 10:10:48

by Pablo Neira Ayuso

[permalink] [raw]
Subject: Re: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

On Thu, May 13, 2021 at 06:19:38AM +0000, Dexuan Cui wrote:
> > From: Dexuan Cui
> > Sent: Wednesday, May 12, 2021 11:02 PM
>
> BTW, I found a similar report in 2019:
>
> "
> https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
> I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
> service failed randomly on boot while working ok when launched manually.
> Turned out it was conflicting with sshguard service due to systemd trying
> to load everything in parallel. What helped is to setting
> ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
> and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
> "
>
> The above report provided a workaround.

There's -w and -W to serialize ruleset updates. You could follow a
similar approach from userspace if you don't use iptables userspace
binary.

> I think we need a real fix.

iptables-nft already fixes this.

2021-05-13 17:10:31

by Phil Sutter

[permalink] [raw]
Subject: Re: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

Hi,

On Thu, May 13, 2021 at 11:40:47AM +0200, Pablo Neira Ayuso wrote:
> On Thu, May 13, 2021 at 06:19:38AM +0000, Dexuan Cui wrote:
> > > From: Dexuan Cui
> > > Sent: Wednesday, May 12, 2021 11:02 PM
> >
> > BTW, I found a similar report in 2019:
> >
> > "
> > https://serverfault.com/questions/101022/error-applying-iptables-rules-using-iptables-restore
> > I stumbled upon this issue on Ubuntu 18.04. The netfilter-persistent
> > service failed randomly on boot while working ok when launched manually.
> > Turned out it was conflicting with sshguard service due to systemd trying
> > to load everything in parallel. What helped is to setting
> > ENABLE_FIREWALL=0 in /etc/default/sshguard and then adding sshguard chain
> > and rule manually to /etc/iptables/rules.v4 and /etc/iptables/rules.v6.
> > "
> >
> > The above report provided a workaround.
>
> There's -w and -W to serialize ruleset updates. You could follow a
> similar approach from userspace if you don't use iptables userspace
> binary.

My guess is the xtables lock is not effective here, so waiting for it
probably won't help.

Dexuan, concurrent access is avoided in user space using a file-based
lock. So if multiple iptables(-restore) processes run in different
mount-namespaces, they might miss the other's /run/xtables.lock. Another
option would be if libiptc is used instead of calling iptables, but
that's more a shot in the dark - I don't know if libiptc doesn't support
obtaining the xtables lock.

> > I think we need a real fix.
>
> iptables-nft already fixes this.

nftables (and therefore iptables-nft) implement transactional logic in
kernel, user space automatically retries if a transaction's commit
fails.

Cheers, Phil

2021-05-14 00:33:31

by Dexuan Cui

[permalink] [raw]
Subject: RE: netfilter: iptables-restore: setsockopt(3, SOL_IP, IPT_SO_SET_REPLACE, "security...", ...) return -EAGAIN

> From: [email protected] <[email protected]> On Behalf Of Phil Sutter
> Sent: Thursday, May 13, 2021 10:08 AM
> >
> > There's -w and -W to serialize ruleset updates. You could follow a
> > similar approach from userspace if you don't use iptables userspace
> > binary.
>
> My guess is the xtables lock is not effective here, so waiting for it
> probably won't help.

Here iptables-restore v1.6.0 is used, and it does not support the -w and -W
options. :-)

New iptables-restore versions, e.g. 1.8.4-3ubuntu2, do support the -w/-W
options.

> Dexuan, concurrent access is avoided in user space using a file-based
> lock. So if multiple iptables(-restore) processes run in different
> mount-namespaces, they might miss the other's /run/xtables.lock. Another
> option would be if libiptc is used instead of calling iptables, but
> that's more a shot in the dark - I don't know if libiptc doesn't support
> obtaining the xtables lock.
>
> > > I think we need a real fix.
> >
> > iptables-nft already fixes this.
>
> nftables (and therefore iptables-nft) implement transactional logic in
> kernel, user space automatically retries if a transaction's commit
> fails.
>
> Cheers, Phil

Good to know. Thanks for the explanation!

It sounds like I need to either migrate to iptables-nft/nft or use a retry
workarouond (if iptables-restore-legacy fails due to EAGAIN).

Thanks,
-- Dexuan