2009-03-29 20:09:38

by Roman Mindalev

[permalink] [raw]
Subject: Regression caused by commit "netfilter: iptables: lock free counters"

Hi.
Since 2.6.29-git1 I got many debug messages in kernel log.
One from them:

Mar 29 12:06:11 desktop kernel: [ 917.414369] BUG: using
smp_processor_id() in preemptible [00000000] code: hddtemp/2820 Mar 29
12:06:11 desktop kernel: [ 917.414378] caller is
ipt_do_table+0xd0/0x3a8 [ip_tables] Mar 29 12:06:11 desktop kernel:
[ 917.414382] Pid: 2820, comm: hddtemp Not tainted
2.6.29-05093-gc31f403 #1 Mar 29 12:06:11 desktop kernel: [ 917.414386]
Call Trace: Mar 29 12:06:11 desktop kernel: [ 917.414393]
[<ffffffff8042d037>] debug_smp_processor_id+0xe7/0xf0 Mar 29 12:06:11
desktop kernel: [ 917.414400] [<ffffffffa0171f10>]
ipt_do_table+0xd0/0x3a8 [ip_tables] Mar 29 12:06:11 desktop kernel:
[ 917.414414] [<ffffffffa0180d0e>] ? nf_conntrack_in+0x23e/0x610
[nf_conntrack] Mar 29 12:06:11 desktop kernel: [ 917.414422]
[<ffffffff8026d47d>] ? trace_hardirqs_off+0xd/0x10 Mar 29 12:06:11
desktop kernel: [ 917.414429] [<ffffffffa008e0b3>]
ipt_local_out_hook+0x53/0x60 [iptable_filter] Mar 29 12:06:11 desktop
kernel: [ 917.414435] [<ffffffff80546a07>] nf_iterate+0x67/0xa0 Mar
29 12:06:11 desktop kernel: [ 917.414441] [<ffffffff80551140>] ?
dst_output+0x0/0x10 Mar 29 12:06:11 desktop kernel: [ 917.414446]
[<ffffffff80546af5>] nf_hook_slow+0xb5/0x100 Mar 29 12:06:11 desktop
kernel: [ 917.414452] [<ffffffff80551140>] ? dst_output+0x0/0x10 Mar
29 12:06:11 desktop kernel: [ 917.414458] [<ffffffff80552f7f>]
__ip_local_out+0x9f/0xb0 Mar 29 12:06:11 desktop kernel:
[ 917.414464] [<ffffffff8058c4db>] ? bictcp_cong_avoid+0x4b/0x2d0 Mar
29 12:06:11 desktop kernel: [ 917.414469] [<ffffffff80552fa1>]
ip_local_out+0x11/0x30 Mar 29 12:06:11 desktop kernel: [ 917.414474]
[<ffffffff8055389c>] ip_queue_xmit+0x21c/0x410 Mar 29 12:06:11 desktop
kernel: [ 917.414480] [<ffffffff80272825>] ?
print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop kernel:
[ 917.414486] [<ffffffff8056810e>] ?
tcp_established_options+0x3e/0xe0 Mar 29 12:06:11 desktop kernel:
[ 917.414492] [<ffffffff80272825>] ?
print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop kernel:
[ 917.414498] [<ffffffff80568ef5>] tcp_transmit_skb+0x535/0x820 Mar
29 12:06:11 desktop kernel: [ 917.414504] [<ffffffff8056b5fb>]
tcp_write_xmit+0x20b/0xaa0 Mar 29 12:06:11 desktop kernel:
[ 917.414510] [<ffffffff8055d451>] ? sk_stream_alloc_skb+0x41/0x110
Mar 29 12:06:11 desktop kernel: [ 917.414515] [<ffffffff8052799e>] ?
__alloc_skb+0x6e/0x150 Mar 29 12:06:11 desktop kernel: [ 917.414521]
[<ffffffff8056befb>] __tcp_push_pending_frames+0x2b/0x90 Mar 29
12:06:11 desktop kernel: [ 917.414527] [<ffffffff8055ddc3>]
tcp_sendmsg+0x8a3/0xbb0 Mar 29 12:06:11 desktop kernel: [ 917.414533]
[<ffffffff8051d2bb>] sock_aio_write+0x12b/0x140 Mar 29 12:06:11 desktop
kernel: [ 917.414539] [<ffffffff802df6f1>] do_sync_write+0xf1/0x140
Mar 29 12:06:11 desktop kernel: [ 917.414545] [<ffffffff8025e470>] ?
autoremove_wake_function+0x0/0x40 Mar 29 12:06:11 desktop kernel:
[ 917.414551] [<ffffffff802e0259>] vfs_write+0x179/0x190 Mar 29
12:06:11 desktop kernel: [ 917.414557] [<ffffffff802e0360>]
sys_write+0x50/0x90 Mar 29 12:06:11 desktop kernel: [ 917.414563]
[<ffffffff8020b51b>] system_call_fastpath+0x16/0x1b

Result of the bisection:

784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
commit 784544739a25c30637397ace5489eeb6e15d7d49
Author: Stephen Hemminger <[email protected]>
Date: Fri Feb 20 10:35:32 2009 +0100

netfilter: iptables: lock free counters

The reader/writer lock in ip_tables is acquired in the critical
path of processing packets and is one of the reasons just loading
iptables can cause a 20% performance loss. The rwlock serves two
functions:

1) it prevents changes to table state (xt_replace) while table is
in use. This is now handled by doing rcu on the xt_table. When
table is replaced, the new table(s) are put in and the old one
table(s) are freed after RCU period.

2) it provides synchronization when accesing the counter values.
This is now handled by swapping in new table_info entries for
each cpu then summing the old values, and putting the result back
onto one cpu. On a busy system it may cause sampling to occur at
different times on each cpu, but no packet/byte counts are lost in
the process.

Signed-off-by: Stephen Hemminger <[email protected]>

Sucessfully tested on my dual quad core machine too, but iptables
only (no ipv6 here) BTW, my new "tbench 8" result is 2450 MB/s, (it
was 2150 MB/s not so long ago)

Acked-by: Eric Dumazet <[email protected]>
Signed-off-by: Patrick McHardy <[email protected]>

:040000 040000 67ef5761d009330e8d538dd1a3aff77c86995172
0d8d6398ced2117bb6995b0c414fd8013685324e M include :040000 040000
0317c745a0722ade9a6c0cbe9d6cb92abbd40efa
0eb195a365164c05e7c764ab1c2a7da891276837 M net

After reverting commit no more these messages.

This issue in bugzilla: http://bugzilla.kernel.org/show_bug.cgi?id=12960


Attachments:
(No filename) (4.98 kB)
config.gz (12.46 kB)
Download all attachments

2009-03-30 07:33:18

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Roman Mindalev a ?crit :
> Hi.
> Since 2.6.29-git1 I got many debug messages in kernel log.
> One from them:
>
> Mar 29 12:06:11 desktop kernel: [ 917.414369] BUG: using
> smp_processor_id() in preemptible [00000000] code: hddtemp/2820 Mar 29
> 12:06:11 desktop kernel: [ 917.414378] caller is
> ipt_do_table+0xd0/0x3a8 [ip_tables] Mar 29 12:06:11 desktop kernel:
> [ 917.414382] Pid: 2820, comm: hddtemp Not tainted
> 2.6.29-05093-gc31f403 #1 Mar 29 12:06:11 desktop kernel: [ 917.414386]
> Call Trace: Mar 29 12:06:11 desktop kernel: [ 917.414393]
> [<ffffffff8042d037>] debug_smp_processor_id+0xe7/0xf0 Mar 29 12:06:11
> desktop kernel: [ 917.414400] [<ffffffffa0171f10>]
> ipt_do_table+0xd0/0x3a8 [ip_tables] Mar 29 12:06:11 desktop kernel:
> [ 917.414414] [<ffffffffa0180d0e>] ? nf_conntrack_in+0x23e/0x610
> [nf_conntrack] Mar 29 12:06:11 desktop kernel: [ 917.414422]
> [<ffffffff8026d47d>] ? trace_hardirqs_off+0xd/0x10 Mar 29 12:06:11
> desktop kernel: [ 917.414429] [<ffffffffa008e0b3>]
> ipt_local_out_hook+0x53/0x60 [iptable_filter] Mar 29 12:06:11 desktop
> kernel: [ 917.414435] [<ffffffff80546a07>] nf_iterate+0x67/0xa0 Mar
> 29 12:06:11 desktop kernel: [ 917.414441] [<ffffffff80551140>] ?
> dst_output+0x0/0x10 Mar 29 12:06:11 desktop kernel: [ 917.414446]
> [<ffffffff80546af5>] nf_hook_slow+0xb5/0x100 Mar 29 12:06:11 desktop
> kernel: [ 917.414452] [<ffffffff80551140>] ? dst_output+0x0/0x10 Mar
> 29 12:06:11 desktop kernel: [ 917.414458] [<ffffffff80552f7f>]
> __ip_local_out+0x9f/0xb0 Mar 29 12:06:11 desktop kernel:
> [ 917.414464] [<ffffffff8058c4db>] ? bictcp_cong_avoid+0x4b/0x2d0 Mar
> 29 12:06:11 desktop kernel: [ 917.414469] [<ffffffff80552fa1>]
> ip_local_out+0x11/0x30 Mar 29 12:06:11 desktop kernel: [ 917.414474]
> [<ffffffff8055389c>] ip_queue_xmit+0x21c/0x410 Mar 29 12:06:11 desktop
> kernel: [ 917.414480] [<ffffffff80272825>] ?
> print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop kernel:
> [ 917.414486] [<ffffffff8056810e>] ?
> tcp_established_options+0x3e/0xe0 Mar 29 12:06:11 desktop kernel:
> [ 917.414492] [<ffffffff80272825>] ?
> print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop kernel:
> [ 917.414498] [<ffffffff80568ef5>] tcp_transmit_skb+0x535/0x820 Mar
> 29 12:06:11 desktop kernel: [ 917.414504] [<ffffffff8056b5fb>]
> tcp_write_xmit+0x20b/0xaa0 Mar 29 12:06:11 desktop kernel:
> [ 917.414510] [<ffffffff8055d451>] ? sk_stream_alloc_skb+0x41/0x110
> Mar 29 12:06:11 desktop kernel: [ 917.414515] [<ffffffff8052799e>] ?
> __alloc_skb+0x6e/0x150 Mar 29 12:06:11 desktop kernel: [ 917.414521]
> [<ffffffff8056befb>] __tcp_push_pending_frames+0x2b/0x90 Mar 29
> 12:06:11 desktop kernel: [ 917.414527] [<ffffffff8055ddc3>]
> tcp_sendmsg+0x8a3/0xbb0 Mar 29 12:06:11 desktop kernel: [ 917.414533]
> [<ffffffff8051d2bb>] sock_aio_write+0x12b/0x140 Mar 29 12:06:11 desktop
> kernel: [ 917.414539] [<ffffffff802df6f1>] do_sync_write+0xf1/0x140
> Mar 29 12:06:11 desktop kernel: [ 917.414545] [<ffffffff8025e470>] ?
> autoremove_wake_function+0x0/0x40 Mar 29 12:06:11 desktop kernel:
> [ 917.414551] [<ffffffff802e0259>] vfs_write+0x179/0x190 Mar 29
> 12:06:11 desktop kernel: [ 917.414557] [<ffffffff802e0360>]
> sys_write+0x50/0x90 Mar 29 12:06:11 desktop kernel: [ 917.414563]
> [<ffffffff8020b51b>] system_call_fastpath+0x16/0x1b
>
> Result of the bisection:
>
> 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
> commit 784544739a25c30637397ace5489eeb6e15d7d49
> Author: Stephen Hemminger <[email protected]>
> Date: Fri Feb 20 10:35:32 2009 +0100
>
> netfilter: iptables: lock free counters
>
> The reader/writer lock in ip_tables is acquired in the critical
> path of processing packets and is one of the reasons just loading
> iptables can cause a 20% performance loss. The rwlock serves two
> functions:
>
> 1) it prevents changes to table state (xt_replace) while table is
> in use. This is now handled by doing rcu on the xt_table. When
> table is replaced, the new table(s) are put in and the old one
> table(s) are freed after RCU period.
>
> 2) it provides synchronization when accesing the counter values.
> This is now handled by swapping in new table_info entries for
> each cpu then summing the old values, and putting the result back
> onto one cpu. On a busy system it may cause sampling to occur at
> different times on each cpu, but no packet/byte counts are lost in
> the process.
>
> Signed-off-by: Stephen Hemminger <[email protected]>
>
> Sucessfully tested on my dual quad core machine too, but iptables
> only (no ipv6 here) BTW, my new "tbench 8" result is 2450 MB/s, (it
> was 2150 MB/s not so long ago)
>
> Acked-by: Eric Dumazet <[email protected]>
> Signed-off-by: Patrick McHardy <[email protected]>
>
> :040000 040000 67ef5761d009330e8d538dd1a3aff77c86995172
> 0d8d6398ced2117bb6995b0c414fd8013685324e M include :040000 040000
> 0317c745a0722ade9a6c0cbe9d6cb92abbd40efa
> 0eb195a365164c05e7c764ab1c2a7da891276837 M net
>
> After reverting commit no more these messages.
>
> This issue in bugzilla: http://bugzilla.kernel.org/show_bug.cgi?id=12960

Thanks Roman for this excellent report

Could you try following patch ?

[PATCH] netfilter: use rcu_read_bh() in ipt_do_table()

Commit 784544739a25c30637397ace5489eeb6e15d7d49
(netfilter: iptables: lock free counters) forgot to disable BH
in arpt_do_table(), ipt_do_table() and ip6t_do_table()

Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.

Reported-and-bisected-by: Roman Mindalev <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 35c5f6a..5ba533d 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -253,7 +253,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
indev = in ? in->name : nulldevname;
outdev = out ? out->name : nulldevname;

- rcu_read_lock();
+ rcu_read_lock_bh();
private = rcu_dereference(table->private);
table_base = rcu_dereference(private->entries[smp_processor_id()]);

@@ -329,7 +329,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
}
} while (!hotdrop);

- rcu_read_unlock();
+ rcu_read_unlock_bh();

if (hotdrop)
return NF_DROP;
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 82ee7c9..810c0b6 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -339,7 +339,7 @@ ipt_do_table(struct sk_buff *skb,

IP_NF_ASSERT(table->valid_hooks & (1 << hook));

- rcu_read_lock();
+ rcu_read_lock_bh();
private = rcu_dereference(table->private);
table_base = rcu_dereference(private->entries[smp_processor_id()]);

@@ -437,7 +437,7 @@ ipt_do_table(struct sk_buff *skb,
}
} while (!hotdrop);

- rcu_read_unlock();
+ rcu_read_unlock_bh();

#ifdef DEBUG_ALLOW_ALL
return NF_ACCEPT;
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index e89cfa3..dfed176 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -365,7 +365,7 @@ ip6t_do_table(struct sk_buff *skb,

IP_NF_ASSERT(table->valid_hooks & (1 << hook));

- rcu_read_lock();
+ rcu_read_lock_bh();
private = rcu_dereference(table->private);
table_base = rcu_dereference(private->entries[smp_processor_id()]);

@@ -466,7 +466,7 @@ ip6t_do_table(struct sk_buff *skb,
#ifdef CONFIG_NETFILTER_DEBUG
((struct ip6t_entry *)table_base)->comefrom = NETFILTER_LINK_POISON;
#endif
- rcu_read_unlock();
+ rcu_read_unlock_bh();

#ifdef DEBUG_ALLOW_ALL
return NF_ACCEPT;

2009-03-30 10:51:22

by Roman Mindalev

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

On Mon, 30 Mar 2009 09:32:51 +0200
Eric Dumazet <[email protected]> wrote:

> Roman Mindalev a écrit :
> > Hi.
> > Since 2.6.29-git1 I got many debug messages in kernel log.
> > One from them:
> >
> > Mar 29 12:06:11 desktop kernel: [ 917.414369] BUG: using
> > smp_processor_id() in preemptible [00000000] code: hddtemp/2820 Mar
> > 29 12:06:11 desktop kernel: [ 917.414378] caller is
> > ipt_do_table+0xd0/0x3a8 [ip_tables] Mar 29 12:06:11 desktop kernel:
> > [ 917.414382] Pid: 2820, comm: hddtemp Not tainted
> > 2.6.29-05093-gc31f403 #1 Mar 29 12:06:11 desktop kernel:
> > [ 917.414386] Call Trace: Mar 29 12:06:11 desktop kernel:
> > [ 917.414393] [<ffffffff8042d037>]
> > debug_smp_processor_id+0xe7/0xf0 Mar 29 12:06:11 desktop kernel:
> > [ 917.414400] [<ffffffffa0171f10>] ipt_do_table+0xd0/0x3a8
> > [ip_tables] Mar 29 12:06:11 desktop kernel: [ 917.414414]
> > [<ffffffffa0180d0e>] ? nf_conntrack_in+0x23e/0x610 [nf_conntrack]
> > Mar 29 12:06:11 desktop kernel: [ 917.414422]
> > [<ffffffff8026d47d>] ? trace_hardirqs_off+0xd/0x10 Mar 29 12:06:11
> > desktop kernel: [ 917.414429] [<ffffffffa008e0b3>]
> > ipt_local_out_hook+0x53/0x60 [iptable_filter] Mar 29 12:06:11
> > desktop kernel: [ 917.414435] [<ffffffff80546a07>]
> > nf_iterate+0x67/0xa0 Mar 29 12:06:11 desktop kernel:
> > [ 917.414441] [<ffffffff80551140>] ? dst_output+0x0/0x10 Mar 29
> > 12:06:11 desktop kernel: [ 917.414446] [<ffffffff80546af5>]
> > nf_hook_slow+0xb5/0x100 Mar 29 12:06:11 desktop kernel:
> > [ 917.414452] [<ffffffff80551140>] ? dst_output+0x0/0x10 Mar 29
> > 12:06:11 desktop kernel: [ 917.414458] [<ffffffff80552f7f>]
> > __ip_local_out+0x9f/0xb0 Mar 29 12:06:11 desktop kernel:
> > [ 917.414464] [<ffffffff8058c4db>] ? bictcp_cong_avoid+0x4b/0x2d0
> > Mar 29 12:06:11 desktop kernel: [ 917.414469]
> > [<ffffffff80552fa1>] ip_local_out+0x11/0x30 Mar 29 12:06:11 desktop
> > kernel: [ 917.414474] [<ffffffff8055389c>]
> > ip_queue_xmit+0x21c/0x410 Mar 29 12:06:11 desktop kernel:
> > [ 917.414480] [<ffffffff80272825>] ?
> > print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop
> > kernel: [ 917.414486] [<ffffffff8056810e>] ?
> > tcp_established_options+0x3e/0xe0 Mar 29 12:06:11 desktop kernel:
> > [ 917.414492] [<ffffffff80272825>] ?
> > print_lock_contention_bug+0x25/0x120 Mar 29 12:06:11 desktop
> > kernel: [ 917.414498] [<ffffffff80568ef5>]
> > tcp_transmit_skb+0x535/0x820 Mar 29 12:06:11 desktop kernel:
> > [ 917.414504] [<ffffffff8056b5fb>] tcp_write_xmit+0x20b/0xaa0 Mar
> > 29 12:06:11 desktop kernel: [ 917.414510] [<ffffffff8055d451>] ?
> > sk_stream_alloc_skb+0x41/0x110 Mar 29 12:06:11 desktop kernel:
> > [ 917.414515] [<ffffffff8052799e>] ? __alloc_skb+0x6e/0x150 Mar
> > 29 12:06:11 desktop kernel: [ 917.414521] [<ffffffff8056befb>]
> > __tcp_push_pending_frames+0x2b/0x90 Mar 29 12:06:11 desktop kernel:
> > [ 917.414527] [<ffffffff8055ddc3>] tcp_sendmsg+0x8a3/0xbb0 Mar 29
> > 12:06:11 desktop kernel: [ 917.414533] [<ffffffff8051d2bb>]
> > sock_aio_write+0x12b/0x140 Mar 29 12:06:11 desktop kernel:
> > [ 917.414539] [<ffffffff802df6f1>] do_sync_write+0xf1/0x140 Mar
> > 29 12:06:11 desktop kernel: [ 917.414545] [<ffffffff8025e470>] ?
> > autoremove_wake_function+0x0/0x40 Mar 29 12:06:11 desktop kernel:
> > [ 917.414551] [<ffffffff802e0259>] vfs_write+0x179/0x190 Mar 29
> > 12:06:11 desktop kernel: [ 917.414557] [<ffffffff802e0360>]
> > sys_write+0x50/0x90 Mar 29 12:06:11 desktop kernel: [ 917.414563]
> > [<ffffffff8020b51b>] system_call_fastpath+0x16/0x1b
> >
> > Result of the bisection:
> >
> > 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
> > commit 784544739a25c30637397ace5489eeb6e15d7d49
> > Author: Stephen Hemminger <[email protected]>
> > Date: Fri Feb 20 10:35:32 2009 +0100
> >
> > netfilter: iptables: lock free counters
> >
> > The reader/writer lock in ip_tables is acquired in the critical
> > path of processing packets and is one of the reasons just
> > loading iptables can cause a 20% performance loss. The rwlock
> > serves two functions:
> >
> > 1) it prevents changes to table state (xt_replace) while table
> > is in use. This is now handled by doing rcu on the xt_table. When
> > table is replaced, the new table(s) are put in and the old one
> > table(s) are freed after RCU period.
> >
> > 2) it provides synchronization when accesing the counter values.
> > This is now handled by swapping in new table_info entries for
> > each cpu then summing the old values, and putting the result
> > back onto one cpu. On a busy system it may cause sampling to occur
> > at different times on each cpu, but no packet/byte counts are lost
> > in the process.
> >
> > Signed-off-by: Stephen Hemminger <[email protected]>
> >
> > Sucessfully tested on my dual quad core machine too, but
> > iptables only (no ipv6 here) BTW, my new "tbench 8" result is 2450
> > MB/s, (it was 2150 MB/s not so long ago)
> >
> > Acked-by: Eric Dumazet <[email protected]>
> > Signed-off-by: Patrick McHardy <[email protected]>
> >
> > :040000 040000 67ef5761d009330e8d538dd1a3aff77c86995172
> > 0d8d6398ced2117bb6995b0c414fd8013685324e M include :040000
> > 040000 0317c745a0722ade9a6c0cbe9d6cb92abbd40efa
> > 0eb195a365164c05e7c764ab1c2a7da891276837 M net
> >
> > After reverting commit no more these messages.
> >
> > This issue in bugzilla:
> > http://bugzilla.kernel.org/show_bug.cgi?id=12960
>
> Thanks Roman for this excellent report
>
> Could you try following patch ?
>
> [PATCH] netfilter: use rcu_read_bh() in ipt_do_table()
>
> Commit 784544739a25c30637397ace5489eeb6e15d7d49
> (netfilter: iptables: lock free counters) forgot to disable BH
> in arpt_do_table(), ipt_do_table() and ip6t_do_table()
>
> Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.
>
> Reported-and-bisected-by: Roman Mindalev <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>
> ---
>
> diff --git a/net/ipv4/netfilter/arp_tables.c
> b/net/ipv4/netfilter/arp_tables.c index 35c5f6a..5ba533d 100644
> --- a/net/ipv4/netfilter/arp_tables.c
> +++ b/net/ipv4/netfilter/arp_tables.c
> @@ -253,7 +253,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> indev = in ? in->name : nulldevname;
> outdev = out ? out->name : nulldevname;
>
> - rcu_read_lock();
> + rcu_read_lock_bh();
> private = rcu_dereference(table->private);
> table_base =
> rcu_dereference(private->entries[smp_processor_id()]);
> @@ -329,7 +329,7 @@ unsigned int arpt_do_table(struct sk_buff *skb,
> }
> } while (!hotdrop);
>
> - rcu_read_unlock();
> + rcu_read_unlock_bh();
>
> if (hotdrop)
> return NF_DROP;
> diff --git a/net/ipv4/netfilter/ip_tables.c
> b/net/ipv4/netfilter/ip_tables.c index 82ee7c9..810c0b6 100644
> --- a/net/ipv4/netfilter/ip_tables.c
> +++ b/net/ipv4/netfilter/ip_tables.c
> @@ -339,7 +339,7 @@ ipt_do_table(struct sk_buff *skb,
>
> IP_NF_ASSERT(table->valid_hooks & (1 << hook));
>
> - rcu_read_lock();
> + rcu_read_lock_bh();
> private = rcu_dereference(table->private);
> table_base =
> rcu_dereference(private->entries[smp_processor_id()]);
> @@ -437,7 +437,7 @@ ipt_do_table(struct sk_buff *skb,
> }
> } while (!hotdrop);
>
> - rcu_read_unlock();
> + rcu_read_unlock_bh();
>
> #ifdef DEBUG_ALLOW_ALL
> return NF_ACCEPT;
> diff --git a/net/ipv6/netfilter/ip6_tables.c
> b/net/ipv6/netfilter/ip6_tables.c index e89cfa3..dfed176 100644
> --- a/net/ipv6/netfilter/ip6_tables.c
> +++ b/net/ipv6/netfilter/ip6_tables.c
> @@ -365,7 +365,7 @@ ip6t_do_table(struct sk_buff *skb,
>
> IP_NF_ASSERT(table->valid_hooks & (1 << hook));
>
> - rcu_read_lock();
> + rcu_read_lock_bh();
> private = rcu_dereference(table->private);
> table_base =
> rcu_dereference(private->entries[smp_processor_id()]);
> @@ -466,7 +466,7 @@ ip6t_do_table(struct sk_buff *skb,
> #ifdef CONFIG_NETFILTER_DEBUG
> ((struct ip6t_entry *)table_base)->comefrom =
> NETFILTER_LINK_POISON; #endif
> - rcu_read_unlock();
> + rcu_read_unlock_bh();
>
> #ifdef DEBUG_ALLOW_ALL
> return NF_ACCEPT;
>

2.6.29-git4 and 2.6.29-git6 both works fine for me with this patch.
Thanks, Eric!

2009-03-30 12:07:05

by Patrick McHardy

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Roman Mindalev wrote:
> Since 2.6.29-git1 I got many debug messages in kernel log.
> One from them:
>
> Mar 29 12:06:11 desktop kernel: [ 917.414369] BUG: using
> smp_processor_id() in preemptible [00000000] code: hddtemp/2820 Mar 29
> 12:06:11 desktop kernel: [ 917.414378] caller is
> ipt_do_table+0xd0/0x3a8 [ip_tables]

Stephen, please have a look at this. I think we need to disable
preemption in *t_do_table() for PREEMPT_RCU.

2009-03-30 12:08:24

by Patrick McHardy

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Roman Mindalev wrote:
> On Mon, 30 Mar 2009 09:32:51 +0200
> Eric Dumazet <[email protected]> wrote:
>
>> Could you try following patch ?
>>
>> [PATCH] netfilter: use rcu_read_bh() in ipt_do_table()
>>
>> Commit 784544739a25c30637397ace5489eeb6e15d7d49
>> (netfilter: iptables: lock free counters) forgot to disable BH
>> in arpt_do_table(), ipt_do_table() and ip6t_do_table()
>>
>> Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.
>>
>> Reported-and-bisected-by: Roman Mindalev <[email protected]>
>> Signed-off-by: Eric Dumazet <[email protected]>
>> ---
>
> 2.6.29-git4 and 2.6.29-git6 both works fine for me with this patch.
> Thanks, Eric!

Thanks indeed Eric :) Dave, please apply.

Acked-by: Patrick McHardy <[email protected]>

2009-03-30 14:56:39

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

On Mon, 30 Mar 2009 14:08:08 +0200
Patrick McHardy <[email protected]> wrote:

> Roman Mindalev wrote:
> > On Mon, 30 Mar 2009 09:32:51 +0200
> > Eric Dumazet <[email protected]> wrote:
> >
> >> Could you try following patch ?
> >>
> >> [PATCH] netfilter: use rcu_read_bh() in ipt_do_table()
> >>
> >> Commit 784544739a25c30637397ace5489eeb6e15d7d49
> >> (netfilter: iptables: lock free counters) forgot to disable BH
> >> in arpt_do_table(), ipt_do_table() and ip6t_do_table()
> >>
> >> Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem.
> >>
> >> Reported-and-bisected-by: Roman Mindalev <[email protected]>
> >> Signed-off-by: Eric Dumazet <[email protected]>
> >> ---
> >
> > 2.6.29-git4 and 2.6.29-git6 both works fine for me with this patch.
> > Thanks, Eric!
>
> Thanks indeed Eric :) Dave, please apply.
>
> Acked-by: Patrick McHardy <[email protected]>
>
>
Acked-by: Stephen Hemminger <[email protected]>

The original version had rcu_read_lock_bh() but someone (not sure who)
talked me into removing it.

2009-04-02 07:54:25

by David Miller

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

From: Patrick McHardy <[email protected]>
Date: Mon, 30 Mar 2009 14:08:08 +0200

> Thanks indeed Eric :) Dave, please apply.
>
> Acked-by: Patrick McHardy <[email protected]>

Applied, thanks everyone.

2009-04-05 07:05:31

by Graham Murray

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Roman Mindalev <[email protected]> writes:

> Result of the bisection:
>
> 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
> commit 784544739a25c30637397ace5489eeb6e15d7d49

I am seeing a different problem which also bisects to this commit. There are
no kernel messages but ip6tables fails to run.

newton ~ # ip6tables -L -v
FATAL: Module ip6_tables not found.
ip6tables v1.4.3.1: can't initialize ip6tables table `filter': Memory allocation problem
Perhaps ip6tables or your kernel needs to be upgraded.

I get this error no matter which ip6tables sub-command I run. Ip6tables
is built into the kernel, not as modules.

An strace shows the failure to be
socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
getsockopt(3, SOL_IPV6, 0x40 /* IPV6_??? */, "filter\0\305\0w~\300\0wb\305P\24\312\t\0009b\305\216\23\0\0\310\341/g\16"..., [84]) = 0
brk(0) = 0x8273000
brk(0x8294000) = 0x8294000
getsockopt(3, SOL_IPV6, 0x41 /* IPV6_??? */, 0x8273090, 0xbfd23628) = -1 ENOMEM (Cannot allocate memory)
close(3) = 0

2009-04-05 08:23:15

by David Miller

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

From: Graham Murray <[email protected]>
Date: Sun, 05 Apr 2009 08:05:17 +0100

Please CC the appropriate mailing lists (as I have now) when reporting
this incredibly useful information. The networking and netfilter
developers largely do not read linux-kernel.

> Roman Mindalev <[email protected]> writes:
>
> > Result of the bisection:
> >
> > 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
> > commit 784544739a25c30637397ace5489eeb6e15d7d49
>
> I am seeing a different problem which also bisects to this commit. There are
> no kernel messages but ip6tables fails to run.
>
> newton ~ # ip6tables -L -v
> FATAL: Module ip6_tables not found.
> ip6tables v1.4.3.1: can't initialize ip6tables table `filter': Memory allocation problem
> Perhaps ip6tables or your kernel needs to be upgraded.
>
> I get this error no matter which ip6tables sub-command I run. Ip6tables
> is built into the kernel, not as modules.
>
> An strace shows the failure to be
> socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
> getsockopt(3, SOL_IPV6, 0x40 /* IPV6_??? */, "filter\0\305\0w~\300\0wb\305P\24\312\t\0009b\305\216\23\0\0\310\341/g\16"..., [84]) = 0
> brk(0) = 0x8273000
> brk(0x8294000) = 0x8294000
> getsockopt(3, SOL_IPV6, 0x41 /* IPV6_??? */, 0x8273090, 0xbfd23628) = -1 ENOMEM (Cannot allocate memory)
> close(3) = 0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-04-05 10:01:34

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

David Miller a ?crit :
> From: Graham Murray <[email protected]>
> Date: Sun, 05 Apr 2009 08:05:17 +0100
>
> Please CC the appropriate mailing lists (as I have now) when reporting
> this incredibly useful information. The networking and netfilter
> developers largely do not read linux-kernel.
>
>> Roman Mindalev <[email protected]> writes:
>>
>>> Result of the bisection:
>>>
>>> 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
>>> commit 784544739a25c30637397ace5489eeb6e15d7d49
>> I am seeing a different problem which also bisects to this commit. There are
>> no kernel messages but ip6tables fails to run.
>>
>> newton ~ # ip6tables -L -v
>> FATAL: Module ip6_tables not found.
>> ip6tables v1.4.3.1: can't initialize ip6tables table `filter': Memory allocation problem
>> Perhaps ip6tables or your kernel needs to be upgraded.
>>
>> I get this error no matter which ip6tables sub-command I run. Ip6tables
>> is built into the kernel, not as modules.
>>
>> An strace shows the failure to be
>> socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
>> getsockopt(3, SOL_IPV6, 0x40 /* IPV6_??? */, "filter\0\305\0w~\300\0wb\305P\24\312\t\0009b\305\216\23\0\0\310\341/g\16"..., [84]) = 0
>> brk(0) = 0x8273000
>> brk(0x8294000) = 0x8294000

so ip6tables allocates about 128 Kbytes of ram in order to get rules from kernel.

>> getsockopt(3, SOL_IPV6, 0x41 /* IPV6_??? */, 0x8273090, 0xbfd23628) = -1 ENOMEM (Cannot allocate memory)
>> close(3) = 0
>>


This is a big problem yes, since "iptables|ip6tables" -L needs to allocate kernel memory
to perform the momentary swap.

On x86, this is potentially a problem if vmalloc space is exhausted or fragmented,
(or lowmem exhausted) and/or many cpus are online/possible.

Graham, could you please give us :

# cat /proc/vmallocinfo
# cat /proc/meminfo

I wonder if your machine is in a state where even an "ip6tables -A ..." would fail anyway
since it should allocate same amount of memory than "ip6tables -L "


This could probably be solved using a single "table" containing rules only, that could
be shared for every cpus. Only counters should be percpu. This should save a lot of ram,
over previous situation (2.6.29 or current one)

(current scheme is to allocate a copy of all rules logic *and* counters per cpu)

Then if we want to be sure "iptables -L" cannot fail, we should reserve this extra space
at load time (iptables -{A|I}", instead.

Other possibility is to use a percpu seqlock as Stephen did in one of his patch, and not swap tables
when doing "iptables -L".
This would slowdown fast path a litle bit (one spinlock/spinunlock) per ipt_do_table() call.

2009-04-05 10:12:51

by Graham Murray

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Eric Dumazet <[email protected]> writes:

>>>> 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
>>>> commit 784544739a25c30637397ace5489eeb6e15d7d49
>>> I am seeing a different problem which also bisects to this commit. There are
>>> no kernel messages but ip6tables fails to run.
>>>
>>> newton ~ # ip6tables -L -v
>>> FATAL: Module ip6_tables not found.
>>> ip6tables v1.4.3.1: can't initialize ip6tables table `filter': Memory allocation problem
>>> Perhaps ip6tables or your kernel needs to be upgraded.
>>>
>>> I get this error no matter which ip6tables sub-command I run. Ip6tables
>>> is built into the kernel, not as modules.
>>>
>>> An strace shows the failure to be
>>> socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
>>> getsockopt(3, SOL_IPV6, 0x40 /* IPV6_??? */, "filter\0\305\0w~\300\0wb\305P\24\312\t\0009b\305\216\23\0\0\310\341/g\16"..., [84]) = 0
>>> brk(0) = 0x8273000
>>> brk(0x8294000) = 0x8294000
>
> so ip6tables allocates about 128 Kbytes of ram in order to get rules from kernel.
>
>>> getsockopt(3, SOL_IPV6, 0x41 /* IPV6_??? */, 0x8273090, 0xbfd23628) = -1 ENOMEM (Cannot allocate memory)
>>> close(3) = 0
>>>
>
>
> This is a big problem yes, since "iptables|ip6tables" -L needs to allocate kernel memory
> to perform the momentary swap.
>
> On x86, this is potentially a problem if vmalloc space is exhausted or fragmented,
> (or lowmem exhausted) and/or many cpus are online/possible.

iptables gives me no problems at all, it is just ip6tables that
fails. The first indication of this is during the init scripts when
ip6tables-restore fails.

> Graham, could you please give us :
>
> # cat /proc/vmallocinfo
> # cat /proc/meminfo

newton ~ # cat /proc/vmallocinfo
0xf81fe000-0xf8200000 8192 hpet_enable+0x2f/0x272 phys=fed00000 ioremap
0xf8202000-0xf8204000 8192 acpi_os_map_memory+0x24/0x3e phys=bff8e000 ioremap
0xf8206000-0xf8208000 8192 acpi_os_map_memory+0x24/0x3e phys=bff8e000 ioremap
0xf820a000-0xf820c000 8192 acpi_os_map_memory+0x24/0x3e phys=bff8e000 ioremap
0xf820e000-0xf8210000 8192 acpi_os_map_memory+0x24/0x3e phys=bff89000 ioremap
0xf8210000-0xf821b000 45056 acpi_os_map_memory+0x24/0x3e phys=bff80000 ioremap
0xf821c000-0xf821e000 8192 acpi_os_map_memory+0x24/0x3e phys=bff80000 ioremap
0xf8220000-0xf8222000 8192 acpi_os_map_memory+0x24/0x3e phys=fed1f000 ioremap
0xf8224000-0xf8227000 12288 acpi_os_map_memory+0x24/0x3e phys=bfff0000 ioremap
0xf8228000-0xf822a000 8192 acpi_os_map_memory+0x24/0x3e phys=bff8e000 ioremap
0xf822c000-0xf822e000 8192 acpi_os_map_memory+0x24/0x3e phys=bff80000 ioremap
0xf8230000-0xf8232000 8192 acpi_os_map_memory+0x24/0x3e phys=bff80000 ioremap
0xf8234000-0xf8236000 8192 acpi_os_map_memory+0x24/0x3e phys=bff8e000 ioremap
0xf8238000-0xf823a000 8192 acpi_os_map_memory+0x24/0x3e phys=bff89000 ioremap
0xf823b000-0xf8247000 49152 zisofs_init+0x1b/0x46 pages=11 vmalloc
0xf8248000-0xf824d000 20480 radeonfb_pci_register+0x193/0x174c phys=ff9f0000 ioremap
0xf824e000-0xf8250000 8192 pci_iomap+0xdd/0xf0 phys=ffafb000 ioremap
0xf8252000-0xf8254000 8192 usb_hcd_pci_probe+0x1a1/0x410 phys=ffafb000 ioremap
0xf8254000-0xf8257000 12288 pci_iomap+0xdd/0xf0 phys=ff6fe000 ioremap
0xf8258000-0xf825a000 8192 iTCO_wdt_probe+0x2f1/0x360 phys=fed1f000 ioremap
0xf825e000-0xf8260000 8192 module_alloc+0x1e/0x40 pages=1 vmalloc
0xf8260000-0xf8265000 20480 pci_ioremap_bar+0x4e/0xa0 phys=ffafc000 ioremap
0xf8266000-0xf8277000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf827b000-0xf8280000 20480 module_alloc+0x1e/0x40 pages=4 vmalloc
0xf8280000-0xf9281000 16781312 radeonfb_pci_register+0x3a0/0x174c phys=d0000000 ioremap
0xf9282000-0xf9293000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf9294000-0xf92a5000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf92a6000-0xf92b7000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf92b8000-0xf92c9000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf92ca000-0xf92db000 69632 snd_malloc_sgbuf_pages+0x1f4/0x230 vmap
0xf92dc000-0xf931d000 266240 pcpu_alloc+0x16d/0x440 vpages
0xf9320000-0xf9325000 20480 sky2_probe+0x12d/0xa79 [sky2] phys=ff7fc000 ioremap
0xf9326000-0xf932f000 36864 module_alloc+0x1e/0x40 pages=8 vmalloc
0xf932f000-0xf933c000 53248 module_alloc+0x1e/0x40 pages=12 vmalloc
0xf933d000-0xf9340000 12288 drm_addmap_core+0x6f8/0x8a0 [drm] pages=2 vmalloc user
0xf9340000-0xf9345000 20480 sky2_probe+0x12d/0xa79 [sky2] phys=ff8fc000 ioremap
0xf9346000-0xf934f000 36864 drm_sg_alloc+0x13f/0x2a0 [drm] pages=8 vmalloc
0xf9352000-0xf9357000 20480 drm_ht_create+0xaf/0xe0 [drm] pages=4 vmalloc
0xf935d000-0xf9361000 16384 module_alloc+0x1e/0x40 pages=3 vmalloc
0xf9368000-0xf936c000 16384 module_alloc+0x1e/0x40 pages=3 vmalloc
0xf9373000-0xf9377000 16384 module_alloc+0x1e/0x40 pages=3 vmalloc
0xf9380000-0xf9391000 69632 drm_addmap_core+0x6cb/0x8a0 [drm] phys=ff9f0000 ioremap
0xf9394000-0xf9399000 20480 module_alloc+0x1e/0x40 pages=4 vmalloc
0xf93a8000-0xf93b3000 45056 module_alloc+0x1e/0x40 pages=10 vmalloc
0xf93c0000-0xf93c7000 28672 module_alloc+0x1e/0x40 pages=6 vmalloc
0xf93d3000-0xf93d9000 24576 module_alloc+0x1e/0x40 pages=5 vmalloc
0xf93e0000-0xf93e9000 36864 drm_core_ioremap_wc+0x43/0x80 [drm] phys=dfff8000 ioremap
0xf93ef000-0xf93ff000 65536 module_alloc+0x1e/0x40 pages=15 vmalloc
0xf9400000-0xf9411000 69632 drm_addmap_core+0x6cb/0x8a0 [drm] phys=ff9f0000 ioremap
0xf9421000-0xf9437000 90112 module_alloc+0x1e/0x40 pages=21 vmalloc
0xf9445000-0xf944d000 32768 module_alloc+0x1e/0x40 pages=7 vmalloc
0xf9457000-0xf945a000 12288 module_alloc+0x1e/0x40 pages=2 vmalloc
0xf9471000-0xf9480000 61440 module_alloc+0x1e/0x40 pages=14 vmalloc
0xf948b000-0xf9490000 20480 module_alloc+0x1e/0x40 pages=4 vmalloc
0xf94ac000-0xf94ae000 8192 dm_vcalloc+0x5a/0x90 [dm_mod] pages=1 vmalloc
0xf94af000-0xf94d6000 159744 module_alloc+0x1e/0x40 pages=38 vmalloc
0xf94d9000-0xf94db000 8192 dm_vcalloc+0x5a/0x90 [dm_mod] pages=1 vmalloc
0xf9569000-0xf95c3000 368640 module_alloc+0x1e/0x40 pages=89 vmalloc
0xf989d000-0xf9a94000 2060288 sys_swapon+0x56d/0xa60 pages=502 vmalloc
0xf9a95000-0xf9c18000 1585152 sys_swapon+0x56d/0xa60 pages=386 vmalloc
0xf9c46000-0xf9c49000 12288 xt_alloc_table_info+0xc0/0xe0 pages=2 vmalloc
0xf9c4a000-0xf9c4d000 12288 xt_alloc_table_info+0xc0/0xe0 pages=2 vmalloc
0xf9c4e000-0xf9c51000 12288 xt_alloc_table_info+0xc0/0xe0 pages=2 vmalloc
0xf9c52000-0xf9c55000 12288 xt_alloc_table_info+0xc0/0xe0 pages=2 vmalloc
0xf9c60000-0xf9c62000 8192 module_alloc+0x1e/0x40 pages=1 vmalloc
0xf9c6d000-0xf9c74000 28672 module_alloc+0x1e/0x40 pages=6 vmalloc
0xf9c7d000-0xf9c80000 12288 module_alloc+0x1e/0x40 pages=2 vmalloc
0xf9c98000-0xf9ca7000 61440 module_alloc+0x1e/0x40 pages=14 vmalloc
0xf9ccc000-0xf9cd1000 20480 module_alloc+0x1e/0x40 pages=4 vmalloc
0xf9cf6000-0xfbcf7000 33558528 drm_sg_alloc+0x13f/0x2a0 [drm] pages=8192 vmalloc vpages

newton ~ # cat /proc/meminfo
MemTotal: 6210480 kB
MemFree: 2935668 kB
Buffers: 648416 kB
Cached: 1188316 kB
SwapCached: 0 kB
Active: 1606488 kB
Inactive: 1483096 kB
Active(anon): 1151104 kB
Inactive(anon): 129184 kB
Active(file): 455384 kB
Inactive(file): 1353912 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 5379592 kB
HighFree: 2925400 kB
LowTotal: 830888 kB
LowFree: 10268 kB
SwapTotal: 7261360 kB
SwapFree: 7261360 kB
Dirty: 84 kB
Writeback: 0 kB
AnonPages: 1253024 kB
Mapped: 130664 kB
Slab: 114644 kB
SReclaimable: 102084 kB
SUnreclaim: 12560 kB
PageTables: 8132 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 10366600 kB
Committed_AS: 1969496 kB
VmallocTotal: 122880 kB
VmallocUsed: 55208 kB
VmallocChunk: 62492 kB
DirectMap4k: 8184 kB
DirectMap2M: 903168 kB

2009-04-05 10:30:42

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Graham Murray a ?crit :
> Eric Dumazet <[email protected]> writes:
>
>>>>> 784544739a25c30637397ace5489eeb6e15d7d49 is first bad commit
>>>>> commit 784544739a25c30637397ace5489eeb6e15d7d49
>>>> I am seeing a different problem which also bisects to this commit. There are
>>>> no kernel messages but ip6tables fails to run.
>>>>
>>>> newton ~ # ip6tables -L -v
>>>> FATAL: Module ip6_tables not found.
>>>> ip6tables v1.4.3.1: can't initialize ip6tables table `filter': Memory allocation problem
>>>> Perhaps ip6tables or your kernel needs to be upgraded.
>>>>
>>>> I get this error no matter which ip6tables sub-command I run. Ip6tables
>>>> is built into the kernel, not as modules.
>>>>
>>>> An strace shows the failure to be
>>>> socket(PF_INET6, SOCK_RAW, IPPROTO_RAW) = 3
>>>> getsockopt(3, SOL_IPV6, 0x40 /* IPV6_??? */, "filter\0\305\0w~\300\0wb\305P\24\312\t\0009b\305\216\23\0\0\310\341/g\16"..., [84]) = 0
>>>> brk(0) = 0x8273000
>>>> brk(0x8294000) = 0x8294000
>> so ip6tables allocates about 128 Kbytes of ram in order to get rules from kernel.
>>
>>>> getsockopt(3, SOL_IPV6, 0x41 /* IPV6_??? */, 0x8273090, 0xbfd23628) = -1 ENOMEM (Cannot allocate memory)
>>>> close(3) = 0
>>>>
>>
>> This is a big problem yes, since "iptables|ip6tables" -L needs to allocate kernel memory
>> to perform the momentary swap.
>>
>> On x86, this is potentially a problem if vmalloc space is exhausted or fragmented,
>> (or lowmem exhausted) and/or many cpus are online/possible.
>
> iptables gives me no problems at all, it is just ip6tables that
> fails. The first indication of this is during the init scripts when
> ip6tables-restore fails.

I see, its a plain bug in net/ipv6/netfilter/ip6_tables.c
function alloc_counters() always returns -ENOMEM

Unfortunatly , its Sunday here and I have to run for lunch time with family :)

If nobody beats me, I will do the fix in a couple of hours...

Thank you

2009-04-05 11:37:05

by Jan Engelhardt

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"


On Sunday 2009-04-05 12:01, Eric Dumazet wrote:
>
>This could probably be solved using a single "table" containing
>rules only, that could be shared for every cpus. Only counters
>should be percpu. This should save a lot of ram, over previous
>situation (2.6.29 or current one)

Why would counters stay separate?

I recognize all of this table copying is related to do NUMA
optimizations, and I think I heard cache bouncing too somewhere else.

[ http://marc.info/?l=netfilter-devel&m=119903624211253&w=2 ]

2009-04-05 12:30:28

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH] netfilter: ip6tables fix

Eric Dumazet a ?crit :
> Graham Murray a ?crit :
>> iptables gives me no problems at all, it is just ip6tables that
>> fails. The first indication of this is during the init scripts when
>> ip6tables-restore fails.
>
> I see, its a plain bug in net/ipv6/netfilter/ip6_tables.c
> function alloc_counters() always returns -ENOMEM
>
> Unfortunatly , its Sunday here and I have to run for lunch time with family :)
>
> If nobody beats me, I will do the fix in a couple of hours...

Here is the fix, thanks Graham for the report !

[PATCH] netfilter: ip6tables fix

ip6_tables.c alloc_counters() misses a return statement, making
ip6tables -N always failing and leaking memory.

Reported-by: Graham Murray <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index dfed176..800ae85 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1033,6 +1033,8 @@ static struct xt_counters *alloc_counters(struct xt_table *table)

xt_free_table_info(info);

+ return counters;
+
free_counters:
vfree(counters);
nomem:

2009-04-05 12:35:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: Regression caused by commit "netfilter: iptables: lock free counters"

Jan Engelhardt a ?crit :
> On Sunday 2009-04-05 12:01, Eric Dumazet wrote:
>> This could probably be solved using a single "table" containing
>> rules only, that could be shared for every cpus. Only counters
>> should be percpu. This should save a lot of ram, over previous
>> situation (2.6.29 or current one)
>
> Why would counters stay separate?
>
> I recognize all of this table copying is related to do NUMA
> optimizations, and I think I heard cache bouncing too somewhere else.
>
> [ http://marc.info/?l=netfilter-devel&m=119903624211253&w=2 ]
>
>

Not only NUMA, but SMP too. counters are integrated in rules themselves.
So in order to avoid ping pongs between cpus, we choose to allocate
one copy of rules/counters per cpu.

But with some changes, we could let the rules read-only and shared by
all cpus, and shadow counters only on percpu variables, thus reducing
memory costs.

2009-04-05 12:40:52

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] netfilter: ip6tables fix

Eric Dumazet a ?crit :
> Eric Dumazet a ?crit :
>> Graham Murray a ?crit :
>>> iptables gives me no problems at all, it is just ip6tables that
>>> fails. The first indication of this is during the init scripts when
>>> ip6tables-restore fails.
>> I see, its a plain bug in net/ipv6/netfilter/ip6_tables.c
>> function alloc_counters() always returns -ENOMEM
>>
>> Unfortunatly , its Sunday here and I have to run for lunch time with family :)
>>
>> If nobody beats me, I will do the fix in a couple of hours...
>
> Here is the fix, thanks Graham for the report !
>
> [PATCH] netfilter: ip6tables fix
>
> ip6_tables.c alloc_counters() misses a return statement, making
> ip6tables -N always failing and leaking memory.

Oh well, this ChangeLog is not correct :(

Here is an updated patch with correct ChangeLog, sorry David.
(-L instead of -N, and there was no memory leak involved)

[PATCH] netfilter: ip6tables fix

ip6_tables.c alloc_counters() misses a return statement, making
ip6tables -L always failing.

Reported-by: Graham Murray <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---

diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index dfed176..800ae85 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1033,6 +1033,8 @@ static struct xt_counters *alloc_counters(struct xt_table *table)

xt_free_table_info(info);

+ return counters;
+
free_counters:
vfree(counters);
nomem:

2009-04-06 15:08:48

by Patrick McHardy

[permalink] [raw]
Subject: Re: [PATCH] netfilter: ip6tables fix

Eric Dumazet wrote:
> [PATCH] netfilter: ip6tables fix
>
> ip6_tables.c alloc_counters() misses a return statement, making
> ip6tables -L always failing.
>
> Reported-by: Graham Murray <[email protected]>
> Signed-off-by: Eric Dumazet <[email protected]>

Applied, thanks Eric.