2005-03-08 18:00:18

by Michal Vanco

[permalink] [raw]
Subject: 2.6.11 on AMD64 traps

Hello,

I see this problem running 2.6.11 on dual AMD64:

Running quagga routing daemon (ospf+bgp) and issuing "netstat -rn |wc -l" command
while quagga tries to load more than 154000 routes from its bgp neighbours causes this trap:

Unable to handle kernel paging request at 00000000007f5c60 RIP:
<ffffffff8041be35>{fib_get_next+181}
PGD 3a112067 PUD 3a115067 PMD 0
Oops: 0000 [1] SMP
CPU 1
Modules linked in:
Pid: 2537, comm: netstat Not tainted 2.6.11-mv
RIP: 0010:[<ffffffff8041be35>] <ffffffff8041be35>{fib_get_next+181}
RSP: 0018:ffff81003a13fe90 EFLAGS: 00010206
RAX: ffff81003a74c000 RBX: 0000000000000000 RCX: ffff81003a13ff50
RDX: 00000000007f5c60 RSI: 0000000000000000 RDI: ffff81003a004d00
RBP: ffff81003a13fed8 R08: ffff81003f3ff7c0 R09: 0000000000000800
R10: 00007fffffffefe0 R11: 0000000000000246 R12: ffff810002231480
R13: 00002aaaaab08000 R14: 0000000000000400 R15: ffff8100022314a8
FS: 00002aaaaae00620(0000) GS:ffffffff806195c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000007f5c60 CR3: 000000003a12e000 CR4: 00000000000006e0
Process netstat (pid: 2537, threadinfo ffff81003a13e000, task ffff81003a66a760)
Stack: ffffffff8041bf0f ffff810002231480 ffff81003a67ac80 0000000000000000
ffffffff8019576b 0000000000000000 ffff81003a13ff50 00002aaaaab08000
00000000000006f7 00000000000006f8
Call Trace:<ffffffff8041bf0f>{fib_seq_start+63} <ffffffff8019576b>{seq_read+219}
<ffffffff8017497f>{vfs_read+191} <ffffffff80174c53>{sys_read+83}
<ffffffff8010d1ba>{system_call+126}

Code: 48 8b 0a 0f 18 09 48 8b 72 10 48 8b 06 0f 18 08 48 8d 42 10
RIP <ffffffff8041be35>{fib_get_next+181} RSP <ffff81003a13fe90>
CR2: 00000000007f5c60

I saw the same issue on 2.6.10 before. I'm not a kernel hacker but it sounds like
locking problem. But may be I'm totally wrong in this.

michal


Attachments:
(No filename) (1.79 kB)
(No filename) (189.00 B)
Download all attachments

2005-03-08 18:35:47

by Andre Tomt

[permalink] [raw]
Subject: Re: 2.6.11 on AMD64 traps

[just adding netdev to CC, from LKML]

Michal Vanco wrote:
> Hello,
>
> I see this problem running 2.6.11 on dual AMD64:
>
> Running quagga routing daemon (ospf+bgp) and issuing "netstat -rn |wc -l" command
> while quagga tries to load more than 154000 routes from its bgp neighbours causes this trap:
>
> Unable to handle kernel paging request at 00000000007f5c60 RIP:
> <ffffffff8041be35>{fib_get_next+181}
> PGD 3a112067 PUD 3a115067 PMD 0
> Oops: 0000 [1] SMP
> CPU 1
> Modules linked in:
> Pid: 2537, comm: netstat Not tainted 2.6.11-mv
> RIP: 0010:[<ffffffff8041be35>] <ffffffff8041be35>{fib_get_next+181}
> RSP: 0018:ffff81003a13fe90 EFLAGS: 00010206
> RAX: ffff81003a74c000 RBX: 0000000000000000 RCX: ffff81003a13ff50
> RDX: 00000000007f5c60 RSI: 0000000000000000 RDI: ffff81003a004d00
> RBP: ffff81003a13fed8 R08: ffff81003f3ff7c0 R09: 0000000000000800
> R10: 00007fffffffefe0 R11: 0000000000000246 R12: ffff810002231480
> R13: 00002aaaaab08000 R14: 0000000000000400 R15: ffff8100022314a8
> FS: 00002aaaaae00620(0000) GS:ffffffff806195c0(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00000000007f5c60 CR3: 000000003a12e000 CR4: 00000000000006e0
> Process netstat (pid: 2537, threadinfo ffff81003a13e000, task ffff81003a66a760)
> Stack: ffffffff8041bf0f ffff810002231480 ffff81003a67ac80 0000000000000000
> ffffffff8019576b 0000000000000000 ffff81003a13ff50 00002aaaaab08000
> 00000000000006f7 00000000000006f8
> Call Trace:<ffffffff8041bf0f>{fib_seq_start+63} <ffffffff8019576b>{seq_read+219}
> <ffffffff8017497f>{vfs_read+191} <ffffffff80174c53>{sys_read+83}
> <ffffffff8010d1ba>{system_call+126}
>
> Code: 48 8b 0a 0f 18 09 48 8b 72 10 48 8b 06 0f 18 08 48 8d 42 10
> RIP <ffffffff8041be35>{fib_get_next+181} RSP <ffff81003a13fe90>
> CR2: 00000000007f5c60
>
> I saw the same issue on 2.6.10 before. I'm not a kernel hacker but it sounds like
> locking problem. But may be I'm totally wrong in this.
>
> michal

2005-03-09 19:47:18

by Patrick McHardy

[permalink] [raw]
Subject: Re: 2.6.11 on AMD64 traps

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
# 2005/03/09 20:41:46+01:00 [email protected]
# [IPV4]: Fix crash while reading /proc/net/route caused by stale pointers
#
# Signed-off-by: Patrick McHardy <[email protected]>
#
# net/ipv4/fib_hash.c
# 2005/03/09 20:41:37+01:00 [email protected] +11 -1
# [IPV4]: Fix crash while reading /proc/net/route caused by stale pointers
#
# Signed-off-by: Patrick McHardy <[email protected]>
#
diff -Nru a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
--- a/net/ipv4/fib_hash.c 2005-03-09 20:43:55 +01:00
+++ b/net/ipv4/fib_hash.c 2005-03-09 20:43:55 +01:00
@@ -919,13 +919,23 @@
return fa;
}

+static struct fib_alias *fib_get_idx(struct seq_file *seq, loff_t pos)
+{
+ struct fib_alias *fa = fib_get_first(seq);
+
+ if (fa)
+ while (pos && (fa = fib_get_next(seq)))
+ --pos;
+ return pos ? NULL : fa;
+}
+
static void *fib_seq_start(struct seq_file *seq, loff_t *pos)
{
void *v = NULL;

read_lock(&fib_hash_lock);
if (ip_fib_main_table)
- v = *pos ? fib_get_next(seq) : SEQ_START_TOKEN;
+ v = *pos ? fib_get_idx(seq, *pos - 1) : SEQ_START_TOKEN;
return v;
}


Attachments:
x (1.13 kB)

2005-03-09 20:41:26

by Michal Vanco

[permalink] [raw]
Subject: Re: 2.6.11 on AMD64 traps

On Wednesday 09 March 2005 20:45, Patrick McHardy wrote:
> > Michal Vanco wrote:
> >> I see this problem running 2.6.11 on dual AMD64:
> >>
> >> Running quagga routing daemon (ospf+bgp) and issuing "netstat -rn |wc
> >> -l" command
> >> while quagga tries to load more than 154000 routes from its bgp
> >> neighbours causes this trap:
>
> This patch should fix it. The crash is caused by stale pointers,
> the pointers in fib_iter_state are not reloaded after seq->stop()
> followed by seq->start(pos > 0).

Well. Trap vanished after applying this patch, but another weird thing occurs:

# ip route show | wc -l
156033
# date; time ip route show > /dev/null; date; time netstat -rn > /dev/null
Wed Mar 9 22:15:21 CET 2005

real 0m0.656s
user 0m0.415s
sys 0m0.242s
Wed Mar 9 22:15:22 CET 2005

real 6m41.472s
user 0m1.261s
sys 6m40.143s

regards,
--
Ing. Michal VanĨo
Network Engineer
SATRO s.r.o.
e-mail: [email protected]


Attachments:
(No filename) (944.00 B)
(No filename) (189.00 B)
Download all attachments

2005-03-09 20:46:00

by Patrick McHardy

[permalink] [raw]
Subject: Re: 2.6.11 on AMD64 traps

Michal Vanco wrote:
> On Wednesday 09 March 2005 20:45, Patrick McHardy wrote:
>>
>>This patch should fix it. The crash is caused by stale pointers,
>>the pointers in fib_iter_state are not reloaded after seq->stop()
>>followed by seq->start(pos > 0).
>
> Well. Trap vanished after applying this patch, but another weird thing occurs:
>
> # ip route show | wc -l
> 156033
> # date; time ip route show > /dev/null; date; time netstat -rn > /dev/null
> Wed Mar 9 22:15:21 CET 2005
>
> real 0m0.656s
> user 0m0.415s
> sys 0m0.242s
> Wed Mar 9 22:15:22 CET 2005
>
> real 6m41.472s
> user 0m1.261s
> sys 6m40.143s

Yes, I know it is totally inefficient. Just use ip route, which doesn't
suffer from this problem.

Regards
Patrick

2005-03-11 02:26:31

by David Miller

[permalink] [raw]
Subject: Re: 2.6.11 on AMD64 traps

On Wed, 09 Mar 2005 20:45:35 +0100
Patrick McHardy <[email protected]> wrote:

> > Michal Vanco wrote:
> >>
> >> I see this problem running 2.6.11 on dual AMD64:
> >>
> >> Running quagga routing daemon (ospf+bgp) and issuing "netstat -rn |wc
> >> -l" command
> >> while quagga tries to load more than 154000 routes from its bgp
> >> neighbours causes this trap:
>
> This patch should fix it. The crash is caused by stale pointers,
> the pointers in fib_iter_state are not reloaded after seq->stop()
> followed by seq->start(pos > 0).

Applied, thanks Patrick.