LinuxLists.cc - kernel BUG at page_alloc.c:98 -- compiling with distcc

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Friday April 2nd 2004 Marco Fais wrote:

> [...]

> When compiling with distcc the local system doesn't show any kernel
> panic, while the same system used as a "remote compiler system" dies
> very quickly.

> >>EIP; c01372ae <__free_pages_ok+26e/280> <=====
> ...
> Trace; e08d7eab <[8139too]rtl8139_rx_interrupt+6b/3b0>

> <0>Kernel panic: Aiee, killing interrupt handler!

>From a very superficial examination of your data, it looks like there is
something going wrong in the interrupt handling of the driver for (one
of) the network cards.

Distcc can generate a lot of network traffic. You might experiment with
switching the role of the two network cards (in case there might be
something wrong with the hardware of one of them) or use the '--listen'
directive in the distccd configuration to do so.

If the panic is indeed caused by the network driver, then it should also
be possible to trigger and debug this with a tool like netcat (listen on
the panicking box with 'nc -l someport' and send some stuff from another
box ('cat /dev/zero | nc panicker someport' or vice versa).

Sadly, nothing of this will solve your problem of course, but it might
pinpoint the cause somewhat more accurately, leading hopefully to a
solution!
--
Marco Roeland

2004-04-02 15:06:00

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Friday April 2nd 2004 Marco Fais wrote:

> Mmmh, all the servers use an RTL-8139 compatible card, with the same
> 8139too driver. So this can be the problem.

Hey, I'm by no means an expert. Suggesting the driver is to blame was
mostly based on the fact that compiling locally worked, and from a
remote machine triggered a panick. The rest of your description below
indicates that it probably *isnt't* the driver.

> But in this moment I'm doing a kernel compile while receiving and sending
> huge amounts of data using netcat, as you suggested... and works perfectly.

> Ok, next I will test the second network card on the server, just to avoid
> the possibility of an hardware failure -- but I have other 4 servers that
> show the same behaviour, so I don't think it's caused by faulty hardware.

If 4 other servers show the same behaviour, and netcatting a lot of data
doesn't panick the machine, that highly suggests that the network card
and driver are innocent! I thought only one machine had the problem.

> Running this test for about an hour, using all the available bandwidth on
> the NIC, while compiling the kernel in a loop... no problem. Using distcc,
> compiling the same files, cause a kernel panic in a few seconds.
> So this test doesn't show the problem, but I think that anyway the network
> card driver (or the hardware) is involved.

Why do you think so, it seems there's nothing wrong with it; you've just
tested that?

One last suggestion:

Have you tried a local distcc compile, but specifying the host name as
it's IP address or its real name. Distcc treats 'localhost' differently,
but if it sees an IP address it will use the network route. As specified
in the man page this is slower, but if there's something peculiar with
the interaction of distcc with the network layer, then perhaps this
triggers it. You can also use the '--verbose' option on distccd, perhaps
it reports something useful before panicking.
--
Marco Roeland

2004-04-02 23:34:24

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

(linux-2.4.25)

Marco Fais <[email protected]> wrote:
>
> kernel BUG at page_alloc.c:98!
>

uh-oh.

>
> > >EIP; c01372ae <__free_pages_ok+26e/280> <=====
>
> > >ebx; c14b3f00 <_end+116e728/204d48a8>
> > >ecx; c14b3f00 <_end+116e728/204d48a8>
> > >edi; dec11340 <_end+1e8cbb68/204d48a8>
> > >ebp; c02f1d04 <init_task_union+1d04/2000>
> > >esp; c02f1cd4 <init_task_union+1cd4/2000>
>
> Trace; c0135a76 <kmem_cache_free_one+f6/210>
> Trace; c021667b <skb_release_data+6b/90>
> Trace; c02166b4 <kfree_skbmem+14/70>
> Trace; c0216816 <__kfree_skb+106/160>
> Trace; c023be39 <tcp_clean_rtx_queue+139/330>
> Trace; c023c385 <tcp_ack+c5/380>
> Trace; c023f51c <tcp_rcv_state_process+19c/a90>
> Trace; c02465a9 <tcp_v4_do_rcv+a9/130>
> Trace; c0246a76 <tcp_v4_rcv+446/560>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022dc25 <ip_local_deliver_finish+155/180>
> Trace; c0222780 <nf_hook_slow+b0/170>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022d88f <ip_local_deliver+4f/70>
> Trace; c022dad0 <ip_local_deliver_finish+0/180>
> Trace; c022de3a <ip_rcv_finish+1ea/270>
> Trace; e08d7eab <[8139too]rtl8139_rx_interrupt+6b/3b0>
> Trace; c021ad14 <netif_receive_skb+c4/180>
> Trace; c021ae3f <process_backlog+6f/120>
> Trace; c021af5a <net_rx_action+6a/100>
> Trace; c0121cd7 <do_softirq+97/a0>
> Trace; c010a66d <do_IRQ+bd/f0>

distcc uses sendfile(). The 8139too hardware and driver are
zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
sendfile().

The bug is that the networking layer is releasing the final ref to user
pages from softirq context. Those pages are still on the page LRU so
__free_pages_ok() will take them off.

Problem is, removing these pages from the LRU requires that the
pagemap_lru_lock be taken, and that lock may not be taken from interrupt
context. So we go BUG instead.

This was all discussed fairly extensively a couple of years back and I
thought it ended up being fixed.

2004-04-05 10:42:14

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Marco Roeland ha scritto:

>>Mmmh, all the servers use an RTL-8139 compatible card, with the same
>>8139too driver. So this can be the problem.
> Hey, I'm by no means an expert. Suggesting the driver is to blame was
> mostly based on the fact that compiling locally worked, and from a
> remote machine triggered a panick. The rest of your description below
> indicates that it probably *isnt't* the driver.

I was not saying *this is the problem*, just noticing that all the
systems that show this problem have this network card, while the other
systems that are working perfectly are using other network hardware
(e100 driver) :)

>>Ok, next I will test the second network card on the server, just to avoid
>>the possibility of an hardware failure -- but I have other 4 servers that
>>show the same behaviour, so I don't think it's caused by faulty hardware.
> If 4 other servers show the same behaviour, and netcatting a lot of data
> doesn't panick the machine, that highly suggests that the network card
> and driver are innocent! I thought only one machine had the problem.

If you read Andrew's message, seems that distcc uses a function that
trigger the problem -- sendfile() -- so, if netcat doesn't use it, it's
clear why doesn't panic the kernel.

> Have you tried a local distcc compile, but specifying the host name as
> it's IP address or its real name. Distcc treats 'localhost' differently,
> but if it sees an IP address it will use the network route. As specified

Good test.

Yeah, kernel panic in a few seconds. Using localhost instead, compile
run perfectly for hours.
So it's definitely an issue related to distcc AND networking (and
probably interaction between network driver and kernel).

Thank you again for your advice!

2004-04-05 10:47:44

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Andrew Morton ha scritto:

>>kernel BUG at page_alloc.c:98!
> uh-oh.

That was the same thing that I've said when I saw all the leds blinking
in *all* the keyboards ... :)

> distcc uses sendfile(). The 8139too hardware and driver are
> zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
> sendfile().

Ok. Other servers with e100 driver doesn't show the problem. This means
that they're not "zerocopy-capable"?

> This was all discussed fairly extensively a couple of years back and I
> thought it ended up being fixed.

There are any workarounds for this, until the problem is corrected?

Thank you very much.

2004-04-05 10:56:38

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Marco Fais <[email protected]> wrote:
>
> Andrew Morton ha scritto:
>
> >>kernel BUG at page_alloc.c:98!
> > uh-oh.
>
> That was the same thing that I've said when I saw all the leds blinking
> in *all* the keyboards ... :)
>
> > distcc uses sendfile(). The 8139too hardware and driver are
> > zerocopy-capable so the kernel uses zerocopy direct-from-user-pages for
> > sendfile().
>
> Ok. Other servers with e100 driver doesn't show the problem. This means
> that they're not "zerocopy-capable"?

They are. It could be a timing thing.

> > This was all discussed fairly extensively a couple of years back and I
> > thought it ended up being fixed.
>
> There are any workarounds for this, until the problem is corrected?

This will probably make it go away.

--- linux-2.4.26-rc1/drivers/net/8139too.c 2004-03-27 22:06:18.000000000 -0800
+++ 24/drivers/net/8139too.c 2004-04-05 03:54:50.478692968 -0700
@@ -983,7 +983,7 @@ static int __devinit rtl8139_init_one (s
* through the use of skb_copy_and_csum_dev we enable these
* features
*/
- dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA;
+ dev->features |= NETIF_F_SG | NETIF_F_HIGHDMA;

dev->irq = pdev->irq;

2004-04-05 11:47:13

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Monday April 5th 2004 Marco Fais wrote:

> I was not saying *this is the problem*, just noticing that all the
> systems that show this problem have this network card, while the other
> systems that are working perfectly are using other network hardware
> (e100 driver) :)

Yes, my conclusion was too hasty, it *is* driver related! ;-)

With hindsight we also should have tried, of course, a 'strace distccd
--no-detach' in a crashing and a non-crashing situation. This would
probably have shown that 'sendfile()' was the first missing system call
(and therefore likely the culprit) in the crashing situation. Oh, well...

> If you read Andrew's message, seems that distcc uses a function that
> trigger the problem -- sendfile() -- so, if netcat doesn't use it, it's
> clear why doesn't panic the kernel.

Yes, sendfile() in combination with the 8139too driver seems to be
causing the trouble. Until that will hopefully be fixed, it doesn't seem
easy to workaround against. At the moment it looks like it is not an
easy configurable option to *not* want to use zero_copy functionality,
either in the kernel, nor in distcc.

There is an '8139cp' driver too, it's supposed to be working better
as well, perhaps that one might not free the pages that are to be
zero_copied across the network before they are sent?! That is the real
problem if I understand Andrew's mail correctly.

You might send a 'linux 8139too sendfile() panic' kind of bugreport
to the '[email protected]' mailing list. That is the list where the
networking gurus are supposed to be hanging out. Although IMVHO this bug
is more on the kernel than on the network side. Also filing an entry to
bugzilla.kernel.org might speed up someone fixing the real problem.

Easiest workaround might be to just use a customised distcc for the
machines involved: just download the source from 'distcc.samba.org', do
a regular './configure', and then in the generated 'src/config.h' hand
edit '#undef HAVE_SENDFILE' and '#undef HAVE_SYS_SENDFILE_H'. That
should stop distcc from using sendfile().
--
Marco Roeland

2004-04-05 13:58:29

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Hola Andrew!

Andrew Morton ha scritto:

>>There are any workarounds for this, until the problem is corrected?
> This will probably make it go away.
>
> --- linux-2.4.26-rc1/drivers/net/8139too.c 2004-03-27 22:06:18.000000000 -0800
> +++ 24/drivers/net/8139too.c 2004-04-05 03:54:50.478692968 -0700
> @@ -983,7 +983,7 @@ static int __devinit rtl8139_init_one (s
> * through the use of skb_copy_and_csum_dev we enable these
> * features
> */
> - dev->features |= NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA;
> + dev->features |= NETIF_F_SG | NETIF_F_HIGHDMA;
>
> dev->irq = pdev->irq;

Unfortunately, this doesn't solve the problem. Seems that the panic it's
triggered a little later (1-2 minutes instead of a few seconds), but
anyway I have a kernel panic every time, also with this patch.

The oops tracing looks very similar to the one I've posted on the
linux-kernel list.

Thank you Andrew, bye!

2004-04-05 14:08:35

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Marco Roeland ha scritto:

> There is an '8139cp' driver too, it's supposed to be working better
> as well, perhaps that one might not free the pages that are to be
> zero_copied across the network before they are sent?! That is the real
> problem if I understand Andrew's mail correctly.

Just tried that, unfortunately this network card isn't supported from
8139cp driver.

> You might send a 'linux 8139too sendfile() panic' kind of bugreport
> to the '[email protected]' mailing list. That is the list where the
> networking gurus are supposed to be hanging out. Although IMVHO this bug

Andrew's messages are in CC: to the [email protected] list, so I think
they're already aware of the problem.

> is more on the kernel than on the network side. Also filing an entry to
> bugzilla.kernel.org might speed up someone fixing the real problem.

Ok, let see if we get a patch from this discussion, otherwise I'll file
a new bugzilla entry.

> Easiest workaround might be to just use a customised distcc for the
> machines involved: just download the source from 'distcc.samba.org', do
> a regular './configure', and then in the generated 'src/config.h' hand
> edit '#undef HAVE_SENDFILE' and '#undef HAVE_SYS_SENDFILE_H'. That
> should stop distcc from using sendfile().

Great! I'm going to test that right now, surely better than deploying
customized kernels in all servers until an "official" patch comes out.

Thank you very much, Marco.

2004-04-05 14:36:46

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Monday April 5th 2004 Marco Fais wrote:

> Ok, let see if we get a patch from this discussion, otherwise I'll file
> a new bugzilla entry.

Perhaps the fact that you have *two* cards in each machine that crashes
with the 8139too driver could be important? I have two Athlon XP 2000+
with Realtek Semiconductor Co., Ltd. RTL-8139/8139C/8139C+ that distcc
quite a lot, and never any crash. But network topology and timings might
just trigger the panic in your situation and not with others...

> [building distcc without sendfile()]
> Great! I'm going to test that right now, surely better than deploying
> customized kernels in all servers until an "official" patch comes out.

Yeah, although that viewpoint might not be very popular on this mailing
list. ;-) By the way the patch looks quite alright and applies (with
an offset) to 2.6.5 as well. If you build 8139too modular, you might
even make two modules, a modified one with the reduced advertised
capabilities (so that the kernel assumes the card isn't zero-copy
capable) under another name perhaps like 8139too-nosendfile, and the
standard one. You can than at least distribute one kernel package, and
only on the affected machines modprobe the bugfix module.

Anyway, first installing a distcc without sendfile() usages, can make
you (distcc)build patched kernels much faster in the future. ;-)
--
Marco Roeland

2004-04-05 17:03:29

by Max Valdez

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

I Sent an email a couple os weeks ago about the same issue.

But it wasnt so documented and organized.

I can say that the card and hardware are inocents, maybe the driver, the
"remote" machines that hang are using the latest fedore stable kernel.

I would need really good pointing to the procedure to debug the problem, I'm
not expert in anything about kernel.

I think it's a problem in the network handling because it happens on different
kernels, in different hardware. And it happens from a couple of months ago
(we got a new faster network "arquitecture") and the problems seems to be
triggered by fast transport of file over NTF, and distcc. I remember having a
crash using scp too for some iso files.

If needed I can help track this problem, but I need some hints on the
procedure

Max

--
Linux garaged 2.6.5-rc2-mm3 #1 Fri Mar 26 11:07:16 CST 2004 i686 Intel(R)
Pentium(R) 4 CPU 2.80GHz GenuineIntel GNU/Linux
-----BEGIN GEEK CODE BLOCK-----
Version: 3.12
GS/S d- s: a-29 C++(+++) ULAHI+++ P+ L++>+++ E--- W++ N* o-- K- w++++ O- M--
V-- PS+ PE Y-- PGP++ t- 5- X+ R tv++ b+ DI+++ D- G++ e++ h+ r+ z**
------END GEEK CODE BLOCK------
gpg-key: http://garaged.homeip.net/gpg-key.txt

2004-04-23 22:33:30

by Carson Gaspar

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

FYI, we see the exact same panic with the tg3 driver using 2.4.25 and
distcc with sendfile(). The bcm5700 driver also panics, but I haven't
captured a panic message to be certain it's the same bug.

kernel BUG at page_alloc.c:98!
invalid operand: 0000
CPU: 1
EIP: 0010:[<c0139492>] Tainted: PF
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 00000001 ebx: c294dcb0 ecx: 00000001 edx: 00000020
esi: edb6e2e0 edi: 00000000 ebp: 00000004 esp: c55af9b4
ds: 0018 es: 0018 ss: 0018
Process cc1plus (pid: 21186, stackpage=c55af000)
Stack: c022e9ee f6fb1000 c022aa9c 00000287 00000206 00000286 db5a9600
00000001
edb6e2e0 edb6e2e0 00000004 c022aa4e edb6e2e0 f3716100 c022aa9c
edb6e2e0
f371623c f3716100 c022ac25 edb6e2e0 00000000 c025423a edb6e2e0
c55ae000
Call Trace: [<c022e9ee>] [<c022aa9c>] [<c022aa4e>] [<c022aa9c>]
[<c022ac25>]
[<c025423a>] [<c0247d28>] [<c024be53>] [<c025675b>] [<c02547c8>]
[<c0256bdf>]
[<c0138175>] [<c022aa9c>] [<c0254307>] [<c0258a67>] [<c022aa9c>]
[<c0254307>]
[<c025ef5b>] [<c025f4ad>] [<c022ac25>] [<c0256bec>] [<c01550dc>]
[<c014ba00>]
[<c02449a3>] [<c02449a3>] [<c0244da6>] [<c025ef5b>] [<c0139c05>]
[<c025f4ad>]
[<c022a8af>] [<c022f189>] [<c022a8af>] [<f8990d48>] [<c02449a3>]
[<f8990ef9>]
[<c022f3a3>]o[<c0122c5b>] [<c010a74e>] [<c0131a04>] [<c012e232>]
[<c0131487>]
[<c0119e06>] [<c0131b08>] [<c0131990>] [<c01410d6>] [<c012e72a>]
[<c0108b5f>]
Code: 0f 0b 62 00 bd 35 2a c0 89 d8 e8 5f ed ff ff 8b 6b 28 85 ed

>>EIP; c0139492 <__free_pages_ok+32/2b0> <=====
Trace; c022e9ee <dev_queue_xmit+14e/320>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c022aa4e <skb_release_data+4e/90>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c022ac25 <__kfree_skb+125/130>
Trace; c025423a <tcp_clean_rtx_queue+15a/310>
Trace; c0247d28 <ip_queue_xmit+3d8/550>
Trace; c024be53 <tcp_write_space+53/80>
Trace; c025675b <tcp_new_space+7b/80>
Trace; c02547c8 <tcp_ack+138/360>
Trace; c0256bdf <tcp_rcv_established+ef/8b0>
Trace; c0138175 <lru_cache_add+75/80>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c0254307 <tcp_clean_rtx_queue+227/310>
Trace; c0258a67 <tcp_transmit_skb+567/620>
Trace; c022aa9c <kfree_skbmem+c/70>
Trace; c0254307 <tcp_clean_rtx_queue+227/310>
Trace; c025ef5b <tcp_v4_do_rcv+3b/120>
Trace; c025f4ad <tcp_v4_rcv+46d/6f0>
Trace; c022ac25 <__kfree_skb+125/130>
Trace; c0256bec <tcp_rcv_established+fc/8b0>
Trace; c01550dc <dput+1c/160>
Trace; c014ba00 <cached_lookup+10/50>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; c0244da6 <ip_rcv+366/400>
Trace; c025ef5b <tcp_v4_do_rcv+3b/120>
Trace; c0139c05 <__alloc_pages+75/2f0>
Trace; c025f4ad <tcp_v4_rcv+46d/6f0>
Trace; c022a8af <alloc_skb+ef/1c0>
Trace; c022f189 <netif_receive_skb+189/1c0>
Trace; c022a8af <alloc_skb+ef/1c0>
Trace; f8990d48 <[usbcore]__kstrtab_usb_hcd_giveback_urb+52f8/6a50>
Trace; c02449a3 <ip_local_deliver+f3/190>
Trace; f8990ef9 <[usbcore]__kstrtab_usb_hcd_giveback_urb+54a9/6a50>
Trace; c022f3a3 <net_rx_action+b3/170>
Trace; c0119e06 <do_page_fault+1a6/4eb>
Trace; c0131b08 <generic_file_read+88/170>
Trace; c0131990 <file_read_actor+0/f0>
Trace; c01410d6 <sys_read+96/110>
Trace; c012e72a <sys_brk+ba/f0>
Trace; c0108b5f <system_call+33/38>
Code; c0139492 <__free_pages_ok+32/2b0>
00000000 <_EIP>:
Code; c0139492 <__free_pages_ok+32/2b0> <=====
0: 0f 0b ud2a <=====
Code; c0139494 <__free_pages_ok+34/2b0>
2: 62 00 bound %eax,(%eax)
Code; c0139496 <__free_pages_ok+36/2b0>
4: bd 35 2a c0 89 mov $0x89c02a35,%ebp
Code; c013949b <__free_pages_ok+3b/2b0>
9: d8 e8 fsubr %st(0),%st
Code; c013949d <__free_pages_ok+3d/2b0>
b: 5f pop %edi
Code; c013949e <__free_pages_ok+3e/2b0>
c: ed in (%dx),%eax
Code; c013949f <__free_pages_ok+3f/2b0>
d: ff (bad)
Code; c01394a0 <__free_pages_ok+40/2b0>
e: ff 8b 6b 28 85 ed decl 0xed85286b(%ebx)

<0>Kernel panic: Aiee, killing interrupt handler!

2004-04-28 02:03:44

by Jeff Moyer

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

>FYI, we see the exact same panic with the tg3 driver using 2.4.25 and
>distcc with sendfile(). The bcm5700 driver also panics, but I haven't
>captured a panic message to be certain it's the same bug.

>kernel BUG at page_alloc.c:98!

Andrea fixed this in his tree by deferring the page free to process context
instead of BUG()ing on PageLRU(page).

-Jeff

2004-04-29 21:14:05

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Tue, Apr 27, 2004 at 10:02:11PM -0400, Jeff Moyer wrote:
>
> >FYI, we see the exact same panic with the tg3 driver using 2.4.25 and
> >distcc with sendfile(). The bcm5700 driver also panics, but I haven't
> >captured a panic message to be certain it's the same bug.
>
> >kernel BUG at page_alloc.c:98!
>
> Andrea fixed this in his tree by deferring the page free to process context
> instead of BUG()ing on PageLRU(page).

Yeap, his fix looks OK.

Can you please people seeing the oops try this, from Andrea (on top of 2.4.26):

--- a/mm/page_alloc.c.orig 2004-04-29 17:38:14.184021976 -0300
+++ b/mm/page_alloc.c 2004-04-29 17:47:27.906843312 -0300
@@ -46,6 +46,34 @@

int vm_gfp_debug = 0;

+static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
+
+static spinlock_t free_pages_ok_no_irq_lock = SPIN_LOCK_UNLOCKED;
+struct page * free_pages_ok_no_irq_head;
+
+static void do_free_pages_ok_no_irq(void * arg)
+{
+ struct page * page, * __page;
+
+ spin_lock_irq(&free_pages_ok_no_irq_lock);
+
+ page = free_pages_ok_no_irq_head;
+ free_pages_ok_no_irq_head = NULL;
+
+ spin_unlock_irq(&free_pages_ok_no_irq_lock);
+
+ while (page) {
+ __page = page;
+ page = page->next_hash;
+ __free_pages_ok(__page, __page->index);
+ }
+}
+
+static struct tq_struct free_pages_ok_no_irq_task = {
+ .routine = do_free_pages_ok_no_irq,
+};
+
+
/*
* Temporary debugging check.
*/
@@ -81,7 +109,6 @@
* -- wli
*/

-static void FASTCALL(__free_pages_ok (struct page *page, unsigned int order));
static void __free_pages_ok (struct page *page, unsigned int order)
{
unsigned long index, page_idx, mask, flags;
@@ -94,8 +121,20 @@
* a reference to a page in order to pin it for io. -ben
*/
if (PageLRU(page)) {
- if (unlikely(in_interrupt()))
- BUG();
+ if (unlikely(in_interrupt())) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&free_pages_ok_no_irq_lock, flags);
+ page->next_hash = free_pages_ok_no_irq_head;
+ free_pages_ok_no_irq_head = page;
+ page->index = order;
+
+ spin_unlock_irqrestore(&free_pages_ok_no_irq_lock, flags);
+
+ schedule_task(&free_pages_ok_no_irq_task);
+ return;
+ }
+
lru_cache_del(page);
}

2004-04-29 21:27:40

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

Marcelo Tosatti <[email protected]> wrote:
>
> > Andrea fixed this in his tree by deferring the page free to process context
> > instead of BUG()ing on PageLRU(page).
>
> Yeap, his fix looks OK.

It does.

It would be nice to change

if (in_interrupt())

to

if (in_interrupt() || ((count++ % 10000) == 0))

just to exercise that code path a bit more.

2004-04-29 22:51:46

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: kernel BUG at page_alloc.c:98 -- compiling with distcc

On Thu, Apr 29, 2004 at 02:28:07PM -0700, Andrew Morton wrote:
> just to exercise that code path a bit more.

what's the point of exercising that code path more? are you worried that
there are bugs in it?

2004-04-29 23:24:17