2002-07-23 06:02:55

by Zwane Mwaikambo

[permalink] [raw]
Subject: odd memory corruption in 2.5.27?

Hi Al,

kernel is 2.5.27, the box was doing a 'make -j2 bzImage' at the
time on an NFS mounted filesystem, the server is 2.4.19-re5-ac3

(gdb) list *0xc014b09c
0xc014b09c is in filp_close (open.c:834).
829 */
830 int filp_close(struct file *filp, fl_owner_t id)
831 {
832 int retval;
833
834 if (!file_count(filp)) {
835 printk(KERN_ERR "VFS: Close: file count is 0\n");
836 return 0;
837 }
838 retval = 0;

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
printing eip:
c014b09c
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c014b09c>] Not tainted
EFLAGS: 00010206
eax: cea36a44 ebx: cea36920 ecx: cb235b30 edx: 00000078
esi: 00000078 edi: 00000001 ebp: cea36920 esp: ca9bfddc
ds: 0018 es: 0018 ss: 0018
Process fixdep (pid: 1015, threadinfo=ca9be000 task=c8245980)
Stack: cea36920 00000001 00000001 0000000a c01202db 00000078 cea36920 ca9be000
c8245980 cea36920 00000000 c0120e40 00000000 c036bb00 ca9bff28 00000002
c0314a6f 00000001 c0340018 c0310018 ffffff00 c0108622 00000010 00000202
Call Trace: [<c01202db>] [<c0120e40>] [<c0108622>] [<c0115261>] [<c014cd1d>]
[<c01616ac>] [<c0157cda>] [<c0159020>] [<c012f75c>] [<c0114e90>] [<c01080a0>]
[<c0120018>] [<c014cd1d>] [<c0154970>] [<c0154ed1>] [<c014b066>] [<c01075db>]

Code: 8b 46 14 85 c0 75 12 68 c0 6e 31 c0 e8 13 1b fd ff 5a 31 c0

>>EIP; c014b09c <filp_close+c/130> <=====
Trace; c01202db <put_files_struct+4b/d0>
Trace; c0120e40 <do_exit+210/520>
Trace; c0108622 <die+f2/100>
Trace; c0115261 <do_page_fault+3d1/506>
Trace; c014cd1d <fget+4d/70>
Trace; c01616ac <dput+1c/240>
Trace; c0157cda <permission+1a/30>
Trace; c0159020 <link_path_walk+a30/a60>
Trace; c012f75c <zap_pmd_range+4c/60>
Trace; c0114e90 <do_page_fault+0/506>
Trace; c01080a0 <error_code+34/40>
Trace; c0120018 <will_become_orphaned_pgrp+48/d0>
Trace; c014cd1d <fget+4d/70>
Trace; c0154970 <vfs_fstat+10/40>
Trace; c0154ed1 <sys_fstat64+11/30>
Trace; c014b066 <sys_open+66/70>
Trace; c01075db <syscall_call+7/b>
Code; c014b09c <filp_close+c/130>
00000000 <_EIP>:
Code; c014b09c <filp_close+c/130> <=====
0: 8b 46 14 mov 0x14(%esi),%eax <=====
Code; c014b09f <filp_close+f/130>
3: 85 c0 test %eax,%eax
Code; c014b0a1 <filp_close+11/130>
5: 75 12 jne 19 <_EIP+0x19> c014b0b5 <filp_close+25/130>
Code; c014b0a3 <filp_close+13/130>
7: 68 c0 6e 31 c0 push $0xc0316ec0
Code; c014b0a8 <filp_close+18/130>
c: e8 13 1b fd ff call fffd1b24 <_EIP+0xfffd1b24> c011cbc0 <printk+0/210>
Code; c014b0ad <filp_close+1d/130>
11: 5a pop %edx
Code; c014b0ae <filp_close+1e/130>
12: 31 c0 xor %eax,%eax

--
function.linuxpower.ca


2002-07-23 06:03:31

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

Hi Al,

Hmm this is wild pointer #2, whats -really- freaky is that its the same
address 0x14(00000078), this was also over NFS.

Unable to handle kernel NULL pointer dereference at virtual address 0000008c
c014cd1d
*pde = 00000000
Oops: 0002
CPU: 0
EIP: 0010:[<c014cd1d>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010206
eax: ccefb4c0 ebx: 00000078 ecx: 00000003 edx: ccf30000
esi: ccefb39c edi: 00000003 ebp: bfffe998 esp: ccf31f5c
ds: 0018 es: 0018 ss: 0018
Stack: ccf31f7c fffffff7 c0154970 ccf31f7c bfffe9e0 c0154ed1 00000003 ccf31f7c
00000246 00000246 c4ca7000 00000001 00000000 00000000 00000400 00000078
00000003 c4ca7000 bfffe9c8 c014b066 c16b6a30 c4ca7000 ccf30000 bfffe9e0
Call Trace: [<c0154970>] [<c0154ed1>] [<c014b066>] [<c01075db>]
Code: f0 ff 43 14 f0 ff 46 04 ff 4a 10 8b 42 08 83 e0 08 74 05 e8

>>EIP; c014cd1d <fget+4d/70> <=====
Trace; c0154970 <vfs_fstat+10/40>
Trace; c0154ed1 <sys_fstat64+11/30>
Trace; c014b066 <sys_open+66/70>
Trace; c01075db <syscall_call+7/b>
Code; c014cd1d <fget+4d/70>
00000000 <_EIP>:
Code; c014cd1d <fget+4d/70> <=====
0: f0 ff 43 14 lock incl 0x14(%ebx) <=====
Code; c014cd21 <fget+51/70>
4: f0 ff 46 04 lock incl 0x4(%esi)
Code; c014cd25 <fget+55/70>
8: ff 4a 10 decl 0x10(%edx)
Code; c014cd28 <fget+58/70>
b: 8b 42 08 mov 0x8(%edx),%eax
Code; c014cd2b <fget+5b/70>
e: 83 e0 08 and $0x8,%eax
Code; c014cd2e <fget+5e/70>
11: 74 05 je 18 <_EIP+0x18> c014cd35 <fget+65/70>
Code; c014cd30 <fget+60/70>

--
function.linuxpower.ca

2002-07-23 06:05:12

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

Hi Trond, Arnaldo, Al
I tried reproducing using a local filesystem and couldn't
(machine survived 3 make -j10 kernel compiles). Here is another oops for
the collection. Al i'll remove you from further CCs now.

client: 2.5.27-serial, 3c905B
server: 2.4.19-pre5-ac3, 3c905B
connection: 100Mb/FD

I got this message before it oopsed;
RPC: garbage, exit EIO

Unable to handle kernel paging request at virtual address 5f707275
c013f822
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c013f822>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010202
eax: 5f707261 ebx: 00000001 ecx: 5f707261 edx: 00000000
esi: c826d2e4 edi: c826d2e4 ebp: c038c000 esp: c038deb0
ds: 0018 es: 0018 ss: 0018
Stack: c028d4fe c826d2e4 cea012cc c028d53c c826d2e4 cde2ca04 cea012cc c028d67d
c826d2e4 c02a2b00 c02a28d3 c826d2e4 c8e0ae2c c02a2b00 c038c000 c02a2c21
cea012cc c8e0ae2c c02a2c8b cea012cc 00000001 00000000 cc00fc0c 00000000
Call Trace: [<c028d4fe>] [<c028d53c>] [<c028d67d>] [<c02a2b00>] [<c02a28d3>]
[<c02a2b00>] [<c02a2c21>] [<c02a2c8b>] [<c0126614>] [<c0122950>] [<c01227e0>] [<c01224eb>] [<c0109d82>] [<c01053a0>] [<c0107f2a>] [<c01053a0>] [<c01053a0>] [<c01053ca>] [<c0105472>] [<c0105000>]
Code: 8b 41 14 a9 00 08 00 00 75 14 f0 ff 49 10 0f 94 c0 84 c0 74

>>EIP; c013f822 <__free_pages+2/30> <=====
Trace; c028d4fe <skb_release_data+1e/80>
Trace; c028d53c <skb_release_data+5c/80>
Trace; c028d67d <__kfree_skb+ad/e0>
Trace; c02a2b00 <ip_evictor+1d0/200>
Trace; c02a28d3 <ip_frag_destroy+43/a0>
Trace; c02a2b00 <ip_evictor+1d0/200>
Trace; c02a2c21 <ip_expire+f1/1a0>
Trace; c02a2c8b <ip_expire+15b/1a0>
Trace; c0126614 <timer_bh+324/400>
Trace; c0122950 <bh_action+70/140>
Trace; c01227e0 <tasklet_hi_action+70/c0>
Trace; c01224eb <do_softirq+7b/f0>
Trace; c0109d82 <do_IRQ+1d2/1e0>
Trace; c01053a0 <default_idle+0/40>
Trace; c0107f2a <common_interrupt+22/28>
Trace; c01053a0 <default_idle+0/40>
Trace; c01053a0 <default_idle+0/40>
Trace; c01053ca <default_idle+2a/40>
Trace; c0105472 <cpu_idle+52/70>
Trace; c0105000 <_stext+0/0>
Code; c013f822 <__free_pages+2/30>
00000000 <_EIP>:
Code; c013f822 <__free_pages+2/30> <=====
0: 8b 41 14 mov 0x14(%ecx),%eax <=====
Code; c013f825 <__free_pages+5/30>
3: a9 00 08 00 00 test $0x800,%eax
Code; c013f82a <__free_pages+a/30>
8: 75 14 jne 1e <_EIP+0x1e> c013f840 <__free_pages+20/30>
Code; c013f82c <__free_pages+c/30>
a: f0 ff 49 10 lock decl 0x10(%ecx)
Code; c013f830 <__free_pages+10/30>
e: 0f 94 c0 sete %al
Code; c013f833 <__free_pages+13/30>
11: 84 c0 test %al,%al
Code; c013f835 <__free_pages+15/30>
13: 74 00 je 15 <_EIP+0x15> c013f837 <__free_pages+17/30>

<0>Kernel panic: Aiee, killing interrupt handler!


--
function.linuxpower.ca

2002-07-23 07:54:16

by Trond Myklebust

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

On Tuesday 23 July 2002 08:26, Zwane Mwaikambo wrote:
> Hi Trond, Arnaldo, Al
> I tried reproducing using a local filesystem and couldn't
> (machine survived 3 make -j10 kernel compiles). Here is another oops for
> the collection. Al i'll remove you from further CCs now.
>
> client: 2.5.27-serial, 3c905B
> server: 2.4.19-pre5-ac3, 3c905B
> connection: 100Mb/FD
>
> I got this message before it oopsed;
> RPC: garbage, exit EIO

Just means that some RPC message reply from the server was crap. We should
deal fine with that sort of thing...

AFAICS The Oops itself happened deep down in the socket layer in the part
which has to do with reassembling fragments into packets. The garbage
collector tried to release a fragment that had timed out and Oopsed.

Suggests either memory corruption or else that the networking driver is doing
something odd ('cos at that point in the socket layer *only* the driver + the
fragment handler should have touched the skb).

Cheers,
Trond

2002-07-23 08:33:28

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

On Tue, 23 Jul 2002, Trond Myklebust wrote:

> Just means that some RPC message reply from the server was crap. We should
> deal fine with that sort of thing...
>
> AFAICS The Oops itself happened deep down in the socket layer in the part
> which has to do with reassembling fragments into packets. The garbage
> collector tried to release a fragment that had timed out and Oopsed.
>
> Suggests either memory corruption or else that the networking driver is doing
> something odd ('cos at that point in the socket layer *only* the driver + the
> fragment handler should have touched the skb).

Thanks, that helps quite a bit, i'll see if i can pinpoint it and send it
to the relevant people.

Thanks,
Zwane

--
function.linuxpower.ca

2002-07-23 20:29:33

by George Anzinger

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

Zwane Mwaikambo wrote:
>
> On Tue, 23 Jul 2002, Trond Myklebust wrote:
>
> > Just means that some RPC message reply from the server was crap. We should
> > deal fine with that sort of thing...
> >
> > AFAICS The Oops itself happened deep down in the socket layer in the part
> > which has to do with reassembling fragments into packets. The garbage
> > collector tried to release a fragment that had timed out and Oopsed.
> >
> > Suggests either memory corruption or else that the networking driver is doing
> > something odd ('cos at that point in the socket layer *only* the driver + the
> > fragment handler should have touched the skb).
>
> Thanks, that helps quite a bit, i'll see if i can pinpoint it and send it
> to the relevant people.
>
I just spent a month tracking down this issue. It comes
down to the slab allocater using per cpu data structures and
protecting them with a combination of interrupt disables and
spin_locks. Preemption is allowed (incorrectly) if
interrupts are off and preempt_count goes to zero on the
spin_unlock. I will wager that this is an SMP machine.
After the preemption interrupts will be on (schedule() does
that) AND you could be on a different cpu. Either of these
is a BAD thing.

The proposed fix is to catch the attempted preemption in
preempt_schedule() and just return if the interrupt system
is off. (Of course there is more that this to it, but I do
believe that the problem is known. You could blow this
assertion out of the water by asserting that the machine is
NOT smp.)
--
George Anzinger [email protected]
High-res-timers:
http://sourceforge.net/projects/high-res-timers/
Real time sched: http://sourceforge.net/projects/rtsched/
Preemption patch:
http://www.kernel.org/pub/linux/kernel/people/rml

2002-07-23 20:45:10

by William Lee Irwin III

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

On Tue, Jul 23, 2002 at 01:32:11PM -0700, george anzinger wrote:
> I just spent a month tracking down this issue. It comes
> down to the slab allocater using per cpu data structures and
> protecting them with a combination of interrupt disables and
> spin_locks. Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock. I will wager that this is an SMP machine.
> After the preemption interrupts will be on (schedule() does
> that) AND you could be on a different cpu. Either of these
> is a BAD thing.
> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off. (Of course there is more that this to it, but I do
> believe that the problem is known. You could blow this
> assertion out of the water by asserting that the machine is
> NOT smp.)

I've been seeing the slab allocator race like mad already (i.e.
BUG at slab.c:1947 and slab.c:1961. I'll test this fix.


Thanks,
Bill

2002-07-23 23:26:11

by Ingo Molnar

[permalink] [raw]
Subject: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]


On Tue, 23 Jul 2002, george anzinger wrote:

> I just spent a month tracking down this issue. It comes
> down to the slab allocater using per cpu data structures and
> protecting them with a combination of interrupt disables and
> spin_locks. Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock. [...]

> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off. [...]

this is most definitely not the correct fix ...

i'm quite convinced that the fix is to avoid illegal preemption, not to
work it around.

i've written debugging code that caught and reported this slab.c bug
within minutes. The code detects the irqs-off condition in schedule(). You
can find it my latest irqlock patchset, at:

http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-G3

it fixes this and other related bugs as well.

Changes in -G3:

- slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
also has to check for preemption in the right spot.) This should fix
the memory corruption.

- irq_exit() needs to run softirqs if interrupts not active - in the
previous patch it ran them when preempt_count() was 0, which is
incorrect.

- spinlock macros are updated to enable preemption after enabling
interrupts. Besides avoiding false positive warnings, this also

- fork.c has to call scheduler_tick() with preemption disabled -
otherwise scheduler_tick()'s spin_unlock can preempt!

- irqs_disabled() macro introduced.

- [ all other local_irq_enable() or sti instances conditional on
CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

- fix buggy in_softirq(). Fortunately the bug made the test broader,
which didnt result in algorithmical breakage, just suboptimal
performance.

- move do_softirq() processing into irq_exit() => this also fixes the
softirq processing bugs present in apic.c IRQ handlers that did not
test for softirqs after irq_exit().

- simplify local_bh_enable().

Changes in -F9:

- replace all instances of:

local_save_flags(flags);
local_irq_disable();

with the shorter form of:

local_irq_save(flags);

about 30 files are affected by this change.

Changes in -F8:

- preempt/hardirq/softirq count separation, cleanups.

- skbuff.c fix.

- use irq_count() in scheduler_tick()

Changes in -F3:

- the entry.S cleanups/speedups by Oleg Nesterov.

- a rather critical synchronize_irq() bugfix: if a driver frees an
interrupt that is still being probed then synchronize_irq() locks up.
This bug has caused a spurious boot-lockup on one of my testsystems,
ifconfig would lock up trying to close eth0.

- remove duplicate definitions from asm-i386/system.h, this fixes
compiler warnings.

Ingo

2002-07-23 23:52:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]



On Wed, 24 Jul 2002, Ingo Molnar wrote:
>
> - slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
> also has to check for preemption in the right spot.) This should fix
> the memory corruption.

That cannot be right.

If we want to drop a spinlock but remain non-preemptible, we should
comment that a _lot_, and not just say "xxxx_no_resched()".

In fact, I personally think that every "spin_unlock_no_resched()" is an
outright BUG. Either the spin_unlock() makes us preemptible (in which case
it doesn't matter from a correctness point whether we schedule
immediately, or whether something else like a vmalloc fault might force us
to schedule soon afterwards), or the spin_unlock is doing something
magical, and we depend on the preemptability to not change.

In the latter case (which should be very very rare indeed), we should just
use

/* BIG comment about what we're doing. */
/* We're dropping the spinlock, but we remain non-preemptable */
__raw_spin_unlock(..);

and then later on, when preemptability is over, we do

local_irq_enable();
preempt_enable();

so that we _clearly_ mark out the region where we must not re-schedule.

It is simply not acceptable to just play games with disabling interrupts,
and magically "knowing" that we're not preemptable without making that
clear some way.

Please get rid of spin_unlock_no_schedule() and friends, ok?

Linus

2002-07-24 00:05:38

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]


On Tue, 23 Jul 2002, Linus Torvalds wrote:

> /* BIG comment about what we're doing. */
> /* We're dropping the spinlock, but we remain non-preemptable */
> __raw_spin_unlock(..);
>
> and then later on, when preemptability is over, we do
>
> local_irq_enable();
> preempt_enable();
>
> so that we _clearly_ mark out the region where we must not re-schedule.

agreed, this is much nicer.

i removed the spin_unlock_no_resched() define, and modified the slab.c fix
(and the other spin_unlock_no_resched() places) to be more verbose about
what they do.

we still have preempt_enable_no_resched() - that is legitimately needed in
a number of cases - i've added comments to make its usage clear.

these cleanups are in the -G5 patch:

http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-G5

i've test-booted BK-curr + G5, CONFIG_SMP, CONFIG_PREEMPT and
CONFIG_DEBUG_IRQ_SCHEDULE kernel on an SMP box, it works fine.

Ingo


Changes in -G3:

- slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
also has to check for preemption in the right spot.) This should fix
the memory corruption.

- irq_exit() needs to run softirqs if interrupts not active - in the
previous patch it ran them when preempt_count() was 0, which is
incorrect.

- spinlock macros are updated to enable preemption after enabling
interrupts. Besides avoiding false positive warnings, this also

- fork.c has to call scheduler_tick() with preemption disabled -
otherwise scheduler_tick()'s spin_unlock can preempt!

- irqs_disabled() macro introduced.

- [ all other local_irq_enable() or sti instances conditional on
CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

- fix buggy in_softirq(). Fortunately the bug made the test broader,
which didnt result in algorithmical breakage, just suboptimal
performance.

- move do_softirq() processing into irq_exit() => this also fixes the
softirq processing bugs present in apic.c IRQ handlers that did not
test for softirqs after irq_exit().

- simplify local_bh_enable().

Changes in -F9:

- replace all instances of:

local_save_flags(flags);
local_irq_disable();

with the shorter form of:

local_irq_save(flags);

about 30 files are affected by this change.

Changes in -F8:

- preempt/hardirq/softirq count separation, cleanups.

- skbuff.c fix.

- use irq_count() in scheduler_tick()

Changes in -F3:

- the entry.S cleanups/speedups by Oleg Nesterov.

- a rather critical synchronize_irq() bugfix: if a driver frees an
interrupt that is still being probed then synchronize_irq() locks up.
This bug has caused a spurious boot-lockup on one of my testsystems,
ifconfig would lock up trying to close eth0.

- remove duplicate definitions from asm-i386/system.h, this fixes
compiler warnings.

2002-07-24 01:05:12

by Robert Love

[permalink] [raw]
Subject: Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]

On Tue, 2002-07-23 at 16:28, Ingo Molnar wrote:

> this is most definitely not the correct fix ...
>
> i'm quite convinced that the fix is to avoid illegal preemption, not to
> work it around.

I am not sure I am fully convinced one way or the other, but treating
every bit of code as we find it scares me. The fact is, if a
spin_unlock() can magically reenable interrupts that is a bug.

I don't like relying on chance and the possibility your debug tool found
the problem... but at the same time, Ingo's solution is a lot cleaner.

Linus, Ingo, comments?

Attached is the patch George mentioned, against 2.5.27.

Robert Love

diff -urN linux-2.5.27/include/asm-i386/system.h linux/include/asm-i386/system.h
--- linux-2.5.27/include/asm-i386/system.h Sat Jul 20 12:11:05 2002
+++ linux/include/asm-i386/system.h Tue Jul 23 18:03:47 2002
@@ -270,6 +270,13 @@
/* Compiling for a 386 proper. Is it worth implementing via cli/sti? */
#endif

+#define MASK_IF 0x200
+#define interrupts_enabled() ({ \
+ int flg; \
+ __save_flags(flg); \
+ flg & MASK_IF; \
+})
+
/*
* Force strict CPU ordering.
* And yes, this is required on UP too when we're talking
diff -urN linux-2.5.27/kernel/sched.c linux/kernel/sched.c
--- linux-2.5.27/kernel/sched.c Sat Jul 20 12:11:11 2002
+++ linux/kernel/sched.c Tue Jul 23 18:02:13 2002
@@ -899,7 +899,7 @@
{
struct thread_info *ti = current_thread_info();

- if (unlikely(ti->preempt_count))
+ if (unlikely(ti->preempt_count || !interrupts_enabled()))
return;

need_resched:

2002-07-24 02:11:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] irqlock patch -G3. [was Re: odd memory corruption in 2.5.27?]



Ingo,
you really shouldn't use a generic #define like IRQ_MASK for something as
obscure as the mask for preemption bits (that apparently gets used
exactly _once_ in the same header file that defines it).

That #define is already used in the kernel inside various files, and from
the things I looked at, other users had more reason to call their stuff
IRQ_MASK than the new code has.

Also, please don't do things like

#define NR_PREEMPT 256

when what you really are doing is doling out bits rather than "numbers".

So I think you'd be better off doing

#define PREEMPT_BITS 8
#define HARDIRQ_BITS 8
#define SOFTIRQ_BITS 8

#define PREEMPT_SHIFT 0
#define HARDIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS)
#define SOFTIRQ_SHIFT (HARDIRQ_SHIFT + PREEMPT_BITS)

#define __MASK(x) ((1UL << (x))-1)

#define PREEMPT_MASK (__MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
#define HARDIRQ_MASK (__MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
#define SOFTIRQ_MASK (__MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)

#define hardirq_count() (preempt_count() & HARDIRQ_MASK)
#define softirq_count() (preempt_count() & SOFTIRQ_MASK)
#define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))

which creates bitmasks from bit-operations rather than doing non-bitwise
arithmetic to get them from magic values that have to be powers-of-2.

And avoids using too generic a name (ie IRQ_MASK is gone).

Ehh?

Linus

2002-07-24 06:31:46

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: odd memory corruption in 2.5.27?

On Tue, 23 Jul 2002, george anzinger wrote:

> protecting them with a combination of interrupt disables and
> spin_locks. Preemption is allowed (incorrectly) if
> interrupts are off and preempt_count goes to zero on the
> spin_unlock. I will wager that this is an SMP machine.
> After the preemption interrupts will be on (schedule() does
> that) AND you could be on a different cpu. Either of these
> is a BAD thing.
>
> The proposed fix is to catch the attempted preemption in
> preempt_schedule() and just return if the interrupt system
> is off. (Of course there is more that this to it, but I do
> believe that the problem is known. You could blow this
> assertion out of the water by asserting that the machine is
> NOT smp.)

I haven't looked at it further than gathering oopses and idly browsing
surrounding code. About your assertion, you're almost right, its UP box
running an SMP kernel w/ CONFIG_PREEMT.

--
function.linuxpower.ca

2002-07-24 08:58:03

by Ingo Molnar

[permalink] [raw]
Subject: [patch] irqlock patch 2.5.27-H3.


On Tue, 23 Jul 2002, Linus Torvalds wrote:

> you really shouldn't use a generic #define like IRQ_MASK for something as
> obscure as the mask for preemption bits (that apparently gets used
> exactly _once_ in the same header file that defines it).
>
> That #define is already used in the kernel inside various files, and from
> the things I looked at, other users had more reason to call their stuff
> IRQ_MASK than the new code has.

(doh - and i've specifically checked the new names for namespace collision
because they looked too generic - apparently not well enough.)

> So I think you'd be better off doing
>
> #define PREEMPT_BITS 8
> #define HARDIRQ_BITS 8
> #define SOFTIRQ_BITS 8
>
> #define PREEMPT_SHIFT 0
> #define HARDIRQ_SHIFT (PREEMPT_SHIFT + PREEMPT_BITS)
> #define SOFTIRQ_SHIFT (HARDIRQ_SHIFT + PREEMPT_BITS)
>
> #define __MASK(x) ((1UL << (x))-1)
>
> #define PREEMPT_MASK (__MASK(PREEMPT_BITS) << PREEMPT_SHIFT)
> #define HARDIRQ_MASK (__MASK(HARDIRQ_BITS) << HARDIRQ_SHIFT)
> #define SOFTIRQ_MASK (__MASK(SOFTIRQ_BITS) << SOFTIRQ_SHIFT)
>
> #define hardirq_count() (preempt_count() & HARDIRQ_MASK)
> #define softirq_count() (preempt_count() & SOFTIRQ_MASK)
> #define irq_count() (preempt_count() & (HARDIRQ_MASK | SOFTIRQ_MASK))
>
> which creates bitmasks from bit-operations rather than doing non-bitwise
> arithmetic to get them from magic values that have to be powers-of-2.

yeah, agreed.

i did one more modification in this area, just for educational purposes
it's a better ordering to have:

- preempt count
- softirq count
- hardirq count

the higher positioned the bitmask, the 'stronger' and more atomic the
execution concept is. Latest patch (against BK-curr) is at:

http://redhat.com/~mingo/remove-irqlock-patches/remove-irqlock-2.5.27-H3

Changes in -H3:

- init thread needs to have preempt_count of 1 until sched_init().
(William Lee Irwin III)

- clean up the irq-mask macros. (Linus)

- add barrier() to irq_enter() and irq_exit(). (based on Oleg Nesterov's
comment.)

- move the irqs-off check into preempt_schedule() and remove
CONFIG_DEBUG_IRQ_SCHEDULE.

Changes in -G5:

- remove spin_unlock_no_resched() and comment the affected places more
agressively.

Changes in -G3:

- slab.c needs to spin_unlock_no_resched(), instead of spin_unlock(). (It
also has to check for preemption in the right spot.) This should fix
the memory corruption.

- irq_exit() needs to run softirqs if interrupts not active - in the
previous patch it ran them when preempt_count() was 0, which is
incorrect.

- spinlock macros are updated to enable preemption after enabling
interrupts. Besides avoiding false positive warnings, this also

- fork.c has to call scheduler_tick() with preemption disabled -
otherwise scheduler_tick()'s spin_unlock can preempt!

- irqs_disabled() macro introduced.

- [ all other local_irq_enable() or sti instances conditional on
CONFIG_DEBUG_IRQ_SCHEDULE are to fix false positive warnings. ]

Changes in -G0:

- fix buggy in_softirq(). Fortunately the bug made the test broader,
which didnt result in algorithmical breakage, just suboptimal
performance.

- move do_softirq() processing into irq_exit() => this also fixes the
softirq processing bugs present in apic.c IRQ handlers that did not
test for softirqs after irq_exit().

- simplify local_bh_enable().

Changes in -F9:

- replace all instances of:

local_save_flags(flags);
local_irq_disable();

with the shorter form of:

local_irq_save(flags);

about 30 files are affected by this change.

Changes in -F8:

- preempt/hardirq/softirq count separation, cleanups.

- skbuff.c fix.

- use irq_count() in scheduler_tick()

Changes in -F3:

- the entry.S cleanups/speedups by Oleg Nesterov.

- a rather critical synchronize_irq() bugfix: if a driver frees an
interrupt that is still being probed then synchronize_irq() locks up.
This bug has caused a spurious boot-lockup on one of my testsystems,
ifconfig would lock up trying to close eth0.

- remove duplicate definitions from asm-i386/system.h, this fixes
compiler warnings.

Ingo