2002-09-11 01:55:40

by Thomas Molina

[permalink] [raw]
Subject: 2.5 Problem Status Report


The most current version of this status report can be found at:
http://members.cox.net/tmolina/kernprobs/status.html

Notes:
* Several people have requested the discussion be linked to LKML
archives. With this version I've switched from
locally edited discussion threads to archived links.
* Great progress has been made in forward porting IDE driver code
from 2.4 to 2.5. Several people have tried 2.5.33
without disaster. Updates continue to be added to the -ac kernels
and the 2.5 bitkeeper kernels.
* Floppy support appears to have been fixed in 2.5.33-bitkeeper. It
has been tested, and the corruption previously
seen has not been duplicated.
* Support for __FUNCTION__ pasting is being phased out of gcc. This
has broken compiling in numerous places. Defines
of the form:
#define func_enter() sx_dprintk (SX_DEBUG_FLOW, "sx: enter "
__FUNCTION__ "\n")
need to be changed to the form:
#define func_enter() sx_dprintk (SX_DEBUG_FLOW, "sx: enter %s\n",
__FUNCTION__)

2.5 Kernel Problem Reports as of 10 Sep
Problem Title Status Discussion
JFS oops open 06 Sep 2002
qlogicisp oops no further discussion 2.5.33
2.5.32 reboot oops no further discussion 2.5.33
ext2 umount oops no further discussion 2.5.33
DEBUG_SLAB oops no further discussion 2.5.33
2.5.32-mm1 problems no further discussion 2.5.33
soft suspend problem no further discussion 2.5.33
PCI and/or starfire.c broken no further discussion 2.5.33
__write_lock_failed() oops no further discussion 2.5.33
invalidate_inode_pages open 10 Sep 2002
Problem running on Athlons open 06 Sep 2002
dequeue_signal panic 08 Sep 2002
closed 09 Sep 2002
mouse/keyboard flakiness open 09 Sep 2002
process hang in do_IRQ open 09 Sep 2002
do_syslog lockup open 09 Sep 2002
BUG at kernel/sched.c open 10 Sep 2002



2002-09-11 02:26:21

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Tue, Sep 10, 2002 at 09:00:30PM -0500, Thomas Molina wrote:
> 2.5 Kernel Problem Reports as of 10 Sep
> Problem Title Status Discussion
> qlogicisp oops no further discussion 2.5.33
> PCI and/or starfire.c broken no further discussion 2.5.33
> __write_lock_failed() oops no further discussion 2.5.33

Since 3 of these are things I reported...

qlogicisp.c oops is some longstanding error recovery issue and/or ISR
bug. [email protected] has been taking my bugreports. There is a lack of
general interest in and information about this device.

PCI/starfire.c breakage has to do with PCI-PCI bridges appearing on
machines with multiple PCI buses. The bus numbering scheme used by
the bridges creates clashes with various other bus' numbers or something
like that. It's likely more visible on NUMA-Q since the bus numbers are
used for port I/O remapping. I'm unaware of the amount of reliance on
bus numbers in other circumstances. [email protected] is handling it.

__write_lock_failed() oops is tasklist_lock starvation. The starving
writer had spun with interrupts off for so long the NMI oopser went
off. [email protected] is looking into it and I have backup plans.


Cheers,
Bill

2002-09-11 03:55:27

by Robert Love

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Tue, 2002-09-10 at 22:00, Thomas Molina wrote:

> do_syslog lockup open 09 Sep 2002

This is fixed in 2.5-BK.

> BUG at kernel/sched.c open 10 Sep 2002

What exactly is this?

Robert Love

2002-09-11 07:01:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report


On 11 Sep 2002, Robert Love wrote:

> > do_syslog lockup open 09 Sep 2002
>
> This is fixed in 2.5-BK.

and it has nothing to do with do_syslog - that was a symbol mismatch.

the lockup was in sys_exit().

Ingo

2002-09-11 07:02:31

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 12:00:20AM -0400, Robert Love wrote:

> > BUG at kernel/sched.c open 10 Sep 2002
> What exactly is this?

Looks like this is my bugreport for BUG in kernel/sched.c:944 in the middle
of partition parsing output on boot.

Subject of email was
'2.5.34 BUG at kernel/sched.c:944 (partitions code related?)'
msgid: [email protected]

Bye,
Oleg

2002-09-11 07:16:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report


On Wed, 11 Sep 2002, Oleg Drokin wrote:

> > > BUG at kernel/sched.c open 10 Sep 2002
> > What exactly is this?
>
> Looks like this is my bugreport for BUG in kernel/sched.c:944 in the middle
> of partition parsing output on boot.
>
> Subject of email was
> '2.5.34 BUG at kernel/sched.c:944 (partitions code related?)'
> msgid: [email protected]

very strange backtrace:

>>EIP; c0115818 <schedule+18/4a0> <=====
Trace; c01053a0 <default_idle+0/40>
Trace; c0105000 <_stext+0/0>
Trace; c01053c8 <default_idle+28/40>
Trace; c010547e <cpu_idle+4e/50>
Trace; c010506c <rest_init+6c/70>

i've once seen the 2.5 IDE code doing a schedule_timeout() from an IRQ
handler, but the above has to be something else. Could you hack sched.c to
print out the exact preemption count? It could be a preempt-count
underflow due to an unbalanced spin_unlock, or an inbalanced
preempt_enable. [or the IRQ code - but i doubt that, we'd have seen
problems much earlier if this was the case.]

Oleg, do you have CONFIG_DEBUG_SPINLOCK enabled? That should catch an
unbalanced spin_unlock().

Robert, i'd suggest to add some sort of debugging code for this anyway, if
preempt_count goes below 0, right now we do this:

#define dec_preempt_count() \
do { \
preempt_count()--; \
} while (0)

this should be something like:

#if CONFIG_DEBUG_SPINLOCK

#define dec_preempt_count() \
do { \
if (!--preempt_count()) \
BUG(); \
} while (0)

#else

#define dec_preempt_count() \
do { \
preempt_count()--; \
} while (0)

#endif

or introduce a separate debug config option, CONFIG_DEBUG_PREEMPT - until
all code is converted. This should catch a bad preempt_enable where it
happens.

Ingo

2002-09-11 07:23:22

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 09:26:40AM +0200, Ingo Molnar wrote:
> > > > BUG at kernel/sched.c open 10 Sep 2002
> > > What exactly is this?
> > Looks like this is my bugreport for BUG in kernel/sched.c:944 in the middle
> > of partition parsing output on boot.
> > Subject of email was
> > '2.5.34 BUG at kernel/sched.c:944 (partitions code related?)'
> > msgid: [email protected]
> very strange backtrace:
> >>EIP; c0115818 <schedule+18/4a0> <=====
> Trace; c01053a0 <default_idle+0/40>

I noticed it too.

> i've once seen the 2.5 IDE code doing a schedule_timeout() from an IRQ
> handler, but the above has to be something else. Could you hack sched.c to
> print out the exact preemption count? It could be a preempt-count
> underflow due to an unbalanced spin_unlock, or an inbalanced
> preempt_enable. [or the IRQ code - but i doubt that, we'd have seen

I have preemption disabled.

> problems much earlier if this was the case.]

> Oleg, do you have CONFIG_DEBUG_SPINLOCK enabled? That should catch an
> unbalanced spin_unlock().

Yes, I do.
CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_HIGHMEM=y

Bye,
Oleg

2002-09-11 07:28:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report


On Wed, 11 Sep 2002, Oleg Drokin wrote:

> I have preemption disabled.

nevertheless please print out preempt_count() in sched.c - since the big
IRQ cleanups we use the preemption count even if preemption is disabled.

this way we'll know what kind of problem happened - a stuck softirq count,
a stuck hardirq count or an underflow?

Ingo

2002-09-11 07:59:23

by Axel H. Siebenwirth

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hi Thomas!

On Tue, 10 Sep 2002, Thomas Molina wrote:

> The most current version of this status report can be found at:
> http://members.cox.net/tmolina/kernprobs/status.html

> 2.5 Kernel Problem Reports as of 10 Sep
> Problem Title Status Discussion
> JFS oops open 06 Sep 2002

We have now figured out what causes this. It only happens when the kernel is
compiled with CONFIG_PREEMPT enabled. Without CONFIG_PREEMPT JFS runs just
prefectly.
This preemption stuff seems to cause a lot of trouble.

http://marc.theaimsgroup.com/?l=linux-kernel&m=103127160424684&w=2

Best regards,
Axel Siebenwirth

2002-09-11 08:01:11

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 09:38:25AM +0200, Ingo Molnar wrote:

> > I have preemption disabled.
> nevertheless please print out preempt_count() in sched.c - since the big
> IRQ cleanups we use the preemption count even if preemption is disabled.
> this way we'll know what kind of problem happened - a stuck softirq count,
> a stuck hardirq count or an underflow?

You was exactly right.
preemption count is -1.
I inserted chack in dec_preempt_count() and here is updated correct stacktrace.
Seems like ide_unmap_buffer is called with some bogus data or something like
that. Also I guess the bug is only visible with debug highmem = ON and highmem
enabled.

ksymoops 2.4.2 on i686 2.4.19. Options used
-V (default)
-k /proc/ksyms (default)
-l /proc/modules (default)
-o /lib/modules/2.4.19/ (default)
-m System.map (specified)

hdb:kernel BUG at /home/green/bk_work/reiser3-linux-2.5-work-t/include/asm/highmem.h:107!
invalid operand: 0000
CPU: 1
EIP: 0060:[<c01bd571>] Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00010016
eax: 00010000 ebx: f7dda000 ecx: c1c5c000 edx: 0000ffff
esi: 00000022 edi: f7de6b44 ebp: 00000008 esp: c1c5def0
ds: 0068 es: 0068 ss: 0068
Stack: c033ff6c f7dfed04 c02ea080 00000296 c1c5df0c c1c5df0c 00000008 00000082
c01b5fd7 c033ff6c c1c3ccdc 04000001 0000000e c1c5df80 00000000 c01bd390
c033fda0 c010957d 0000000e f7dfed04 c1c5df80 c02cdb90 c02cdb80 c02cdb90
Call Trace: [<c01b5fd7>] [<c01bd390>] [<c010957d>] [<c0109819>] [<c01053a0>]
[<c01080a8>] [<c01053a0>] [<c01053c9>] [<c0105472>] [<c011ad5b>]
Code: 0f 0b 6b 00 a0 ee 24 c0 eb 55 90 8d 74 26 00 83 c2 14 c1 e2

>>EIP; c01bd570 <read_intr+1e0/2d0> <=====
Trace; c01b5fd6 <ide_intr+1b6/280>
Trace; c01bd390 <read_intr+0/2d0>
Trace; c010957c <handle_IRQ_event+2c/50>
Trace; c0109818 <do_IRQ+d8/190>
Trace; c01053a0 <default_idle+0/40>
Trace; c01080a8 <common_interrupt+18/20>
Trace; c01053a0 <default_idle+0/40>
Trace; c01053c8 <default_idle+28/40>
Trace; c0105472 <cpu_idle+42/50>
Trace; c011ad5a <release_console_sem+10a/120>
Code; c01bd570 <read_intr+1e0/2d0>
00000000 <_EIP>:
Code; c01bd570 <read_intr+1e0/2d0> <=====
0: 0f 0b ud2a <=====
Code; c01bd572 <read_intr+1e2/2d0>
2: 6b 00 a0 imul $0xffffffa0,(%eax),%eax
Code; c01bd574 <read_intr+1e4/2d0>
5: ee out %al,(%dx)
Code; c01bd576 <read_intr+1e6/2d0>
6: 24 c0 and $0xc0,%al
Code; c01bd578 <read_intr+1e8/2d0>
8: eb 55 jmp 5f <_EIP+0x5f> c01bd5ce <read_intr+23e/2d0>
Code; c01bd57a <read_intr+1ea/2d0>
a: 90 nop
Code; c01bd57a <read_intr+1ea/2d0>
b: 8d 74 26 00 lea 0x0(%esi,1),%esi
Code; c01bd57e <read_intr+1ee/2d0>
f: 83 c2 14 add $0x14,%edx
Code; c01bd582 <read_intr+1f2/2d0>
12: c1 e2 00 shl $0x0,%edx

<0>Kernel panic: Aiee, killing interrupt handler!

1143 warnings issued. Results may not be reliable.

2002-09-11 08:02:35

by Thomas Molina

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, 11 Sep 2002, Oleg Drokin wrote:

> Hello!
>
> On Wed, Sep 11, 2002 at 12:00:20AM -0400, Robert Love wrote:
>
> > > BUG at kernel/sched.c open 10 Sep 2002
> > What exactly is this?
>
> Looks like this is my bugreport for BUG in kernel/sched.c:944 in the middle
> of partition parsing output on boot.
>
> Subject of email was
> '2.5.34 BUG at kernel/sched.c:944 (partitions code related?)'
> msgid: [email protected]

What I post to LKML is a cut and paste from lynx. I put hyperlinks to the
LKML archive discussions on my web page and a hyperlink to the web page
when I post the report.

2002-09-11 08:07:43

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 03:07:12AM -0500, Thomas Molina wrote:

> > > > BUG at kernel/sched.c open 10 Sep 2002
> > > What exactly is this?
> > Looks like this is my bugreport for BUG in kernel/sched.c:944 in the middle
> > of partition parsing output on boot.
> > Subject of email was
> > '2.5.34 BUG at kernel/sched.c:944 (partitions code related?)'
> > msgid: [email protected]
> What I post to LKML is a cut and paste from lynx. I put hyperlinks to the
> LKML archive discussions on my web page and a hyperlink to the web page
> when I post the report.

Hm. Posting a link to original email on some archive even in text-only
version that you mail to lkml would not hurt too, I think.
It's up to you anyway of course.

Bye,
Oleg

2002-09-11 09:16:29

by Clemens Schwaighofer

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello Thomas Molina

--On Tuesday, September 10, 2002 09:00:30 PM -0500 you wrote:

> 2.5 Kernel Problem Reports as of 10 Sep
[snip]

and since we are here reporting Problems in 2.5

fb still broken, since 2.5.32 at least. Patches exist, but they haven't
found their way into the kerenl tree.

--
"Der Krieg ist ein Massaker von Leuten, die sich nicht kennen, zum
Nutzen von Leuten, die sich kennen, aber nicht massakrieren"
- Paul Val?ry (1871-1945)
mfg, Clemens Schwaighofer PIXELWINGS Medien GMBH
Kandlgasse 15/5, A-1070 Wien T: [+43 1] 524 58 50
JETZT NEU! MIT FEWA GEWASCHEN --> http://www.pixelwings.com

2002-09-11 09:14:12

by Adrian Bunk

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, 11 Sep 2002, Thomas Molina wrote:

> What I post to LKML is a cut and paste from lynx. I put hyperlinks to the
> LKML archive discussions on my web page and a hyperlink to the web page
> when I post the report.

If you use

lynx -dump http://members.cox.net/tmolina/kernprobs/status.html

instead you get a text output that includes all links.

cu
Adrian

--

You only think this is a free country. Like the US the UK spends a lot of
time explaining its a free country because its a police state.
Alan Cox


2002-09-11 10:20:38

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Oleg Drokin wrote:
> Hello!
>
> On Wed, Sep 11, 2002 at 09:38:25AM +0200, Ingo Molnar wrote:
>
> > > I have preemption disabled.
> > nevertheless please print out preempt_count() in sched.c - since the big
> > IRQ cleanups we use the preemption count even if preemption is disabled.
> > this way we'll know what kind of problem happened - a stuck softirq count,
> > a stuck hardirq count or an underflow?
>
> You was exactly right. preemption count is -1. I inserted chack in
> dec_preempt_count() and here is updated correct stacktrace. Seems
> like ide_unmap_buffer is called with some bogus data or something like
> that. Also I guess the bug is only visible with debug highmem = ON and
> highmem enabled.

ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
vs ide_unmap_buffer(). i'll cook up a fix right away.

--
Jens Axboe

2002-09-11 10:24:49

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Jens Axboe wrote:
> On Wed, Sep 11 2002, Oleg Drokin wrote:
> > Hello!
> >
> > On Wed, Sep 11, 2002 at 09:38:25AM +0200, Ingo Molnar wrote:
> >
> > > > I have preemption disabled.
> > > nevertheless please print out preempt_count() in sched.c - since the big
> > > IRQ cleanups we use the preemption count even if preemption is disabled.
> > > this way we'll know what kind of problem happened - a stuck softirq count,
> > > a stuck hardirq count or an underflow?
> >
> > You was exactly right. preemption count is -1. I inserted chack in
> > dec_preempt_count() and here is updated correct stacktrace. Seems
> > like ide_unmap_buffer is called with some bogus data or something like
> > that. Also I guess the bug is only visible with debug highmem = ON and
> > highmem enabled.
>
> ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
> vs ide_unmap_buffer(). i'll cook up a fix right away.

Oleg,

Does this make it work?

--- drivers/ide/ide-disk.c~ 2002-09-11 12:27:47.000000000 +0200
+++ drivers/ide/ide-disk.c 2002-09-11 12:28:17.000000000 +0200
@@ -175,7 +175,7 @@
drive->name, rq->sector, rq->sector+nsect-1,
(unsigned long) rq->buffer+(nsect<<9), rq->nr_sectors-nsect);
#endif
- ide_unmap_buffer(to, &flags);
+ ide_unmap_buffer(rq, to, &flags);
rq->sector += nsect;
rq->errors = 0;
i = (rq->nr_sectors -= nsect);
@@ -226,7 +226,7 @@
unsigned long flags;
char *to = ide_map_buffer(rq, &flags);
taskfile_output_data(drive, to, SECTOR_WORDS);
- ide_unmap_buffer(to, &flags);
+ ide_unmap_buffer(rq, to, &flags);
if (HWGROUP(drive)->handler != NULL)
BUG();
ide_set_handler(drive, &write_intr, WAIT_CMD, NULL);
@@ -302,7 +302,7 @@
* re-entering us on the last transfer.
*/
taskfile_output_data(drive, buffer, nsect<<7);
- ide_unmap_buffer(buffer, &flags);
+ ide_unmap_buffer(rq, buffer, &flags);
} while (mcount);

return 0;
@@ -688,7 +688,7 @@
BUG();
ide_set_handler(drive, &write_intr, WAIT_CMD, NULL);
taskfile_output_data(drive, buffer, SECTOR_WORDS);
- ide_unmap_buffer(buffer, &flags);
+ ide_unmap_buffer(rq, buffer, &flags);
}
return ide_started;
}
--- drivers/ide/ide-taskfile.c~ 2002-09-11 12:27:51.000000000 +0200
+++ drivers/ide/ide-taskfile.c 2002-09-11 12:28:25.000000000 +0200
@@ -60,7 +60,7 @@
#endif

#define task_map_rq(rq, flags) ide_map_buffer((rq), (flags))
-#define task_unmap_rq(rq, buf, flags) ide_unmap_buffer((buf), (flags))
+#define task_unmap_rq(rq, buf, flags) ide_unmap_buffer((rq), (buf), (flags))

inline u32 task_read_24 (ide_drive_t *drive)
{
--- include/linux/ide.h~ 2002-09-11 12:27:14.000000000 +0200
+++ include/linux/ide.h 2002-09-11 12:27:29.000000000 +0200
@@ -597,9 +597,10 @@
return rq->buffer + task_rq_offset(rq);
}

-extern inline void ide_unmap_buffer(char *buffer, unsigned long *flags)
+extern inline void ide_unmap_buffer(struct request *rq, char *buffer, unsigned long *flags)
{
- bio_kunmap_irq(buffer, flags);
+ if (rq->bio)
+ bio_kunmap_irq(buffer, flags);
}

/*

--
Jens Axboe

2002-09-11 10:42:54

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 12:29:26PM +0200, Jens Axboe wrote:

> > ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
> > vs ide_unmap_buffer(). i'll cook up a fix right away.
> Does this make it work?

No. It fails exactly like without the patch.

> --- include/linux/ide.h~ 2002-09-11 12:27:14.000000000 +0200
> +++ include/linux/ide.h 2002-09-11 12:27:29.000000000 +0200
> @@ -597,9 +597,10 @@
> return rq->buffer + task_rq_offset(rq);
> }
>
> -extern inline void ide_unmap_buffer(char *buffer, unsigned long *flags)
> +extern inline void ide_unmap_buffer(struct request *rq, char *buffer, unsigned long *flags)
> {
> - bio_kunmap_irq(buffer, flags);
> + if (rq->bio)
> + bio_kunmap_irq(buffer, flags);
> }
>
> /*

Perhaps you forgot to make sure rq->bio is zeroed on unmapping/freeing?

Bye,
Oleg

2002-09-11 10:53:27

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Oleg Drokin wrote:
> Hello!
>
> On Wed, Sep 11, 2002 at 12:29:26PM +0200, Jens Axboe wrote:
>
> > > ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
> > > vs ide_unmap_buffer(). i'll cook up a fix right away.
> > Does this make it work?
>
> No. It fails exactly like without the patch.

Hmm, ok I'll try and reproduce it here then.

> > --- include/linux/ide.h~ 2002-09-11 12:27:14.000000000 +0200
> > +++ include/linux/ide.h 2002-09-11 12:27:29.000000000 +0200
> > @@ -597,9 +597,10 @@
> > return rq->buffer + task_rq_offset(rq);
> > }
> >
> > -extern inline void ide_unmap_buffer(char *buffer, unsigned long *flags)
> > +extern inline void ide_unmap_buffer(struct request *rq, char *buffer, unsigned long *flags)
> > {
> > - bio_kunmap_irq(buffer, flags);
> > + if (rq->bio)
> > + bio_kunmap_irq(buffer, flags);
> > }
> >
> > /*
>
> Perhaps you forgot to make sure rq->bio is zeroed on unmapping/freeing?

rq->bio must not be zeroed or free'd or anything like that. ok I see
what happens now. does this patch work for you? just back out the other
patch first (well you don't have to, but might as well).

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.639 -> 1.640
# include/linux/bio.h 1.17 -> 1.18
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/09/11 [email protected] 1.640
# clean up with bio_kmap_irq() thing properly. remove the micro optimization
# of _not_ calling kmap_atomic() if this isn't a highmem page. we could
# keep that and do the inc_preempt_count() ourselves, but I'm not sure
# it's worth it and this is cleaner.
# --------------------------------------------
#
diff -Nru a/include/linux/bio.h b/include/linux/bio.h
--- a/include/linux/bio.h Wed Sep 11 12:57:45 2002
+++ b/include/linux/bio.h Wed Sep 11 12:57:45 2002
@@ -215,17 +215,11 @@
{
unsigned long addr;

- local_save_flags(*flags);
-
- /*
- * could be low
- */
- if (!PageHighMem(bio_page(bio)))
- return bio_data(bio);
-
/*
- * it's a highmem page
+ * might not be a highmem page, but the preempt/irq count
+ * balancing is a lot nicer this way
*/
+ local_save_flags(*flags);
local_irq_disable();
addr = (unsigned long) kmap_atomic(bio_page(bio), KM_BIO_SRC_IRQ);


--
Jens Axboe

2002-09-11 11:03:12

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Jens Axboe wrote:
> what happens now. does this patch work for you? just back out the other
> patch first (well you don't have to, but might as well).

nonsense, you need both patches of course... sorry for the confusion.

--
Jens Axboe

2002-09-11 11:11:21

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 12:58:07PM +0200, Jens Axboe wrote:

> > > > ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
> > > > vs ide_unmap_buffer(). i'll cook up a fix right away.
> > > Does this make it work?
> > No. It fails exactly like without the patch.
> Hmm, ok I'll try and reproduce it here then.

> > > - bio_kunmap_irq(buffer, flags);
> > > + if (rq->bio)
> > > + bio_kunmap_irq(buffer, flags);
> > > }
> > >
> > Perhaps you forgot to make sure rq->bio is zeroed on unmapping/freeing?
> rq->bio must not be zeroed or free'd or anything like that. ok I see

Hm? So this branch is always executed? Why to check for it then?
(I mean content of rq->bio, not the place where it points to).

> what happens now. does this patch work for you? just back out the other
> patch first (well you don't have to, but might as well).

Ok, with other patch it still fails in the same way.
I have not backed out other patch so I tested with both patches perent.

Bye,
Oleg

2002-09-11 11:12:45

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Oleg Drokin wrote:
> Hello!
>
> On Wed, Sep 11, 2002 at 12:58:07PM +0200, Jens Axboe wrote:
>
> > > > > ok I see the bug. it's due to the imbalanced nature of ide_map_buffer()
> > > > > vs ide_unmap_buffer(). i'll cook up a fix right away.
> > > > Does this make it work?
> > > No. It fails exactly like without the patch.
> > Hmm, ok I'll try and reproduce it here then.
>
> > > > - bio_kunmap_irq(buffer, flags);
> > > > + if (rq->bio)
> > > > + bio_kunmap_irq(buffer, flags);
> > > > }
> > > >
> > > Perhaps you forgot to make sure rq->bio is zeroed on unmapping/freeing?
> > rq->bio must not be zeroed or free'd or anything like that. ok I see
>
> Hm? So this branch is always executed? Why to check for it then?
> (I mean content of rq->bio, not the place where it points to).

ehm no it isn't always executed?! there might not be a ->bio attached to
the request. that goes for both ide_map_buffer() and ide_unmap_buffer()

> > what happens now. does this patch work for you? just back out the other
> > patch first (well you don't have to, but might as well).
>
> Ok, with other patch it still fails in the same way.
> I have not backed out other patch so I tested with both patches perent.

alright, seems I do have to try it myself... ok will do that.

--
Jens Axboe

2002-09-11 11:44:25

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Wed, Sep 11 2002, Jens Axboe wrote:
> > Ok, with other patch it still fails in the same way.
> > I have not backed out other patch so I tested with both patches perent.
>
> alright, seems I do have to try it myself... ok will do that.

with both patches I sent applied, the bug does _not_ exist here as
expected. could you please double check that they are applied, and that
you have booted the right kernel? a make clean just to be on the safe
side might be a good idea :-)

--
Jens Axboe

2002-09-11 12:05:37

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 01:49:03PM +0200, Jens Axboe wrote:
> > > Ok, with other patch it still fails in the same way.
> > > I have not backed out other patch so I tested with both patches perent.
> > alright, seems I do have to try it myself... ok will do that.
> with both patches I sent applied, the bug does _not_ exist here as
> expected. could you please double check that they are applied, and that
> you have booted the right kernel? a make clean just to be on the safe
> side might be a good idea :-)

Well, now it works for me too. Not sure why it was working previous time,
because all the patches were in place. I will play more with it later today.

Thanks.

Bye,
Oleg

2002-09-11 15:33:46

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 04:10:21PM +0400, Oleg Drokin wrote:
> > > > Ok, with other patch it still fails in the same way.
> > > > I have not backed out other patch so I tested with both patches perent.
> > > alright, seems I do have to try it myself... ok will do that.
> > with both patches I sent applied, the bug does _not_ exist here as
> > expected. could you please double check that they are applied, and that
> > you have booted the right kernel? a make clean just to be on the safe
> > side might be a good idea :-)
> Well, now it works for me too. Not sure why it was working previous time,
> because all the patches were in place. I will play more with it later today.

It's me again. And I have bad news.
I have figured why it works for you and did not worked for me.

Try following patch (inspired by Ingo) to get nice BUG on bootup again.

Without the patch ext2 works with your fixes, but reiserfs is not working,
so it seems there are constant preempt counter underflows that later gets
corrected.

Bye,
Oleg

===== include/linux/preempt.h 1.6 vs edited =====
--- 1.6/include/linux/preempt.h Fri Sep 6 04:18:30 2002
+++ edited/include/linux/preempt.h Wed Sep 11 19:28:42 2002
@@ -17,7 +17,8 @@

#define dec_preempt_count() \
do { \
- preempt_count()--; \
+ if ( --preempt_count()) \
+ BUG(); \
} while (0)

#ifdef CONFIG_PREEMPT

2002-09-11 15:35:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report


On Wed, 11 Sep 2002, Oleg Drokin wrote:

> - preempt_count()--; \
> + if ( --preempt_count()) \
> + BUG(); \

actually, the correct patch is to:

- preempt_count()--; \
+ if (!--preempt_count()) \
+ BUG(); \

(note the '!').

Ingo

2002-09-11 15:42:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report


> - preempt_count()--; \
> + if (!--preempt_count()) \
> + BUG(); \

and that should be:

- preempt_count()--; \
+ if (!preempt_count()--) \
+ BUG(); \

ie. we try to check whether a 0 preempt count is decreased to -1.

Ingo

2002-09-11 17:44:47

by Oleg Drokin

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

Hello!

On Wed, Sep 11, 2002 at 05:46:21PM +0200, Ingo Molnar wrote:
> > - preempt_count()--; \
> > + if ( --preempt_count()) \
> > + BUG(); \
> actually, the correct patch is to:
> - preempt_count()--; \
> + if (!--preempt_count()) \
> + BUG(); \
> (note the '!').

Ah, yes. My bad.

Bye,
Oleg

2002-09-11 18:30:02

by Thomas Molina

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Tue, 10 Sep 2002, William Lee Irwin III wrote:

> On Tue, Sep 10, 2002 at 09:00:30PM -0500, Thomas Molina wrote:
> > 2.5 Kernel Problem Reports as of 10 Sep
> > Problem Title Status Discussion
> > qlogicisp oops no further discussion 2.5.33
> > PCI and/or starfire.c broken no further discussion 2.5.33
> > __write_lock_failed() oops no further discussion 2.5.33
>
> Since 3 of these are things I reported...

[snip explanation of why these are "long-term problems"]

I've seen these problem status reports as a fairly dynamic item,
especially at this stage of 2.5 development. Is my current way of doing
things useful to the list? Certainly this posting seems to have generated
a number of useful discussions. I wish they had migrated to the threads
where the original problem had been brought up.

I've had several messages like this one, where people have asked what
about problem x, y, or z that have disappeared from the radar scope. I'm
thinking I could keep the character of this report like it is now, and
have another one with the "long-term problems". I've been sending this
report to the list about once a week, usually just after Linus brings out
a new point release. The other list could be a less-frequent "what about
these?" kind of list.

2002-09-11 20:21:23

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5 Problem Status Report

On Tue, 10 Sep 2002, William Lee Irwin III wrote:
>> Since 3 of these are things I reported...

On Wed, Sep 11, 2002 at 01:33:36PM -0500, Thomas Molina wrote:
> [snip explanation of why these are "long-term problems"]

The PCI-PCI bridge stuff shouldn't be terribly difficult to deal with.
qlogicisp.c is just waiting for someone, anyone who has any idea how
the hardware works, and the tasklist_lock setting off the NMI oopser
when you breathe on it NFI. So maybe 1/3 is long-term.


Cheers,
Bill