2008-07-13 23:57:46

by Ryan Hope

[permalink] [raw]
Subject: Performance Question: BUG_ON vs. WARN_ON_ONCE

I was porting the -rt branch to the latest -mm kernel and encountered
a bug. The
replace-bugon-by-warn-on.patch patch make the following change:

diff --git a/18f7d025bb2e5762fd4063cce0b6e2342475c55c:arch/x86/mm/highmem_32.c
b/db090b52f9d2f9088a4ff9bce530e3c234c8e3af:arch/x86/mm/highmem_32.c
index 165c871ba9af0211e0c939e0bc2212750d4bf39f..402ecdd04d7818fd24e921c94b698faa19383b71
100644
--- a/18f7d025bb2e5762fd4063cce0b6e2342475c55c:arch/x86/mm/highmem_32.c
+++ b/db090b52f9d2f9088a4ff9bce530e3c234c8e3af:arch/x86/mm/highmem_32.c
@@ -84,7 +84,7 @@ void *kmap_atomic_prot(struct page *page, enum
km_type type, pgprot_t prot)

idx = type + KM_TYPE_NR*smp_processor_id();
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
- BUG_ON(!pte_none(*(kmap_pte-idx)));
+ WARN_ON_ONCE(!pte_none(*(kmap_pte-idx)));
set_pte(kmap_pte-idx, mk_pte(page, prot));
arch_flush_lazy_mmu_mode();

However, this causes the kernel to crash or oops under certain loads.
Reverting this change makes the error go away. Is there any sort of
performance difference between BUG_ON and WARN_ON_ONCE, I figure the
change was for a reason so I am wondering what will result from this
change. Any info would be appreciated.

-Ryan


Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

On Sun, 13 Jul 2008 19:57:37 -0400
"Ryan Hope" <[email protected]> wrote:

> However, this causes the kernel to crash or oops under certain loads.
> Reverting this change makes the error go away. Is there any sort of
> performance difference between BUG_ON and WARN_ON_ONCE, I figure the
> change was for a reason so I am wondering what will result from this
> change. Any info would be appreciated.
>
> -Ryan

Looks like WARN_ON_ONCE declares and uses a static int variable, so
it's not reentrant. It should be an atomic static. Still, I don't see
how this could crash the kernel or even oops, or have any other
side-effects.

Could you post the oops? Are you sure the oops you're seeing isn't just
what WARN_ON et al. regularly produce?


Eduard

2008-07-14 04:11:06

by Ryan Hope

[permalink] [raw]
Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

well the bug I recieved looked like it had to do with highmem and this
was the only code relating to mem that got touched, as for the other
person, their crash was reproducible and it definitely was an oops,
numlock led started to blink and system was unresponsive, for both of
us
reverting this change seems fix the issue, my dmesg log is attached to
this message

-Ryan

On Sun, Jul 13, 2008 at 11:36 PM, Eduard - Gabriel Munteanu
<[email protected]> wrote:
> On Sun, 13 Jul 2008 19:57:37 -0400
> "Ryan Hope" <[email protected]> wrote:
>
>> However, this causes the kernel to crash or oops under certain loads.
>> Reverting this change makes the error go away. Is there any sort of
>> performance difference between BUG_ON and WARN_ON_ONCE, I figure the
>> change was for a reason so I am wondering what will result from this
>> change. Any info would be appreciated.
>>
>> -Ryan
>
> Looks like WARN_ON_ONCE declares and uses a static int variable, so
> it's not reentrant. It should be an atomic static. Still, I don't see
> how this could crash the kernel or even oops, or have any other
> side-effects.
>
> Could you post the oops? Are you sure the oops you're seeing isn't just
> what WARN_ON et al. regularly produce?
>
>
> Eduard
>


Attachments:
(No filename) (1.23 kB)
errors.txt (56.45 kB)
Download all attachments

2008-07-14 04:12:18

by Ryan Hope

[permalink] [raw]
Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

i forgot to mention that my system did not die, i was compiling a
kernel at the time and the compile froze, but i was able to cancel it
and restart the compile, my system stayed alive

On Mon, Jul 14, 2008 at 12:10 AM, Ryan Hope <[email protected]> wrote:
> well the bug I recieved looked like it had to do with highmem and this
> was the only code relating to mem that got touched, as for the other
> person, their crash was reproducible and it definitely was an oops,
> numlock led started to blink and system was unresponsive, for both of
> us
> reverting this change seems fix the issue, my dmesg log is attached to
> this message
>
> -Ryan
>
> On Sun, Jul 13, 2008 at 11:36 PM, Eduard - Gabriel Munteanu
> <[email protected]> wrote:
>> On Sun, 13 Jul 2008 19:57:37 -0400
>> "Ryan Hope" <[email protected]> wrote:
>>
>>> However, this causes the kernel to crash or oops under certain loads.
>>> Reverting this change makes the error go away. Is there any sort of
>>> performance difference between BUG_ON and WARN_ON_ONCE, I figure the
>>> change was for a reason so I am wondering what will result from this
>>> change. Any info would be appreciated.
>>>
>>> -Ryan
>>
>> Looks like WARN_ON_ONCE declares and uses a static int variable, so
>> it's not reentrant. It should be an atomic static. Still, I don't see
>> how this could crash the kernel or even oops, or have any other
>> side-effects.
>>
>> Could you post the oops? Are you sure the oops you're seeing isn't just
>> what WARN_ON et al. regularly produce?
>>
>>
>> Eduard
>>
>

Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

On Mon, 14 Jul 2008 00:10:50 -0400
"Ryan Hope" <[email protected]> wrote:

> well the bug I recieved looked like it had to do with highmem and this
> was the only code relating to mem that got touched, as for the other
> person, their crash was reproducible and it definitely was an oops,
> numlock led started to blink and system was unresponsive, for both of
> us
> reverting this change seems fix the issue, my dmesg log is attached to
> this message

There are a few things you should take into account before anything else:
1. The bug does not occur there, but in other code.
2. The kernel is tainted.
3. The oopses start occuring just after you load that tainted module,
or at least something related (that drm stuff is linked to the radeon
module, I presume, which is proprietary AFAIK)

So you should retest after eliminating all these possible noise and
error sources. Test both if your fix (revert) is correct and if it
crashes without your fix.

My guess is that would've happened sooner or later and your fix just
moved stuff around enough to mask it. That static int from WARN_ON_ONCE
means another 4 or 8 bytes in the kernel image, which might set things
off in an already unstable environment.


Eduard

2008-07-14 04:51:18

by Ryan Hope

[permalink] [raw]
Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

radeon drm/dri is opensource, i use the radeonhd driver for xorg

On Mon, Jul 14, 2008 at 12:36 AM, Eduard - Gabriel Munteanu
<[email protected]> wrote:
> On Mon, 14 Jul 2008 00:10:50 -0400
> "Ryan Hope" <[email protected]> wrote:
>
>> well the bug I recieved looked like it had to do with highmem and this
>> was the only code relating to mem that got touched, as for the other
>> person, their crash was reproducible and it definitely was an oops,
>> numlock led started to blink and system was unresponsive, for both of
>> us
>> reverting this change seems fix the issue, my dmesg log is attached to
>> this message
>
> There are a few things you should take into account before anything else:
> 1. The bug does not occur there, but in other code.
> 2. The kernel is tainted.
> 3. The oopses start occuring just after you load that tainted module,
> or at least something related (that drm stuff is linked to the radeon
> module, I presume, which is proprietary AFAIK)
>
> So you should retest after eliminating all these possible noise and
> error sources. Test both if your fix (revert) is correct and if it
> crashes without your fix.
>
> My guess is that would've happened sooner or later and your fix just
> moved stuff around enough to mask it. That static int from WARN_ON_ONCE
> means another 4 or 8 bytes in the kernel image, which might set things
> off in an already unstable environment.
>
>
> Eduard
>

2008-07-14 05:05:27

by Nick Piggin

[permalink] [raw]
Subject: Re: Performance Question: BUG_ON vs. WARN_ON_ONCE

On Monday 14 July 2008 14:36, Eduard - Gabriel Munteanu wrote:
> On Mon, 14 Jul 2008 00:10:50 -0400
>
> "Ryan Hope" <[email protected]> wrote:
> > well the bug I recieved looked like it had to do with highmem and this
> > was the only code relating to mem that got touched, as for the other
> > person, their crash was reproducible and it definitely was an oops,
> > numlock led started to blink and system was unresponsive, for both of
> > us
> > reverting this change seems fix the issue, my dmesg log is attached to
> > this message
>
> There are a few things you should take into account before anything else:
> 1. The bug does not occur there, but in other code.
> 2. The kernel is tainted.
> 3. The oopses start occuring just after you load that tainted module,
> or at least something related (that drm stuff is linked to the radeon
> module, I presume, which is proprietary AFAIK)
>
> So you should retest after eliminating all these possible noise and
> error sources. Test both if your fix (revert) is correct and if it
> crashes without your fix.
>
> My guess is that would've happened sooner or later and your fix just
> moved stuff around enough to mask it. That static int from WARN_ON_ONCE
> means another 4 or 8 bytes in the kernel image, which might set things
> off in an already unstable environment.

Yeah, it looks like most or all of the errors cascade down from
the first one where a page gets freed while it is still in use and
mapped into page tables.

I'd say it is the driver mishandling the pages perhaps in its fault
function or the returned pages from a get_user_pages.

Probably best off reporting it to the drm driver authors first.