LinuxLists.cc - 2.6.39-rc5-git2 boot crashs

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On Mon, May 2, 2011 at 3:28 PM, werner <[email protected]> wrote:
> Also, with this configuration, sync dont work, the terminal waits endless.
> ?And if I type in 'reboot', then after the message it would go to reboot,
> nothing happens (it's possible that for reboot, sync is called).

Yes, sync is called for reboot.

However, that should be fairly easy to figure out. Well, perhaps not
"figure out", but at least get more details of where it is hanging. A
simple

echo w > /proc/sysrq-trigger

should give you a nice trace in the dmesg showing where 'sync' is
hanging, and that would hopefully give us a lot more clues.

Linus

2011-05-03 16:29:13

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

2011/5/3 werner <[email protected]>:
> Pls watch the config enclosed.
>
> IDE on , X86_EXTENDED_PLATFORM off (also X_86 elan)
>
> From the previous two suggestions, MTD on (appearently don't makes
> problems), but of MISC-FILESYSTEMS what appearently causes the error message
> during boot and perhaps also that sync don't work, I switched on the half
> and off the other half, to circle the problem.

Ok, can you try the attached patch, to see if the logfs oops goes
away. Perhaps more importantly, does the sync problem also go away?

> No problem with unzip / zip / moving big files etc, , so that this problem
> cames from X86_EXTENDED_PLATTFORM.

Ok, that is very interesting.

> Tell me what to try out now

So at this point you have two problems, and I really would like to
just doubly verify both of them. First off, the attached patch for the
logfs oops and (hopefully) the sync hanging issue.

But secondly, I want you to double--check that whole CONFIG_X86_ELAN
thing - I'd like you to test two kernels that are otherwise totally
identical in their configurations, except one has
CONFIG_X86_EXTENDED_PLATTFORM on and CONFIG_X86_ELAN, and the other
does not. Just to make sure that with all the changes to the config
file, that is really the _only_ difference, and that yes, that's the
one that brings up the "crash at unzip" problem.

I'm adding Ingo Molnar, Thomas Gleixner Peter Anvin to the cc, because
if this whole problem really is because of the x86 CPU configuration,
they may have better ideas than I do.

Ingo/Thomas/Peter: see the whole long and confused thread on lkml. But
it all boils down to Werner using a very full kernel config where not
only is almost everything compiled in (which showed the logfs problem
even though Werner didn't even have a logfs filesystem), but he also
had a very generic x86 kernel. Too generic.

He had CONFIG_X86_EXTENDED_PLATTFORM and CONFIG_X86_ELAN on, and that
has apparently worked for him (and a lot of other people - he does a
distribution) up until 2.6.38. But as of 2.6.39-rc1 it causes some
really odd problems under IO (his test-case is "unzip", but that's
probably fairly random). The problem seems to show up as a bogus IO
list for SATA, causing a big WARN_ON() or oops and then a dead machine
due to IO problems.

I wonder what CONFIG_X86_ELAN has to do with anything, but from all
the config testing werner has done, it really looks like that's the
smoking gun here.

Why does M686 work, but X86_ELAN causes odd problems in 2.6.39-rc?
Allocator issues? Maybe related to the lockless slub paths?

So I obviously agree that X86_ELAN is a crazy choice for a generic
kernel, but it _used_ to work, and this is a regression.

Linus

Attachments:

patch.diff (888.00 B)

2011-05-03 19:08:42

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

* Linus Torvalds <[email protected]> wrote:

> > No problem with unzip / zip / moving big files etc, , so that this problem
> > cames from X86_EXTENDED_PLATTFORM.
>
> Ok, that is very interesting.

I have no smart ideas straight away - just an observation: i keep testing
CONFIG_X86_ELAN=y on real hardware, and it's enabled in about 4% of my configs:

sirius:~/linux/linux2/configs> grep -h MELAN * | sort | uniq -c
358 # CONFIG_MELAN is not set
15 CONFIG_MELAN=y

and i booted a few recently:

sirius:~/linux/linux2/configs> grep -l MELAN=y *
config-Mon_May__2_13_30_09_CEST_2011.good
config-Mon_May__2_15_02_43_CEST_2011.good
config-Mon_May__2_15_59_10_CEST_2011.good
config-Mon_May__2_16_19_49_CEST_2011.good
config-Mon_May__2_16_41_15_CEST_2011.good
config-Mon_May__2_18_10_56_CEST_2011.good
config-Mon_May__2_19_07_32_CEST_2011.good
config-Mon_May__2_19_28_56_CEST_2011.good
config-Mon_May__2_19_52_30_CEST_2011.good
config-Mon_May__2_21_40_44_CEST_2011.good
config-Mon_May__2_21_52_04_CEST_2011.good
config-Mon_May__2_23_43_29_CEST_2011.good
config-Sun_May__1_21_21_06_CEST_2011.good
config-Sun_May__1_22_15_02_CEST_2011.good
config-Tue_May__3_14_45_15_CEST_2011.good

(Note: CONFIG_X86_ELAN got renamed to CONFIG_MELAN recently.)

So whatever the breakage is, it's not just some 'nobody tests CONFIG_MELAN=y'
easy breakage, but possibly something more configuration (or test) specific.

Thanks,

Ingo

2011-05-03 20:18:29

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On Tue, May 3, 2011 at 12:08 PM, Ingo Molnar <[email protected]> wrote:
>
> I have no smart ideas straight away - just an observation: i keep testing
> CONFIG_X86_ELAN=y on real hardware, and it's enabled in about 4% of my configs:

So how often do you do more than just boot?

Because this problem doesn't happen at boot-time, the "bad IO" issue
seems to happen only after some real filesystem work (ie the "unzip a
140MB file" etc). In one of the logs, the SATA driver issue ended up
happening four hours after boot (but I think Werner could trigger it
in minutes by doing heavy FS work).

That's why I'd suspect some subtle race or other and/or memory
allocation re-use.

The subject line "boot crashs" is misleading, because that's the
unrelated logfs thing that I have a tentative patch for.

That said, I don't really see why ELAN would be so special.

Werner: can you test without X86_EXTENDED_PLATFORM, but then picking
the i486 config instead? That should approximate the ELAN
configuration pretty closely, at least in things like compiler flags..

Linus

2011-05-03 20:21:15

by H. Peter Anvin

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On 05/03/2011 01:17 PM, Linus Torvalds wrote:
>
> That said, I don't really see why ELAN would be so special.
>

ELAN used to have a nonstandard A20 enabling sequence, until someone
found out that changes to the mainline sequence had made ELAN work as a
side effect.

At this point, it's just a CPU selection thing, so it makes very little
sense.

-hpa

2011-05-03 20:51:10

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

* H. Peter Anvin <[email protected]> wrote:

> On 05/03/2011 01:17 PM, Linus Torvalds wrote:
> >
> > That said, I don't really see why ELAN would be so special.
> >
>
> ELAN used to have a nonstandard A20 enabling sequence, until someone
> found out that changes to the mainline sequence had made ELAN work as a
> side effect.
>
> At this point, it's just a CPU selection thing, so it makes very little
> sense.

the ELAN .config option influences the following details:

- sets X86_L1_CACHE_SHIFT to 4 (16 bytes) instead of the typical 6 (64 bytes)
- sets X86_ALIGNMENT_16
- sets the -march=i486 compiler flag

So in terms of the kernel image it seems to be mostly equivalent to selecting
i486 from the CPU menu - except the X86_ALIGNMENT_16 detail (which does not
seem to do anything substantive, AFAICS).

Thanks,

Ingo

2011-05-03 21:46:07

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On Tue, May 3, 2011 at 1:50 PM, Ingo Molnar <[email protected]> wrote:
>
> the ELAN .config option influences the following details:
>
> ?- sets X86_L1_CACHE_SHIFT to 4 (16 bytes) instead of the typical 6 (64 bytes)
> ?- sets X86_ALIGNMENT_16
> ?- sets the -march=i486 compiler flag

It also does this to the config diff:

306,307c328,332
< CONFIG_X86_ALIGNMENT_16=y
< CONFIG_X86_MINIMUM_CPU_FAMILY=4
---
> CONFIG_X86_USE_PPRO_CHECKSUM=y
> CONFIG_X86_TSC=y
> CONFIG_X86_CMPXCHG64=y
> CONFIG_X86_CMOV=y
> CONFIG_X86_MINIMUM_CPU_FAMILY=5

because of all the indirect changes it causes.

Now, Werner is actually _running_ on an AMD Opteron (or whatever
family 15 is), I think. And his kernel is SMP-enabled. And that whole
thin looks really really iffy.

How/why do we even allow that combination of SMP and (for example)
arch/x86/lib/atomic64_386_32.S to be picked?

I don't think he actually runs SMP, but the fact that we even allow
that combination looks really odd/iffy. Am I missing something?

Linus

2011-05-03 22:03:01

by H. Peter Anvin

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On 05/03/2011 02:45 PM, Linus Torvalds wrote:
> On Tue, May 3, 2011 at 1:50 PM, Ingo Molnar <[email protected]> wrote:
>>
>> the ELAN .config option influences the following details:
>>
>> - sets X86_L1_CACHE_SHIFT to 4 (16 bytes) instead of the typical 6 (64 bytes)
>> - sets X86_ALIGNMENT_16
>> - sets the -march=i486 compiler flag
>
> It also does this to the config diff:
>
> 306,307c328,332
> < CONFIG_X86_ALIGNMENT_16=y
> < CONFIG_X86_MINIMUM_CPU_FAMILY=4
> ---
> > CONFIG_X86_USE_PPRO_CHECKSUM=y
> > CONFIG_X86_TSC=y
> > CONFIG_X86_CMPXCHG64=y
> > CONFIG_X86_CMOV=y
> > CONFIG_X86_MINIMUM_CPU_FAMILY=5
>
> because of all the indirect changes it causes.
>
> Now, Werner is actually _running_ on an AMD Opteron (or whatever
> family 15 is), I think. And his kernel is SMP-enabled. And that whole
> thin looks really really iffy.
>
> How/why do we even allow that combination of SMP and (for example)
> arch/x86/lib/atomic64_386_32.S to be picked?
>
> I don't think he actually runs SMP, but the fact that we even allow
> that combination looks really odd/iffy. Am I missing something?
>

We would end up in those paths before alternatives are run, but
alternatives should be run before we start the second processor.

-hpa

2011-05-04 04:13:32

by werner

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

The result of the test below is:

The big-file-zip / unzip / very-big-file-moving crashs
returned

The sync error didn't return

after switching on CONFIG-X86_EXTENDED_PLATFORMS inclusive
CONFIG_ELAN, and also let ON the RDC321X_WDT,
ELAN-CPUFREQ, CONFIG-SC500-CPUFREQ (BTREE switched off
anyway).

Thus, the first problem (zip..) cames from switching on
EXTENDED PLATFORMS etc , but the second (sync) comes from
MISC-FILESYSTEMS of which a part incl logfs continued
unchanged switched OFF, like also the (not-crashing) error
message came from this.

As said, before was used a config (with EXTENDED_PLATFORMS
etc OFF) which yesterday runned stable almost the whole
day and didnt show any of the reported problems. But
after this mail, for better security, I'll return to it
and try to provoke a crash (unzipping bigger files etc) to
reaffirm that its really stable.

Annexed are:

/ config used
/ The difference of that config to the previous stable
one. The zip error should be caused by these differences
/ The difference of the previous stable config, to the
initial everythingyes config. If this stable config was
really stable, then all problems reported by me since
2.6.35-rc3 should be caused by these differences. (this
file I sent already, it's for comparison, and as a
summary)

W.Landgraf

=====================
I started already the compilation suggested before, ie let
CONFIG_MISC_FILESYSTEMS unchanged like it was at the last
stable config, with logfs OFF,
but switching on CONFIG-X86_EXTENDED_PLATFORMS, and also
let ON the RDC321X_WDT,
ELAN-CPUFREQ, CONFIG-SC500-CPUFREQ (BTREE switched off
anyway).
Thus, with exception of the items switched off in
MISC-FILESYSTEMS (incl. logfs),
it's the same like on my first everyyesconfig.

If the problems don't return, then the reason is anywhere
in MISC-FILESYSTEMS, one
of the few features which I switched off now, inclusive
logfs.

If the problems return, then they are because of the
EXTENDED-PLATFORMS (incl. ELAN) or in one of the 3 items
quoted above. However, then still the reported boot error
message can be because logfs.

After this is ready, I report about the results, and I do
what you write below.

wl
---
Professional hosting for everyone - http://www.host.ru

Attachments:

diff.zipproblem (1.49 kB)
config.zipproblem (134.26 kB)
diff.config.txt (2.61 kB)
Download all attachments

2011-05-04 06:15:18

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

* werner <[email protected]> wrote:

> Enclosed also the screen fotos of the zip crash from now, and of the
> secondary reboot-resistent ata error which came back too.

All the oopses have scrolled off and the pictures are blurry, so it's not
possible to see the backtraces clearly.

To prevent the scroll-off please add these boot options:

pause_on_oops=600 initcall_debug debug ignore_loglevel

This should rate-limit the crash to one oops per 600 seconds. The first one is
probably the most interesting one.

Also, does the vga=ask boot option added to the ones above work for you?

If yes then on bootup select a VGA mode that has as many lines as possible (to
capture a big oops screen) - but in this case it's doubly important that the
picture you take is *sharp*.

Thanks,

Ingo

2011-05-04 07:19:37

by Borislav Petkov

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On Tue, May 03, 2011 at 02:45:10PM -0700, Linus Torvalds wrote:
> Now, Werner is actually _running_ on an AMD Opteron (or whatever
> family 15 is), I think.

http://marc.info/?l=linux-kernel&m=130438580705332 says it's a AMD
Athlon(tm) 64 X2 Dual Core Processor 4800+, i.e. a good old K8 desktop.
The X86_ELAN Kconfig option, however, says:

config X86_ELAN
bool "AMD Elan"
depends on X86_32
depends on X86_EXTENDED_PLATFORM
---help---
Select this for an AMD Elan processor.

Do not use this option for K6/Athlon/Opteron processors!

If unsure, choose "PC-compatible" instead.

so if the sentence before last used to mean anything, this could be a
problem. Quick search about it gives http://lkml.org/lkml/2004/1/12/239
which introduces that different compiler arch for ELAN: -march=i486,
which could conflict with the generic selection?

Hmm.

--
Regards/Gruss,
Boris.

2011-05-04 07:39:17

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

* Borislav Petkov <[email protected]> wrote:

> On Tue, May 03, 2011 at 02:45:10PM -0700, Linus Torvalds wrote:
> > Now, Werner is actually _running_ on an AMD Opteron (or whatever
> > family 15 is), I think.
>
> http://marc.info/?l=linux-kernel&m=130438580705332 says it's a AMD
> Athlon(tm) 64 X2 Dual Core Processor 4800+, i.e. a good old K8 desktop.
> The X86_ELAN Kconfig option, however, says:
>
> config X86_ELAN
> bool "AMD Elan"
> depends on X86_32
> depends on X86_EXTENDED_PLATFORM
> ---help---
> Select this for an AMD Elan processor.
>
> Do not use this option for K6/Athlon/Opteron processors!
>
> If unsure, choose "PC-compatible" instead.
>
> so if the sentence before last used to mean anything, this could be a
> problem. Quick search about it gives http://lkml.org/lkml/2004/1/12/239
> which introduces that different compiler arch for ELAN: -march=i486,
> which could conflict with the generic selection?

Well, but CONFIG_X86_ELAN=y always worked/booted fine on generic hardware -
including later AMD CPUs. I have booted it on an AMD Athlon64 CPU today:

config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_X86_32_SMP=y
config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_MELAN=y
config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_SMP=y

When it comes to regressions it's past behavior that counts, not the Kconfig
help text.

Thanks,

Ingo

2011-05-04 07:55:10

by Borislav Petkov

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 04, 2011 at 09:38:54AM +0200, Ingo Molnar wrote:
>
> * Borislav Petkov <[email protected]> wrote:
>
> > On Tue, May 03, 2011 at 02:45:10PM -0700, Linus Torvalds wrote:
> > > Now, Werner is actually _running_ on an AMD Opteron (or whatever
> > > family 15 is), I think.
> >
> > http://marc.info/?l=linux-kernel&m=130438580705332 says it's a AMD
> > Athlon(tm) 64 X2 Dual Core Processor 4800+, i.e. a good old K8 desktop.
> > The X86_ELAN Kconfig option, however, says:
> >
> > config X86_ELAN
> > bool "AMD Elan"
> > depends on X86_32
> > depends on X86_EXTENDED_PLATFORM
> > ---help---
> > Select this for an AMD Elan processor.
> >
> > Do not use this option for K6/Athlon/Opteron processors!
> >
> > If unsure, choose "PC-compatible" instead.
> >
> > so if the sentence before last used to mean anything, this could be a
> > problem. Quick search about it gives http://lkml.org/lkml/2004/1/12/239
> > which introduces that different compiler arch for ELAN: -march=i486,
> > which could conflict with the generic selection?
>
> Well, but CONFIG_X86_ELAN=y always worked/booted fine on generic hardware -
> including later AMD CPUs. I have booted it on an AMD Athlon64 CPU today:
>
> config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_X86_32_SMP=y
> config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_MELAN=y
> config-Wed_May__4_09_26_50_CEST_2011.good:CONFIG_SMP=y
>
> When it comes to regressions it's past behavior that counts, not the Kconfig
> help text.

Right, I was referring to the odd problem Linus was talking about:

"He had CONFIG_X86_EXTENDED_PLATTFORM and CONFIG_X86_ELAN on, and that
has apparently worked for him (and a lot of other people - he does a
distribution) up until 2.6.38. But as of 2.6.39-rc1 it causes some
really odd problems under IO (his test-case is "unzip", but that's
probably fairly random). The problem seems to show up as a bogus IO list
for SATA, causing a big WARN_ON() or oops and then a dead machine due to
IO problems."

So booting might've not triggered it. But reportedly .38 was fine so
yeah, the Kconfig help text might not even mean anything anymore.

Maybe I should run the same .config and test case on a K8 box here to
see what happens.

@Werner: can you send me the exact .config and the testcase that
triggers the issue?

Thanks.

--
Regards/Gruss,
Boris.

2011-05-04 08:37:03

[permalink] [raw]

Subject: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Linus Torvalds <[email protected]> wrote:

> On Tue, May 3, 2011 at 12:08 PM, Ingo Molnar <[email protected]> wrote:
> >
> > I have no smart ideas straight away - just an observation: i keep testing
> > CONFIG_X86_ELAN=y on real hardware, and it's enabled in about 4% of my
> > configs:
>
> So how often do you do more than just boot?

Not very often - but 'to boot' means a certain amount of filesystem work - and
so does the 'prepare to boot the next kernel' step.

So i took Werner's .config.zipproblem and modified it to make it bootable:
removed CONFIG_ROOT_NFS=y and disabled CONFIG_IDE - both of which keep my box
from booting. I've attached an updated .config.zipproblem2 file: Werner, can
you confirm that this still fails for you?

So i booted v2.6.39-rc5-254-g5933f2a on an AMD box (which is SMP in fact, so
should trigger races even faster):

Kernel 2.6.39-rc5-i486-1sys+ on an i686

and started a couple of such IO-intense loops:

FILE=bigfile.$RANDOM;
while sync; do rm -f $FILE; dd if=/dev/urandom of=$FILE bs=1000 count=10000; done &

this creates patterns of high IO combined with periods waiting for IO to flush:

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 0 0 621140 8968 346500 0 0 29 874 1044 135 0 27 71 2 0
8 0 0 614692 8984 352904 0 0 0 0 2011 658 0 100 0 0 0
8 0 0 608492 8992 359104 0 0 0 0 2096 709 1 99 0 0 0
8 0 0 602016 8992 365332 0 0 0 0 2006 651 0 100 0 0 0
8 0 0 595692 8992 371760 0 0 0 0 2005 647 0 100 0 0 0
4 5 0 588128 9000 378080 0 0 0 31508 2031 610 0 100 0 0 0
6 2 0 583168 9068 383664 0 0 4 7532 2256 668 0 91 0 9 0
8 0 0 586888 9116 380212 0 0 0 456 2073 804 1 100 0 0 0
5 3 0 620484 9408 346904 0 0 0 14932 2768 1901 0 94 1 6 0
8 0 0 633656 9576 333392 0 0 0 6180 2369 1232 0 98 0 2 0
8 0 0 627208 9576 339720 0 0 0 0 2004 650 0 100 0 0 0
8 0 0 621008 9580 346032 0 0 0 0 2048 680 1 100 0 0 0
8 0 0 614436 9588 352296 0 0 0 0 2004 657 0 100 0 0 0
8 0 0 618652 9708 348296 0 0 0 25984 2283 816 0 92 0 7 0
8 0 0 612424 9708 354576 0 0 0 0 2005 651 0 100 0 0 0
8 0 0 605976 9716 360892 0 0 0 0 2004 652 0 100 0 0 0
8 0 0 599652 9716 367208 0 0 0 0 2006 654 0 100 0 0 0
8 0 0 593204 9720 373528 0 0 0 0 2004 652 0 100 0 0 0
7 2 0 585764 9728 379852 0 0 0 31612 2038 649 0 100 0 0 0
8 0 0 590104 9832 376180 0 0 4 7400 2274 920 0 98 0 2 0
2 6 0 597288 10008 368892 0 0 0 8636 2432 1373 0 95 0 5 0
8 0 0 627180 10208 339096 0 0 0 4324 2426 1429 0 97 0 2 0
8 0 0 630404 10284 335480 0 0 0 6408 2202 970 1 97 0 2 0

And indeed, after a couple of minutes testing i triggered this beauty:

BUG: unable to handle kernel NULL pointer dereference at 00000008
IP: [<c14e85c0>] generic_make_request+0x86/0x3f4
*pde = 00000000
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:0a.0/net/eth0/address
Modules linked in:

Pid: 2969, comm: flush-8:0 Not tainted 2.6.39-rc5-i486-1sys+ #122580 System manufacturer System Product Name/A8N-E
EIP: 0060:[<c14e85c0>] EFLAGS: 00010202 CPU: 1
EIP is at generic_make_request+0x86/0x3f4
EAX: 00000000 EBX: f569e280 ECX: f6a00000 EDX: f5528000
ESI: 00000008 EDI: 00000001 EBP: f5fd7c8c ESP: f5fd7c14
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process flush-8:0 (pid: 2969, ti=f5fd6000 task=f64d0ab0 task.ti=f5fd6000)
Stack:
f569e280 c10988af f5fd7c84 c1d74030 f6402400 f569e300 00081ddb f569e280
c10988af f5fd7c84 f64d0ab0 0000007b 0000007b 000000d8 00000008 ffffffc1
00000000 00000000 00000246 f6a06c80 055c08fe 00000000 c10988af 00000080
Call Trace:
[<c10988af>] ? mempool_alloc_slab+0x13/0x15
[<c1d74030>] ? common_interrupt+0x30/0x40
[<c10988af>] ? mempool_alloc_slab+0x13/0x15
[<c10988af>] ? mempool_alloc_slab+0x13/0x15
[<c14e89f9>] submit_bio+0xcb/0xe4
[<c10a8048>] ? inc_zone_page_state+0xe/0x88
[<c10ed8b5>] ? bio_init+0x9/0x2e
[<c10ee05b>] ? bio_alloc_bioset+0x3c/0x9c
[<c10ea27a>] submit_bh+0xc6/0xe0
[<c10eb8f3>] __block_write_full_page+0x20a/0x2df
[<c10ed9d9>] ? bio_put+0x8/0x2c
[<c10ea9e5>] ? end_buffer_async_read+0xd5/0xd5
[<c10eba88>] block_write_full_page_endio+0xc0/0xc8
[<c10ea9e5>] ? end_buffer_async_read+0xd5/0xd5
[<c10ebaa7>] block_write_full_page+0x17/0x19
[<c10ea9e5>] ? end_buffer_async_read+0xd5/0xd5
[<c1156fd0>] ext3_ordered_writepage+0xc8/0x19c
[<c11566ee>] ? bput_one+0x10/0x10
[<c109c9ce>] __writepage+0x10/0x28
[<c109cd96>] write_cache_pages+0x1c9/0x283
[<c109c9be>] ? bdi_set_max_ratio+0x52/0x52
[<c109ce86>] generic_writepages+0x36/0x49
[<c109d994>] do_writepages+0x28/0x2b
[<c10e5642>] writeback_single_inode+0xa6/0x18d
[<c10e58ef>] writeback_sb_inodes+0xa6/0x10b
[<c10e6267>] writeback_inodes_wb+0xd9/0xee
[<c10e642b>] wb_writeback+0x1af/0x26d
[<c1045c12>] ? try_to_del_timer_sync+0x81/0x89
[<c103faca>] ? local_bh_disable+0x8/0x18
[<c10e655a>] wb_do_writeback+0x71/0x181
[<c1045dc0>] ? add_timer_on+0x95/0x95
[<c1045cb1>] ? del_timer+0xc/0x86
[<c10e66d8>] bdi_writeback_thread+0x6e/0x186
[<c10e666a>] ? wb_do_writeback+0x181/0x181
[<c1051f73>] kthread+0x67/0x6c
[<c1051f0c>] ? kthread_worker_fn+0x114/0x114
[<c1d74046>] kernel_thread_helper+0x6/0x10
Code: 00 c7 45 c8 00 00 00 00 8d 55 c8 89 90 f4 02 00 00 89 45 b0 8b 53 20 c1 ea 09 89 55 c0 e8 e3 8c 88 00 83 7d c0 00 74 69 8b 43 0c <8b> 40 08 8b 90 84 00 00 00 f6 c2 01 74 04 f3 90 eb f1 8b 70 7c
EIP: [<c14e85c0>] generic_make_request+0x86/0x3f4 SS:ESP 0068:f5fd7c14
CR2: 0000000000000008
---[ end trace c45e837de578cd2f ]---
------------[ cut here ]------------

( the full crashlog is attached as well - hardware details can be found there
although i doubt it matters. )

Seems to be a generic IO/BDI badness at first sight. It gives me the appearance
of a race or boundary condition bug, not that of memory corruption.

Here's the crashing generic_make_request() function:

c14e853a <generic_make_request>:
c14e853a: 55 push %ebp
c14e853b: 89 e5 mov %esp,%ebp
c14e853d: 57 push %edi
c14e853e: 56 push %esi
c14e853f: 53 push %ebx
c14e8540: 83 ec 6c sub $0x6c,%esp
c14e8543: e8 08 bb 88 00 call c1d74050 <mcount>
c14e8548: 89 c3 mov %eax,%ebx
c14e854a: 65 a1 14 00 00 00 mov %gs:0x14,%eax
c14e8550: 89 45 f0 mov %eax,-0x10(%ebp)
c14e8553: 31 c0 xor %eax,%eax
c14e8555: 64 a1 c4 c4 38 c2 mov %fs:0xc238c4c4,%eax
c14e855b: 83 b8 f4 02 00 00 00 cmpl $0x0,0x2f4(%eax)
c14e8562: 74 23 je c14e8587 <generic_make_request+0x4d>
c14e8564: 8b 80 f4 02 00 00 mov 0x2f4(%eax),%eax
c14e856a: c7 43 08 00 00 00 00 movl $0x0,0x8(%ebx)
c14e8571: 8b 50 04 mov 0x4(%eax),%edx
c14e8574: 85 d2 test %edx,%edx
c14e8576: 74 05 je c14e857d <generic_make_request+0x43>
c14e8578: 89 5a 08 mov %ebx,0x8(%edx)
c14e857b: eb 02 jmp c14e857f <generic_make_request+0x45>
c14e857d: 89 18 mov %ebx,(%eax)
c14e857f: 89 58 04 mov %ebx,0x4(%eax)
c14e8582: e9 8e 03 00 00 jmp c14e8915 <generic_make_request+0x3db>
c14e8587: 83 7b 08 00 cmpl $0x0,0x8(%ebx)
c14e858b: 74 02 je c14e858f <generic_make_request+0x55>
c14e858d: 0f 0b ud2
c14e858f: c7 45 cc 00 00 00 00 movl $0x0,-0x34(%ebp)
c14e8596: c7 45 c8 00 00 00 00 movl $0x0,-0x38(%ebp)
c14e859d: 8d 55 c8 lea -0x38(%ebp),%edx
c14e85a0: 89 90 f4 02 00 00 mov %edx,0x2f4(%eax)
c14e85a6: 89 45 b0 mov %eax,-0x50(%ebp)
c14e85a9: 8b 53 20 mov 0x20(%ebx),%edx
c14e85ac: c1 ea 09 shr $0x9,%edx
c14e85af: 89 55 c0 mov %edx,-0x40(%ebp)
c14e85b2: e8 e3 8c 88 00 call c1d7129a <_cond_resched>
c14e85b7: 83 7d c0 00 cmpl $0x0,-0x40(%ebp)
c14e85bb: 74 69 je c14e8626 <generic_make_request+0xec>
c14e85bd: 8b 43 0c mov 0xc(%ebx),%eax
c14e85c0: 8b 40 08 mov 0x8(%eax),%eax
c14e85c3: 8b 90 84 00 00 00 mov 0x84(%eax),%edx
c14e85c9: f6 c2 01 test $0x1,%dl
c14e85cc: 74 04 je c14e85d2 <generic_make_request+0x98>
c14e85ce: f3 90 pause
c14e85d0: eb f1 jmp c14e85c3 <generic_make_request+0x89>
c14e85d2: 8b 70 7c mov 0x7c(%eax),%esi
c14e85d5: 8b b8 80 00 00 00 mov 0x80(%eax),%edi
c14e85db: 39 90 84 00 00 00 cmp %edx,0x84(%eax)
c14e85e1: 75 e0 jne c14e85c3 <generic_make_request+0x89>
c14e85e3: 89 f0 mov %esi,%eax
c14e85e5: 89 fa mov %edi,%edx
c14e85e7: 0f ac d0 09 shrd $0x9,%edx,%eax
c14e85eb: c1 fa 09 sar $0x9,%edx
c14e85ee: 89 d1 mov %edx,%ecx
c14e85f0: 09 c1 or %eax,%ecx
c14e85f2: 74 32 je c14e8626 <generic_make_request+0xec>
c14e85f4: 8b 0b mov (%ebx),%ecx
c14e85f6: 89 4d c4 mov %ecx,-0x3c(%ebp)
c14e85f9: 8b 4b 04 mov 0x4(%ebx),%ecx
c14e85fc: 8b 75 c0 mov -0x40(%ebp),%esi
c14e85ff: 31 ff xor %edi,%edi
c14e8601: 83 fa 00 cmp $0x0,%edx
c14e8604: 77 09 ja c14e860f <generic_make_request+0xd5>
c14e8606: 3b 45 c0 cmp -0x40(%ebp),%eax
c14e8609: 0f 82 21 02 00 00 jb c14e8830 <generic_make_request+0x2f6>
c14e860f: 29 f0 sub %esi,%eax
c14e8611: 19 fa sbb %edi,%edx
c14e8613: 39 ca cmp %ecx,%edx
c14e8615: 77 0f ja c14e8626 <generic_make_request+0xec>
c14e8617: 0f 82 13 02 00 00 jb c14e8830 <generic_make_request+0x2f6>
c14e861d: 3b 45 c4 cmp -0x3c(%ebp),%eax
c14e8620: 0f 82 0a 02 00 00 jb c14e8830 <generic_make_request+0x2f6>
c14e8626: c7 45 b4 00 00 00 00 movl $0x0,-0x4c(%ebp)
c14e862d: c7 45 b8 ff ff ff ff movl $0xffffffff,-0x48(%ebp)
c14e8634: c7 45 bc ff ff ff ff movl $0xffffffff,-0x44(%ebp)
c14e863b: 8b 45 c0 mov -0x40(%ebp),%eax
c14e863e: 89 45 9c mov %eax,-0x64(%ebp)
c14e8641: c7 45 a0 00 00 00 00 movl $0x0,-0x60(%ebp)
c14e8648: 8b 43 0c mov 0xc(%ebx),%eax
c14e864b: 89 45 98 mov %eax,-0x68(%ebp)
c14e864e: 89 c2 mov %eax,%edx
c14e8650: 8b 40 58 mov 0x58(%eax),%eax
c14e8653: 8b 80 c8 01 00 00 mov 0x1c8(%eax),%eax
c14e8659: 89 45 c4 mov %eax,-0x3c(%ebp)
c14e865c: 85 c0 test %eax,%eax
c14e865e: 75 33 jne c14e8693 <generic_make_request+0x159>
c14e8660: 89 d1 mov %edx,%ecx
c14e8662: 8b 33 mov (%ebx),%esi
c14e8664: 8b 7b 04 mov 0x4(%ebx),%edi
c14e8667: 8d 55 d0 lea -0x30(%ebp),%edx
c14e866a: 89 c8 mov %ecx,%eax
c14e866c: e8 27 8c c2 ff call c1111298 <bdevname>
c14e8671: 89 74 24 08 mov %esi,0x8(%esp)
c14e8675: 89 7c 24 0c mov %edi,0xc(%esp)
c14e8679: 89 44 24 04 mov %eax,0x4(%esp)
c14e867d: c7 04 24 52 b5 01 c2 movl $0xc201b552,(%esp)
c14e8684: e8 79 17 87 00 call c1d59e02 <printk>
c14e8689: ba fb ff ff ff mov $0xfffffffb,%edx
c14e868e: e9 43 02 00 00 jmp c14e88d6 <generic_make_request+0x39c>
c14e8693: f6 43 14 40 testb $0x40,0x14(%ebx)
c14e8697: 75 36 jne c14e86cf <generic_make_request+0x195>
c14e8699: 8b 4d c4 mov -0x3c(%ebp),%ecx
c14e869c: 8b b9 3c 02 00 00 mov 0x23c(%ecx),%edi
c14e86a2: 39 7d c0 cmp %edi,-0x40(%ebp)
c14e86a5: 76 28 jbe c14e86cf <generic_make_request+0x195>
c14e86a7: 8b 4d 98 mov -0x68(%ebp),%ecx
c14e86aa: 8b 73 20 mov 0x20(%ebx),%esi
c14e86ad: c1 ee 09 shr $0x9,%esi
c14e86b0: 8d 55 d0 lea -0x30(%ebp),%edx
c14e86b3: 89 c8 mov %ecx,%eax
c14e86b5: e8 de 8b c2 ff call c1111298 <bdevname>
c14e86ba: 89 7c 24 0c mov %edi,0xc(%esp)
c14e86be: 89 74 24 08 mov %esi,0x8(%esp)
c14e86c2: 89 44 24 04 mov %eax,0x4(%esp)
c14e86c6: c7 04 24 9f b5 01 c2 movl $0xc201b59f,(%esp)
c14e86cd: eb b5 jmp c14e8684 <generic_make_request+0x14a>
c14e86cf: 8b 55 c4 mov -0x3c(%ebp),%edx
c14e86d2: 8b 82 a4 01 00 00 mov 0x1a4(%edx),%eax
c14e86d8: a8 20 test $0x20,%al
c14e86da: 75 ad jne c14e8689 <generic_make_request+0x14f>
c14e86dc: 8b 43 20 mov 0x20(%ebx),%eax
c14e86df: c1 e8 09 shr $0x9,%eax
c14e86e2: 0f 84 80 00 00 00 je c14e8768 <generic_make_request+0x22e>
c14e86e8: 8b 45 98 mov -0x68(%ebp),%eax
c14e86eb: 3b 40 44 cmp 0x44(%eax),%eax
c14e86ee: 74 78 je c14e8768 <generic_make_request+0x22e>
c14e86f0: 8b 50 4c mov 0x4c(%eax),%edx
c14e86f3: 89 55 a8 mov %edx,-0x58(%ebp)
c14e86f6: 8b 32 mov (%edx),%esi
c14e86f8: 8b 7a 04 mov 0x4(%edx),%edi
c14e86fb: 03 33 add (%ebx),%esi
c14e86fd: 13 7b 04 adc 0x4(%ebx),%edi
c14e8700: 89 33 mov %esi,(%ebx)
c14e8702: 89 7b 04 mov %edi,0x4(%ebx)
c14e8705: 8b 40 44 mov 0x44(%eax),%eax
c14e8708: 89 45 a4 mov %eax,-0x5c(%ebp)
c14e870b: 89 43 0c mov %eax,0xc(%ebx)
c14e870e: 89 f2 mov %esi,%edx
c14e8710: 89 f9 mov %edi,%ecx
c14e8712: 8b 45 a8 mov -0x58(%ebp),%eax
c14e8715: 2b 10 sub (%eax),%edx
c14e8717: 1b 48 04 sbb 0x4(%eax),%ecx
c14e871a: 89 55 a8 mov %edx,-0x58(%ebp)
c14e871d: 89 4d ac mov %ecx,-0x54(%ebp)
c14e8720: 8b 45 98 mov -0x68(%ebp),%eax
c14e8723: 8b 38 mov (%eax),%edi
c14e8725: 8b 55 a4 mov -0x5c(%ebp),%edx
c14e8728: 8b 42 58 mov 0x58(%edx),%eax
c14e872b: 8b 80 c8 01 00 00 mov 0x1c8(%eax),%eax
c14e8731: 89 45 a4 mov %eax,-0x5c(%ebp)
c14e8734: e9 00 00 00 00 jmp c14e8739 <generic_make_request+0x1ff>
c14e8739: eb 2d jmp c14e8768 <generic_make_request+0x22e>
c14e873b: 8b 35 84 4e 2a c2 mov 0xc22a4e84,%esi
c14e8741: 85 f6 test %esi,%esi
c14e8743: 74 23 je c14e8768 <generic_make_request+0x22e>
c14e8745: 8b 46 04 mov 0x4(%esi),%eax
c14e8748: 8b 55 a8 mov -0x58(%ebp),%edx
c14e874b: 8b 4d ac mov -0x54(%ebp),%ecx
c14e874e: 89 54 24 04 mov %edx,0x4(%esp)
c14e8752: 89 4c 24 08 mov %ecx,0x8(%esp)
c14e8756: 89 3c 24 mov %edi,(%esp)
c14e8759: 89 d9 mov %ebx,%ecx
c14e875b: 8b 55 a4 mov -0x5c(%ebp),%edx
c14e875e: ff 16 call *(%esi)
c14e8760: 83 c6 08 add $0x8,%esi
c14e8763: 83 3e 00 cmpl $0x0,(%esi)
c14e8766: eb db jmp c14e8743 <generic_make_request+0x209>
c14e8768: 89 d8 mov %ebx,%eax
c14e876a: e8 a1 a2 c0 ff call c10f2a10 <bio_integrity_enabled>
c14e876f: 85 c0 test %eax,%eax
c14e8771: 74 0f je c14e8782 <generic_make_request+0x248>
c14e8773: 89 d8 mov %ebx,%eax
c14e8775: e8 06 a4 c0 ff call c10f2b80 <bio_integrity_prep>
c14e877a: 85 c0 test %eax,%eax
c14e877c: 0f 85 07 ff ff ff jne c14e8689 <generic_make_request+0x14f>
c14e8782: 83 7d bc ff cmpl $0xffffffff,-0x44(%ebp)
c14e8786: 75 06 jne c14e878e <generic_make_request+0x254>
c14e8788: 83 7d b8 ff cmpl $0xffffffff,-0x48(%ebp)
c14e878c: 74 37 je c14e87c5 <generic_make_request+0x28b>
c14e878e: e9 00 00 00 00 jmp c14e8793 <generic_make_request+0x259>
c14e8793: eb 30 jmp c14e87c5 <generic_make_request+0x28b>
c14e8795: 8b 35 84 4e 2a c2 mov 0xc22a4e84,%esi
c14e879b: 85 f6 test %esi,%esi
c14e879d: 74 26 je c14e87c5 <generic_make_request+0x28b>
c14e879f: 8b 46 04 mov 0x4(%esi),%eax
c14e87a2: 8b 55 b8 mov -0x48(%ebp),%edx
c14e87a5: 8b 4d bc mov -0x44(%ebp),%ecx
c14e87a8: 89 54 24 04 mov %edx,0x4(%esp)
c14e87ac: 89 4c 24 08 mov %ecx,0x8(%esp)
c14e87b0: 8b 4d b4 mov -0x4c(%ebp),%ecx
c14e87b3: 89 0c 24 mov %ecx,(%esp)
c14e87b6: 89 d9 mov %ebx,%ecx
c14e87b8: 8b 55 c4 mov -0x3c(%ebp),%edx
c14e87bb: ff 16 call *(%esi)
c14e87bd: 83 c6 08 add $0x8,%esi
c14e87c0: 83 3e 00 cmpl $0x0,(%esi)
c14e87c3: eb d8 jmp c14e879d <generic_make_request+0x263>
c14e87c5: 8b 03 mov (%ebx),%eax
c14e87c7: 8b 53 04 mov 0x4(%ebx),%edx
c14e87ca: 89 45 b8 mov %eax,-0x48(%ebp)
c14e87cd: 89 55 bc mov %edx,-0x44(%ebp)
c14e87d0: 8b 43 0c mov 0xc(%ebx),%eax
c14e87d3: 8b 10 mov (%eax),%edx
c14e87d5: 89 55 b4 mov %edx,-0x4c(%ebp)
c14e87d8: 83 7d c0 00 cmpl $0x0,-0x40(%ebp)
c14e87dc: 74 5e je c14e883c <generic_make_request+0x302>
c14e87de: 8b 50 08 mov 0x8(%eax),%edx
c14e87e1: 8b 82 84 00 00 00 mov 0x84(%edx),%eax
c14e87e7: a8 01 test $0x1,%al
c14e87e9: 74 04 je c14e87ef <generic_make_request+0x2b5>
c14e87eb: f3 90 pause
c14e87ed: eb f2 jmp c14e87e1 <generic_make_request+0x2a7>
c14e87ef: 8b 72 7c mov 0x7c(%edx),%esi
c14e87f2: 8b ba 80 00 00 00 mov 0x80(%edx),%edi
c14e87f8: 39 82 84 00 00 00 cmp %eax,0x84(%edx)
c14e87fe: 75 e1 jne c14e87e1 <generic_make_request+0x2a7>
c14e8800: 89 f0 mov %esi,%eax
c14e8802: 89 fa mov %edi,%edx
c14e8804: 0f ac d0 09 shrd $0x9,%edx,%eax
c14e8808: c1 fa 09 sar $0x9,%edx
c14e880b: 89 d1 mov %edx,%ecx
c14e880d: 09 c1 or %eax,%ecx
c14e880f: 74 2b je c14e883c <generic_make_request+0x302>
c14e8811: 8b 33 mov (%ebx),%esi
c14e8813: 8b 4b 04 mov 0x4(%ebx),%ecx
c14e8816: 83 fa 00 cmp $0x0,%edx
c14e8819: 77 05 ja c14e8820 <generic_make_request+0x2e6>
c14e881b: 3b 45 c0 cmp -0x40(%ebp),%eax
c14e881e: 72 10 jb c14e8830 <generic_make_request+0x2f6>
c14e8820: 2b 45 9c sub -0x64(%ebp),%eax
c14e8823: 1b 55 a0 sbb -0x60(%ebp),%edx
c14e8826: 39 ca cmp %ecx,%edx
c14e8828: 77 12 ja c14e883c <generic_make_request+0x302>
c14e882a: 72 04 jb c14e8830 <generic_make_request+0x2f6>
c14e882c: 39 f0 cmp %esi,%eax
c14e882e: 73 0c jae c14e883c <generic_make_request+0x302>
c14e8830: 89 d8 mov %ebx,%eax
c14e8832: e8 87 e7 ff ff call c14e6fbe <handle_bad_sector>
c14e8837: e9 4d fe ff ff jmp c14e8689 <generic_make_request+0x14f>
c14e883c: 8b 43 14 mov 0x14(%ebx),%eax
c14e883f: a9 00 10 80 00 test $0x801000,%eax
c14e8844: 74 1a je c14e8860 <generic_make_request+0x326>
c14e8846: 8b 55 c4 mov -0x3c(%ebp),%edx
c14e8849: 83 ba 7c 02 00 00 00 cmpl $0x0,0x27c(%edx)
c14e8850: 75 0e jne c14e8860 <generic_make_request+0x326>
c14e8852: 25 ff ef 7f ff and $0xff7fefff,%eax
c14e8857: 89 43 14 mov %eax,0x14(%ebx)
c14e885a: 83 7d c0 00 cmpl $0x0,-0x40(%ebp)
c14e885e: 74 6d je c14e88cd <generic_make_request+0x393>
c14e8860: 8b 43 14 mov 0x14(%ebx),%eax
c14e8863: a8 40 test $0x40,%al
c14e8865: 74 2d je c14e8894 <generic_make_request+0x35a>
c14e8867: 8b 4d c4 mov -0x3c(%ebp),%ecx
c14e886a: 8b 91 a4 01 00 00 mov 0x1a4(%ecx),%edx
c14e8870: 80 e6 40 and $0x40,%dh
c14e8873: 74 5c je c14e88d1 <generic_make_request+0x397>
c14e8875: a9 00 00 00 08 test $0x8000000,%eax
c14e887a: 74 18 je c14e8894 <generic_make_request+0x35a>
c14e887c: 8b 81 a4 01 00 00 mov 0x1a4(%ecx),%eax
c14e8882: f6 c4 40 test $0x40,%ah
c14e8885: 74 4a je c14e88d1 <generic_make_request+0x397>
c14e8887: 8b 81 a4 01 00 00 mov 0x1a4(%ecx),%eax
c14e888d: a9 00 00 02 00 test $0x20000,%eax
c14e8892: 74 3d je c14e88d1 <generic_make_request+0x397>
c14e8894: 85 db test %ebx,%ebx
c14e8896: 74 45 je c14e88dd <generic_make_request+0x3a3>
c14e8898: e9 00 00 00 00 jmp c14e889d <generic_make_request+0x363>
c14e889d: eb 1c jmp c14e88bb <generic_make_request+0x381>
c14e889f: 8b 35 fc 4e 2a c2 mov 0xc22a4efc,%esi
c14e88a5: 85 f6 test %esi,%esi
c14e88a7: 74 12 je c14e88bb <generic_make_request+0x381>
c14e88a9: 8b 46 04 mov 0x4(%esi),%eax
c14e88ac: 89 d9 mov %ebx,%ecx
c14e88ae: 8b 55 c4 mov -0x3c(%ebp),%edx
c14e88b1: ff 16 call *(%esi)
c14e88b3: 83 c6 08 add $0x8,%esi
c14e88b6: 83 3e 00 cmpl $0x0,(%esi)
c14e88b9: eb ec jmp c14e88a7 <generic_make_request+0x36d>
c14e88bb: 89 da mov %ebx,%edx
c14e88bd: 8b 45 c4 mov -0x3c(%ebp),%eax
c14e88c0: ff 50 44 call *0x44(%eax)
c14e88c3: 85 c0 test %eax,%eax
c14e88c5: 0f 85 7d fd ff ff jne c14e8648 <generic_make_request+0x10e>
c14e88cb: eb 10 jmp c14e88dd <generic_make_request+0x3a3>
c14e88cd: 31 d2 xor %edx,%edx
c14e88cf: eb 05 jmp c14e88d6 <generic_make_request+0x39c>
c14e88d1: ba a1 ff ff ff mov $0xffffffa1,%edx
c14e88d6: 89 d8 mov %ebx,%eax
c14e88d8: e8 b8 49 c0 ff call c10ed295 <bio_endio>
c14e88dd: 8b 55 b0 mov -0x50(%ebp),%edx
c14e88e0: 8b 82 f4 02 00 00 mov 0x2f4(%edx),%eax
c14e88e6: 8b 18 mov (%eax),%ebx
c14e88e8: 85 db test %ebx,%ebx
c14e88ea: 74 1c je c14e8908 <generic_make_request+0x3ce>
c14e88ec: 8b 53 08 mov 0x8(%ebx),%edx
c14e88ef: 89 10 mov %edx,(%eax)
c14e88f1: 85 d2 test %edx,%edx
c14e88f3: 75 07 jne c14e88fc <generic_make_request+0x3c2>
c14e88f5: c7 40 04 00 00 00 00 movl $0x0,0x4(%eax)
c14e88fc: c7 43 08 00 00 00 00 movl $0x0,0x8(%ebx)
c14e8903: e9 a1 fc ff ff jmp c14e85a9 <generic_make_request+0x6f>
c14e8908: 8b 4d b0 mov -0x50(%ebp),%ecx
c14e890b: c7 81 f4 02 00 00 00 movl $0x0,0x2f4(%ecx)
c14e8912: 00 00 00
c14e8915: 8b 45 f0 mov -0x10(%ebp),%eax
c14e8918: 65 33 05 14 00 00 00 xor %gs:0x14,%eax
c14e891f: 74 05 je c14e8926 <generic_make_request+0x3ec>
c14e8921: e8 56 27 b5 ff call c103b07c <__stack_chk_fail>
c14e8926: 83 c4 6c add $0x6c,%esp
c14e8929: 5b pop %ebx
c14e892a: 5e pop %esi
c14e892b: 5f pop %edi
c14e892c: 5d pop %ebp
c14e892d: c3 ret

The crash is at:

c14e85ac: c1 ea 09 shr $0x9,%edx
c14e85af: 89 55 c0 mov %edx,-0x40(%ebp)
c14e85b2: e8 e3 8c 88 00 call c1d7129a <_cond_resched>
c14e85b7: 83 7d c0 00 cmpl $0x0,-0x40(%ebp)
c14e85bb: 74 69 je c14e8626 <generic_make_request+0xec>
c14e85bd: 8b 43 0c mov 0xc(%ebx),%eax
* c14e85c0: 8b 40 08 mov 0x8(%eax),%eax <==== [ **CRASH** ]
c14e85c3: 8b 90 84 00 00 00 mov 0x84(%eax),%edx
c14e85c9: f6 c2 01 test $0x1,%dl

which corresponds to:

1414
1415 if (!nr_sectors)
1416 return 0;
1417
1418 /* Test device or partition size, when known. */
1419 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; <==== [ **CRASH** ]
1420 if (maxsector) {
1421 sector_t sector = bio->bi_sector;
1422
1423 if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {

bio->bi_bdev has become NULL?

I do not think the _cond_resched() was called, judging from stack contents. But
we just had an IRQ:

[<c1d74030>] ? common_interrupt+0x30/0x40

So we might have raced with block IO IRQ queue-completion/submission activites.

But maybe it was a reschedule after all, just the stack does not carry any
traces of it anymore. IRQs do not clear ->bi_bdev, right? Unless the bio
refcounts are wrong and an IRQ's completion actually frees the bio, right?

I've built a CONFIG_DEBUG_PAGEALLOC=y and CONFIG_SLUB_DEBUG=y kernel, maybe the
crash triggers in a more revealing way.

Thanks,

Ingo

Attachments:

(No filename) (25.70 kB)
config.zipproblem2 (127.49 kB)
crash.log (200.28 kB)
Download all attachments

2011-05-04 09:53:09

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Ingo Molnar wrote:

> 1415 if (!nr_sectors)
> 1416 return 0;
> 1417
> 1418 /* Test device or partition size, when known. */
> 1419 maxsector = i_size_read(bio->bi_bdev->bd_inode) >> 9; <==== [ **CRASH** ]
> 1420 if (maxsector) {
> 1421 sector_t sector = bio->bi_sector;
> 1422
> 1423 if (maxsector < nr_sectors || maxsector - nr_sectors < sector) {
>
> bio->bi_bdev has become NULL?
>
> I do not think the _cond_resched() was called, judging from stack contents. But
> we just had an IRQ:
>
> [<c1d74030>] ? common_interrupt+0x30/0x40
>
> So we might have raced with block IO IRQ queue-completion/submission activites.
>
> But maybe it was a reschedule after all, just the stack does not carry any
> traces of it anymore. IRQs do not clear ->bi_bdev, right? Unless the bio
> refcounts are wrong and an IRQ's completion actually frees the bio, right?

Looking at the call chain that's impossible:

generic_make_request
submit_bio
submit_bh

submit_bh does:

bio = bio_alloc()
bio_get(bio)
submit_bio(bio)
bio_put(bio)

So that bio is not yet known to anything else than the calling
code.

One possibility is that bh->bdev is NULL when submit_bh() is called,
which I think is rather unlikely, but can be easily verified with

--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2887,6 +2887,7 @@ int submit_bh(int rw, struct buffer_head * bh)
BUG_ON(!bh->b_end_io);
BUG_ON(buffer_delay(bh));
BUG_ON(buffer_unwritten(bh));
+ BUG_ON(!bh->b_bdev);

/*
* Only clear out a write error when rewriting

But I rather suspect, that CONFIG_SLUB=y is the thing we need to look
at. The lockless fastpath cmpxchg comes to my mind.

Either we generate broken code with that ELAN caused options or
that combo triggers some hidden problem in SLUB.

Thanks,

tglx

2011-05-04 10:13:39

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> I've built a CONFIG_DEBUG_PAGEALLOC=y and CONFIG_SLUB_DEBUG=y kernel, maybe the
> crash triggers in a more revealing way.

After an hour of testing it still has not triggered. This could have multiple
reasons:

- Maybe i reproduced the original crash by pure luck.

To exclude this i'll re-test once again with the non-debug .config.

- There's some kernel image layout dependency that twiddling these options has
hidden.

This is the lowest probability explanation, because the problem
triggered for me after tweaking the original .config and using a different
toolchain.

- These debug options could have narrowed the race window and could make the
race much less likely to occur.

This is the highest probability explanation.

Thanks,

Ingo

2011-05-04 10:19:53

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Thomas Gleixner <[email protected]> wrote:

> But I rather suspect, that CONFIG_SLUB=y is the thing we need to look at. The
> lockless fastpath cmpxchg comes to my mind.

Hm, and CONFIG_X86_ELAN, as Linus noted, has an impact on the cmpxchg
implementation.

> Either we generate broken code with that ELAN caused options or that combo
> triggers some hidden problem in SLUB.

Note that the crash went away with SLUB_DEBUG=y and PAGEALLOC=y and
SLUB_DEBUG=y would certainly narrow any lockless-SLUB race windows.

Thanks,

Ingo

2011-05-04 10:25:58

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> * Thomas Gleixner <[email protected]> wrote:
>
> > But I rather suspect, that CONFIG_SLUB=y is the thing we need to look at. The
> > lockless fastpath cmpxchg comes to my mind.
>
> Hm, and CONFIG_X86_ELAN, as Linus noted, has an impact on the cmpxchg
> implementation.

Walter, could you please test the patch below on a failing kernel?

Note, the patch is actually incorrect for a real Elan box, but should work on
Walter's box - and should avoid the cmpxchg8b_emul implementation on that box.

Thanks,

Ingo

diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu
index 6a7cfdf..0783906 100644
--- a/arch/x86/Kconfig.cpu
+++ b/arch/x86/Kconfig.cpu
@@ -402,7 +402,7 @@ config X86_TSC

config X86_CMPXCHG64
def_bool y
- depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM
+ depends on X86_PAE || X86_64 || MCORE2 || MPENTIUM4 || MPENTIUMM || MPENTIUMIII || MPENTIUMII || M686 || MATOM || X86_ELAN

# this should be set for all -march=.. options where the compiler
# generates cmov.

2011-05-04 10:33:11

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> Walter, could you please test the patch below on a failing kernel?

the other patch to test would be the one below, on a failing kernel. This would
check whether compiler flags have an impact on your crash.

Ingo

diff --git a/arch/x86/Makefile_32.cpu b/arch/x86/Makefile_32.cpu
index f2ee1ab..58dae4f 100644
--- a/arch/x86/Makefile_32.cpu
+++ b/arch/x86/Makefile_32.cpu
@@ -36,9 +36,6 @@ cflags-$(CONFIG_MCORE2) += -march=i686 $(call tune,core2)
cflags-$(CONFIG_MATOM) += $(call cc-option,-march=atom,$(call cc-option,-march=core2,-march=i686)) \
$(call cc-option,-mtune=atom,$(call cc-option,-mtune=generic))

-# AMD Elan support
-cflags-$(CONFIG_X86_ELAN) += -march=i486
-
# Geode GX1 support
cflags-$(CONFIG_MGEODEGX1) += -march=pentium-mmx
cflags-$(CONFIG_MGEODE_LX) += $(call cc-option,-march=geode,-march=pentium-mmx)

2011-05-04 10:41:57

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> - Maybe i reproduced the original crash by pure luck.
>
> To exclude this i'll re-test once again with the non-debug .config.

I got a different problem this time around:

Fedora Core release 6 (Zod)
Kernel 2.6.39-rc5-i486-1sys+ on an i686

mercury login: ata1: lost interrupt (Status 0x58)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE DMA EXT
ata1.00: cmd 35/00:38:56:cf:a5/00:01:05:00:00/e0 tag 0 dma 159744 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: nv_mode_filter: 0x3f39f&0x3f39f->0x3f39f, BIOS=0x3f000 (0xc60000c0) ACPI=0x3f01f (20:600:0x13)
ata1.00: configured for UDMA/100
ata1.00: device reported invalid CHS sector 0
ata1: EH complete
ata1: lost interrupt (Status 0x58)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE DMA EXT
ata1.00: cmd 35/00:38:56:cf:a5/00:01:05:00:00/e0 tag 0 dma 159744 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: nv_mode_filter: 0x3f39f&0x3f39f->0x3f39f, BIOS=0x3f000 (0xc60000c0) ACPI=0x3f01f (20:600:0x13)
ata1.00: configured for UDMA/100
ata1.00: device reported invalid CHS sector 0
ata1: EH complete

Fedora Core release 6 (Zod)
Kernel 2.6.39-rc5-i486-1sys+ on an i686

mercury login:
Fedora Core release 6 (Zod)
Kernel 2.6.39-rc5-i486-1sys+ on an i686

mercury login: ata1: lost interrupt (Status 0x58)
ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata1.00: failed command: WRITE DMA EXT
ata1.00: cmd 35/00:38:56:cf:a5/00:01:05:00:00/e0 tag 0 dma 159744 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }
ata1: soft resetting link
ata1: nv_mode_filter: 0x3f39f&0x3f39f->0x3f39f, BIOS=0x3f000 (0xc60000c0) ACPI=0x3f01f (20:600:0x13)
ata1.00: configured for UDMA/100
ata1.00: device reported invalid CHS sector 0
ata1: EH complete

The system is dead (is not performing any IO) after that.

Note that i reported similar problems early during this merge window, in this mail:

[sporadic crash] blk: request botched

That sha1 was 89078d572eb9ce8d4c04264b8b0ba86de0d74c8f - that already had
lockless SLUB:

e8c500c2b64b: Merge branch 'slub/lockless' into for-linus

I never managed to pin that down, these IO problems were very sporadic in my
test setup and never reproducible, until today.

Ingo

2011-05-04 10:45:36

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> Note that i reported similar problems early during this merge window, in this mail:
>
> [sporadic crash] blk: request botched
>
> That sha1 was 89078d572eb9ce8d4c04264b8b0ba86de0d74c8f - that already had
> lockless SLUB:
>
> e8c500c2b64b: Merge branch 'slub/lockless' into for-linus
>
> I never managed to pin that down, these IO problems were very sporadic in my
> test setup and never reproducible, until today.

So i'm doing a testrun now with a failing kernel with the patch below applied,
to disable the SLUB lockless code.

Ingo

diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..27bc3be 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -30,6 +30,8 @@

#include <trace/events/kmem.h>

+#undef CONFIG_CMPXCHG_LOCAL
+
/*
* Lock order:
* 1. slab_lock(page)

2011-05-04 11:07:15

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index 94d2a33..27bc3be 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -30,6 +30,8 @@
>
> #include <trace/events/kmem.h>
>
> +#undef CONFIG_CMPXCHG_LOCAL
> +
> /*
> * Lock order:
> * 1. slab_lock(page)

This seems rock solid after half an hour of testing. I'll keep it running
longer, i still have no good data for how frequently the crashes are occuring.

Thanks,

Ingo

2011-05-04 11:11:48

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Ingo Molnar wrote:

>
> * Thomas Gleixner <[email protected]> wrote:
>
> > But I rather suspect, that CONFIG_SLUB=y is the thing we need to look at. The
> > lockless fastpath cmpxchg comes to my mind.
>
> Hm, and CONFIG_X86_ELAN, as Linus noted, has an impact on the cmpxchg
> implementation.

Exactly. With ELAN CONFIG_X86_CMPXCHG64 is not set. When I disable
ELAN it's set.

> > Either we generate broken code with that ELAN caused options or that combo
> > triggers some hidden problem in SLUB.
>
> Note that the crash went away with SLUB_DEBUG=y and PAGEALLOC=y and
> SLUB_DEBUG=y would certainly narrow any lockless-SLUB race windows.

Well, it's pretty simple:

CONFIG_X86_CMPXCHG64=y compiles this_cpu_generic_cmpxchg_double() into:

28f7: 89 f9 mov %edi,%ecx
28f9: 8b 3e mov (%esi),%edi
28fb: 89 45 e4 mov %eax,-0x1c(%ebp)
28fe: 89 c3 mov %eax,%ebx
2900: 8b 45 f0 mov -0x10(%ebp),%eax
2903: 64 0f c7 0f cmpxchg8b %fs:(%edi)
2907: 0f 94 c0 sete %al
290a: 84 c0 test %al,%al
290c: 88 45 e4 mov %al,-0x1c(%ebp)
290f: 74 a4 je 28b5 <kmem_cache_alloc+0x29>

while CONFIG_X86_CMPXCHG64=n results in:

28b0: 8b 03 mov (%ebx),%eax
28b2: 64 8b 30 mov %fs:(%eax),%esi
28b5: 39 d6 cmp %edx,%esi
28b7: 75 d2 jne 288b <kmem_cache_alloc+0x26>
28b9: 64 8b 50 04 mov %fs:0x4(%eax),%edx
28bd: 39 ca cmp %ecx,%edx
28bf: 75 ca jne 288b <kmem_cache_alloc+0x26>
28c1: 8b 4b 14 mov 0x14(%ebx),%ecx
28c4: 8b 0c 0e mov (%esi,%ecx,1),%ecx
28c7: 64 89 08 mov %ecx,%fs:(%eax)
28ca: 8b 03 mov (%ebx),%eax
28cc: 42 inc %edx
28cd: 64 89 50 04 mov %edx,%fs:0x4(%eax)

And that code runs with preemption enabled. So when the task gets
preempted _BEFORE_ it has actuallty written back the data, then the
race window is wide open.

I'm still trying to understand that macro hell which actually
generates that code. I always thought that George Anzingers macro maze
was horrible, but that's even worse.

Thanks,

tglx

2011-05-04 11:17:03

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 2:11 PM, Thomas Gleixner <[email protected]> wrote:
> On Wed, 4 May 2011, Ingo Molnar wrote:
>
>>
>> * Thomas Gleixner <[email protected]> wrote:
>>
>> > But I rather suspect, that CONFIG_SLUB=y is the thing we need to look at. The
>> > lockless fastpath cmpxchg comes to my mind.
>>
>> Hm, and CONFIG_X86_ELAN, as Linus noted, has an impact on the cmpxchg
>> implementation.
>
> Exactly. With ELAN CONFIG_X86_CMPXCHG64 is not set. When I disable
> ELAN it's set.
>
>> > Either we generate broken code with that ELAN caused options or that combo
>> > triggers some hidden problem in SLUB.
>>
>> Note that the crash went away with SLUB_DEBUG=y and PAGEALLOC=y and
>> SLUB_DEBUG=y would certainly narrow any lockless-SLUB race windows.
>
> Well, it's pretty simple:
>
> CONFIG_X86_CMPXCHG64=y compiles this_cpu_generic_cmpxchg_double() into:
>
> ? ?28f7: ? ? ? 89 f9 ? ? ? ? ? ? ? ? ? mov ? ?%edi,%ecx
> ? ?28f9: ? ? ? 8b 3e ? ? ? ? ? ? ? ? ? mov ? ?(%esi),%edi
> ? ?28fb: ? ? ? 89 45 e4 ? ? ? ? ? ? ? ?mov ? ?%eax,-0x1c(%ebp)
> ? ?28fe: ? ? ? 89 c3 ? ? ? ? ? ? ? ? ? mov ? ?%eax,%ebx
> ? ?2900: ? ? ? 8b 45 f0 ? ? ? ? ? ? ? ?mov ? ?-0x10(%ebp),%eax
> ? ?2903: ? ? ? 64 0f c7 0f ? ? ? ? ? ? cmpxchg8b %fs:(%edi)
> ? ?2907: ? ? ? 0f 94 c0 ? ? ? ? ? ? ? ?sete ? %al
> ? ?290a: ? ? ? 84 c0 ? ? ? ? ? ? ? ? ? test ? %al,%al
> ? ?290c: ? ? ? 88 45 e4 ? ? ? ? ? ? ? ?mov ? ?%al,-0x1c(%ebp)
> ? ?290f: ? ? ? 74 a4 ? ? ? ? ? ? ? ? ? je ? ? 28b5 <kmem_cache_alloc+0x29>
>
> while CONFIG_X86_CMPXCHG64=n results in:
>
> ? ?28b0: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> ? ?28b2: ? ? ? 64 8b 30 ? ? ? ? ? ? ? ?mov ? ?%fs:(%eax),%esi
> ? ?28b5: ? ? ? 39 d6 ? ? ? ? ? ? ? ? ? cmp ? ?%edx,%esi
> ? ?28b7: ? ? ? 75 d2 ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> ? ?28b9: ? ? ? 64 8b 50 04 ? ? ? ? ? ? mov ? ?%fs:0x4(%eax),%edx
> ? ?28bd: ? ? ? 39 ca ? ? ? ? ? ? ? ? ? cmp ? ?%ecx,%edx
> ? ?28bf: ? ? ? 75 ca ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> ? ?28c1: ? ? ? 8b 4b 14 ? ? ? ? ? ? ? ?mov ? ?0x14(%ebx),%ecx
> ? ?28c4: ? ? ? 8b 0c 0e ? ? ? ? ? ? ? ?mov ? ?(%esi,%ecx,1),%ecx
> ? ?28c7: ? ? ? 64 89 08 ? ? ? ? ? ? ? ?mov ? ?%ecx,%fs:(%eax)
> ? ?28ca: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> ? ?28cc: ? ? ? 42 ? ? ? ? ? ? ? ? ? ? ?inc ? ?%edx
> ? ?28cd: ? ? ? 64 89 50 04 ? ? ? ? ? ? mov ? ?%edx,%fs:0x4(%eax)
>
> And that code runs with preemption enabled. So when the task gets
> preempted _BEFORE_ it has actuallty written back the data, then the
> race window is wide open.
>
> I'm still trying to understand that macro hell which actually
> generates that code. I always thought that George Anzingers macro maze
> was horrible, but that's even worse.

I'm adding Christoph and Tejun to CC.

2011-05-04 11:27:53

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hello,

On Wed, May 04, 2011 at 02:16:57PM +0300, Pekka Enberg wrote:
> On Wed, May 4, 2011 at 2:11 PM, Thomas Gleixner <[email protected]> wrote:
> > On Wed, 4 May 2011, Ingo Molnar wrote:
> > Well, it's pretty simple:
> >
> > CONFIG_X86_CMPXCHG64=y compiles this_cpu_generic_cmpxchg_double() into:
> >
> > ? ?28f7: ? ? ? 89 f9 ? ? ? ? ? ? ? ? ? mov ? ?%edi,%ecx
> > ? ?28f9: ? ? ? 8b 3e ? ? ? ? ? ? ? ? ? mov ? ?(%esi),%edi
> > ? ?28fb: ? ? ? 89 45 e4 ? ? ? ? ? ? ? ?mov ? ?%eax,-0x1c(%ebp)
> > ? ?28fe: ? ? ? 89 c3 ? ? ? ? ? ? ? ? ? mov ? ?%eax,%ebx
> > ? ?2900: ? ? ? 8b 45 f0 ? ? ? ? ? ? ? ?mov ? ?-0x10(%ebp),%eax
> > ? ?2903: ? ? ? 64 0f c7 0f ? ? ? ? ? ? cmpxchg8b %fs:(%edi)
> > ? ?2907: ? ? ? 0f 94 c0 ? ? ? ? ? ? ? ?sete ? %al
> > ? ?290a: ? ? ? 84 c0 ? ? ? ? ? ? ? ? ? test ? %al,%al
> > ? ?290c: ? ? ? 88 45 e4 ? ? ? ? ? ? ? ?mov ? ?%al,-0x1c(%ebp)
> > ? ?290f: ? ? ? 74 a4 ? ? ? ? ? ? ? ? ? je ? ? 28b5 <kmem_cache_alloc+0x29>
> >
> > while CONFIG_X86_CMPXCHG64=n results in:
> >
> > ? ?28b0: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> > ? ?28b2: ? ? ? 64 8b 30 ? ? ? ? ? ? ? ?mov ? ?%fs:(%eax),%esi
> > ? ?28b5: ? ? ? 39 d6 ? ? ? ? ? ? ? ? ? cmp ? ?%edx,%esi
> > ? ?28b7: ? ? ? 75 d2 ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> > ? ?28b9: ? ? ? 64 8b 50 04 ? ? ? ? ? ? mov ? ?%fs:0x4(%eax),%edx
> > ? ?28bd: ? ? ? 39 ca ? ? ? ? ? ? ? ? ? cmp ? ?%ecx,%edx
> > ? ?28bf: ? ? ? 75 ca ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> > ? ?28c1: ? ? ? 8b 4b 14 ? ? ? ? ? ? ? ?mov ? ?0x14(%ebx),%ecx
> > ? ?28c4: ? ? ? 8b 0c 0e ? ? ? ? ? ? ? ?mov ? ?(%esi,%ecx,1),%ecx
> > ? ?28c7: ? ? ? 64 89 08 ? ? ? ? ? ? ? ?mov ? ?%ecx,%fs:(%eax)
> > ? ?28ca: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> > ? ?28cc: ? ? ? 42 ? ? ? ? ? ? ? ? ? ? ?inc ? ?%edx
> > ? ?28cd: ? ? ? 64 89 50 04 ? ? ? ? ? ? mov ? ?%edx,%fs:0x4(%eax)
> >
> > And that code runs with preemption enabled. So when the task gets
> > preempted _BEFORE_ it has actuallty written back the data, then the
> > race window is wide open.

Hmmm... if it's a race caused by preemtion enabled where it shouldn't
be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
used in SLUB? ie. __this_cpu_cmpxchg_double() where it should have
been this_cpu_cmpxchg_double()? Christoph?

Thanks.

--
tejun

2011-05-04 12:36:54

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Thomas Gleixner <[email protected]> wrote:
> >
> > > But I rather suspect, that CONFIG_SLUB=y is the thing we need to look at. The
> > > lockless fastpath cmpxchg comes to my mind.
> >
> > Hm, and CONFIG_X86_ELAN, as Linus noted, has an impact on the cmpxchg
> > implementation.
>
> Walter, could you please test the patch below on a failing kernel?

This patch will likely make the bug go away.

Ingo

2011-05-04 12:37:19

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > Walter, could you please test the patch below on a failing kernel?
>
> the other patch to test would be the one below, on a failing kernel. This would
> check whether compiler flags have an impact on your crash.

this patch will probably have no effect on the crash.

Thanks,

Ingo

2011-05-04 12:38:12

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 94d2a33..27bc3be 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -30,6 +30,8 @@
> >
> > #include <trace/events/kmem.h>
> >
> > +#undef CONFIG_CMPXCHG_LOCAL
> > +
> > /*
> > * Lock order:
> > * 1. slab_lock(page)
>
> This seems rock solid after half an hour of testing. I'll keep it running
> longer, i still have no good data for how frequently the crashes are occuring.

It's still rock solid after 2 hours: neither crashes nor IO/IRQ timeouts are
occuring.

Thanks,

Ingo

2011-05-04 12:48:22

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Ingo Molnar <[email protected]> wrote:

> > > index 94d2a33..27bc3be 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -30,6 +30,8 @@
> > >
> > > #include <trace/events/kmem.h>
> > >
> > > +#undef CONFIG_CMPXCHG_LOCAL
> > > +
> > > /*
> > > * Lock order:
> > > * 1. slab_lock(page)
> >
> > This seems rock solid after half an hour of testing. I'll keep it running
> > longer, i still have no good data for how frequently the crashes are occuring.
>
> It's still rock solid after 2 hours: neither crashes nor IO/IRQ timeouts are
> occuring.

So i removed the above patch and rebooted, and within minutes of starting the
FS test i got:

skb_over_panic: text:c19fe045 len:98 put:98 head: (null) data: (null) tail:0x62 end:0x0 dev:<NULL>
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:127!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:0a.0/net/eth0/address
Modules linked in:

Pid: 3535, comm: dd Not tainted 2.6.39-rc5-i486-1sys+ #122586 System manufacturer System Product Name/A8N-E
EIP: 0060:[<c1bda60d>] EFLAGS: 00010292 CPU: 1
EIP is at skb_put+0x89/0x92
EAX: 0000006b EBX: 00000000 ECX: 00000046 EDX: 00000000
ESI: c19fe045 EDI: 00000062 EBP: f64cdf20 ESP: f64cdef4
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Process dd (pid: 3535, ti=f64cc000 task=f5f4b570 task.ti=f53f4000)
Stack:
c2143545 c19fe045 00000062 00000062 00000000 00000000 00000062 00000000
c207d136 f6506000 f408d600 f64cdf4c c19fe045 c19fd92b f64cdf4c 00000040
f6506428 00000000 34020062 f6506000 00000246 c21b799c f64cdf90 c1a004c1
Call Trace:
[<c19fe045>] ? nv_rx_process_optimized+0x101/0x1de
[<c19fe045>] nv_rx_process_optimized+0x101/0x1de
[<c19fd92b>] ? nv_alloc_rx_optimized+0xe/0x18f
[<c1a004c1>] nv_napi_poll+0x496/0x4a5
[<c105838c>] ? hrtimer_run_pending+0xe/0xd1
[<c1d734b4>] ? _raw_spin_lock+0x8/0x1e
[<c1be1d59>] net_rx_action+0x94/0x1ab
[<c1042fcd>] __do_softirq+0x9f/0x14f
[<c1042f2e>] ? remote_softirq_receive+0x33/0x33
<IRQ>
[<c10431e7>] ? irq_exit+0x3a/0x43
[<c10047ce>] ? do_IRQ+0x8c/0xa0
[<c116366d>] ? __ext3_journal_dirty_metadata+0x1e/0x45
[<c1054f23>] ? wake_up_bit+0x1c/0x20
[<c10ec726>] ? __brelse+0xb/0x36
[<c102ea1c>] ? __wake_up_common+0xe/0x62
[<c1d74eb0>] ? common_interrupt+0x30/0x40
[<c14fb1ea>] ? sha_transform+0x9a/0x1be
[<c15ff44e>] ? extract_buf+0x50/0xe3
[<c14fe7ab>] ? __copy_to_user_ll+0xb/0x37
[<c14fe9b5>] ? copy_to_user+0x3e/0x49
[<c15ffd83>] ? extract_entropy_user+0x80/0xe5
[<c15ffdfa>] ? urandom_read+0x12/0x14
[<c10cc888>] ? vfs_read+0x93/0x115
[<c15ffde8>] ? extract_entropy_user+0xe5/0xe5
[<c10cc94c>] ? sys_read+0x42/0x66
[<c1d74903>] ? sysenter_do_call+0x12/0x28
Code: 00 00 89 44 24 14 8b 81 a8 00 00 00 89 44 24 10 89 54 24 0c 8b 41 50 89 44 24 08 89 74 24 04 c7 04 24 45 35 14 c2 e8 fa 09 18 00 <0f> 0b 83 c4 24 5b 5e 5d c3 55 89 e5 57 56 53 83 ec 30 e8 ac a8
EIP: [<c1bda60d>] skb_put+0x89/0x92 SS:ESP 0068:f64cdef4
---[ end trace 1d38b9741c67ed6b ]---

And in hindsight i have to admit that i saw this in randconfig testing in the
past few weeks, i just never managed to reproduce it ...

So yes, the fact that this time it crashed in networking (not in block IO)
clearly implicates SLUB as well.

And the trigger condition is the lockless SLUB code on 32-bit,
non-64-bit-cmpxchg platforms. I'd not be surprised if some embedded platforms
triggered this too.

Ingo

2011-05-04 12:51:16

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hi Tejun,

On Wed, May 4, 2011 at 2:27 PM, Tejun Heo <[email protected]> wrote:
> Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> used in SLUB? ?ie. __this_cpu_cmpxchg_double() where it should have
> been this_cpu_cmpxchg_double()? ?Christoph?

There's no __this_cpu_cmpxchg_double() usage in mm/slub.c so I don't
think it's that simple.

2011-05-04 12:57:28

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Pekka Enberg <[email protected]> wrote:

> Hi Tejun,
>
> On Wed, May 4, 2011 at 2:27 PM, Tejun Heo <[email protected]> wrote:
> > Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> > be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> > used in SLUB? ?ie. __this_cpu_cmpxchg_double() where it should have
> > been this_cpu_cmpxchg_double()? ?Christoph?
>
> There's no __this_cpu_cmpxchg_double() usage in mm/slub.c so I don't
> think it's that simple.

Well, AFAICS the problem is:

earth4:~/tip> grep cmpxchg mm/slub.c

if (unlikely(!this_cpu_cmpxchg_double(
if (unlikely(!this_cpu_cmpxchg_double(

Where this macro resolves to:

# define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
_this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)

where:

#define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({ \
int ret__; \
preempt_disable(); \
ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2, \
oval1, oval2, nval1, nval2); \
preempt_enable();
where:

#define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({ \
int __ret = 0; \
if (__this_cpu_read(pcp1) == (oval1) && \
__this_cpu_read(pcp2) == (oval2)) { \
__this_cpu_write(pcp1, (nval1)); \
__this_cpu_write(pcp2, (nval2)); \
__ret = 1; \
} \
(__ret); \
})

With is both IRQ and SMP unsafe.

Thanks,

Ingo

2011-05-04 13:01:06

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Tejun Heo wrote:
> > > And that code runs with preemption enabled. So when the task gets
> > > preempted _BEFORE_ it has actuallty written back the data, then the
> > > race window is wide open.
>
> Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> used in SLUB? ie. __this_cpu_cmpxchg_double() where it should have
> been this_cpu_cmpxchg_double()? Christoph?

No, the problem is that ELAN prevents the cmpxchg8b, but keeps
CONFIG_CMPXCHG_LOCAL=y which then results in the unprotected code for
the following reason:

this_cpu_cmpxchg_double()

-> __pcpu_double_call_return_bool

-> this_cpu_cmpxchg_double_4

Which on x86 expands to

-> percpu_cmpxchg8b_double() when CONFIG_X86_CMPXCHG64=y

With CONFIG_X86_CMPXCHG64=n it expands to the default:

_this_cpu_generic_cmpxchg_double() in linux/percpu.h

#define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({ \
int ret__; \
preempt_disable(); \
ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2, \
oval1, oval2, nval1, nval2); \
preempt_enable(); \
ret__; \
})

And:

#define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
({ \
int __ret = 0; \
if (__this_cpu_read(pcp1) == (oval1) && \
__this_cpu_read(pcp2) == (oval2)) { \
__this_cpu_write(pcp1, (nval1)); \
__this_cpu_write(pcp2, (nval2)); \
__ret = 1; \
} \
(__ret); \
})

So now that failing config has CONFIG_PREEMPT=n which makes
preempt_disable / enable a nop.

So preemption is not the problem, but what about interrupts and
softirqs ?

So the question is whether CMPXCHG_LOCAL for x86 wants to depend on
X86_CMPXCHG64.

The other solution is to use irqsafe_cpu_cmpxchg_double() instead of
this_cpu_cmpxchg_double() in slub.c.

This will not hurt the X86_CMPXCHG64=y case, but keep the expansion to
the above __this_cpu_generic_cmpxchg_double working.

Which makes me even wonder some more whether we need that whole
CMPXCHG_LOCAL #ifdeffery in slub.c at all.

Thanks,

tglx

2011-05-04 13:03:04

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Ingo Molnar wrote:
> * Pekka Enberg <[email protected]> wrote:
>
> > Hi Tejun,
> >
> > On Wed, May 4, 2011 at 2:27 PM, Tejun Heo <[email protected]> wrote:
> > > Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> > > be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> > > used in SLUB? ?ie. __this_cpu_cmpxchg_double() where it should have
> > > been this_cpu_cmpxchg_double()? ?Christoph?
> >
> > There's no __this_cpu_cmpxchg_double() usage in mm/slub.c so I don't
> > think it's that simple.
>
> Well, AFAICS the problem is:
>
> earth4:~/tip> grep cmpxchg mm/slub.c
>
> if (unlikely(!this_cpu_cmpxchg_double(
> if (unlikely(!this_cpu_cmpxchg_double(
>
> Where this macro resolves to:
>
> # define this_cpu_cmpxchg_double_8(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
>
> where:
>
> #define _this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> ({ \
> int ret__; \
> preempt_disable(); \
> ret__ = __this_cpu_generic_cmpxchg_double(pcp1, pcp2, \
> oval1, oval2, nval1, nval2); \
> preempt_enable();
> where:
>
> #define __this_cpu_generic_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> ({ \
> int __ret = 0; \
> if (__this_cpu_read(pcp1) == (oval1) && \
> __this_cpu_read(pcp2) == (oval2)) { \
> __this_cpu_write(pcp1, (nval1)); \
> __this_cpu_write(pcp2, (nval2)); \
> __ret = 1; \
> } \
> (__ret); \
> })
>
> With is both IRQ and SMP unsafe.

SMP is not an issue because that's cpu local access, but it's
irq/softirq unsafe. See my other mail.

Thanks,

tglx

2011-05-04 13:20:29

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hello,

On Wed, May 04, 2011 at 03:00:37PM +0200, Thomas Gleixner wrote:
> On Wed, 4 May 2011, Tejun Heo wrote:
> > > > And that code runs with preemption enabled. So when the task gets
> > > > preempted _BEFORE_ it has actuallty written back the data, then the
> > > > race window is wide open.
> >
> > Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> > be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> > used in SLUB? ie. __this_cpu_cmpxchg_double() where it should have
> > been this_cpu_cmpxchg_double()? Christoph?
>
> No, the problem is that ELAN prevents the cmpxchg8b, but keeps
> CONFIG_CMPXCHG_LOCAL=y which then results in the unprotected code for
> the following reason:
...
> So the question is whether CMPXCHG_LOCAL for x86 wants to depend on
> X86_CMPXCHG64.
>
> The other solution is to use irqsafe_cpu_cmpxchg_double() instead of
> this_cpu_cmpxchg_double() in slub.c.

I think this is the root cause. CMPXCHG_LOCAL is an optimization
flag, indicating that the processor provides fast local cmpxchg, it
doesn't say anything about local synchronization properties and if the
code required irq exclusion, it should have used
irqsafe_cpu_cmpxchg_double() whether the processor supports it
natively or not, so there's the bug. Pekka, can you please change the
offending cmpxchg_double() to irqsafe variant?

As for CMPXCHG_LOCAL being set spuriously, maybe introduce
CMPXCHG_DOUBLE_LOCAL? I don't know. It's pretty nasty to implement
different high-level code paths depending on CPU features. We can't
even determine whether the feature will be actually available at
compile time. But, then again, it might incur noticeable slowdown for
cases where the generic implementation is used. Has anyone measured
the difference against before the whole this_cpu conversion?

Thanks.

--
tejun

2011-05-04 13:23:06

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Thomas Gleixner <[email protected]> wrote:

> Which makes me even wonder some more whether we need that whole
> CMPXCHG_LOCAL #ifdeffery in slub.c at all.

I always found that #ifdeffery rather repulsive.

Thanks,

Ingo

2011-05-04 14:04:26

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Pekka Enberg wrote:

> > Well, it's pretty simple:
> >
> > CONFIG_X86_CMPXCHG64=y compiles this_cpu_generic_cmpxchg_double() into:
> >
> > ? ?28f7: ? ? ? 89 f9 ? ? ? ? ? ? ? ? ? mov ? ?%edi,%ecx
> > ? ?28f9: ? ? ? 8b 3e ? ? ? ? ? ? ? ? ? mov ? ?(%esi),%edi
> > ? ?28fb: ? ? ? 89 45 e4 ? ? ? ? ? ? ? ?mov ? ?%eax,-0x1c(%ebp)
> > ? ?28fe: ? ? ? 89 c3 ? ? ? ? ? ? ? ? ? mov ? ?%eax,%ebx
> > ? ?2900: ? ? ? 8b 45 f0 ? ? ? ? ? ? ? ?mov ? ?-0x10(%ebp),%eax
> > ? ?2903: ? ? ? 64 0f c7 0f ? ? ? ? ? ? cmpxchg8b %fs:(%edi)
> > ? ?2907: ? ? ? 0f 94 c0 ? ? ? ? ? ? ? ?sete ? %al
> > ? ?290a: ? ? ? 84 c0 ? ? ? ? ? ? ? ? ? test ? %al,%al
> > ? ?290c: ? ? ? 88 45 e4 ? ? ? ? ? ? ? ?mov ? ?%al,-0x1c(%ebp)
> > ? ?290f: ? ? ? 74 a4 ? ? ? ? ? ? ? ? ? je ? ? 28b5 <kmem_cache_alloc+0x29>
> >
> > while CONFIG_X86_CMPXCHG64=n results in:
> >
> > ? ?28b0: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> > ? ?28b2: ? ? ? 64 8b 30 ? ? ? ? ? ? ? ?mov ? ?%fs:(%eax),%esi
> > ? ?28b5: ? ? ? 39 d6 ? ? ? ? ? ? ? ? ? cmp ? ?%edx,%esi
> > ? ?28b7: ? ? ? 75 d2 ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> > ? ?28b9: ? ? ? 64 8b 50 04 ? ? ? ? ? ? mov ? ?%fs:0x4(%eax),%edx
> > ? ?28bd: ? ? ? 39 ca ? ? ? ? ? ? ? ? ? cmp ? ?%ecx,%edx
> > ? ?28bf: ? ? ? 75 ca ? ? ? ? ? ? ? ? ? jne ? ?288b <kmem_cache_alloc+0x26>
> > ? ?28c1: ? ? ? 8b 4b 14 ? ? ? ? ? ? ? ?mov ? ?0x14(%ebx),%ecx
> > ? ?28c4: ? ? ? 8b 0c 0e ? ? ? ? ? ? ? ?mov ? ?(%esi,%ecx,1),%ecx
> > ? ?28c7: ? ? ? 64 89 08 ? ? ? ? ? ? ? ?mov ? ?%ecx,%fs:(%eax)
> > ? ?28ca: ? ? ? 8b 03 ? ? ? ? ? ? ? ? ? mov ? ?(%ebx),%eax
> > ? ?28cc: ? ? ? 42 ? ? ? ? ? ? ? ? ? ? ?inc ? ?%edx
> > ? ?28cd: ? ? ? 64 89 50 04 ? ? ? ? ? ? mov ? ?%edx,%fs:0x4(%eax)
> >
> > And that code runs with preemption enabled. So when the task gets
> > preempted _BEFORE_ it has actuallty written back the data, then the
> > race window is wide open.
> >
> > I'm still trying to understand that macro hell which actually
> > generates that code. I always thought that George Anzingers macro maze
> > was horrible, but that's even worse.
>
> I'm adding Christoph and Tejun to CC.

The above should not happen. If a kernel config indicates that there is no
cmpxchg16b/cmpxchg8b available then we need to not compile the path that
uses cmpxchg8b/cmpxchg16b. Guess we need CMPXCHG_DOUBLE_LOCAL or so.

2011-05-04 14:08:01

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hello, Christoph.

On Wed, May 04, 2011 at 09:04:19AM -0500, Christoph Lameter wrote:
> The above should not happen. If a kernel config indicates that there is no
> cmpxchg16b/cmpxchg8b available then we need to not compile the path that
> uses cmpxchg8b/cmpxchg16b. Guess we need CMPXCHG_DOUBLE_LOCAL or so.

But regardless of that, if the code requires irq safety it should be
using irqsafe_* operations. CMPXCHG_LOCAL is an optimization hint and
doesn't necessarily imply stronger synchronization guarantees. In
addition, there's no downside to using irqsafe_* variant when CPU
actually supports cmpxchg[_double]. So, let's first fix that and
think about CMPXCHG[_DOUBLE]_LOCAL or whatnot later.

Thank you.

--
tejun

2011-05-04 14:10:51

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Tejun Heo wrote:
> Hello,
>
> On Wed, May 04, 2011 at 03:00:37PM +0200, Thomas Gleixner wrote:
> > On Wed, 4 May 2011, Tejun Heo wrote:
> > > > > And that code runs with preemption enabled. So when the task gets
> > > > > preempted _BEFORE_ it has actuallty written back the data, then the
> > > > > race window is wide open.
> > >
> > > Hmmm... if it's a race caused by preemtion enabled where it shouldn't
> > > be, it's most likely the wrong type of this_cpu_cmpxchg_double() being
> > > used in SLUB? ie. __this_cpu_cmpxchg_double() where it should have
> > > been this_cpu_cmpxchg_double()? Christoph?
> >
> > No, the problem is that ELAN prevents the cmpxchg8b, but keeps
> > CONFIG_CMPXCHG_LOCAL=y which then results in the unprotected code for
> > the following reason:
> ...
> > So the question is whether CMPXCHG_LOCAL for x86 wants to depend on
> > X86_CMPXCHG64.
> >
> > The other solution is to use irqsafe_cpu_cmpxchg_double() instead of
> > this_cpu_cmpxchg_double() in slub.c.
>
> I think this is the root cause. CMPXCHG_LOCAL is an optimization
> flag, indicating that the processor provides fast local cmpxchg, it
> doesn't say anything about local synchronization properties and if the
> code required irq exclusion, it should have used
> irqsafe_cpu_cmpxchg_double() whether the processor supports it
> natively or not, so there's the bug. Pekka, can you please change the
> offending cmpxchg_double() to irqsafe variant?

Patch below. Ingo, can you test that please ?

> As for CMPXCHG_LOCAL being set spuriously, maybe introduce
> CMPXCHG_DOUBLE_LOCAL? I don't know. It's pretty nasty to implement

Oh no, please not another CONFIG_WTF and more #ifdeffery.

> different high-level code paths depending on CPU features. We can't
> even determine whether the feature will be actually available at
> compile time. But, then again, it might incur noticeable slowdown for

Right, and the x86 implementation should not do

#ifdef CONFIG_X86_CMPXCHG64
#define percpu_cmpxchg8b_double(pcp1, o1, o2, n1, n2)
#endif

And let the code fallback to the generic variant. It should have an
#else path using the the cmpxchg64_local() implementation which has
alternatives for runtime selection of cmpxchg8b or the cli protected
emulation.

> cases where the generic implementation is used. Has anyone measured
> the difference against before the whole this_cpu conversion?

Yes, that really wants to be done. The whole CMPXCHG_LOCAL ifdeffery
should have been avoided in the first place. this_cpu_cmpxchg can
really be implemented with preempt_enable/disable and the irqsafe
variant in any case.

Thanks,

tglx

--------->
Subject: slub-hmm.patch
From: Thomas Gleixner <[email protected]>
Date: Wed, 04 May 2011 15:38:19 +0200

Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/percpu.h | 2 +-
mm/slub.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h
+++ linux-2.6/include/linux/percpu.h
@@ -948,7 +948,7 @@ do { \
irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
# endif
# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
- __pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+ __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
#endif

#endif /* __LINUX_PERCPU_H */
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c
+++ linux-2.6/mm/slub.c
@@ -1940,7 +1940,7 @@ redo:
* Since this is without lock semantics the protection is only against
* code executing on this cpu *not* from access by other cpus.
*/
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
get_freepointer(s, object), next_tid(tid)))) {
@@ -2145,7 +2145,7 @@ redo:
set_freepointer(s, object, c->freelist);

#ifdef CONFIG_CMPXCHG_LOCAL
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
c->freelist, tid,
object, next_tid(tid)))) {

2011-05-04 14:14:31

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Thomas Gleixner <[email protected]> wrote:

> Patch below. Ingo, can you test that please ?

Sure - test underway.

Thanks,

Ingo

2011-05-04 14:19:47

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Thomas Gleixner wrote:

> Index: linux-2.6/include/linux/percpu.h
> ===================================================================
> --- linux-2.6.orig/include/linux/percpu.h
> +++ linux-2.6/include/linux/percpu.h
> @@ -948,7 +948,7 @@ do { \
> irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
> # endif
> # define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
> - __pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
> + __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
> #endif
>
> #endif /* __LINUX_PERCPU_H */

Looking at the same thing here testing a similar patch. This should go
separately as a bug fix.

> Index: linux-2.6/mm/slub.c
> ===================================================================
> --- linux-2.6.orig/mm/slub.c
> +++ linux-2.6/mm/slub.c
> @@ -1940,7 +1940,7 @@ redo:
> * Since this is without lock semantics the protection is only against
> * code executing on this cpu *not* from access by other cpus.
> */
> - if (unlikely(!this_cpu_cmpxchg_double(
> + if (unlikely(!irqsafe_cpu_cmpxchg_double(
> s->cpu_slab->freelist, s->cpu_slab->tid,
> object, tid,
> get_freepointer(s, object), next_tid(tid)))) {
> @@ -2145,7 +2145,7 @@ redo:
> set_freepointer(s, object, c->freelist);
>
> #ifdef CONFIG_CMPXCHG_LOCAL
> - if (unlikely(!this_cpu_cmpxchg_double(
> + if (unlikely(!irqsafe_cpu_cmpxchg_double(
> s->cpu_slab->freelist, s->cpu_slab->tid,
> c->freelist, tid,
> object, next_tid(tid)))) {
>

Ok as a basic fix since it does not change the code generated for the
common x86 and other.

My version also takes out the CONFIG_CMPXCHG_LOCAL.

We could make the above two patches and then make the CONFIG_CMPXCHG_LOCAL
removal a separate one (since it requires some benchmarking).

2011-05-04 14:21:11

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Thomas Gleixner wrote:

> So the question is whether CMPXCHG_LOCAL for x86 wants to depend on
> X86_CMPXCHG64.

Right.

> The other solution is to use irqsafe_cpu_cmpxchg_double() instead of
> this_cpu_cmpxchg_double() in slub.c.
>
> This will not hurt the X86_CMPXCHG64=y case, but keep the expansion to
> the above __this_cpu_generic_cmpxchg_double working.
>
> Which makes me even wonder some more whether we need that whole
> CMPXCHG_LOCAL #ifdeffery in slub.c at all.

Hmmm... I have not measured it but it would certainly be nice to get rid
of it. The period of irq disablement is then reduced in the fastpath too.
However, we need to do additional tid checking (which should be fast).

The following patch does that and runs here in kvm

---
include/linux/percpu.h | 2 -
include/linux/slub_def.h | 2 -
mm/slub.c | 60 +----------------------------------------------
3 files changed, 3 insertions(+), 61 deletions(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2011-05-04 09:13:10.000000000 -0500
+++ linux-2.6/include/linux/percpu.h 2011-05-04 09:13:28.000000000 -0500
@@ -948,7 +948,7 @@ do { \
irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
# endif
# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
- __pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+ __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
#endif

#endif /* __LINUX_PERCPU_H */
Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2011-05-04 09:11:12.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2011-05-04 09:11:17.000000000 -0500
@@ -37,9 +37,7 @@ enum stat_item {

struct kmem_cache_cpu {
void **freelist; /* Pointer to next available object */
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid; /* Globally unique transaction id */
-#endif
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
#ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2011-05-04 09:07:10.000000000 -0500
+++ linux-2.6/mm/slub.c 2011-05-04 09:13:58.000000000 -0500
@@ -1540,7 +1540,6 @@ static void unfreeze_slab(struct kmem_ca
}
}

-#ifdef CONFIG_CMPXCHG_LOCAL
#ifdef CONFIG_PREEMPT
/*
* Calculate the next globally unique transaction for disambiguiation
@@ -1600,17 +1599,12 @@ static inline void note_cmpxchg_failure(
stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
}

-#endif
-
void init_kmem_cache_cpus(struct kmem_cache *s)
{
-#ifdef CONFIG_CMPXCHG_LOCAL
int cpu;

for_each_possible_cpu(cpu)
per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
-#endif
-
}
/*
* Remove the cpu slab
@@ -1643,9 +1637,7 @@ static void deactivate_slab(struct kmem_
page->inuse--;
}
c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
c->tid = next_tid(c->tid);
-#endif
unfreeze_slab(s, page, tail);
}

@@ -1780,7 +1772,6 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long flags;

local_irq_save(flags);
@@ -1792,7 +1783,6 @@ static void *__slab_alloc(struct kmem_ca
*/
c = this_cpu_ptr(s->cpu_slab);
#endif
-#endif

/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
@@ -1819,10 +1809,8 @@ load_freelist:
c->node = page_to_nid(c->page);
unlock_out:
slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
c->tid = next_tid(c->tid);
local_irq_restore(flags);
-#endif
stat(s, ALLOC_SLOWPATH);
return object;

@@ -1858,9 +1846,7 @@ new_slab:
}
if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
return NULL;
debug:
if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1887,20 +1873,12 @@ static __always_inline void *slab_alloc(
{
void **object;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid;
-#else
- unsigned long flags;
-#endif

if (slab_pre_alloc_hook(s, gfpflags))
return NULL;

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_save(flags);
-#else
redo:
-#endif

/*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -1910,7 +1888,6 @@ redo:
*/
c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1919,7 +1896,6 @@ redo:
*/
tid = c->tid;
barrier();
-#endif

object = c->freelist;
if (unlikely(!object || !node_match(c, node)))
@@ -1927,7 +1903,6 @@ redo:
object = __slab_alloc(s, gfpflags, node, addr, c);

else {
-#ifdef CONFIG_CMPXCHG_LOCAL
/*
* The cmpxchg will only match if there was no additional
* operation and if we are on the right processor.
@@ -1940,7 +1915,7 @@ redo:
* Since this is without lock semantics the protection is only against
* code executing on this cpu *not* from access by other cpus.
*/
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
get_freepointer(s, object), next_tid(tid)))) {
@@ -1948,16 +1923,9 @@ redo:
note_cmpxchg_failure("slab_alloc", s, tid);
goto redo;
}
-#else
- c->freelist = get_freepointer(s, object);
-#endif
stat(s, ALLOC_FASTPATH);
}

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
-
if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->objsize);

@@ -2034,11 +2002,9 @@ static void __slab_free(struct kmem_cach
{
void *prior;
void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long flags;

local_irq_save(flags);
-#endif
slab_lock(page);
stat(s, FREE_SLOWPATH);

@@ -2070,9 +2036,7 @@ checks_ok:

out_unlock:
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
return;

slab_empty:
@@ -2084,9 +2048,7 @@ slab_empty:
stat(s, FREE_REMOVE_PARTIAL);
}
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
stat(s, FREE_SLAB);
discard_slab(s, page);
return;
@@ -2113,20 +2075,11 @@ static __always_inline void slab_free(st
{
void **object = (void *)x;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid;
-#else
- unsigned long flags;
-#endif

slab_free_hook(s, x);

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_save(flags);
-
-#else
redo:
-#endif

/*
* Determine the currently cpus per cpu slab.
@@ -2136,16 +2089,13 @@ redo:
*/
c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
tid = c->tid;
barrier();
-#endif

if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
set_freepointer(s, object, c->freelist);

-#ifdef CONFIG_CMPXCHG_LOCAL
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
c->freelist, tid,
object, next_tid(tid)))) {
@@ -2153,16 +2103,10 @@ redo:
note_cmpxchg_failure("slab_free", s, tid);
goto redo;
}
-#else
- c->freelist = object;
-#endif
stat(s, FREE_FASTPATH);
} else
__slab_free(s, page, x, addr);

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
}

void kmem_cache_free(struct kmem_cache *s, void *x)

2011-05-04 14:21:37

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Christoph Lameter wrote:
>
> The above should not happen. If a kernel config indicates that there is no
> cmpxchg16b/cmpxchg8b available then we need to not compile the path that
> uses cmpxchg8b/cmpxchg16b. Guess we need CMPXCHG_DOUBLE_LOCAL or so.

Oh no, you have perfectly fine replacements for that in percpu.h with
irqsafe and preempt safe versions. So why adding another CONFIG from
hell which creates even more ifdef mess ?

The whole CONFIG_CMPXCHG_LOCAL ifdeffery in slub.c is horrible and I
think it's not necessary at all for two reasons:

1) irqsafe_cpu_cmpxchg() can be emulated cheap and we still have the
advantatage of shorter irq disabled sections. And the difference
between the current CONFIG_CMPXCHG_LOCAL=n and using the emulation
macros with the short irqsave/restore around the access is probably
not measurable at all.

2) that would actually test the code under all possible combinations
and not splitting the tester base into x86 and !x86. So that bug
would have been avoided in the first place.

Thanks,

tglx

2011-05-04 14:25:39

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hello,

On Wed, May 04, 2011 at 04:10:29PM +0200, Thomas Gleixner wrote:
> > cases where the generic implementation is used. Has anyone measured
> > the difference against before the whole this_cpu conversion?
>
> Yes, that really wants to be done. The whole CMPXCHG_LOCAL ifdeffery
> should have been avoided in the first place. this_cpu_cmpxchg can
> really be implemented with preempt_enable/disable and the irqsafe
> variant in any case.

Yeah, slub code looks pretty scary with the #ifdefs. IIUC, the
problem was that cmpxchg_double is an optimization for fast path which
was already very light weight and an extra locked op or irq on/off
would have made considerable difference.

The cmpxchg_double optimization made the fast path go quite faster
when CPU supports it but it may as well slow things down considerably
if CPU doesn't, due to extra irq on/off's. Anyways, here's hoping
that the slow down is acceptable compared to the base code without
cmpxchg_double and the ugliness can be removed.

Thanks.

--
tejun

2011-05-04 14:35:35

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Tejun Heo wrote:

> The cmpxchg_double optimization made the fast path go quite faster
> when CPU supports it but it may as well slow things down considerably
> if CPU doesn't, due to extra irq on/off's. Anyways, here's hoping
> that the slow down is acceptable compared to the base code without
> cmpxchg_double and the ugliness can be removed.

I think we are all in agreement. First path to fix the percpu macros:

Subject: percpu: Fix irqsafe_cpu_cmpxchg_double

The function in the macro was not updated when the function was given a bool return value.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/percpu.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2011-05-04 09:33:08.000000000 -0500
+++ linux-2.6/include/linux/percpu.h 2011-05-04 09:33:39.000000000 -0500
@@ -948,7 +948,7 @@ do { \
irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
# endif
# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
- __pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+ __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
#endif

#endif /* __LINUX_PERCPU_H */

2011-05-04 14:36:35

[permalink] [raw]

Subject: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

* Ingo Molnar <[email protected]> wrote:

>
> * Thomas Gleixner <[email protected]> wrote:
>
> > Patch below. Ingo, can you test that please ?
>
> Sure - test underway.

Ok, the patch below is looking really good - the test would have crashed on the
bad kernel. I think we can consider the regression fixed:

Acked-and-tested-by: Ingo Molnar <[email protected]>

I'll keep it running some more to make really sure it's fixed, plus it would be
nice if Walter tested your fix as well.

Thanks,

Ingo

----------------->
>From c22bd309573330e33a77c329405d9473fc14f1cb Mon Sep 17 00:00:00 2001
From: Thomas Gleixner <[email protected]>
Date: Wed, 4 May 2011 15:38:19 +0200
Subject: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: werner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Tejun Heo <[email protected]>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1105041539050.3005@ionos
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/percpu.h | 2 +-
mm/slub.c | 4 ++--
2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 3a5c444..8b97308 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -948,7 +948,7 @@ do { \
irqsafe_generic_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
# endif
# define irqsafe_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2) \
- __pcpu_double_call_return_int(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
+ __pcpu_double_call_return_bool(irqsafe_cpu_cmpxchg_double_, (pcp1), (pcp2), (oval1), (oval2), (nval1), (nval2))
#endif

#endif /* __LINUX_PERCPU_H */
diff --git a/mm/slub.c b/mm/slub.c
index 94d2a33..9d2e5e4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1940,7 +1940,7 @@ redo:
* Since this is without lock semantics the protection is only against
* code executing on this cpu *not* from access by other cpus.
*/
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
object, tid,
get_freepointer(s, object), next_tid(tid)))) {
@@ -2145,7 +2145,7 @@ redo:
set_freepointer(s, object, c->freelist);

#ifdef CONFIG_CMPXCHG_LOCAL
- if (unlikely(!this_cpu_cmpxchg_double(
+ if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
c->freelist, tid,
object, next_tid(tid)))) {

2011-05-04 14:42:59

[permalink] [raw]

Subject: Re: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

On Wed, 4 May 2011, Ingo Molnar wrote:

> Ok, the patch below is looking really good - the test would have crashed on the
> bad kernel. I think we can consider the regression fixed:
>
> Acked-and-tested-by: Ingo Molnar <[email protected]>
>
> I'll keep it running some more to make really sure it's fixed, plus it would be
> nice if Walter tested your fix as well.

Looks good and is exactly what I have here.

Acked-by: Christoph Lameter <[email protected]>

2011-05-04 14:46:42

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Tejun Heo wrote:

> Hello,
>
> On Wed, May 04, 2011 at 04:10:29PM +0200, Thomas Gleixner wrote:
> > > cases where the generic implementation is used. Has anyone measured
> > > the difference against before the whole this_cpu conversion?
> >
> > Yes, that really wants to be done. The whole CMPXCHG_LOCAL ifdeffery
> > should have been avoided in the first place. this_cpu_cmpxchg can
> > really be implemented with preempt_enable/disable and the irqsafe
> > variant in any case.
>
> Yeah, slub code looks pretty scary with the #ifdefs. IIUC, the
> problem was that cmpxchg_double is an optimization for fast path which
> was already very light weight and an extra locked op or irq on/off
> would have made considerable difference.
>
> The cmpxchg_double optimization made the fast path go quite faster
> when CPU supports it but it may as well slow things down considerably
> if CPU doesn't, due to extra irq on/off's. Anyways, here's hoping

Not really.

CMPXCHG_LOCAL=n

local_irq_save();
handle_everything();
local_irq_restore();

vs.

CMPXCHG_LOCAL=y

do_prep();
local_irq_save();
emulate_cmpxchg();
local_irq_restore();
do_rest();

So you have local irq disable/enable in both cases. So for the case
where you don't have a local cmpxchg8b/16b available it's not worse
versus irq disable/enable than now. It just has the possible repeat
case when stuff changed between the prep and the actual cmpxchg, which
is the same problem when cmpxchg8b/16 is available.

Thanks,

tglx

2011-05-04 15:00:36

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

> So you have local irq disable/enable in both cases. So for the case
> where you don't have a local cmpxchg8b/16b available it's not worse
> versus irq disable/enable than now. It just has the possible repeat
> case when stuff changed between the prep and the actual cmpxchg, which
> is the same problem when cmpxchg8b/16 is available.

Right there is only the tid management that is added. Hope I am fast
enough to at least get one patch in (not very well tested):

Subject: slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery

Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
everywhere.

There may be performance implications since:

A. We now have to manage a transaction ID for all arches

B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
to a very short irqoff section.

There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
case we only have to do one disable and enable like before.

Signed-off-by: Christoph Lameter <[email protected]>

---
include/linux/slub_def.h | 2 -
mm/slub.c | 56 -----------------------------------------------
2 files changed, 58 deletions(-)

Index: linux-2.6/include/linux/slub_def.h
===================================================================
--- linux-2.6.orig/include/linux/slub_def.h 2011-05-04 09:33:08.000000000 -0500
+++ linux-2.6/include/linux/slub_def.h 2011-05-04 09:42:05.000000000 -0500
@@ -37,9 +37,7 @@ enum stat_item {

struct kmem_cache_cpu {
void **freelist; /* Pointer to next available object */
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid; /* Globally unique transaction id */
-#endif
struct page *page; /* The slab from which we are allocating */
int node; /* The node of the page (or -1 for debug) */
#ifdef CONFIG_SLUB_STATS
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2011-05-04 09:41:59.000000000 -0500
+++ linux-2.6/mm/slub.c 2011-05-04 09:48:11.000000000 -0500
@@ -1540,7 +1540,6 @@ static void unfreeze_slab(struct kmem_ca
}
}

-#ifdef CONFIG_CMPXCHG_LOCAL
#ifdef CONFIG_PREEMPT
/*
* Calculate the next globally unique transaction for disambiguiation
@@ -1600,17 +1599,12 @@ static inline void note_cmpxchg_failure(
stat(s, CMPXCHG_DOUBLE_CPU_FAIL);
}

-#endif
-
void init_kmem_cache_cpus(struct kmem_cache *s)
{
-#ifdef CONFIG_CMPXCHG_LOCAL
int cpu;

for_each_possible_cpu(cpu)
per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
-#endif
-
}
/*
* Remove the cpu slab
@@ -1643,9 +1637,7 @@ static void deactivate_slab(struct kmem_
page->inuse--;
}
c->page = NULL;
-#ifdef CONFIG_CMPXCHG_LOCAL
c->tid = next_tid(c->tid);
-#endif
unfreeze_slab(s, page, tail);
}

@@ -1780,7 +1772,6 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
struct page *new;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long flags;

local_irq_save(flags);
@@ -1792,7 +1783,6 @@ static void *__slab_alloc(struct kmem_ca
*/
c = this_cpu_ptr(s->cpu_slab);
#endif
-#endif

/* We handle __GFP_ZERO in the caller */
gfpflags &= ~__GFP_ZERO;
@@ -1819,10 +1809,8 @@ load_freelist:
c->node = page_to_nid(c->page);
unlock_out:
slab_unlock(c->page);
-#ifdef CONFIG_CMPXCHG_LOCAL
c->tid = next_tid(c->tid);
local_irq_restore(flags);
-#endif
stat(s, ALLOC_SLOWPATH);
return object;

@@ -1858,9 +1846,7 @@ new_slab:
}
if (!(gfpflags & __GFP_NOWARN) && printk_ratelimit())
slab_out_of_memory(s, gfpflags, node);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
return NULL;
debug:
if (!alloc_debug_processing(s, c->page, object, addr))
@@ -1887,20 +1873,12 @@ static __always_inline void *slab_alloc(
{
void **object;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid;
-#else
- unsigned long flags;
-#endif

if (slab_pre_alloc_hook(s, gfpflags))
return NULL;

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_save(flags);
-#else
redo:
-#endif

/*
* Must read kmem_cache cpu data via this cpu ptr. Preemption is
@@ -1910,7 +1888,6 @@ redo:
*/
c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
/*
* The transaction ids are globally unique per cpu and per operation on
* a per cpu queue. Thus they can be guarantee that the cmpxchg_double
@@ -1919,7 +1896,6 @@ redo:
*/
tid = c->tid;
barrier();
-#endif

object = c->freelist;
if (unlikely(!object || !node_match(c, node)))
@@ -1927,7 +1903,6 @@ redo:
object = __slab_alloc(s, gfpflags, node, addr, c);

else {
-#ifdef CONFIG_CMPXCHG_LOCAL
/*
* The cmpxchg will only match if there was no additional
* operation and if we are on the right processor.
@@ -1948,16 +1923,9 @@ redo:
note_cmpxchg_failure("slab_alloc", s, tid);
goto redo;
}
-#else
- c->freelist = get_freepointer(s, object);
-#endif
stat(s, ALLOC_FASTPATH);
}

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
-
if (unlikely(gfpflags & __GFP_ZERO) && object)
memset(object, 0, s->objsize);

@@ -2034,11 +2002,9 @@ static void __slab_free(struct kmem_cach
{
void *prior;
void **object = (void *)x;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long flags;

local_irq_save(flags);
-#endif
slab_lock(page);
stat(s, FREE_SLOWPATH);

@@ -2070,9 +2036,7 @@ checks_ok:

out_unlock:
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
return;

slab_empty:
@@ -2084,9 +2048,7 @@ slab_empty:
stat(s, FREE_REMOVE_PARTIAL);
}
slab_unlock(page);
-#ifdef CONFIG_CMPXCHG_LOCAL
local_irq_restore(flags);
-#endif
stat(s, FREE_SLAB);
discard_slab(s, page);
return;
@@ -2113,20 +2075,11 @@ static __always_inline void slab_free(st
{
void **object = (void *)x;
struct kmem_cache_cpu *c;
-#ifdef CONFIG_CMPXCHG_LOCAL
unsigned long tid;
-#else
- unsigned long flags;
-#endif

slab_free_hook(s, x);

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_save(flags);
-
-#else
redo:
-#endif

/*
* Determine the currently cpus per cpu slab.
@@ -2136,15 +2089,12 @@ redo:
*/
c = __this_cpu_ptr(s->cpu_slab);

-#ifdef CONFIG_CMPXCHG_LOCAL
tid = c->tid;
barrier();
-#endif

if (likely(page == c->page && c->node != NUMA_NO_NODE)) {
set_freepointer(s, object, c->freelist);

-#ifdef CONFIG_CMPXCHG_LOCAL
if (unlikely(!irqsafe_cpu_cmpxchg_double(
s->cpu_slab->freelist, s->cpu_slab->tid,
c->freelist, tid,
@@ -2153,16 +2103,10 @@ redo:
note_cmpxchg_failure("slab_free", s, tid);
goto redo;
}
-#else
- c->freelist = object;
-#endif
stat(s, FREE_FASTPATH);
} else
__slab_free(s, page, x, addr);

-#ifndef CONFIG_CMPXCHG_LOCAL
- local_irq_restore(flags);
-#endif
}

void kmem_cache_free(struct kmem_cache *s, void *x)

2011-05-04 15:14:25

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 8:00 AM, Christoph Lameter <[email protected]> wrote:
>
> Right there is only the tid management that is added. Hope I am fast
> enough to at least get one patch in (not very well tested):

So the thing that worries me about this is non-x86 architectures.

Have we verified that the generic routines are ok for all
architectures? Has somebody checked the memory barriers in particular?
Things that work on x86 may not work on non-x86. Everything should be
per-cpu _except_ for the initialization, I think, but that should be
double-checked.

I guess the initialization happens so early that we don't really need
to worry about it, but I'd still like somebody to really double- and
triple-check it for me.

My gut reaction would be: let's do the minimal patch that just fixes
things to do irqsafe_cpu_cmpxchg_double() for 2.6.39, and then let's
remove the #ifdef'fery in -rc1. Or make _really_ sure that things are
ok for platforms that never even triggered the CMPXCHG_LOCAL case
before.

Hmm?

Linus

2011-05-04 15:21:16

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Christoph Lameter <[email protected]> wrote:

> On Wed, 4 May 2011, Tejun Heo wrote:
>
> > The cmpxchg_double optimization made the fast path go quite faster
> > when CPU supports it but it may as well slow things down considerably
> > if CPU doesn't, due to extra irq on/off's. Anyways, here's hoping
> > that the slow down is acceptable compared to the base code without
> > cmpxchg_double and the ugliness can be removed.
>
> I think we are all in agreement. First path to fix the percpu macros:
>
>
> Subject: percpu: Fix irqsafe_cpu_cmpxchg_double
>
> The function in the macro was not updated when the function was given a bool return value.
>
> Signed-off-by: Christoph Lameter <[email protected]>

Note, the final commit needs to have a proper Reported-by and explanation about
how this was triggered, to fairly credit all the hard work done by Werner ...

Thanks,

Ingo

2011-05-04 15:28:09

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

> On Wed, May 4, 2011 at 8:00 AM, Christoph Lameter <[email protected]> wrote:
> >
> > Right there is only the tid management that is added. Hope I am fast
> > enough to at least get one patch in (not very well tested):
>
> So the thing that worries me about this is non-x86 architectures.
>
> Have we verified that the generic routines are ok for all
> architectures? Has somebody checked the memory barriers in particular?

There are no memory barriers used. The barrier() here is a compiler hint
to not keep data across it. This is dealing with concurrency issues on
the *same* cpu due to preemption and irqs. It avoids the interrupt
disable. There no memory barrier issues for code running on a single cpu.
The cpu is always guaranteed to have a consistent view of the data.

There is an additional patchset that also uses a cmpxchg_double for the
slowpath. There barrier issues may arise because concurrent access is
possible but that is likely to be merged only in 2.6.41. The
cmpxchg_double in that case is a locked operation.

> Things that work on x86 may not work on non-x86. Everything should be
> per-cpu _except_ for the initialization, I think, but that should be
> double-checked.

The fallback path is using interrupt disable / enable. This must be
working on all arches AFAICT.

> My gut reaction would be: let's do the minimal patch that just fixes
> things to do irqsafe_cpu_cmpxchg_double() for 2.6.39, and then let's
> remove the #ifdef'fery in -rc1. Or make _really_ sure that things are
> ok for platforms that never even triggered the CMPXCHG_LOCAL case
> before.
>
> Hmm?

Ok. Pekka can put that patch in for the next round of merges?

2011-05-04 15:45:58

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 6:28 PM, Christoph Lameter <[email protected]> wrote:
>> My gut reaction would be: let's do the minimal patch that just fixes
>> things to do irqsafe_cpu_cmpxchg_double() for 2.6.39, and then let's
>> remove the #ifdef'fery in -rc1. Or make _really_ sure that things are
>> ok for platforms that never even triggered the CMPXCHG_LOCAL case
>> before.
>>
>> Hmm?
>
> Ok. Pekka can put that patch in for the next round of merges?

Sure. Who's taking care of the minimal fix for .39?

2011-05-04 15:38:29

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 8:28 AM, Christoph Lameter <[email protected]> wrote:
>
> There are no memory barriers used. The barrier() here is a compiler hint
> to not keep data across it.

Go back and read my email, please.

The lack of memory barriers is exactly what I worry about.

Look at the initialization path. The "tid" thing is _not_ purely cpu-local.

Linus

2011-05-04 15:41:58

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Christoph Lameter wrote:
> Subject: slub: Remove CONFIG_CMPXCHG_LOCAL ifdeffery
>
> Remove the #ifdefs. This means that the irqsafe_cpu_cmpxchg_double() is used
> everywhere.
>
> There may be performance implications since:
>
> A. We now have to manage a transaction ID for all arches
>
> B. The interrupt holdoff for arches not supporting CONFIG_CMPXCHG_LOCAL is reduced
> to a very short irqoff section.
>
> There are no multiple irqoff/irqon sequences as a result of this change. Even in the fallback
> case we only have to do one disable and enable like before.
>
> Signed-off-by: Christoph Lameter <[email protected]>

This doesn't apply cleanly on top of slab/next which has some of your
cleanup patches applied. There's some CONFIG_PREEMPT conflicts so I'd
rather you rediffed it yourself.

Pekka

2011-05-04 15:59:57

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 8:37 AM, Pekka Enberg <[email protected]> wrote:
>
> Sure. Who's taking care of the minimal fix for .39?

I'll take it. I'm just hoping to also get Werner's tested-by for it.

I'm pretty confident this is it, though, so if I don't get it by the
end of the day I'll just apply it regardless with just a reported-by
from him.

Linus

2011-05-04 16:08:40

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

> On Wed, May 4, 2011 at 8:28 AM, Christoph Lameter <[email protected]> wrote:
> >
> > There are no memory barriers used. The barrier() here is a compiler hint
> > to not keep data across it.
>
> Go back and read my email, please.
>
> The lack of memory barriers is exactly what I worry about.
>
> Look at the initialization path. The "tid" thing is _not_ purely cpu-local.

kmem_cache_create() only hands the pointer to the slab cache back
after the percpu values have been initialized. In order to do an allocation one
needs to have the pointer to the slab cache.

You think that accesses to struct kmem_cache_cpu by other cpus could
require memory barriers? But this is what both SLAB and SLUB already do
today even before these modification. Nor do we have memory barriers for
the same situation in the page allocator.

2011-05-04 16:31:08

[permalink] [raw]

Subject: Re: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

* Ingo Molnar <[email protected]> wrote:

> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Thomas Gleixner <[email protected]> wrote:
> >
> > > Patch below. Ingo, can you test that please ?
> >
> > Sure - test underway.
>
> Ok, the patch below is looking really good - the test would have crashed on the
> bad kernel. I think we can consider the regression fixed:
>
> Acked-and-tested-by: Ingo Molnar <[email protected]>
>
> I'll keep it running some more to make really sure it's fixed, plus it would
> be nice if Walter tested your fix as well.

Stopped the test after 2 hours uptime - i think we can consider the crash
fixed.

Thanks,

Ingo

2011-05-04 16:50:40

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Linus Torvalds <[email protected]> wrote:

> My gut reaction would be: let's do the minimal patch that just fixes things
> to do irqsafe_cpu_cmpxchg_double() for 2.6.39, and then let's remove the
> #ifdef'fery in -rc1. [...]

Looks like the sanest option IMHO, -rc7 is pretty late for anything than a
few-liner patch.

> [...] Or make _really_ sure that things are ok for platforms that never even
> triggered the CMPXCHG_LOCAL case before.

Considering that the status quo was !CMPXCHG_LOCAL in v2.6.38 and that lockless
SLUB is an x86-only affair right now:

$ git grep CMPXCHG_LOCAL arch/
arch/um/Kconfig.x86:config CMPXCHG_LOCAL
arch/x86/Kconfig.cpu:config CMPXCHG_LOCAL

There should be no problem with other architectures, right?

Thanks,

Ingo

2011-05-04 17:12:45

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Ingo Molnar wrote:

>
> * Linus Torvalds <[email protected]> wrote:
>
> > My gut reaction would be: let's do the minimal patch that just fixes things
> > to do irqsafe_cpu_cmpxchg_double() for 2.6.39, and then let's remove the
> > #ifdef'fery in -rc1. [...]
>
> Looks like the sanest option IMHO, -rc7 is pretty late for anything than a
> few-liner patch.

Agreed.

> > [...] Or make _really_ sure that things are ok for platforms that never even
> > triggered the CMPXCHG_LOCAL case before.
>
> Considering that the status quo was !CMPXCHG_LOCAL in v2.6.38 and that lockless
> SLUB is an x86-only affair right now:
>
> $ git grep CMPXCHG_LOCAL arch/
> arch/um/Kconfig.x86:config CMPXCHG_LOCAL
> arch/x86/Kconfig.cpu:config CMPXCHG_LOCAL
>
> There should be no problem with other architectures, right?

No, they use the local_irq disabled path which is pretty much the same
as in 38.

Thanks,

tglx

2011-05-04 18:20:50

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 8:53 AM, Linus Torvalds
<[email protected]> wrote:
>
> I'll take it. I'm just hoping to also get Werner's tested-by for it.
>
> I'm pretty confident this is it, though, so if I don't get it by the
> end of the day I'll just apply it regardless with just a reported-by
> from him.

So I'm still waiting for the tested-by, but in the meantime I wrote a
changelog. And part of that changelog reads:

[ Btw, that whole "generic code defaults to no protection" design just
sounds stupid - if the code needs no protection, there is no reason to
use "cmpxchg_double" to begin with. So we should probably just remove
the unprotected version entirely as pointless. - Linus ]

which really sums up the whole thing.

The current "this_cpu_cmpxchg_double()" implementation is just
incredibly idiotic. There is absolutely _no_ point to having that
function at all. Why does it exist?

I can kind of see the point of the "preempt" version, although I'm not
entirely convinced of that either. But the notion of having a
"cmpxchg" function that isn't atomic even on a single CPU just makes
me go "f*ck that, whoever wrote that is just a moron".

If the function doesn't need atomicity, you're really better off just
writing it out. It's going to be faster on pretty much all
architectures using just regular load/store instructions.

Linus

2011-05-04 18:49:15

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

>
> So I'm still waiting for the tested-by, but in the meantime I wrote a
> changelog. And part of that changelog reads:
>
> [ Btw, that whole "generic code defaults to no protection" design just
> sounds stupid - if the code needs no protection, there is no reason to
> use "cmpxchg_double" to begin with. So we should probably just remove
> the unprotected version entirely as pointless. - Linus ]

The unprotected version is the __this_cpu_cmpxchg_double? That currently
has no user and could be removed. But all other functions also have __
variants so it was put there for symmetries sake.

> which really sums up the whole thing.
>
> The current "this_cpu_cmpxchg_double()" implementation is just
> incredibly idiotic. There is absolutely _no_ point to having that
> function at all. Why does it exist?

this_cpu_cmpxchg_double() affords protection against preemption but not
irqs. In the allocator case we have to deal with interrupts so its not
useful there. Other code that can guarantee that nothing runs from irq
context can use this function and then the fallback handling on other
arches can avoid disabling and enabling interrupts which is required by
irqsafe_cpu_cmpxchg_double().

> I can kind of see the point of the "preempt" version, although I'm not
> entirely convinced of that either. But the notion of having a
> "cmpxchg" function that isn't atomic even on a single CPU just makes
> me go "f*ck that, whoever wrote that is just a moron".
>
> If the function doesn't need atomicity, you're really better off just
> writing it out. It's going to be faster on pretty much all
> architectures using just regular load/store instructions.

The unlocked cmpxchg8b/16b is quite fast on x86. The __ function allows
the use of cmpxchg8b on x86 and the fallback (and therefore longer) code
on other archs.

But I see the point that __this_cpu_cmpxchg double is pretty useless.
There is no user and so it could be removed.

2011-05-04 19:08:36

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 11:49 AM, Christoph Lameter <[email protected]> wrote:
>
> The unprotected version is the __this_cpu_cmpxchg_double? That currently
> has no user and could be removed. But all other functions also have __
> variants so it was put there for symmetries sake.

No, I'm talking about the regular "this_cpu_cmpxchg_double" with no underscores.

Why the f*!@ does that exist?

Why the f*%^ do you have to write "irqsafe", when the whole concept
DOES NOT MAKE SENSE without the "irqsafe"?

Why did we have this bug in the first place, in other words? The whole
interface was mis-designed, and pretty much designed to cause bugs.

I think we should remove every single version of
"this_cpu_cmpxchg_double" except for two: the per-cpu one and the
SMP-safe one.

And the per-cpu one doesn't mention "irqsafe" or "preempt" or anything
like that, because the whole function makes no sense except when it is
irq-safe and preempt-safe.

So I think the fact that we need to say "irqsafe" is a bug. Plain and simple.

The whole (and ONLY) point of "cmpxchg" is atomicity.

This is not like "add one to something". That's an operation that
makes sense outside of atomicity issues. But "cmpxchg" is all about
being atomic.

Linus

2011-05-04 19:30:31

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

> On Wed, May 4, 2011 at 11:49 AM, Christoph Lameter <[email protected]> wrote:
> >
> > The unprotected version is the __this_cpu_cmpxchg_double? That currently
> > has no user and could be removed. But all other functions also have __
> > variants so it was put there for symmetries sake.
>
> No, I'm talking about the regular "this_cpu_cmpxchg_double" with no underscores.
>
> Why the f*!@ does that exist?

Because it is useful if the per cpu serialization only requires
safety from preemption during the cmpxchg.

> Why the f*%^ do you have to write "irqsafe", when the whole concept
> DOES NOT MAKE SENSE without the "irqsafe"?

Because that is how the other irq safe this_cpu_xxx functions are named
and I tried to keep it consistent.

> I think we should remove every single version of
> "this_cpu_cmpxchg_double" except for two: the per-cpu one and the
> SMP-safe one.

The smp safe one would be cmpxchg_double() then. This is part of a future
patchset and would be named like other fully atomic ops. Would not be part
of include/linux/percpu.h because it is not a per cpu related function.

> And the per-cpu one doesn't mention "irqsafe" or "preempt" or anything
> like that, because the whole function makes no sense except when it is
> irq-safe and preempt-safe.

It does make sense because arches that do not have the hardware
capabilities must use fallback implementations that can be faster if f.e.
irqs must not be disabled. And this_cpu_xxx ops are usually used in
performance critical paths.

> So I think the fact that we need to say "irqsafe" is a bug. Plain and simple.
>
> The whole (and ONLY) point of "cmpxchg" is atomicity.
>
> This is not like "add one to something". That's an operation that
> makes sense outside of atomicity issues. But "cmpxchg" is all about
> being atomic.

The naming convention came about from the existing this_cpu_xxx
operations (and yes those starting with the problem of the increment). To
do it differently now just for this function would make it more difficult
to comprehend for someone already familiar with this_cpu
operations on per cpu data.

2011-05-04 19:45:48

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 12:30 PM, Christoph Lameter <[email protected]> wrote:
>
> The naming convention came about from the existing this_cpu_xxx
> operations

You're missing my point.

An "add" operation makes sense even if it isn't atomic, because
atomicity isn't a part of the definition of "add".

But cmpxchg DOES NOT MAKE SENSE without atomicity guarantees.

The whole operation is about atomicity.

Having a version that isn't atomic is STUPID. It's misleading. It's _wrong_.

In contrast, having a non-atomic "add" version is understandable.

So when you say "naming convention", you're missing the much bigger
naming convention. Namely the "cmpxchg" part!

Linus

2011-05-04 20:04:45

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

> On Wed, May 4, 2011 at 12:30 PM, Christoph Lameter <[email protected]> wrote:
> >
> > The naming convention came about from the existing this_cpu_xxx
> > operations
>
> You're missing my point.
>
> An "add" operation makes sense even if it isn't atomic, because
> atomicity isn't a part of the definition of "add".
>
> But cmpxchg DOES NOT MAKE SENSE without atomicity guarantees.

This is not a real cmpxchg after all. Its not atomic in the sense of
other functions. Its only "percpu atomic" if you want it that way. This is
*not* a full cmpxchg_double().

> The whole operation is about atomicity.
>
> Having a version that isn't atomic is STUPID. It's misleading. It's _wrong_.

Its "atomic" in the sense that it is an instruction that is either
executed or not in total and that fact alone allow the avoiding of
synchronization for preemption and interrupts. We just push as much
processing as possible into this single instruction and then we dont have
to worry about preemption or interrupts while this function is
executed by the processor.

> In contrast, having a non-atomic "add" version is understandable.
>
> So when you say "naming convention", you're missing the much bigger
> naming convention. Namely the "cmpxchg" part!

Well this is not really a true cmpxchg. There is no lock prefix.

The semantics of the this_cpu_xxx functions are not atomic but only per
cpu atomic. That per cpu atomicity can require only the exclusion of
preemption or the exclusion of interrupts.

In extreme cases we dont care about preemption or interrupts interfering
with the operation. We just want to opportunistically take advantage of
sophisticated instructions if they are available (f.e. for accurate vm
counters). Or we may have some other external means of serialization (like
lock or we already disabled preemption). Thats what the __ operations are
for.

Maybe I should have pushed the cmpxchg_double() before the
this_cpu_cmpxchg to avoid these misunderstandings

Here is the patch for the fullly atomic cmpxchg_double() which will be
needed for making the non per cpu specific processing lockless later:

Subject: x86: Add support for cmpxchg_double

A simple implementation that only supports the word size and does not
have a fallback mode (would require a spinlock).

And 32 and 64 bit support for cmpxchg_double. cmpxchg double uses
the cmpxchg8b or cmpxchg16b instruction on x86 processors to compare
and swap 2 machine words. This allows lockless algorithms to move more
context information through critical sections.

Set a flag CONFIG_CMPXCHG_DOUBLE to signal the support of that feature
during kernel builds.

Signed-off-by: Christoph Lameter <[email protected]>

---
arch/x86/Kconfig.cpu | 3 ++
arch/x86/include/asm/cmpxchg_32.h | 46 ++++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/cmpxchg_64.h | 45 +++++++++++++++++++++++++++++++++++++
arch/x86/include/asm/cpufeature.h | 1
4 files changed, 95 insertions(+)

Index: linux-2.6/arch/x86/include/asm/cmpxchg_64.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/cmpxchg_64.h 2011-04-13 15:19:53.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/cmpxchg_64.h 2011-04-15 13:14:45.000000000 -0500
@@ -151,4 +151,49 @@ extern void __cmpxchg_wrong_size(void);
cmpxchg_local((ptr), (o), (n)); \
})

+#define cmpxchg16b(ptr, o1, o2, n1, n2) \
+({ \
+ char __ret; \
+ __typeof__(o2) __junk; \
+ __typeof__(*(ptr)) __old1 = (o1); \
+ __typeof__(o2) __old2 = (o2); \
+ __typeof__(*(ptr)) __new1 = (n1); \
+ __typeof__(o2) __new2 = (n2); \
+ asm volatile(LOCK_PREFIX_HERE "lock; cmpxchg16b (%%rsi);setz %1" \
+ : "=d"(__junk), "=a"(__ret) \
+ : "S"(ptr), "b"(__new1), "c"(__new2), \
+ "a"(__old1), "d"(__old2)); \
+ __ret; })
+
+
+#define cmpxchg16b_local(ptr, o1, o2, n1, n2) \
+({ \
+ char __ret; \
+ __typeof__(o2) __junk; \
+ __typeof__(*(ptr)) __old1 = (o1); \
+ __typeof__(o2) __old2 = (o2); \
+ __typeof__(*(ptr)) __new1 = (n1); \
+ __typeof__(o2) __new2 = (n2); \
+ asm volatile("cmpxchg16b (%%rsi)\n\t\tsetz %1\n\t" \
+ : "=d"(__junk)_, "=a"(__ret) \
+ : "S"((ptr)), "b"(__new1), "c"(__new2), \
+ "a"(__old1), "d"(__old2)); \
+ __ret; })
+
+#define cmpxchg_double(ptr, o1, o2, n1, n2) \
+({ \
+ BUILD_BUG_ON(sizeof(*(ptr)) != 8); \
+ VM_BUG_ON((unsigned long)(ptr) % 16); \
+ cmpxchg16b((ptr), (o1), (o2), (n1), (n2)); \
+})
+
+#define cmpxchg_double_local(ptr, o1, o2, n1, n2) \
+({ \
+ BUILD_BUG_ON(sizeof(*(ptr)) != 8); \
+ VM_BUG_ON((unsigned long)(ptr) % 16); \
+ cmpxchg16b_local((ptr), (o1), (o2), (n1), (n2)); \
+})
+
+#define system_has_cmpxchg_double() cpu_has_cx16
+
#endif /* _ASM_X86_CMPXCHG_64_H */
Index: linux-2.6/arch/x86/include/asm/cmpxchg_32.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/cmpxchg_32.h 2011-04-13 15:19:53.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/cmpxchg_32.h 2011-04-15 13:14:45.000000000 -0500
@@ -280,4 +280,50 @@ static inline unsigned long cmpxchg_386(

#endif

+#define cmpxchg8b(ptr, o1, o2, n1, n2) \
+({ \
+ char __ret; \
+ __typeof__(o2) __dummy; \
+ __typeof__(*(ptr)) __old1 = (o1); \
+ __typeof__(o2) __old2 = (o2); \
+ __typeof__(*(ptr)) __new1 = (n1); \
+ __typeof__(o2) __new2 = (n2); \
+ asm volatile(LOCK_PREFIX_HERE "lock; cmpxchg8b (%%esi); setz %1"\
+ : "d="(__dummy), "=a" (__ret) \
+ : "S" ((ptr)), "a" (__old1), "d"(__old2), \
+ "b" (__new1), "c" (__new2) \
+ : "memory"); \
+ __ret; })
+
+
+#define cmpxchg8b_local(ptr, o1, o2, n1, n2) \
+({ \
+ char __ret; \
+ __typeof__(o2) __dummy; \
+ __typeof__(*(ptr)) __old1 = (o1); \
+ __typeof__(o2) __old2 = (o2); \
+ __typeof__(*(ptr)) __new1 = (n1); \
+ __typeof__(o2) __new2 = (n2); \
+ asm volatile("cmpxchg8b (%%esi); tsetz %1" \
+ : "d="(__dummy), "=a"(__ret) \
+ : "S" ((ptr)), "a" (__old), "d"(__old2), \
+ "b" (__new1), "c" (__new2), \
+ : "memory"); \
+ __ret; })
+
+
+#define cmpxchg_double(ptr, o1, o2, n1, n2) \
+({ \
+ BUILD_BUG_ON(sizeof(*(ptr)) != 4); \
+ VM_BUG_ON((unsigned long)(ptr) % 8); \
+ cmpxchg8b((ptr), (o1), (o2), (n1), (n2)); \
+})
+
+#define cmpxchg_double_local(ptr, o1, o2, n1, n2) \
+({ \
+ BUILD_BUG_ON(sizeof(*(ptr)) != 4); \
+ VM_BUG_ON((unsigned long)(ptr) % 8); \
+ cmpxchg16b_local((ptr), (o1), (o2), (n1), (n2)); \
+})
+
#endif /* _ASM_X86_CMPXCHG_32_H */
Index: linux-2.6/arch/x86/Kconfig.cpu
===================================================================
--- linux-2.6.orig/arch/x86/Kconfig.cpu 2011-04-13 15:19:53.000000000 -0500
+++ linux-2.6/arch/x86/Kconfig.cpu 2011-04-15 13:14:45.000000000 -0500
@@ -308,6 +308,9 @@ config X86_CMPXCHG
config CMPXCHG_LOCAL
def_bool X86_64 || (X86_32 && !M386)

+config CMPXCHG_DOUBLE
+ def_bool X86_64 || (X86_32 && !M386)
+
config X86_L1_CACHE_SHIFT
int
default "7" if MPENTIUM4 || MPSC
Index: linux-2.6/arch/x86/include/asm/cpufeature.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/cpufeature.h 2011-04-15 12:51:51.000000000 -0500
+++ linux-2.6/arch/x86/include/asm/cpufeature.h 2011-04-15 13:14:45.000000000 -0500
@@ -286,6 +286,7 @@ extern const char * const x86_power_flag
#define cpu_has_hypervisor boot_cpu_has(X86_FEATURE_HYPERVISOR)
#define cpu_has_pclmulqdq boot_cpu_has(X86_FEATURE_PCLMULQDQ)
#define cpu_has_perfctr_core boot_cpu_has(X86_FEATURE_PERFCTR_CORE)
+#define cpu_has_cx16 boot_cpu_has(X86_FEATURE_CX16)

#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
# define cpu_has_invlpg 1

2011-05-04 20:24:34

by Valdis Klētnieks

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 04 May 2011 15:04:39 CDT, Christoph Lameter said:
> On Wed, 4 May 2011, Linus Torvalds wrote:

> > But cmpxchg DOES NOT MAKE SENSE without atomicity guarantees.
>
> This is not a real cmpxchg after all. Its not atomic in the sense of
> other functions. Its only "percpu atomic" if you want it that way. This is
> *not* a full cmpxchg_double().

Calling it a cmpxchg when it doesn't have the primary distinguishing property
of a hardware cmpxchg is just loading a bullet in the chamber and inviting
kernel hackers to point it at their feet...

Attachments:

(No filename) (227.00 B)

2011-05-04 20:33:03

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, [email protected] wrote:

> On Wed, 04 May 2011 15:04:39 CDT, Christoph Lameter said:
> > On Wed, 4 May 2011, Linus Torvalds wrote:
>
> > > But cmpxchg DOES NOT MAKE SENSE without atomicity guarantees.
> >
> > This is not a real cmpxchg after all. Its not atomic in the sense of
> > other functions. Its only "percpu atomic" if you want it that way. This is
> > *not* a full cmpxchg_double().
>
> Calling it a cmpxchg when it doesn't have the primary distinguishing property
> of a hardware cmpxchg is just loading a bullet in the chamber and inviting
> kernel hackers to point it at their feet...

It does have most of the distinguishing characterstics but the
lock-prefixless cmpxchg8b/16b (which is quite fast) is used in a
unique way here in a percpu fastdpath. Thats why we have the strange
naming this_cpu_cmpxchg_double etc.

2011-05-04 20:49:56

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Christoph Lameter <[email protected]> wrote:

> > Calling it a cmpxchg when it doesn't have the primary distinguishing
> > property of a hardware cmpxchg is just loading a bullet in the chamber and
> > inviting kernel hackers to point it at their feet...
>
> It does have most of the distinguishing characterstics but the
> lock-prefixless cmpxchg8b/16b (which is quite fast) [...]

6 cycles for CMPXCHG8B versus 19 cycles for LOCK CMPXCHG8B, on Nehalem.

And note that it's not really true that the LOCK prefix-less CMPXCHG8B is not
atomic: the write is atomic if the target word is naturally aligned, on i586
and upwards ...

So in practice, unless the SLUB variables are misaligned, the lock prefix-less
CMPXCHG8B *IS* atomic.

Non-atomicity is a special case of a weird special case.

> [...] is used in a unique way here in a percpu fastdpath. Thats why we have
> the strange naming this_cpu_cmpxchg_double etc.

Furthermore, we never used cmpxchg in the kernel without expecting atomicity.

Uniquely strange, unintuitive naming == invitation for strange bugs.

And guess what, we had a strange bug here. Can you possibly see any connection?

Thanks,

Ingo

2011-05-04 21:07:18

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 1:04 PM, Christoph Lameter <[email protected]> wrote:
>
> Maybe I should have pushed the cmpxchg_double() before the
> this_cpu_cmpxchg to avoid these misunderstandings

Christoph, the only mis-understanding is yours.

I understand perfectly that the percpu cmpxchg isn't SMP-atomic. We
all know that.

The problem is that cmpxchg *IS*NOT*AN*CMPXCHG*AT*ALL*. Not even on UP.

It's something totally different.

Linus

2011-05-04 21:20:25

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, May 4, 2011 at 2:06 PM, Linus Torvalds
<[email protected]> wrote:
>
> The problem is that cmpxchg *IS*NOT*AN*CMPXCHG*AT*ALL*. Not even on UP.

Btw, the really sad thing is that as far as I can tell, the
cmpxchg64_local() function that we define actually gets this *right*,
with a helper function that actually works (and disables interrupts)
and all the alternative_io() stuff to then use the right instruction
automatically if the CPU supports it, rather than making it a
hardcoded decision.

But we don't use it, because the "percpu_cmpxchg8b_double()" code
rolls its own inferior version, and has slightly different semantics
(ie the whole "return value success" thing).

So close, but yet so far.

Linus

2011-05-04 21:41:04

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Wed, 4 May 2011, Linus Torvalds wrote:

> On Wed, May 4, 2011 at 2:06 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > The problem is that cmpxchg *IS*NOT*AN*CMPXCHG*AT*ALL*. Not even on UP.
>
> Btw, the really sad thing is that as far as I can tell, the
> cmpxchg64_local() function that we define actually gets this *right*,
> with a helper function that actually works (and disables interrupts)
> and all the alternative_io() stuff to then use the right instruction
> automatically if the CPU supports it, rather than making it a
> hardcoded decision.
>
> But we don't use it, because the "percpu_cmpxchg8b_double()" code
> rolls its own inferior version, and has slightly different semantics
> (ie the whole "return value success" thing).

Yeah, I tripped over that as well. And the new shiny extra
CONFIG_CMPXCHG_DOUBLE misfeature Christoph nailed together is just
ignoring it as well. It's just more useless #ifdef and CONFIG bloat.

That whole this_cpu_* stuff seems to be designed by committee. A quick
grep shows that max. 10% of the implemented macro hell is actually
used. Can we get rid of all unused ones and bring them back when they
are actually required?

Thanks,

tglx

2011-05-04 21:56:22

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

On 05/04/2011 07:36 AM, Ingo Molnar wrote:

>> From c22bd309573330e33a77c329405d9473fc14f1cb Mon Sep 17 00:00:00 2001
> From: Thomas Gleixner<[email protected]>
> Date: Wed, 4 May 2011 15:38:19 +0200
> Subject: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

This patch appears to fix the crashes I was seeing on my 32-bit
Atom system.

It would crash within 30 seconds of booting every time before this patch...

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2011-05-04 22:01:25

[permalink] [raw]

Subject: Re: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

On Wed, May 4, 2011 at 2:52 PM, Ben Greear <[email protected]> wrote:
>
> This patch appears to fix the crashes I was seeing on my 32-bit
> Atom system.
>
> It would crash within 30 seconds of booting every time before this patch...

Oh, well - I already committed it and pushed out, otherwise I'd have
added that piece of information to the commit log.

But it's good to know that others had seen this too, just never
realized what was going on.

Btw, that does seem to imply that your kernel config is somewhat odd.
We _should_ be using cmpxchg8b natively if you compile for anything
newer than PPro, and that includes atom.

Did you perhaps say "compile for Pentium" (which is pretty close to
Atom in some respects - but we don't trust that all Pentium-class
CPU's have cmpxchg8b, even if the Intel ones all should).

Regardless, it's pushed out and should be in current -git, apart from
any possible mirroring delays there may be that may b eholdin git up
from actually being visible on the public kernel.org machines yet.

Linus

2011-05-04 22:25:20

by Ben Greear

[permalink] [raw]

Subject: Re: [PATCH] slub: Fix the lockless code on 32-bit platforms with no 64-bit cmpxchg

On 05/04/2011 03:00 PM, Linus Torvalds wrote:
> On Wed, May 4, 2011 at 2:52 PM, Ben Greear<[email protected]> wrote:
>>
>> This patch appears to fix the crashes I was seeing on my 32-bit
>> Atom system.
>>
>> It would crash within 30 seconds of booting every time before this patch...
>
> Oh, well - I already committed it and pushed out, otherwise I'd have
> added that piece of information to the commit log.

No worries...I'm just happy to finally be able to boot .39 :)

> But it's good to know that others had seen this too, just never
> realized what was going on.
>
> Btw, that does seem to imply that your kernel config is somewhat odd.
> We _should_ be using cmpxchg8b natively if you compile for anything
> newer than PPro, and that includes atom.
>
> Did you perhaps say "compile for Pentium" (which is pretty close to
> Atom in some respects - but we don't trust that all Pentium-class
> CPU's have cmpxchg8b, even if the Intel ones all should).

Well, yes. I'm compiling for 'M586' it seems, plus SMP, pre-empt, etc.

Maybe it's time to move to a newer processor family!

Thanks,
Ben

--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com

2011-05-05 06:46:47

by werner

[permalink] [raw]

Subject: Re: 2.6.39-rc5-git2 boot crashs

Just now I compiled 2.6.39-rc6-git1 (including still
the patchs by hand because not yet included), its also OK,
good running, without all these problems. :)

Yes, I continue with my everythingyesconfig, which all the
time and also now
after this patchs gives good generic kernels for the
distro, working good on my
server and on the laptops of the neighbours except that
sometimes
(instead of ELAN) I'll make 486 binaries. And now put the
package on the mirrors because everything is OK, and I
dont need to switch on and off the server so that they can
sync, and the users can use the now working packages of
2.6.39-rcX ,
which have many new drivers and other good improvements.

Like since years, I continue to compile new versions since
-rc1 and report problems.
But it was very important that Linus take care of these
problems now, because
nobody else searched for the reasons. These bugs in logfs
and in slub.c and percpu.h
probably were processor-type / arch independent, and would
have given problems
for the most users, so that they could not have been
ignored.
wl

=======================================================
On Thu, 5 May 2011 08:02:04 +0200
Ingo Molnar <[email protected]> wrote:
>
> * werner <[email protected]> wrote:
>
>> I'll test the patchs with a 486 config, switching off
>>ELAN.
>
> I think you can skip all those other test suggestions in
>this thread - as you
> confirmed in this email the root cause of your crashes
>has been found.
>
> So the best testing you could do would be for you to
>resume testing whatever
> original config you were using on your system, and make
>sure there's really no
> crashes left - and report any other regressions you
>might see.
>
> Thanks,
>
> Ingo
>
>

"werner" <[email protected]>
---
Professional hosting for everyone - http://www.host.ru

2011-05-05 09:54:28

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

Hey,

On Wed, May 04, 2011 at 11:40:33PM +0200, Thomas Gleixner wrote:
> Yeah, I tripped over that as well. And the new shiny extra
> CONFIG_CMPXCHG_DOUBLE misfeature Christoph nailed together is just
> ignoring it as well. It's just more useless #ifdef and CONFIG bloat.
>
> That whole this_cpu_* stuff seems to be designed by committee. A quick
> grep shows that max. 10% of the implemented macro hell is actually
> used. Can we get rid of all unused ones and bring them back when they
> are actually required?

What happened there was more of lack of design rather than too much
design. At first it were a few ops with only two variants -
preemption protected default one and __ prefixed unprotected one.
Then came the irqsafe ops and then the whole kinda exploded on itself
with all the different ops.

Cleaning up percpu accessors has been on my todo list for some time
now. The this_cpu_*() thing is still in its infancy and there aren't
many users and the existing ones are mostly in the core, so cleaning
things up should be relatively easy. Things which I've been
considering are...

1. Static and dynamic accessors.

We have two sets of percpu accessors. Originally, one set was for
static percpu variables and the other for dynamic ones. Since the
percpu allocator reimplementation, there's no distinction between
static and dynamic and the new this_cpu_*() functions don't
distinguish them. We need to unify the duplicate ones. There's
nothing much to decide about this. It just needs to be done.

2. this_cpu_*()

This one is more tricky. General cleanup aside...

this_cpu_*() ops currently come in three flavors and the naming
convention is rather confusing. __this_cpu for no protection,
this_cpu for preemtion disabled and irqsafe_cpu for irq protected.
IIUC it was intended to loosely match the naming convention of
spinlocks but I'm not sure whether that's beneficial in this case.

The biggest problem, as shown by this bug, is that it's error-prone.
Percpu ops as they currently stand don't have lockdep protection for
accessing contexts. There's nothing protecting percpu vars being
accessed from wrong contexts. I think adding lockdep protection
should be doable and should be done, but it would be much better if
it's safer by default.

The problem is aggravated by the fact that, on x86 where most of
developement and testing happens, there's no difference between
__this_cpu_*(), this_cpu_*() and irqsafe_cpu_*(). They're one and the
same, so testing coverage suffers. We can avoid this by adding a
debug option which forces the generic versions whether the arch ones
are there or not, but it does raise the question - do we really need
all these different confusing variants?

For x86, it doesn't matter. There are some corner cases with
cmpxchg_[double] but I don't think we would be in any trouble just
using the generic ones for them. The problem is with other archs
where local atomic ops haven't been or, for most load-store
architectures, can't be implemented. We might say "eh... screw them"
and let them use the generic ones but the problem is that this_cpu_*()
tend to be used in extremely hot paths where extra irq on/off's can
show up as significant overhead.

The thing that I want gone the most is the irqsafe_ variant. It would
be much nicer/safer if this_cpu_*() is atomic against everything. If
the caller is gonna take care of exclusion from the enclosing area
(this is a valid use case when there's a series of operations to be
done including but not limited to multiple this_cpu ops), it can use
__this_cpu_*() ones.

The reason to disable preemtion instead of IRQ is two-fold - First,
preemption can be disabled for longer period than IRQ. Second,
depending on CPU, toggling preemption is significantly cheaper than
toggling preemption. For this_cpu ops, the first one doesn't apply.
It's always very short, so the only thing that needs to be considered
is the overhead of toggling protection.

So, to avoid suffering performance penalty from this_cpu_*()
conversion, architectures can choose one of the followings.

1. Implement arch specific this_cpu_*() ops. This would be the better
way but unfortunately many architectures simply can't. :-(

2. Make irq toggling as cheap as preemption toggling. This can be
achieved by implementing IRQ masking in software. I played with it
a bit on x86 last year. I didn't get to finish it and it would
have taken me some time to iron out weird failures but it didn't
seem too difficult and as irq on/off is quite expensive on a lot of
CPUs, this might bring overall performance benefit.

For many archs, #2 would be the only choice and if we're gonna do that
I think it would be best to do it on x86 too. It involves changes to
common code too and x86 has the highest test/development coverage.

I don't feel brave enough to remove preemption protected ones without
first somehow taking care of non-x86 archs. The preemption disabled
ones are used for statistics in net, block, irq and fs and simply
switching to irq protected ones is likely to noticeably hurt many
non-x86 archs. Maybe the ones which really matter can implement at
least _add() and we can get away without doing soft irq masking.

Anyways, that's what I've been thinking. I'll get to it in the next
devel cycle or the one after that. What do you guys think about soft
irq masking idea?

Thanks.

--
tejun

2011-05-05 10:18:57

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Tejun Heo <[email protected]> wrote:

> 2. Make irq toggling as cheap as preemption toggling. This can be
> achieved by implementing IRQ masking in software. I played with it
> a bit on x86 last year. I didn't get to finish it and it would
> have taken me some time to iron out weird failures but it didn't
> seem too difficult and as irq on/off is quite expensive on a lot of
> CPUs, this might bring overall performance benefit.
>
> For many archs, #2 would be the only choice and if we're gonna do that I
> think it would be best to do it on x86 too. It involves changes to common
> code too and x86 has the highest test/development coverage.

We played with this in -rt on and off but note that -rt doesnt do this right
now. Interestingly, most of the irq-disable wrappery and state tracking code
for that is upstream already, via the lockdep irq state tracking patches.
(Surprise! :-)

The disadvantages:

- register pressure increases, the pushf+cli+popf sequence has no register
side-effects, while soft flags inevitably disturb register allocations.
*possibly* quite low with the modern percpu implementation, but this has
to be measured very carefuly, with disassembly.

- icache size increases - the percpu ops are larger than the minimal
pushf+cli+popfl sequence. Again, this too has to be measured both via
vmlinux size analysis and via perf stat --repeat icache pressure runs.

[ This is also an assymetric cost: it increases the cost of the cache-cold
case, while most of the benefits are in the cache-hot case. ]

- irq replay becomes common and there's extra cost due to that. Also we are
not ready to replay some types of irqs (lapic timer), at least with current
code. So there's some ongoing maintenance cost there.

The benefits are:

- lockdep is already tracking irqs on/off sections rather carefully, so we know
all the places that play with irqs and the ongoing maintenance cost is
shared with what we'd have to do with lockdep anyway.

- on Nehalem a "PUSHF; CLI; POPF" sequence is 18 cycles, a soft sequence would
be more like 2 cycles. So we win around 15 cycles per sequence in the fast
path - minus collateral slowpath cost above ... which are not directly
comparable.

- Stock mainline would become a truly hard RT irq handlers kernel which never
ever disables hardirqs. Big wow factor and precision guided, laser mounted
sharks!

I probably missed a few factors, but these are the main concerns.

My firm judgement: "Dunno".

Thanks,

Ingo

2011-05-05 10:45:42

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Thu, 5 May 2011, Tejun Heo wrote:
> 2. Make irq toggling as cheap as preemption toggling. This can be
> achieved by implementing IRQ masking in software. I played with it
> a bit on x86 last year. I didn't get to finish it and it would
> have taken me some time to iron out weird failures but it didn't
> seem too difficult and as irq on/off is quite expensive on a lot of
> CPUs, this might bring overall performance benefit.
>
> For many archs, #2 would be the only choice and if we're gonna do that
> I think it would be best to do it on x86 too. It involves changes to
> common code too and x86 has the highest test/development coverage.
>
> I don't feel brave enough to remove preemption protected ones without
> first somehow taking care of non-x86 archs. The preemption disabled
> ones are used for statistics in net, block, irq and fs and simply
> switching to irq protected ones is likely to noticeably hurt many
> non-x86 archs. Maybe the ones which really matter can implement at
> least _add() and we can get away without doing soft irq masking.

There it's fine when we have certain use cases which just need
preemption protection, but the function name should be clear about it
and we should only have the functions which are actually required
instead of every possible flavour of _OP.

> Anyways, that's what I've been thinking. I'll get to it in the next
> devel cycle or the one after that. What do you guys think about soft
> irq masking idea?

It's doable, we played around with that in rt already and put it aside
again :)

You need to do the check at the vector entry and then reinject the
vector on local_irq_enable/restore(), so that's mostly hackery in the
ASM entry code which includes to fixup the pushed flags so the reti
wont reeneable interrupts.

That's halfways easy to do on x86, not sure about other architectures,
and you have to carefully weigh the actual benefit against the
overhead in the slow path case plus the inevitable hackery you need in
each architecture implementation.

No strong opinion, but definitely worthwhile to toy with it :)

Thanks,

tglx

2011-05-05 18:20:23

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Thu, 5 May 2011, Tejun Heo wrote:

> Anyways, that's what I've been thinking. I'll get to it in the next
> devel cycle or the one after that. What do you guys think about soft
> irq masking idea?

Great idea. Would make the whole irq on/off business much cheaper.

2011-05-05 19:14:19

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

* Christoph Lameter <[email protected]> wrote:

> On Thu, 5 May 2011, Tejun Heo wrote:
>
> > Anyways, that's what I've been thinking. I'll get to it in the next
> > devel cycle or the one after that. What do you guys think about soft
> > irq masking idea?
>
> Great idea. Would make the whole irq on/off business much cheaper.

The tradeoffs are *not at all* clear and the result (on x86) is not
'much cheaper', at all ...

In particular the irq-enable path gets complicated by the need to check the
flag and call a hardirq handling function in that case - a far cry from the
single-byte POPF instruction. It will be somewhat cheaper cycle-wise - but the
code gets bloated, so the instruction cache impact has to be measured
carefully. (See my other mail for details.)

There's also the fact that PUSHF+CLI+POPF sequence has been getting cheaper all
the time with newer hardware generations. CLI+STI is even cheaper, 10 cycles
both on Intel and AMD CPUs. So it's an optimization that might get narrower and
narrower with every CPU generation.

Thanks,

Ingo

2011-05-05 19:53:52

by werner

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Thu, 5 May 2011 11:54:21 +0200
Tejun Heo <[email protected]> wrote:
> Hey,
>
> On Wed, May 04, 2011 at 11:40:33PM +0200, Thomas
>Gleixner wrote:
>> Yeah, I tripped over that as well. And the new shiny
>>extra
>> CONFIG_CMPXCHG_DOUBLE misfeature Christoph nailed
>>together is just
>> ignoring it as well. It's just more useless #ifdef and
>>CONFIG bloat.
>>
>> That whole this_cpu_* stuff seems to be designed by
>>committee. A quick
>> grep shows that max. 10% of the implemented macro hell
>>is actually
>> used. Can we get rid of all unused ones and bring them
>>back when they
>> are actually required?
>
> What happened there was more of lack of design rather
>than too much
> design.

........

From the sight of an user, not of an programmer, my
opinion about more and more config options and the design
is:

As the 1st step, for can be compiled generic kernels
at all, the kernel should have (and has) the ability, to
discover at run-time , what hardware the individual user
has, and to use only the corresponding kernel subroutines.
F.ex. if subroutines for ELAN or MOORESTON were compiled
also, then the kernel ignore them simply, if on run-time
it discovers that the user has a 486 computer.

Then, in a 2nd step considering this, any
configuration of the arch is becoming obsolete and should
be reduced so much as possible. It was useful in the
time of papertapes and punched cards to can pick out what
one need, but nowaday we have enough memory that the
kernel can be compiled ALWAYS including everything, and on
run-time the kernel discovers and use only what's present.
For the case if users in poorer countries have a 486
with 8 MB storage (long time ago I used Slackware on such
computers), one should CAN alternatively choice to compile
and deliver also small kernels. But as default - or
spoken more generally if is compiled a huge kernel - then
at run-time it simply should ignore what's absent and used
what's present. This means, in this general case, ANY
user configuration of hardware is obsolete, and as default
should be compiled everything (and into the kernel, not as
module, in-/output devices which nowadays can happen as
boot devices inclusive for an installation). One should
reduce the configuration to a very few number of basical
items - such as, with/without PEA; low-memory system
(that parameter is the only what matters for make
impossible an huge everythingyes kernel); slow system.

There is a very good analogon to this. Some distros
have a very complicated configurator, difficult for
advanced user and impossible to install Linux for
beginners. F.ex., SUSE has an elefantous poissen-green
installer, with thousands of options and very
mis-understandable, in what I (trying to help someone two
weeks ago) was unable to create an install/root partition
without to be sure not to loose the existing partition.
This is a complete misconcept, hinders people to use
Linux, put users in risk by a mistake loose his existing
partition/files, and should be dumped completely !!!
Principally, the installation can be automatical and
thus have to be automatical; little adaptions the user
can make later in the KDE control center. My distro,
intended to be maximal easy, has no interactive installer
at all, and produces a very good working system in 99% of
the cases, full-automatically. For example, for what
any partition configuration ?? The 1:4 LZMA-packed iso,
unfold, gives a 20 GB system. Thus, the install program
can search alone where to find 20 GB, without any question
to the user. It searches on the primary, then on the
secondary hard disk for 20GB unpartioned space; if not
found, then it resizes and/or moves existing partitions to
get 20 GB free, and then install the sistem. If really all
disks are full with files, it goes in the rescue system
with mc , and the user should delete some videos etc and
restart the instellation. Even the keyboard language,
system language, user name, internet id/passwords are
found and installed automatically - the install program
reads them out from any (in 99% existing) pre-existing
Windows, Linux or Minix installation on the computer.

Thus, when I read many comments here in the kernel
threads, I feel that the most kernel configs should be
removed and the kernel should find out itself at run time
what hardware is existing. Also, the functions of things
like udev and hal should be automatized. It is a sick
concept, that the kernel itself find all hardware, but one
need an externel program like hal to interprete the kernel
messages and install the hardware, what the kernel can and
should do itself. Each kernel version simply should
include an updated table, of existing/known hardware and
their most adequade Linux drivers, and then, during
identifying hardware at boot or hotplug, set up everything
to use it.

The kernel is becoming more and more complicated, and
before people loosing themselves in details, one should
make it most easy possible, for the users and for the
maintainers, and throw out unnecessary and too complicated
or misconcepted things. Because at the same time more and
more new hardware surges, and also the kernel config
becomes more and more big, there is no alternative and
should be started most early possible, to automatize and
improve the kernel's runtime autoconfig and
auto-hardware-managing habilities. And REDUCE THE
HUMAN-DEPENDING PRE-CONFIG TO THE MINIMUM POSSIBLE. This
satisfies also that Linux is tecnically good, but have to
become more easy for the use-by-anybody. This should be
considered for the whole politics/direction in what is
going the kernel development.

---
Professional hosting for everyone - http://www.host.ru

2011-05-05 20:09:09

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Thu, 5 May 2011, werner wrote:

> As the 1st step, for can be compiled generic kernels at all, the kernel
> should have (and has) the ability, to discover at run-time , what hardware the
> individual user has, and to use only the corresponding kernel subroutines.
> F.ex. if subroutines for ELAN or MOORESTON were compiled also, then the kernel
> ignore them simply, if on run-time it discovers that the user has a 486
> computer.

Yes indeed the kernel can detect that. And the code has fallback for
the case that the processor flags indicate that cmpxchg16b is not supported.

However, in this case the kernel configuration at build time was set in
such a way (!CMPXCHG64 support but CMPXCHG_LOCAL) that generic fallback
functions were used at compile time instead of the x86 assembly that can
do the fallback with run time detection. Thus the code to do the fallback
was not compiled in. Frankly, I was not aware that such a case existed
where one could disable cmpxchg64 in this way and was expecting that the
runtime detection would always be compiled in for the CMPXCHG_LOCAL case.

2011-05-05 21:12:16

by werner

[permalink] [raw]

Subject: Re: [block IO crash] Re: 2.6.39-rc5-git2 boot crashs

On Thu, 5 May 2011 15:09:04 -0500 (CDT)
Christoph Lameter <[email protected]> wrote:
> On Thu, 5 May 2011, werner wrote:
>
>> As the 1st step, for can be compiled generic kernels
>>at all, the kernel
>> should have (and has) the ability, to discover at
>>run-time , what hardware the
>> individual user has, and to use only the corresponding
>>kernel subroutines.
>> F.ex. if subroutines for ELAN or MOORESTON were compiled
>>also, then the kernel
>> ignore them simply, if on run-time it discovers that the
>>user has a 486
>> computer.
>
> Yes indeed the kernel can detect that. And the code has
>fallback for
> the case that the processor flags indicate that
>cmpxchg16b is not supported.
>
> However, in this case the kernel configuration at build
>time was set in
> such a way (!CMPXCHG64 support but CMPXCHG_LOCAL) that
>generic fallback
> functions were used at compile time instead of the x86
>assembly that can
> do the fallback with run time detection. Thus the code
>to do the fallback
> was not compiled in. Frankly, I was not aware that such
>a case existed
> where one could disable cmpxchg64 in this way and was
>expecting that the
> runtime detection would always be compiled in for the
>CMPXCHG_LOCAL case.
>
>
Thus, things gone wrong because it's only half-consequent.
One should compile everything, and at runtime the kernel
itself detect the existing hardware and decides/uses just
what it need.

My (and many other users) mind on configuration is very
simple: I switch everything on, so that the kernel has
everything available, and in run-time it uses just what's
existing on the individual computer of the user.

Just if there occured an human configuration error, so
that something at run-time wasnt available, then one
should improve the above explained manner of use, reducing
human need/influence, compiling everything, and improving
the kernel's runtime detection and use/selection of its
adequade modules.

I saw plenty people (specially, beginners) loosing all
their files by overformating or repartitioning the
harddisk within a too-complicated installer. Many
interactive installers are not good understandable. And
many simple users don't know nothing about harddisks,
partitions, etc. You know all details inside your
washmachine, and you need to know it for use it ?? Linux
has to work also for stupid people. Thus, the
partitioning / formatting process of the installation is a
good example what CAN and HAS TO BE full-automatized, in
order to reduce human errors. If a typical distro is 5 GB
or 20 GB big, then the installer can and has to manage it,
silently and alone, to find or desocupy this space (and
more if possible) anywhere on a harddisk, silently and by
itself, so that it installs problemless for beginners
(while 'specialists' before the installation can desocupy
a certain space of the harddisk which then the installer
will find and use).

90% of the config options, users don't know for what they
are. Instead of more and more new config features in,
EVERY CONFIG PARAMETER WHAT THE KERNEL CAN DETECT AT RUN
TIME (at the beginning of booting) SHOULD BE THROWED OUT.
And the kernel should be compiled 'everythingyes' as
default. Even then, with all exotic (but nowadays
happening) boot-necessary input/output hardware included,
my vmlinuz is only 10 MB big. Sufficient memory for run
this (together with an initrd for an installer) nowadays
should exist on almost all computers. All other, not-boot
necessary modules are perhaps 50 MB, and since the Atari
time with 20 MB harddisks passed, there is NO REASON FOR
COMPILE SMALL THE KERNEL. Even on Android phones,
compiling the kernel / modules 50 MB big but using often
only a part, dont hurt. But for emergency, instead of
completely configless, one could let 2 options for the
kernel config: normal, small (below 16 M memory).
Provisorically, one could but one should not include
such a raw-selection menu in the kernel config, because to
check and maintain the dependences of their various single
config parameters would be big extra work. Instead of
this, one should rigorously make more and more config
parameters definitively obsolete and reduce them by
improving the corresponding runtime ability of the kernel.
---
Professional hosting for everyone - http://www.host.ru

2011-05-05 22:27:44