2011-04-17 12:52:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: 2.6.39-rc3-git7: Reported regressions from 2.6.38

This message contains a list of some regressions from 2.6.38,
for which there are no fixes in the mainline known to the tracking team.
If any of them have been fixed already, please let us know.

If you know of any other unresolved regressions from 2.6.38, please let us
know either and we'll add them to the list. Also, please let us know
if any of the entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply
to this message with CCs to the people involved in reporting and handling
the issue.


Listed regressions statistics:

Date Total Pending Unresolved
----------------------------------------
2011-04-17 17 11 10


Unresolved regressions
----------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33342
Subject : [2.6.39-rc2][bisected] Constant DISK_MEDIA_CHANGE_EVENTS from CDROM drive.
Submitter : Shaun Ruffell <[email protected]>
Date : 2011-04-08 20:15 (10 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130229371907209&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33272
Subject : drm related hard-hang
Submitter : Peter Teoh <[email protected]>
Date : 2011-04-14 01:29 (4 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33242
Subject : Lockdep splat in autofs with 2.6.39-rc2
Submitter : Nick Bowler <[email protected]>
Date : 2011-04-07 19:44 (11 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130220545614682&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33142
Subject : 2.6.39-rc2 regression: X201s fails to resume b77dcf8460ae57d4eb9fd3633eb4f97b8fb20716
Submitter : Keith Packard <[email protected]>
Date : 2011-04-06 7:44 (12 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130207593728273&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33102
Subject : File's copied from client->linux server only copy 1st 64K data;rest is lost
Submitter : Linda Walsh <[email protected]>
Date : 2011-04-11 22:12 (7 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33092
Subject : [regression] 2.6.39-rc1 - Beagleboard usbnet broken
Submitter : Mark Jackson <[email protected]>
Date : 2011-04-04 9:22 (14 days old)
First-Bad-Commit: http://git.kernel.org/linus/087809fce28f50098d9c3ef1a6865c722f23afd2
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130191386508831&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32982
Subject : Kernel locks up a few minutes after boot
Submitter : Bart Van Assche <[email protected]>
Date : 2011-04-10 19:55 (8 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32902
Subject : 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64
Submitter : Alex Romosan <[email protected]>
Date : 2011-04-03 19:41 (15 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130186054431678&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32892
Subject : 2.6.39-rc1 data corruption with rtorrent
Submitter : Jindrich Makovicka <[email protected]>
Date : 2011-04-02 20:21 (16 days old)
Message-ID : <20110402222118.3b5c2fa8@holly>
References : http://marc.info/?l=linux-kernel&m=130177570309226&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32262
Subject : 2.6.38-git15+ IDE hangs boot
Submitter : Pete Clements <[email protected]>
Date : 2011-03-25 15:38 (24 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130106749313695&w=2


Regressions with patches
------------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33252
Subject : [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs
Submitter : Shaun Ruffell <[email protected]>
Date : 2011-04-06 22:30 (12 days old)
First-Bad-Commit: http://git.kernel.org/linus/242214f9c1eeaae40eca11e3b4d37bfce960a7cd
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130212907032580&w=2
Handled-By : Don Zickus <[email protected]>
Patch : http://cache.gmane.org//gmane/linux/kernel/1125621-001.bin


For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions from 2.6.38,
unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=32012

Please let the tracking team know if there are any Bugzilla entries that
should be added to the list in there.

Thanks!


2011-04-17 12:52:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #32262] 2.6.38-git15+ IDE hangs boot

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32262
Subject : 2.6.38-git15+ IDE hangs boot
Submitter : Pete Clements <[email protected]>
Date : 2011-03-25 15:38 (24 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130106749313695&w=2

2011-04-17 12:57:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #32982] Kernel locks up a few minutes after boot

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32982
Subject : Kernel locks up a few minutes after boot
Submitter : Bart Van Assche <[email protected]>
Date : 2011-04-10 19:55 (8 days old)

2011-04-17 12:57:12

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33102] File's copied from client->linux server only copy 1st 64K data;rest is lost

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33102
Subject : File's copied from client->linux server only copy 1st 64K data;rest is lost
Submitter : Linda Walsh <[email protected]>
Date : 2011-04-11 22:12 (7 days old)

2011-04-17 12:57:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33272] drm related hard-hang

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33272
Subject : drm related hard-hang
Submitter : Peter Teoh <[email protected]>
Date : 2011-04-14 01:29 (4 days old)

2011-04-17 12:57:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33142] 2.6.39-rc2 regression: X201s fails to resume b77dcf8460ae57d4eb9fd3633eb4f97b8fb20716

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33142
Subject : 2.6.39-rc2 regression: X201s fails to resume b77dcf8460ae57d4eb9fd3633eb4f97b8fb20716
Submitter : Keith Packard <[email protected]>
Date : 2011-04-06 7:44 (12 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130207593728273&w=2

2011-04-17 12:57:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33342] [2.6.39-rc2][bisected] Constant DISK_MEDIA_CHANGE_EVENTS from CDROM drive.

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33342
Subject : [2.6.39-rc2][bisected] Constant DISK_MEDIA_CHANGE_EVENTS from CDROM drive.
Submitter : Shaun Ruffell <[email protected]>
Date : 2011-04-08 20:15 (10 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130229371907209&w=2

2011-04-17 12:58:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #32902] 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32902
Subject : 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64
Submitter : Alex Romosan <[email protected]>
Date : 2011-04-03 19:41 (15 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130186054431678&w=2

2011-04-17 12:58:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33092] [regression] 2.6.39-rc1 - Beagleboard usbnet broken

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33092
Subject : [regression] 2.6.39-rc1 - Beagleboard usbnet broken
Submitter : Mark Jackson <[email protected]>
Date : 2011-04-04 9:22 (14 days old)
First-Bad-Commit: http://git.kernel.org/linus/087809fce28f50098d9c3ef1a6865c722f23afd2
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130191386508831&w=2

2011-04-17 12:58:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33252] [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33252
Subject : [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs
Submitter : Shaun Ruffell <[email protected]>
Date : 2011-04-06 22:30 (12 days old)
First-Bad-Commit: http://git.kernel.org/linus/242214f9c1eeaae40eca11e3b4d37bfce960a7cd
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130212907032580&w=2
Handled-By : Don Zickus <[email protected]>
Patch : http://cache.gmane.org//gmane/linux/kernel/1125621-001.bin

2011-04-17 12:58:28

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #32892] 2.6.39-rc1 data corruption with rtorrent

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32892
Subject : 2.6.39-rc1 data corruption with rtorrent
Submitter : Jindrich Makovicka <[email protected]>
Date : 2011-04-02 20:21 (16 days old)
Message-ID : <20110402222118.3b5c2fa8@holly>
References : http://marc.info/?l=linux-kernel&m=130177570309226&w=2

2011-04-17 12:58:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #33242] Lockdep splat in autofs with 2.6.39-rc2

This message has been generated automatically as a part of a summary report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.38. Please verify if it still should be listed and let the tracking team
know (either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33242
Subject : Lockdep splat in autofs with 2.6.39-rc2
Submitter : Nick Bowler <[email protected]>
Date : 2011-04-07 19:44 (11 days old)
Message-ID : <[email protected]>
References : http://marc.info/?l=linux-kernel&m=130220545614682&w=2

2011-04-17 13:05:58

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: [Bug #33252] [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs

On 04/17/2011 04:57 PM, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a summary report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.38. Please verify if it still should be listed and let the tracking team
> know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33252
> Subject : [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs
> Submitter : Shaun Ruffell <[email protected]>
> Date : 2011-04-06 22:30 (12 days old)
> First-Bad-Commit: http://git.kernel.org/linus/242214f9c1eeaae40eca11e3b4d37bfce960a7cd
> Message-ID : <[email protected]>
> References : http://marc.info/?l=linux-kernel&m=130212907032580&w=2
> Handled-By : Don Zickus <[email protected]>
> Patch : http://cache.gmane.org//gmane/linux/kernel/1125621-001.bin
>
>

We're working on it, patch is almost done. I guess it'll be published next week.

--
Cyrill

2011-04-17 13:16:55

by Pete Clements

[permalink] [raw]
Subject: Re: [Bug #32262] 2.6.38-git15+ IDE hangs boot

>
> The following bug entry is on the current list of known regressions
> from 2.6.38. Please verify if it still should be listed and let the tracking team
> know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D32262
> Subject : 2.6.38-git15+ IDE hangs boot
> Submitter : Pete Clements <[email protected]>
> Date : 2011-03-25 15:38 (24 days old)
> Message-ID : <[email protected]>
> References : http://marc.info/?l=3Dlinux-kernel&m=3D130106749313695&w=3D2
>
>

I no longer experience the problem. Don't recall when the fix was integrated
(post git19?). (Currently at 39-rc3-git6.)
--
Pete Clements

2011-04-17 13:28:29

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #32262] 2.6.38-git15+ IDE hangs boot

On Sunday, April 17, 2011, Pete Clements wrote:
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.38. Please verify if it still should be listed and let the tracking team
> > know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D32262
> > Subject : 2.6.38-git15+ IDE hangs boot
> > Submitter : Pete Clements <[email protected]>
> > Date : 2011-03-25 15:38 (24 days old)
> > Message-ID : <[email protected]>
> > References : http://marc.info/?l=3Dlinux-kernel&m=3D130106749313695&w=3D2
> >
> >
>
> I no longer experience the problem. Don't recall when the fix was integrated
> (post git19?). (Currently at 39-rc3-git6.)

Thanks, closing.

Rafael

2011-04-17 13:31:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #33252] [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs

On Sunday, April 17, 2011, Cyrill Gorcunov wrote:
> On 04/17/2011 04:57 PM, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.38. Please verify if it still should be listed and let the tracking team
> > know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=33252
> > Subject : [regression 2.6.39-rc2][bisected] "perf, x86: P4 PMU - Read proper MSR register to catch" and NMIs
> > Submitter : Shaun Ruffell <[email protected]>
> > Date : 2011-04-06 22:30 (12 days old)
> >
> >
>
> We're working on it, patch is almost done. I guess it'll be published next week.

Great, thanks! Please let me know when the patch makes it to the Linus' tree.

Rafael

2011-04-17 17:04:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

Is this machine running a RAID5 setup or something like that?

There is a known interaction with the new block layer plugging code
and MD. The "hung task" report in that bugzilla looks very much like
that issue. And you do have "root=/dev/md0", so clearly there's some
md thing going on.

And bisecting might not work all that well for it, because I suspect
it ends up being very much a matter of IO patterns how it triggers.

Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?

Linus

On Sun, Apr 17, 2011 at 5:57 AM, Rafael J. Wysocki <[email protected]> wrote:
> This message has been generated automatically as a part of a summary report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.38. ?Please verify if it still should be listed and let the tracking team
> know (either way).
>
>
> Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=32982
> Subject ? ? ? ? : Kernel locks up a few minutes after boot
> Submitter ? ? ? : Bart Van Assche <[email protected]>
> Date ? ? ? ? ? ?: 2011-04-10 19:55 (8 days old)
>
>
>

2011-04-17 18:38:07

by Bart Van Assche

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Sun, Apr 17, 2011 at 7:03 PM, Linus Torvalds
<[email protected]> wrote:
> On Sun, Apr 17, 2011 at 5:57 AM, Rafael J. Wysocki <[email protected]> wrote:
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.38. ?Please verify if it still should be listed and let the tracking team
> > know (either way).
> >
> >
> > Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=32982
> > Subject ? ? ? ? : Kernel locks up a few minutes after boot
> > Submitter ? ? ? : Bart Van Assche <[email protected]>
> > Date ? ? ? ? ? ?: 2011-04-10 19:55 (8 days old)
>
> Is this machine running a RAID5 setup or something like that?
>
> There is a known interaction with the new block layer plugging code
> and MD. The "hung task" report in that bugzilla looks very much like
> that issue. And you do have "root=/dev/md0", so clearly there's some
> md thing going on.
>
> And bisecting might not work all that well for it, because I suspect
> it ends up being very much a matter of IO patterns how it triggers.
>
> Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?

(converted top-posting into bottom-posting)

Hello Linus,

On the system on which bug #32982 has been triggered md0, md1 and md2
have been configured as two-disk RAID1 (mirroring).

I've done my best to trigger enough I/O in order to obtain reliable
bisect results. A difficulty I encountered during bisecting though was
that I encountered unbootable kernels (all skipped revisions).

Bart.

2011-04-17 21:07:31

by NeilBrown

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Sun, 17 Apr 2011 20:37:39 +0200 Bart Van Assche <[email protected]> wrote:

> On Sun, Apr 17, 2011 at 7:03 PM, Linus Torvalds
> <[email protected]> wrote:
> > On Sun, Apr 17, 2011 at 5:57 AM, Rafael J. Wysocki <[email protected]> wrote:
> > > This message has been generated automatically as a part of a summary report
> > > of recent regressions.
> > >
> > > The following bug entry is on the current list of known regressions
> > > from 2.6.38. ?Please verify if it still should be listed and let the tracking team
> > > know (either way).
> > >
> > >
> > > Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=32982
> > > Subject ? ? ? ? : Kernel locks up a few minutes after boot
> > > Submitter ? ? ? : Bart Van Assche <[email protected]>
> > > Date ? ? ? ? ? ?: 2011-04-10 19:55 (8 days old)
> >
> > Is this machine running a RAID5 setup or something like that?
> >
> > There is a known interaction with the new block layer plugging code
> > and MD. The "hung task" report in that bugzilla looks very much like
> > that issue. And you do have "root=/dev/md0", so clearly there's some
> > md thing going on.
> >
> > And bisecting might not work all that well for it, because I suspect
> > it ends up being very much a matter of IO patterns how it triggers.
> >
> > Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?
>
> (converted top-posting into bottom-posting)
>
> Hello Linus,
>
> On the system on which bug #32982 has been triggered md0, md1 and md2
> have been configured as two-disk RAID1 (mirroring).

If any of those have write-intent bitmaps then I definitely know what the
problem is and I'll be posting patches later today (probably not much later).

If not .. then I'm less sure but it would certainly be worth testing after
applying the promised fixes.

NeilBrown


>
> I've done my best to trigger enough I/O in order to obtain reliable
> bisect results. A difficulty I encountered during bisecting though was
> that I encountered unbootable kernels (all skipped revisions).
>
> Bart.

2011-04-17 22:21:12

by NeilBrown

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Mon, 18 Apr 2011 07:07:11 +1000 NeilBrown <[email protected]> wrote:

> On Sun, 17 Apr 2011 20:37:39 +0200 Bart Van Assche <[email protected]> wrote:
>
> > On Sun, Apr 17, 2011 at 7:03 PM, Linus Torvalds
> > <[email protected]> wrote:
> > > On Sun, Apr 17, 2011 at 5:57 AM, Rafael J. Wysocki <[email protected]> wrote:
> > > > This message has been generated automatically as a part of a summary report
> > > > of recent regressions.
> > > >
> > > > The following bug entry is on the current list of known regressions
> > > > from 2.6.38. ?Please verify if it still should be listed and let the tracking team
> > > > know (either way).
> > > >
> > > >
> > > > Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=32982
> > > > Subject ? ? ? ? : Kernel locks up a few minutes after boot
> > > > Submitter ? ? ? : Bart Van Assche <[email protected]>
> > > > Date ? ? ? ? ? ?: 2011-04-10 19:55 (8 days old)
> > >
> > > Is this machine running a RAID5 setup or something like that?
> > >
> > > There is a known interaction with the new block layer plugging code
> > > and MD. The "hung task" report in that bugzilla looks very much like
> > > that issue. And you do have "root=/dev/md0", so clearly there's some
> > > md thing going on.
> > >
> > > And bisecting might not work all that well for it, because I suspect
> > > it ends up being very much a matter of IO patterns how it triggers.
> > >
> > > Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?
> >
> > (converted top-posting into bottom-posting)
> >
> > Hello Linus,
> >
> > On the system on which bug #32982 has been triggered md0, md1 and md2
> > have been configured as two-disk RAID1 (mirroring).
>
> If any of those have write-intent bitmaps then I definitely know what the
> problem is and I'll be posting patches later today (probably not much later).
>

Actually it won't be today. The new block device plugging is still unusable
for MD - so I won't be able to fix this until that gets sorted out.

NeilBrown

2011-04-18 11:44:51

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-17 20:37, Bart Van Assche wrote:
> On Sun, Apr 17, 2011 at 7:03 PM, Linus Torvalds
> <[email protected]> wrote:
>> On Sun, Apr 17, 2011 at 5:57 AM, Rafael J. Wysocki <[email protected]> wrote:
>>> This message has been generated automatically as a part of a summary report
>>> of recent regressions.
>>>
>>> The following bug entry is on the current list of known regressions
>>> from 2.6.38. Please verify if it still should be listed and let the tracking team
>>> know (either way).
>>>
>>>
>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32982
>>> Subject : Kernel locks up a few minutes after boot
>>> Submitter : Bart Van Assche <[email protected]>
>>> Date : 2011-04-10 19:55 (8 days old)
>>
>> Is this machine running a RAID5 setup or something like that?
>>
>> There is a known interaction with the new block layer plugging code
>> and MD. The "hung task" report in that bugzilla looks very much like
>> that issue. And you do have "root=/dev/md0", so clearly there's some
>> md thing going on.
>>
>> And bisecting might not work all that well for it, because I suspect
>> it ends up being very much a matter of IO patterns how it triggers.
>>
>> Neil supposedly has a patch for it, but I haven't seen it yet. Neil, Jens?
>
> (converted top-posting into bottom-posting)
>
> Hello Linus,
>
> On the system on which bug #32982 has been triggered md0, md1 and md2
> have been configured as two-disk RAID1 (mirroring).
>
> I've done my best to trigger enough I/O in order to obtain reliable
> bisect results. A difficulty I encountered during bisecting though was
> that I encountered unbootable kernels (all skipped revisions).

Bart, can you try and pull:

git://git.kernel.dk/linux-2.6-block.git for-linus

into Linus' tree and see if that works? This has, among other things,
Neils fixes for MD.

--
Jens Axboe

2011-04-18 15:34:49

by Alex Romosan

[permalink] [raw]
Subject: Re: [Bug #32902] 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64

"Rafael J. Wysocki" <[email protected]> writes:

> This message has been generated automatically as a part of a summary report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.38. Please verify if it still should be listed and let the
> tracking team
> know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32902
> Subject : 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64
> Submitter : Alex Romosan <[email protected]>
> Date : 2011-04-03 19:41 (15 days old)
> Message-ID : <[email protected]>
> References : http://marc.info/?l=linux-kernel&m=130186054431678&w=2
>

my laptop works again with 2.6.39-rc3 (except the boot process hangs at
waiting for /dev to be populated.... if i hit Ctrl-C then the booting
proceeds normally. this doesn't happen with 2.6.38) so probably this bug
can be closed.

--alex--

--
| I believe the moment is at hand when, by a paranoiac and active |
| advance of the mind, it will be possible (simultaneously with |
| automatism and other passive states) to systematize confusion |
| and thus to help to discredit completely the world of reality. |

2011-04-18 18:21:23

by Bart Van Assche

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Mon, Apr 18, 2011 at 1:44 PM, Jens Axboe <[email protected]> wrote:
> Bart, can you try and pull:
>
> git://git.kernel.dk/linux-2.6-block.git for-linus
>
> into Linus' tree and see if that works? This has, among other things,
> Neils fixes for MD.

md seems to work stable with the resulting tree, but it looks there is
a performance regression in the block layer not related to the md
issue. If I run a small block IOPS test on a block device created by
ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
(155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).

Bart.

2011-04-18 18:28:59

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-18 20:21, Bart Van Assche wrote:
> On Mon, Apr 18, 2011 at 1:44 PM, Jens Axboe <[email protected]> wrote:
>> Bart, can you try and pull:
>>
>> git://git.kernel.dk/linux-2.6-block.git for-linus
>>
>> into Linus' tree and see if that works? This has, among other things,
>> Neils fixes for MD.
>
> md seems to work stable with the resulting tree, but it looks there is

OK, that's the most important bit.

> a performance regression in the block layer not related to the md
> issue. If I run a small block IOPS test on a block device created by
> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).

That's not good. What's the test case?

--
Jens Axboe

2011-04-18 18:33:06

by Bart Van Assche

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Mon, Apr 18, 2011 at 8:28 PM, Jens Axboe <[email protected]> wrote:
> On 2011-04-18 20:21, Bart Van Assche wrote:
>> a performance regression in the block layer not related to the md
>> issue. If I run a small block IOPS test on a block device created by
>> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
>> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
>
> That's not good. What's the test case?

Nothing more than a fio IOPS test:

fio --bs=512 --ioengine=libaio --buffered=0 --rw=read --thread
--iodepth=64 --numjobs=2 --loops=10000 --group_reporting --size=1G
--gtod_reduce=1 --name=iops-test --filename=/dev/${dev} --invalidate=1

Bart.

2011-04-18 18:39:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-18 20:32, Bart Van Assche wrote:
> On Mon, Apr 18, 2011 at 8:28 PM, Jens Axboe <[email protected]> wrote:
>> On 2011-04-18 20:21, Bart Van Assche wrote:
>>> a performance regression in the block layer not related to the md
>>> issue. If I run a small block IOPS test on a block device created by
>>> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
>>> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
>>
>> That's not good. What's the test case?
>
> Nothing more than a fio IOPS test:
>
> fio --bs=512 --ioengine=libaio --buffered=0 --rw=read --thread
> --iodepth=64 --numjobs=2 --loops=10000 --group_reporting --size=1G
> --gtod_reduce=1 --name=iops-test --filename=/dev/${dev} --invalidate=1

Interesting, I'll have to check if we regressed with all these recent
changes. Comparing your .38 to .39-rc3+, are you using more/less CPU,
more/less sys%, etc?

A quick perf record -fg / perf report -g for both kernels would be nice
to see.

--
Jens Axboe

2011-04-18 21:22:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #32902] 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64

On Monday, April 18, 2011, Alex Romosan wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > This message has been generated automatically as a part of a summary report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.38. Please verify if it still should be listed and let the
> > tracking team
> > know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=32902
> > Subject : 2.6.39-rc1 doesn't boot on thinkpad t61p x86_64
> > Submitter : Alex Romosan <[email protected]>
> > Date : 2011-04-03 19:41 (15 days old)
> > Message-ID : <[email protected]>
> > References : http://marc.info/?l=linux-kernel&m=130186054431678&w=2
> >
>
> my laptop works again with 2.6.39-rc3 (except the boot process hangs at
> waiting for /dev to be populated.... if i hit Ctrl-C then the booting
> proceeds normally. this doesn't happen with 2.6.38) so probably this bug
> can be closed.

Thanks, closing.

Rafael

2011-04-19 03:52:31

by David Dillow

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Mon, 2011-04-18 at 20:21 +0200, Bart Van Assche wrote:
> On Mon, Apr 18, 2011 at 1:44 PM, Jens Axboe <[email protected]> wrote:
> > Bart, can you try and pull:
> >
> > git://git.kernel.dk/linux-2.6-block.git for-linus
> >
> > into Linus' tree and see if that works? This has, among other things,
> > Neils fixes for MD.
>
> md seems to work stable with the resulting tree, but it looks there is
> a performance regression in the block layer not related to the md
> issue. If I run a small block IOPS test on a block device created by
> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).

The mapping code for ib_srp changed in 2.6.39-rc1, but it showed
improved IOPS for a similar setup in my testing so I'd be surprised if
it is the culprit. Still, it wouldn't hurt to check. Do you have time to
try the new ib_srp code with 2.6.38.3 to eliminate it from the equation?

Thanks,
Dave

2011-04-19 09:09:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-18 20:32, Bart Van Assche wrote:
> On Mon, Apr 18, 2011 at 8:28 PM, Jens Axboe <[email protected]> wrote:
>> On 2011-04-18 20:21, Bart Van Assche wrote:
>>> a performance regression in the block layer not related to the md
>>> issue. If I run a small block IOPS test on a block device created by
>>> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
>>> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
>>
>> That's not good. What's the test case?
>
> Nothing more than a fio IOPS test:
>
> fio --bs=512 --ioengine=libaio --buffered=0 --rw=read --thread
> --iodepth=64 --numjobs=2 --loops=10000 --group_reporting --size=1G
> --gtod_reduce=1 --name=iops-test --filename=/dev/${dev} --invalidate=1

Bart, can you try the below:

diff --git a/block/blk-core.c b/block/blk-core.c
index 5fa3dd2..9b41da1 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -307,11 +307,7 @@ void __blk_run_queue(struct request_queue *q)
* Only recurse once to avoid overrunning the stack, let the unplug
* handling reinvoke the handler shortly if we already got there.
*/
- if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
- q->request_fn(q);
- queue_flag_clear(QUEUE_FLAG_REENTER, q);
- } else
- queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
+ q->request_fn(q);
}
EXPORT_SYMBOL(__blk_run_queue);


--
Jens Axboe

2011-04-19 11:16:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-19 11:09, Jens Axboe wrote:
> On 2011-04-18 20:32, Bart Van Assche wrote:
>> On Mon, Apr 18, 2011 at 8:28 PM, Jens Axboe <[email protected]> wrote:
>>> On 2011-04-18 20:21, Bart Van Assche wrote:
>>>> a performance regression in the block layer not related to the md
>>>> issue. If I run a small block IOPS test on a block device created by
>>>> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
>>>> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
>>>
>>> That's not good. What's the test case?
>>
>> Nothing more than a fio IOPS test:
>>
>> fio --bs=512 --ioengine=libaio --buffered=0 --rw=read --thread
>> --iodepth=64 --numjobs=2 --loops=10000 --group_reporting --size=1G
>> --gtod_reduce=1 --name=iops-test --filename=/dev/${dev} --invalidate=1
>
> Bart, can you try the below:

Here's a more complete variant. James, lets get rid of this REENTER
crap. It's completely bogus and triggers falsely for a variety of
reasons. The below will work, but there may be room for improvement on
the SCSI side.

diff --git a/block/blk-core.c b/block/blk-core.c
index 5fa3dd2..4e49665 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -303,15 +303,7 @@ void __blk_run_queue(struct request_queue *q)
if (unlikely(blk_queue_stopped(q)))
return;

- /*
- * Only recurse once to avoid overrunning the stack, let the unplug
- * handling reinvoke the handler shortly if we already got there.
- */
- if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
- q->request_fn(q);
- queue_flag_clear(QUEUE_FLAG_REENTER, q);
- } else
- queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
+ q->request_fn(q);
}
EXPORT_SYMBOL(__blk_run_queue);

@@ -328,6 +320,7 @@ void blk_run_queue_async(struct request_queue *q)
if (likely(!blk_queue_stopped(q)))
queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
}
+EXPORT_SYMBOL(blk_run_queue_async);

/**
* blk_run_queue - run a single device queue
diff --git a/block/blk.h b/block/blk.h
index c9df8fc..6126346 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -22,7 +22,6 @@ void blk_rq_timed_out_timer(unsigned long data);
void blk_delete_timer(struct request *);
void blk_add_timer(struct request *);
void __generic_unplug_device(struct request_queue *);
-void blk_run_queue_async(struct request_queue *q);

/*
* Internal atomic flags for request handling
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index ab55c2f..e9901b8 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -411,8 +411,6 @@ static void scsi_run_queue(struct request_queue *q)
list_splice_init(&shost->starved_list, &starved_list);

while (!list_empty(&starved_list)) {
- int flagset;
-
/*
* As long as shost is accepting commands and we have
* starved queues, call blk_run_queue. scsi_request_fn
@@ -435,20 +433,7 @@ static void scsi_run_queue(struct request_queue *q)
continue;
}

- spin_unlock(shost->host_lock);
-
- spin_lock(sdev->request_queue->queue_lock);
- flagset = test_bit(QUEUE_FLAG_REENTER, &q->queue_flags) &&
- !test_bit(QUEUE_FLAG_REENTER,
- &sdev->request_queue->queue_flags);
- if (flagset)
- queue_flag_set(QUEUE_FLAG_REENTER, sdev->request_queue);
- __blk_run_queue(sdev->request_queue);
- if (flagset)
- queue_flag_clear(QUEUE_FLAG_REENTER, sdev->request_queue);
- spin_unlock(sdev->request_queue->queue_lock);
-
- spin_lock(shost->host_lock);
+ blk_run_queue_async(sdev->request_queue);
}
/* put any unprocessed entries back */
list_splice(&starved_list, &shost->starved_list);
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 28c3350..815069d 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -3816,28 +3816,17 @@ fail_host_msg:
static void
fc_bsg_goose_queue(struct fc_rport *rport)
{
- int flagset;
- unsigned long flags;
-
if (!rport->rqst_q)
return;

+ /*
+ * This get/put dance makes no sense
+ */
get_device(&rport->dev);
-
- spin_lock_irqsave(rport->rqst_q->queue_lock, flags);
- flagset = test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags) &&
- !test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags);
- if (flagset)
- queue_flag_set(QUEUE_FLAG_REENTER, rport->rqst_q);
- __blk_run_queue(rport->rqst_q);
- if (flagset)
- queue_flag_clear(QUEUE_FLAG_REENTER, rport->rqst_q);
- spin_unlock_irqrestore(rport->rqst_q->queue_lock, flags);
-
+ blk_run_queue_async(rport->rqst_q);
put_device(&rport->dev);
}

-
/**
* fc_bsg_rport_dispatch - process rport bsg requests and dispatch to LLDD
* @q: rport request queue
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index cbbfd98..2ad95fa 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -388,20 +388,19 @@ struct request_queue
#define QUEUE_FLAG_SYNCFULL 3 /* read queue has been filled */
#define QUEUE_FLAG_ASYNCFULL 4 /* write queue has been filled */
#define QUEUE_FLAG_DEAD 5 /* queue being torn down */
-#define QUEUE_FLAG_REENTER 6 /* Re-entrancy avoidance */
-#define QUEUE_FLAG_ELVSWITCH 7 /* don't use elevator, just do FIFO */
-#define QUEUE_FLAG_BIDI 8 /* queue supports bidi requests */
-#define QUEUE_FLAG_NOMERGES 9 /* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP 10 /* force complete on same CPU */
-#define QUEUE_FLAG_FAIL_IO 11 /* fake timeout */
-#define QUEUE_FLAG_STACKABLE 12 /* supports request stacking */
-#define QUEUE_FLAG_NONROT 13 /* non-rotational device (SSD) */
+#define QUEUE_FLAG_ELVSWITCH 6 /* don't use elevator, just do FIFO */
+#define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */
+#define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */
+#define QUEUE_FLAG_SAME_COMP 9 /* force complete on same CPU */
+#define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */
+#define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */
+#define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
-#define QUEUE_FLAG_IO_STAT 15 /* do IO stats */
-#define QUEUE_FLAG_DISCARD 16 /* supports DISCARD */
-#define QUEUE_FLAG_NOXMERGES 17 /* No extended merges */
-#define QUEUE_FLAG_ADD_RANDOM 18 /* Contributes to random pool */
-#define QUEUE_FLAG_SECDISCARD 19 /* supports SECDISCARD */
+#define QUEUE_FLAG_IO_STAT 13 /* do IO stats */
+#define QUEUE_FLAG_DISCARD 14 /* supports DISCARD */
+#define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */
+#define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */
+#define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */

#define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \
(1 << QUEUE_FLAG_STACKABLE) | \
@@ -699,6 +698,7 @@ extern void blk_sync_queue(struct request_queue *q);
extern void __blk_stop_queue(struct request_queue *q);
extern void __blk_run_queue(struct request_queue *q);
extern void blk_run_queue(struct request_queue *);
+extern void blk_run_queue_async(struct request_queue *q);
extern int blk_rq_map_user(struct request_queue *, struct request *,
struct rq_map_data *, void __user *, unsigned long,
gfp_t);

--
Jens Axboe

2011-04-19 16:14:02

by Bart Van Assche

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Tue, Apr 19, 2011 at 1:16 PM, Jens Axboe <[email protected]> wrote:
> On 2011-04-19 11:09, Jens Axboe wrote:
> > On 2011-04-18 20:32, Bart Van Assche wrote:
> >> On Mon, Apr 18, 2011 at 8:28 PM, Jens Axboe <[email protected]> wrote:
> >>> On 2011-04-18 20:21, Bart Van Assche wrote:
> >>>> a performance regression in the block layer not related to the md
> >>>> issue. If I run a small block IOPS test on a block device created by
> >>>> ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
> >>>> (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
> >>>
> >>> That's not good. What's the test case?
> >>
> >> Nothing more than a fio IOPS test:
> >>
> >> fio --bs=512 --ioengine=libaio --buffered=0 --rw=read --thread
> >> --iodepth=64 --numjobs=2 --loops=10000 --group_reporting --size=1G
> >> ? ? --gtod_reduce=1 --name=iops-test --filename=/dev/${dev} --invalidate=1
> >
> > Bart, can you try the below:
>
> Here's a more complete variant. James, lets get rid of this REENTER
> crap. It's completely bogus and triggers falsely for a variety of
> reasons. The below will work, but there may be room for improvement on
> the SCSI side.
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 5fa3dd2..4e49665 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -303,15 +303,7 @@ void __blk_run_queue(struct request_queue *q)
> ? ? ? ?if (unlikely(blk_queue_stopped(q)))
> ? ? ? ? ? ? ? ?return;
>
> - ? ? ? /*
> - ? ? ? ?* Only recurse once to avoid overrunning the stack, let the unplug
> - ? ? ? ?* handling reinvoke the handler shortly if we already got there.
> - ? ? ? ?*/
> - ? ? ? if (!queue_flag_test_and_set(QUEUE_FLAG_REENTER, q)) {
> - ? ? ? ? ? ? ? q->request_fn(q);
> - ? ? ? ? ? ? ? queue_flag_clear(QUEUE_FLAG_REENTER, q);
> - ? ? ? } else
> - ? ? ? ? ? ? ? queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> + ? ? ? q->request_fn(q);
> ?}
> ?EXPORT_SYMBOL(__blk_run_queue);
>
> @@ -328,6 +320,7 @@ void blk_run_queue_async(struct request_queue *q)
> ? ? ? ?if (likely(!blk_queue_stopped(q)))
> ? ? ? ? ? ? ? ?queue_delayed_work(kblockd_workqueue, &q->delay_work, 0);
> ?}
> +EXPORT_SYMBOL(blk_run_queue_async);
>
> ?/**
> ?* blk_run_queue - run a single device queue
> diff --git a/block/blk.h b/block/blk.h
> index c9df8fc..6126346 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -22,7 +22,6 @@ void blk_rq_timed_out_timer(unsigned long data);
> ?void blk_delete_timer(struct request *);
> ?void blk_add_timer(struct request *);
> ?void __generic_unplug_device(struct request_queue *);
> -void blk_run_queue_async(struct request_queue *q);
>
> ?/*
> ?* Internal atomic flags for request handling
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index ab55c2f..e9901b8 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -411,8 +411,6 @@ static void scsi_run_queue(struct request_queue *q)
> ? ? ? ?list_splice_init(&shost->starved_list, &starved_list);
>
> ? ? ? ?while (!list_empty(&starved_list)) {
> - ? ? ? ? ? ? ? int flagset;
> -
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * As long as shost is accepting commands and we have
> ? ? ? ? ? ? ? ? * starved queues, call blk_run_queue. scsi_request_fn
> @@ -435,20 +433,7 @@ static void scsi_run_queue(struct request_queue *q)
> ? ? ? ? ? ? ? ? ? ? ? ?continue;
> ? ? ? ? ? ? ? ?}
>
> - ? ? ? ? ? ? ? spin_unlock(shost->host_lock);
> -
> - ? ? ? ? ? ? ? spin_lock(sdev->request_queue->queue_lock);
> - ? ? ? ? ? ? ? flagset = test_bit(QUEUE_FLAG_REENTER, &q->queue_flags) &&
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? !test_bit(QUEUE_FLAG_REENTER,
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? &sdev->request_queue->queue_flags);
> - ? ? ? ? ? ? ? if (flagset)
> - ? ? ? ? ? ? ? ? ? ? ? queue_flag_set(QUEUE_FLAG_REENTER, sdev->request_queue);
> - ? ? ? ? ? ? ? __blk_run_queue(sdev->request_queue);
> - ? ? ? ? ? ? ? if (flagset)
> - ? ? ? ? ? ? ? ? ? ? ? queue_flag_clear(QUEUE_FLAG_REENTER, sdev->request_queue);
> - ? ? ? ? ? ? ? spin_unlock(sdev->request_queue->queue_lock);
> -
> - ? ? ? ? ? ? ? spin_lock(shost->host_lock);
> + ? ? ? ? ? ? ? blk_run_queue_async(sdev->request_queue);
> ? ? ? ?}
> ? ? ? ?/* put any unprocessed entries back */
> ? ? ? ?list_splice(&starved_list, &shost->starved_list);
> diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
> index 28c3350..815069d 100644
> --- a/drivers/scsi/scsi_transport_fc.c
> +++ b/drivers/scsi/scsi_transport_fc.c
> @@ -3816,28 +3816,17 @@ fail_host_msg:
> ?static void
> ?fc_bsg_goose_queue(struct fc_rport *rport)
> ?{
> - ? ? ? int flagset;
> - ? ? ? unsigned long flags;
> -
> ? ? ? ?if (!rport->rqst_q)
> ? ? ? ? ? ? ? ?return;
>
> + ? ? ? /*
> + ? ? ? ?* This get/put dance makes no sense
> + ? ? ? ?*/
> ? ? ? ?get_device(&rport->dev);
> -
> - ? ? ? spin_lock_irqsave(rport->rqst_q->queue_lock, flags);
> - ? ? ? flagset = test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags) &&
> - ? ? ? ? ? ? ? ? !test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags);
> - ? ? ? if (flagset)
> - ? ? ? ? ? ? ? queue_flag_set(QUEUE_FLAG_REENTER, rport->rqst_q);
> - ? ? ? __blk_run_queue(rport->rqst_q);
> - ? ? ? if (flagset)
> - ? ? ? ? ? ? ? queue_flag_clear(QUEUE_FLAG_REENTER, rport->rqst_q);
> - ? ? ? spin_unlock_irqrestore(rport->rqst_q->queue_lock, flags);
> -
> + ? ? ? blk_run_queue_async(rport->rqst_q);
> ? ? ? ?put_device(&rport->dev);
> ?}
>
> -
> ?/**
> ?* fc_bsg_rport_dispatch - process rport bsg requests and dispatch to LLDD
> ?* @q: ? ? ? ? rport request queue
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index cbbfd98..2ad95fa 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -388,20 +388,19 @@ struct request_queue
> ?#define ? ? ? ?QUEUE_FLAG_SYNCFULL ? ? 3 ? ? ? /* read queue has been filled */
> ?#define QUEUE_FLAG_ASYNCFULL ? 4 ? ? ? /* write queue has been filled */
> ?#define QUEUE_FLAG_DEAD ? ? ? ? ? ? ? ?5 ? ? ? /* queue being torn down */
> -#define QUEUE_FLAG_REENTER ? ? 6 ? ? ? /* Re-entrancy avoidance */
> -#define QUEUE_FLAG_ELVSWITCH ? 7 ? ? ? /* don't use elevator, just do FIFO */
> -#define QUEUE_FLAG_BIDI ? ? ? ? ? ? ? ?8 ? ? ? /* queue supports bidi requests */
> -#define QUEUE_FLAG_NOMERGES ? ? 9 ? ? ?/* disable merge attempts */
> -#define QUEUE_FLAG_SAME_COMP ? 10 ? ? ?/* force complete on same CPU */
> -#define QUEUE_FLAG_FAIL_IO ? ? 11 ? ? ?/* fake timeout */
> -#define QUEUE_FLAG_STACKABLE ? 12 ? ? ?/* supports request stacking */
> -#define QUEUE_FLAG_NONROT ? ? ?13 ? ? ?/* non-rotational device (SSD) */
> +#define QUEUE_FLAG_ELVSWITCH ? 6 ? ? ? /* don't use elevator, just do FIFO */
> +#define QUEUE_FLAG_BIDI ? ? ? ? ? ? ? ?7 ? ? ? /* queue supports bidi requests */
> +#define QUEUE_FLAG_NOMERGES ? ? 8 ? ? ?/* disable merge attempts */
> +#define QUEUE_FLAG_SAME_COMP ? 9 ? ? ? /* force complete on same CPU */
> +#define QUEUE_FLAG_FAIL_IO ? ? 10 ? ? ?/* fake timeout */
> +#define QUEUE_FLAG_STACKABLE ? 11 ? ? ?/* supports request stacking */
> +#define QUEUE_FLAG_NONROT ? ? ?12 ? ? ?/* non-rotational device (SSD) */
> ?#define QUEUE_FLAG_VIRT ? ? ? ?QUEUE_FLAG_NONROT /* paravirt device */
> -#define QUEUE_FLAG_IO_STAT ? ? 15 ? ? ?/* do IO stats */
> -#define QUEUE_FLAG_DISCARD ? ? 16 ? ? ?/* supports DISCARD */
> -#define QUEUE_FLAG_NOXMERGES ? 17 ? ? ?/* No extended merges */
> -#define QUEUE_FLAG_ADD_RANDOM ?18 ? ? ?/* Contributes to random pool */
> -#define QUEUE_FLAG_SECDISCARD ?19 ? ? ?/* supports SECDISCARD */
> +#define QUEUE_FLAG_IO_STAT ? ? 13 ? ? ?/* do IO stats */
> +#define QUEUE_FLAG_DISCARD ? ? 14 ? ? ?/* supports DISCARD */
> +#define QUEUE_FLAG_NOXMERGES ? 15 ? ? ?/* No extended merges */
> +#define QUEUE_FLAG_ADD_RANDOM ?16 ? ? ?/* Contributes to random pool */
> +#define QUEUE_FLAG_SECDISCARD ?17 ? ? ?/* supports SECDISCARD */
>
> ?#define QUEUE_FLAG_DEFAULT ? ? ((1 << QUEUE_FLAG_IO_STAT) | ? ? ? ? ? ?\
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? (1 << QUEUE_FLAG_STACKABLE) ? ?| ? ? ? \
> @@ -699,6 +698,7 @@ extern void blk_sync_queue(struct request_queue *q);
> ?extern void __blk_stop_queue(struct request_queue *q);
> ?extern void __blk_run_queue(struct request_queue *q);
> ?extern void blk_run_queue(struct request_queue *);
> +extern void blk_run_queue_async(struct request_queue *q);
> ?extern int blk_rq_map_user(struct request_queue *, struct request *,
> ? ? ? ? ? ? ? ? ? ? ? ? ? struct rq_map_data *, void __user *, unsigned long,
> ? ? ? ? ? ? ? ? ? ? ? ? ? gfp_t);

Hello Jens,

The same test with an initiator running 2.6.39-rc4 +
git://git.kernel.dk/linux-2.6-block.git for-linus + the above patch
yields about 155.000 IOPS on my test setup, or the same performance as
with 2.6.38.3. I'm running the above patch through an I/O stress test
now.

Bart.

2011-04-19 16:33:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Tue, Apr 19, 2011 at 9:13 AM, Bart Van Assche <[email protected]> wrote:
>
> The same test with an initiator running 2.6.39-rc4 +
> git://git.kernel.dk/linux-2.6-block.git for-linus + the above patch
> yields about 155.000 IOPS on my test setup, or the same performance as
> with 2.6.38.3. I'm running the above patch through an I/O stress test
> now.

Goodie. So not only does that patch get back the 11%, it removes the
crazy QUEUE_FLAG_REENTER flag that was broken to begin with. AND it
removes a number of complicated lines.

Halleluja.

Linus

2011-04-19 16:39:55

by Bart Van Assche

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On Tue, Apr 19, 2011 at 5:32 AM, David Dillow <[email protected]> wrote:
>
> On Mon, 2011-04-18 at 20:21 +0200, Bart Van Assche wrote:
> > On Mon, Apr 18, 2011 at 1:44 PM, Jens Axboe <[email protected]> wrote:
> > > Bart, can you try and pull:
> > >
> > > git://git.kernel.dk/linux-2.6-block.git for-linus
> > >
> > > into Linus' tree and see if that works? This has, among other things,
> > > Neils fixes for MD.
> >
> > md seems to work stable with the resulting tree, but it looks there is
> > a performance regression in the block layer not related to the md
> > issue. If I run a small block IOPS test on a block device created by
> > ib_srp (NOOP scheduler) I see about 11% less IOPS than with 2.6.38.3
> > (155.000 IOPS with 2.6.38.3 and 140.000 IOPS with 2.6.39-rc3+).
>
> The mapping code for ib_srp changed in 2.6.39-rc1, but it showed
> improved IOPS for a similar setup in my testing so I'd be surprised if
> it is the culprit. Still, it wouldn't hurt to check. Do you have time to
> try the new ib_srp code with 2.6.38.3 to eliminate it from the equation?

Hello Dave,

I just ran a test with the most important 2.6.39-specific ib_srp
commits reverted but that didn't yield a measurable performance
difference for this specific test:

$ git show --format=format:%s 7f9e5c48c1078507747434d4c182ab10925bf98a
be8b981453a4904399cb090c1660618e250092d8
c07d424d6118d528ef71b22b7424bfc359c307a5
8f26c9ff9cd0317ad867bce972f69e0c6c2cbe3c
961e0be89a5120a1409ebc525cca6f603615a8a8
8c4037b501acd2ec3abc7925e66af8af40a2da9d | grep '^IB'
IB: Increase DMA max_segment_size on Mellanox hardware
IB/srp: try to use larger FMR sizes to cover our mappings
IB/srp: add support for indirect tables that don't fit in SRP_CMD
IB/srp: rework mapping engine to use multiple FMR entries
IB/srp: move IB CM setup completion into its own function
IB/srp: always avoid non-zero offsets into an FMR

Bart.

2011-04-19 16:48:24

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

> + blk_run_queue_async(sdev->request_queue);

This doesn't even have to be async except when scsi drivers call
cmd->scsi_done directly. It seems like if this always went through the
softirq (or kblockd) we could still run it in context for the others.

> + /*
> + * This get/put dance makes no sense
> + */
> get_device(&rport->dev);
> -
> - spin_lock_irqsave(rport->rqst_q->queue_lock, flags);
> - flagset = test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags) &&
> - !test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags);
> - if (flagset)
> - queue_flag_set(QUEUE_FLAG_REENTER, rport->rqst_q);
> - __blk_run_queue(rport->rqst_q);
> - if (flagset)
> - queue_flag_clear(QUEUE_FLAG_REENTER, rport->rqst_q);
> - spin_unlock_irqrestore(rport->rqst_q->queue_lock, flags);
> -
> + blk_run_queue_async(rport->rqst_q);

And the QUEUE_FLAG_REENTER mess here never made sense either as it
tested for a bit beeing set and not set at the same time. So this one
actually should be able to be replaced by a plain blk_run_queue.

2011-04-19 17:06:23

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-19 18:48, Christoph Hellwig wrote:
>> + blk_run_queue_async(sdev->request_queue);
>
> This doesn't even have to be async except when scsi drivers call
> cmd->scsi_done directly. It seems like if this always went through the
> softirq (or kblockd) we could still run it in context for the others.

Exactly. I'll pass an 'optimize' patch past James.

>> + /*
>> + * This get/put dance makes no sense
>> + */
>> get_device(&rport->dev);
>> -
>> - spin_lock_irqsave(rport->rqst_q->queue_lock, flags);
>> - flagset = test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags) &&
>> - !test_bit(QUEUE_FLAG_REENTER, &rport->rqst_q->queue_flags);
>> - if (flagset)
>> - queue_flag_set(QUEUE_FLAG_REENTER, rport->rqst_q);
>> - __blk_run_queue(rport->rqst_q);
>> - if (flagset)
>> - queue_flag_clear(QUEUE_FLAG_REENTER, rport->rqst_q);
>> - spin_unlock_irqrestore(rport->rqst_q->queue_lock, flags);
>> -
>> + blk_run_queue_async(rport->rqst_q);
>
> And the QUEUE_FLAG_REENTER mess here never made sense either as it
> tested for a bit beeing set and not set at the same time. So this one
> actually should be able to be replaced by a plain blk_run_queue.

Yep, it's completely broken as-is.

--
Jens Axboe

2011-04-19 17:43:19

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-19 18:13, Bart Van Assche wrote:
> The same test with an initiator running 2.6.39-rc4 +
> git://git.kernel.dk/linux-2.6-block.git for-linus + the above patch
> yields about 155.000 IOPS on my test setup, or the same performance as
> with 2.6.38.3. I'm running the above patch through an I/O stress test
> now.

OK, so parity, that's good. With the above patch, I can take a single
device from ~400K IOPS on 2.6.38 to ~440K IOPS on 2.6.39-rc4+patches.

--
Jens Axboe

2011-04-19 17:43:39

by Jens Axboe

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 2011-04-19 18:32, Linus Torvalds wrote:
> On Tue, Apr 19, 2011 at 9:13 AM, Bart Van Assche <[email protected]> wrote:
>>
>> The same test with an initiator running 2.6.39-rc4 +
>> git://git.kernel.dk/linux-2.6-block.git for-linus + the above patch
>> yields about 155.000 IOPS on my test setup, or the same performance as
>> with 2.6.38.3. I'm running the above patch through an I/O stress test
>> now.
>
> Goodie. So not only does that patch get back the 11%, it removes the
> crazy QUEUE_FLAG_REENTER flag that was broken to begin with. AND it
> removes a number of complicated lines.
>
> Halleluja.

Indeed, coming your way soonish.

--
Jens Axboe

2011-04-21 00:38:23

by David Dillow

[permalink] [raw]
Subject: Re: [Bug #32982] Kernel locks up a few minutes after boot

On 4/19/2011 12:39 PM, Bart Van Assche wrote:
> On Tue, Apr 19, 2011 at 5:32 AM, David Dillow<[email protected]> wrote:
>> The mapping code for ib_srp changed in 2.6.39-rc1, but it showed
>> improved IOPS for a similar setup in my testing so I'd be surprised if
>> it is the culprit. Still, it wouldn't hurt to check. Do you have time to
>> try the new ib_srp code with 2.6.38.3 to eliminate it from the equation?
> Hello Dave,
>
> I just ran a test with the most important 2.6.39-specific ib_srp
> commits reverted but that didn't yield a measurable performance
> difference for this specific test:

Thanks for giving it a whirl,
Dave