2008-11-09 19:49:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: 2.6.28-rc3-git6: Reported regressions from 2.6.27

This message contains a list of some regressions from 2.6.27, for which there
are no fixes in the mainline I know of. If any of them have been fixed already,
please let me know.

If you know of any other unresolved regressions from 2.6.27, please let me know
either and I'll add them to the list. Also, please let me know if any of the
entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.


Listed regressions statistics:

Date Total Pending Unresolved
----------------------------------------
2008-11-09 73 40 27
2008-11-02 55 41 29
2008-10-25 26 25 20


Unresolved regressions
----------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11996
Subject : Tracing framework regression in 2.6.28-rc3
Submitter : Pekka Paalanen <[email protected]>
Date : 2008-11-09 10:13 (1 days old)
References : http://marc.info/?l=linux-kernel&m=122624392229317&w=4
Handled-By : Steven Rostedt <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11994
Subject : Computer doesn't power down after commit CPI: EC: do transaction from interrupt context
Submitter : François Valenduc <[email protected]>
Date : 2008-11-09 02:02 (1 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ceb40417bca2045350e77f740e0c4c94875fff2
Handled-By : ykzhao <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
Submitter : Rafael J. Wysocki <[email protected]>
Date : 2008-11-03 0:28 (7 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11987
Subject : Bootup time regression from 2.6.27 to 2.6.28-rc3+
Submitter : Lukas Hejtmanek <[email protected]>
Date : 2008-11-04 17:33 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122582006601658&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11986
Subject : 2.6.28-rc2-git1: spitz still won't boot
Submitter : Pavel Machek <[email protected]>
Date : 2008-11-05 14:23 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122589528016337&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11984
Subject : regression when switching TTY-&gt;X, input related?
Submitter : Bernhard Schmidt <[email protected]>
Date : 2008-11-05 22:04 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122592278403853&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11970
Subject : gettimeofday return a old time in mmbench
Submitter : alexs <[email protected]>
Date : 2008-11-06 23:57 (4 days old)
Handled-By : Ingo Molnar <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11965
Subject : regression introduced by - timers: fix itimer/many thread hang
Submitter : Doug Chapman <[email protected]>
Date : 2008-11-06 11:03 (4 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f06febc96ba8e0af80bcc3eaec0a109e88275fac
References : http://marc.info/?l=linux-kernel&m=122596943416648&w=4
Handled-By : Frank Mayhar <[email protected]>
Peter Zijlstra <[email protected]>
Ingo Molnar <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11958
Subject : [2.6.27.x =&gt; 2.6.28-rc3] Xorg crash with xf86MapVidMem error
Submitter : Tomasz Chmielewski <[email protected]>
Date : 2008-11-05 05:37 (5 days old)


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11947
Subject : 2.6.28-rc VC switching with Intel graphics broken
Submitter : Romano Giannetti <[email protected]>
Date : 2008-11-03 12:10 (7 days old)
Handled-By : Jesse Barnes <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11928
Subject : ath5k gets lost with eeepc-laptop removal
Submitter : Luiz Fernando N. Capitulino <[email protected]>
Date : 2008-10-31 13:05 (10 days old)
References : http://marc.info/?l=linux-kernel&m=122545827204957&w=4
Handled-By : Nick Kossifidis <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11913
Subject : USB/INPUT: slab error in cache_alloc_debugcheck_after(): double free?
Submitter : Helge Deller <[email protected]>
Date : 2008-10-30 23:11 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cb8f488c33539f096580e202f5438a809195008f
References : http://marc.info/?l=linux-kernel&m=122540833301394&w=4
Handled-By : Jiri Kosina <[email protected]>
Jiri Slaby <[email protected]>
Jiri Kosina <[email protected]>
Jiri Slaby <[email protected]>
Denys Vlasenko <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11908
Subject : linux-2.6.28-rc2 regression : oprofile doesnt work anymore
Submitter : Eric Dumazet <[email protected]>
Date : 2008-10-30 18:01 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c493756e2a8a78bcaae30668317890dcfe86e7c3
References : http://marc.info/?l=linux-kernel&m=122539004100532&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11906
Subject : 2.6.28-rc2 seems to fail at powering down the monitor when it should
Submitter : Gene Heskett <[email protected]>
Date : 2008-10-30 6:39 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122534879721424&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11905
Subject : lots of extra timer interrupts costing 2W
Submitter : Theodore Ts'o <[email protected]>
Date : 2008-10-30 2:18 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fb02fbc14d17837b4b7b02dbb36142c16a7bf208
References : http://marc.info/?l=linux-kernel&m=122533314305315&w=4
http://marc.info/?l=linux-kernel&m=122541849114444&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11899
Subject : sometime boot failed on T61 laptop
Submitter : alexs <[email protected]>
Date : 2008-10-30 02:04 (11 days old)
Handled-By : Tejun Heo <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11896
Subject : [2.6.28-rc2] EeePC ACPI errors &amp; exceptions
Submitter : Darren Salt <[email protected]>
Date : 2008-10-27 22:52 (14 days old)
References : http://marc.info/?l=linux-kernel&m=122514911328761&w=4
Handled-By : Alexey Starikovskiy <[email protected]>
Zhao Yakui <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11891
Subject : resume from disk broken on hp/compaq nx7000 (DRM problem)
Submitter : Markus Meier <[email protected]>
Date : 2008-10-29 14:42 (12 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0a3e67a4caac273a3bfc4ced3da364830b1ab241
Handled-By : Jesse Barnes <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11875
Subject : radeonfb lockup in .28-rc (bisected)
Submitter : James Cloos <[email protected]>
Date : 2008-10-28 0:00 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b1ee26bab14886350ba12a5c10cbc0696ac679bf
References : http://marc.info/?l=linux-kernel&m=122515210200530&w=4
Handled-By : Benjamin Herrenschmidt <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11873
Subject : unable to mount ext3 root filesystem due to htree_dirblock_to_tree
Submitter : [email protected]
Date : 2008-10-28 05:09 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c46501d1659475dc6c89554af6ce7fe6ecf615c
Handled-By : Tejun Heo <[email protected]>
Neil Brown <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11858
Subject : Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
Submitter : Tejun Heo <[email protected]>
Date : 2008-10-26 9:46 (15 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=242f9dcb8ba6f68fcd217a119a7648a4f69290e9
References : http://marc.info/?l=linux-kernel&m=122501447326698&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11849
Subject : default IRQ affinity change in v2.6.27 (breaking several SMP PPC based systems)
Submitter : Kumar Gala <[email protected]>
Date : 2008-10-24 12:45 (17 days old)
References : http://marc.info/?l=linux-kernel&m=122485245924125&w=4
Handled-By : Chris Snook <[email protected]>
Scott Wood <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11834
Subject : iwl3945: if I leave my machine running overnight, wifi will not work in the morning
Submitter : Pavel Machek <[email protected]>
Date : 2008-10-19 21:40 (22 days old)
References : http://marc.info/?l=linux-kernel&m=122445440206101&w=4
Handled-By : reinette chatre <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11828
Subject : Linux 2.6.27-git3: no SD card reader
Submitter : J.A. Magallón <[email protected]>
Date : 2008-10-14 0:54 (27 days old)
References : http://marc.info/?l=linux-kernel&m=122394573904699&w=4
Handled-By : Pierre Ossman <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11826
Subject : extreme slowness of IO stuff using 2.6.28-rc1
Submitter : Yves-Alexis Perez <[email protected]>
Date : 2008-10-25 04:25 (16 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc4304f7deee29fcdf6a2b62f7146ea7f505fd42
References : http://marc.info/?l=linux-kernel&m=122521238402963&w=4
Handled-By : Arjan van de Ven <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11822
Subject : ACPI Warning (nspredef-0858): _SB_.PCI0.LPC_.EC__.BAT0._BIF: Return Package type mismatch at index 9 - found Buffer, expected String [20080926]
Submitter : Len Brown <[email protected]>
Date : 2008-10-25 01:26 (16 days old)
Handled-By : Robert Moore <[email protected]>


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11799
Subject : xorg can not start up with stolen memory
Submitter : arrow zhang <[email protected]>
Date : 2008-10-21 06:08 (20 days old)


Regressions with patches
------------------------

Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11988
Subject : Eliminate recursive mutex in compat fb ioctl path
Submitter : Keith Packard <[email protected]>
Date : 2008-11-03 7:06 (7 days old)
References : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
Handled-By : Keith Packard <[email protected]>
Geert Uytterhoeven <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
http://lkml.org/lkml/2008/10/31/162


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11985
Subject : 2.6.28-rc3 truncates nfsd results
Submitter : Doug Nazar <[email protected]>
Date : 2008-11-04 18:27 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122582366509153&w=4
Handled-By : Doug Nazar <[email protected]>
J. Bruce Fields <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122592648119790&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11982
Subject : Fan level 7 after resume wit 2.6.28-rc3
Submitter : Tino Keitel <[email protected]>
Date : 2008-11-05 7:33 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122587043409186&w=4
Handled-By : Henrique de Moraes Holschuh <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18744&action=view


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11942
Subject : AMD64 reboot regression
Submitter : Michael B. Trausch <[email protected]>
Date : 2008-11-02 20:30 (8 days old)
References : http://marc.info/?l=linux-kernel&m=122565790519736&w=4
Handled-By : Len Brown <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11942#c11


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11937
Subject : ext3 __log_wait_for_space: no transactions
Submitter : Meelis Roos <[email protected]>
Date : 2008-10-30 9:49 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122536026105643&w=4
Handled-By : Theodore Tso <[email protected]>
Patch : http://lkml.org/lkml/2008/11/1/61


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11925
Subject : cdrom: missing compat ioctls
Submitter : Andreas Schwab <[email protected]>
Date : 2008-10-31 14:02 (10 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=33c2dca4957bd0da3e1af7b96d0758d97e708ef6
Handled-By : Andreas Schwab <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122548923531545&w=2


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11917
Subject : Asus Eee PC hotkeys stop working after prolonged usage
Submitter : Alan Jenkins <[email protected]>
Date : 2008-10-31 03:21 (10 days old)
Handled-By : Alexey Starikovskiy <[email protected]>
Patch : http://marc.info/?l=linux-acpi&m=122603281422097&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11911
Subject : new PCMCIA device instance after resume - orinoco can't download firmware
Submitter : Andrey Borzenkov <[email protected]>
Date : 2008-10-28 19:19 (13 days old)
References : http://marc.info/?l=linux-wireless&m=122522165719760&w=4
Handled-By : Dave <[email protected]>
Patch : http://marc.info/?l=linux-wireless&m=122539058601588&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11903
Subject : regression: vmalloc easily fail
Submitter : Glauber Costa <[email protected]>
Date : 2008-10-28 20:59 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=db64fe02258f1507e13fe5212a989922323685ce
References : http://marc.info/?l=linux-kernel&m=122522755530998&w=4
Handled-By : Glauber Costa <[email protected]>
Nick Piggin <[email protected]>
Glauber Costa <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122609055221549&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11898
Subject : mke2fs hang on AIC79 device.
Submitter : alexs <[email protected]>
Date : 2008-10-30 01:17 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f0c0a376d0fcd4c5579ecf5e95f88387cba85211
Handled-By : James Bottomley <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11898#c28


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11895
Subject : 2.6.28-rc2 regression: keyboard dead after reboot on Toshiba Portege 4000
Submitter : Andrey Borzenkov <[email protected]>
Date : 2008-10-28 19:05 (13 days old)
References : http://marc.info/?l=linux-acpi&m=122522085418555&w=4
Handled-By : Andrey Borzenkov <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122547719810921&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11841
Subject : plenty of line "ACPI: EC: non-query interrupt received, switching to interrupt mode" in dmesg and system not powering down
Submitter : François Valenduc <[email protected]>
Date : 2008-10-25 10:29 (16 days old)
Handled-By : Alan Jenkins <[email protected]>
Patch : http://marc.info/?l=linux-acpi&m=122603281922125&w=4


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11806
Subject : iwl3945 fails with microcode error
Submitter : Johannes Berg <[email protected]>
Date : 2008-10-22 02:36 (19 days old)
References : http://marc.info/?l=linux-kernel&m=122450235730661&w=4
Handled-By : Reinette Chatre <[email protected]>
Patch : http://marc.info/?l=linux-wireless&m=122583010822172&w=2


For details, please visit the bug entries and follow the links given in
references.

As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions from 2.6.27,
unresolved as well as resolved, at:

http://bugzilla.kernel.org/show_bug.cgi?id=11808

Please let me know if there are any Bugzilla entries that should be added to
the list in there.

Thanks,
Rafael


2008-11-09 19:50:05

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11799] xorg can not start up with stolen memory

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11799
Subject : xorg can not start up with stolen memory
Submitter : arrow zhang <[email protected]>
Date : 2008-10-21 06:08 (20 days old)

2008-11-09 19:51:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11806] iwl3945 fails with microcode error

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11806
Subject : iwl3945 fails with microcode error
Submitter : Johannes Berg <[email protected]>
Date : 2008-10-22 02:36 (19 days old)
References : http://marc.info/?l=linux-kernel&m=122450235730661&w=4
Handled-By : Reinette Chatre <[email protected]>
Patch : http://marc.info/?l=linux-wireless&m=122583010822172&w=2

2008-11-09 19:55:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11822] ACPI Warning (nspredef-0858): _SB_.PCI0.LPC_.EC__.BAT0._BIF: Return Package type mismatch at index 9 - found Buffer, expected String [20080926]

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11822
Subject : ACPI Warning (nspredef-0858): _SB_.PCI0.LPC_.EC__.BAT0._BIF: Return Package type mismatch at index 9 - found Buffer, expected String [20080926]
Submitter : Len Brown <[email protected]>
Date : 2008-10-25 01:26 (16 days old)
Handled-By : Robert Moore <[email protected]>

2008-11-09 19:55:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11826] extreme slowness of IO stuff using 2.6.28-rc1

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11826
Subject : extreme slowness of IO stuff using 2.6.28-rc1
Submitter : Yves-Alexis Perez <[email protected]>
Date : 2008-10-25 04:25 (16 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=dc4304f7deee29fcdf6a2b62f7146ea7f505fd42
References : http://marc.info/?l=linux-kernel&m=122521238402963&w=4
Handled-By : Arjan van de Ven <[email protected]>

2008-11-09 19:56:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11834] iwl3945: if I leave my machine running overnight, wifi will not work in the morning

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11834
Subject : iwl3945: if I leave my machine running overnight, wifi will not work in the morning
Submitter : Pavel Machek <[email protected]>
Date : 2008-10-19 21:40 (22 days old)
References : http://marc.info/?l=linux-kernel&m=122445440206101&w=4
Handled-By : reinette chatre <[email protected]>

2008-11-09 19:56:37

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11841] plenty of line "ACPI: EC: non-query interrupt received, switching to interrupt mode" in dmesg and system not powering down

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11841
Subject : plenty of line "ACPI: EC: non-query interrupt received, switching to interrupt mode" in dmesg and system not powering down
Submitter : François Valenduc <[email protected]>
Date : 2008-10-25 10:29 (16 days old)
Handled-By : Alan Jenkins <[email protected]>
Patch : http://marc.info/?l=linux-acpi&m=122603281922125&w=4

2008-11-09 19:56:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11849] default IRQ affinity change in v2.6.27 (breaking several SMP PPC based systems)

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11849
Subject : default IRQ affinity change in v2.6.27 (breaking several SMP PPC based systems)
Submitter : Kumar Gala <[email protected]>
Date : 2008-10-24 12:45 (17 days old)
References : http://marc.info/?l=linux-kernel&m=122485245924125&w=4
Handled-By : Chris Snook <[email protected]>
Scott Wood <[email protected]>

2008-11-09 19:57:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11858] Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11858
Subject : Timeout regression introduced by 242f9dcb8ba6f68fcd217a119a7648a4f69290e9
Submitter : Tejun Heo <[email protected]>
Date : 2008-10-26 9:46 (15 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=242f9dcb8ba6f68fcd217a119a7648a4f69290e9
References : http://marc.info/?l=linux-kernel&m=122501447326698&w=4

2008-11-09 19:57:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11891] resume from disk broken on hp/compaq nx7000 (DRM problem)

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11891
Subject : resume from disk broken on hp/compaq nx7000 (DRM problem)
Submitter : Markus Meier <[email protected]>
Date : 2008-10-29 14:42 (12 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0a3e67a4caac273a3bfc4ced3da364830b1ab241
Handled-By : Jesse Barnes <[email protected]>

2008-11-09 19:58:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11873] unable to mount ext3 root filesystem due to htree_dirblock_to_tree

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11873
Subject : unable to mount ext3 root filesystem due to htree_dirblock_to_tree
Submitter : [email protected]
Date : 2008-10-28 05:09 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4c46501d1659475dc6c89554af6ce7fe6ecf615c
Handled-By : Tejun Heo <[email protected]>
Neil Brown <[email protected]>

2008-11-09 19:58:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11895] 2.6.28-rc2 regression: keyboard dead after reboot on Toshiba Portege 4000

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11895
Subject : 2.6.28-rc2 regression: keyboard dead after reboot on Toshiba Portege 4000
Submitter : Andrey Borzenkov <[email protected]>
Date : 2008-10-28 19:05 (13 days old)
References : http://marc.info/?l=linux-acpi&m=122522085418555&w=4
Handled-By : Andrey Borzenkov <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122547719810921&w=4

2008-11-09 19:58:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11896] [2.6.28-rc2] EeePC ACPI errors &amp; exceptions

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11896
Subject : [2.6.28-rc2] EeePC ACPI errors &amp; exceptions
Submitter : Darren Salt <[email protected]>
Date : 2008-10-27 22:52 (14 days old)
References : http://marc.info/?l=linux-kernel&m=122514911328761&w=4
Handled-By : Alexey Starikovskiy <[email protected]>
Zhao Yakui <[email protected]>

2008-11-09 19:59:09

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11899] sometime boot failed on T61 laptop

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11899
Subject : sometime boot failed on T61 laptop
Submitter : alexs <[email protected]>
Date : 2008-10-30 02:04 (11 days old)
Handled-By : Tejun Heo <[email protected]>

2008-11-09 19:59:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11875] radeonfb lockup in .28-rc (bisected)

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11875
Subject : radeonfb lockup in .28-rc (bisected)
Submitter : James Cloos <[email protected]>
Date : 2008-10-28 0:00 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b1ee26bab14886350ba12a5c10cbc0696ac679bf
References : http://marc.info/?l=linux-kernel&m=122515210200530&w=4
Handled-By : Benjamin Herrenschmidt <[email protected]>

2008-11-09 19:59:47

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11903] regression: vmalloc easily fail

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11903
Subject : regression: vmalloc easily fail
Submitter : Glauber Costa <[email protected]>
Date : 2008-10-28 20:59 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=db64fe02258f1507e13fe5212a989922323685ce
References : http://marc.info/?l=linux-kernel&m=122522755530998&w=4
Handled-By : Glauber Costa <[email protected]>
Nick Piggin <[email protected]>
Glauber Costa <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122609055221549&w=4

2008-11-09 20:00:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11898] mke2fs hang on AIC79 device.

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11898
Subject : mke2fs hang on AIC79 device.
Submitter : alexs <[email protected]>
Date : 2008-10-30 01:17 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f0c0a376d0fcd4c5579ecf5e95f88387cba85211
Handled-By : James Bottomley <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11898#c28

2008-11-09 20:00:40

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11905] lots of extra timer interrupts costing 2W

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11905
Subject : lots of extra timer interrupts costing 2W
Submitter : Theodore Ts'o <[email protected]>
Date : 2008-10-30 2:18 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fb02fbc14d17837b4b7b02dbb36142c16a7bf208
References : http://marc.info/?l=linux-kernel&m=122533314305315&w=4
http://marc.info/?l=linux-kernel&m=122541849114444&w=4

2008-11-09 20:00:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11906] 2.6.28-rc2 seems to fail at powering down the monitor when it should

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11906
Subject : 2.6.28-rc2 seems to fail at powering down the monitor when it should
Submitter : Gene Heskett <[email protected]>
Date : 2008-10-30 6:39 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122534879721424&w=4

2008-11-09 20:01:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11942] AMD64 reboot regression

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11942
Subject : AMD64 reboot regression
Submitter : Michael B. Trausch <[email protected]>
Date : 2008-11-02 20:30 (8 days old)
References : http://marc.info/?l=linux-kernel&m=122565790519736&w=4
Handled-By : Len Brown <[email protected]>
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=11942#c11

2008-11-09 20:01:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11911] new PCMCIA device instance after resume - orinoco can't download firmware

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11911
Subject : new PCMCIA device instance after resume - orinoco can't download firmware
Submitter : Andrey Borzenkov <[email protected]>
Date : 2008-10-28 19:19 (13 days old)
References : http://marc.info/?l=linux-wireless&m=122522165719760&w=4
Handled-By : Dave <[email protected]>
Patch : http://marc.info/?l=linux-wireless&m=122539058601588&w=4

2008-11-09 20:01:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11913] USB/INPUT: slab error in cache_alloc_debugcheck_after(): double free?

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11913
Subject : USB/INPUT: slab error in cache_alloc_debugcheck_after(): double free?
Submitter : Helge Deller <[email protected]>
Date : 2008-10-30 23:11 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=cb8f488c33539f096580e202f5438a809195008f
References : http://marc.info/?l=linux-kernel&m=122540833301394&w=4
Handled-By : Jiri Kosina <[email protected]>
Jiri Slaby <[email protected]>
Jiri Kosina <[email protected]>
Jiri Slaby <[email protected]>
Denys Vlasenko <[email protected]>

2008-11-09 20:02:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11937] ext3 __log_wait_for_space: no transactions

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11937
Subject : ext3 __log_wait_for_space: no transactions
Submitter : Meelis Roos <[email protected]>
Date : 2008-10-30 9:49 (11 days old)
References : http://marc.info/?l=linux-kernel&m=122536026105643&w=4
Handled-By : Theodore Tso <[email protected]>
Patch : http://lkml.org/lkml/2008/11/1/61

2008-11-09 20:02:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11917] Asus Eee PC hotkeys stop working after prolonged usage

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11917
Subject : Asus Eee PC hotkeys stop working after prolonged usage
Submitter : Alan Jenkins <[email protected]>
Date : 2008-10-31 03:21 (10 days old)
Handled-By : Alexey Starikovskiy <[email protected]>
Patch : http://marc.info/?l=linux-acpi&m=122603281422097&w=4

2008-11-09 20:02:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11928] ath5k gets lost with eeepc-laptop removal

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11928
Subject : ath5k gets lost with eeepc-laptop removal
Submitter : Luiz Fernando N. Capitulino <[email protected]>
Date : 2008-10-31 13:05 (10 days old)
References : http://marc.info/?l=linux-kernel&m=122545827204957&w=4
Handled-By : Nick Kossifidis <[email protected]>

2008-11-09 20:03:27

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11908] linux-2.6.28-rc2 regression : oprofile doesnt work anymore

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11908
Subject : linux-2.6.28-rc2 regression : oprofile doesnt work anymore
Submitter : Eric Dumazet <[email protected]>
Date : 2008-10-30 18:01 (11 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c493756e2a8a78bcaae30668317890dcfe86e7c3
References : http://marc.info/?l=linux-kernel&m=122539004100532&w=4

2008-11-09 20:03:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11925] cdrom: missing compat ioctls

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11925
Subject : cdrom: missing compat ioctls
Submitter : Andreas Schwab <[email protected]>
Date : 2008-10-31 14:02 (10 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=33c2dca4957bd0da3e1af7b96d0758d97e708ef6
Handled-By : Andreas Schwab <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122548923531545&w=2

2008-11-09 20:04:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11947] 2.6.28-rc VC switching with Intel graphics broken

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11947
Subject : 2.6.28-rc VC switching with Intel graphics broken
Submitter : Romano Giannetti <[email protected]>
Date : 2008-11-03 12:10 (7 days old)
Handled-By : Jesse Barnes <[email protected]>

2008-11-09 20:04:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11958] [2.6.27.x =&gt; 2.6.28-rc3] Xorg crash with xf86MapVidMem error

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11958
Subject : [2.6.27.x =&gt; 2.6.28-rc3] Xorg crash with xf86MapVidMem error
Submitter : Tomasz Chmielewski <[email protected]>
Date : 2008-11-05 05:37 (5 days old)

2008-11-09 20:04:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11965] regression introduced by - timers: fix itimer/many thread hang

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11965
Subject : regression introduced by - timers: fix itimer/many thread hang
Submitter : Doug Chapman <[email protected]>
Date : 2008-11-06 11:03 (4 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=f06febc96ba8e0af80bcc3eaec0a109e88275fac
References : http://marc.info/?l=linux-kernel&m=122596943416648&w=4
Handled-By : Frank Mayhar <[email protected]>
Peter Zijlstra <[email protected]>
Ingo Molnar <[email protected]>

2008-11-09 20:04:56

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11982] Fan level 7 after resume wit 2.6.28-rc3

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11982
Subject : Fan level 7 after resume wit 2.6.28-rc3
Submitter : Tino Keitel <[email protected]>
Date : 2008-11-05 7:33 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122587043409186&w=4
Handled-By : Henrique de Moraes Holschuh <[email protected]>
Patch : http://bugzilla.kernel.org/attachment.cgi?id=18744&action=view

2008-11-09 20:05:51

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11987] Bootup time regression from 2.6.27 to 2.6.28-rc3+

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11987
Subject : Bootup time regression from 2.6.27 to 2.6.28-rc3+
Submitter : Lukas Hejtmanek <[email protected]>
Date : 2008-11-04 17:33 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122582006601658&w=4

2008-11-09 20:05:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11970] gettimeofday return a old time in mmbench

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11970
Subject : gettimeofday return a old time in mmbench
Submitter : alexs <[email protected]>
Date : 2008-11-06 23:57 (4 days old)
Handled-By : Ingo Molnar <[email protected]>

2008-11-09 20:06:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11984] regression when switching TTY-&gt;X, input related?

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11984
Subject : regression when switching TTY-&gt;X, input related?
Submitter : Bernhard Schmidt <[email protected]>
Date : 2008-11-05 22:04 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122592278403853&w=4

2008-11-09 20:06:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11985] 2.6.28-rc3 truncates nfsd results

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11985
Subject : 2.6.28-rc3 truncates nfsd results
Submitter : Doug Nazar <[email protected]>
Date : 2008-11-04 18:27 (6 days old)
References : http://marc.info/?l=linux-kernel&m=122582366509153&w=4
Handled-By : Doug Nazar <[email protected]>
J. Bruce Fields <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122592648119790&w=4

2008-11-09 20:06:49

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11986] 2.6.28-rc2-git1: spitz still won't boot

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11986
Subject : 2.6.28-rc2-git1: spitz still won't boot
Submitter : Pavel Machek <[email protected]>
Date : 2008-11-05 14:23 (5 days old)
References : http://marc.info/?l=linux-kernel&m=122589528016337&w=4

2008-11-09 20:07:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11988] Eliminate recursive mutex in compat fb ioctl path

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11988
Subject : Eliminate recursive mutex in compat fb ioctl path
Submitter : Keith Packard <[email protected]>
Date : 2008-11-03 7:06 (7 days old)
References : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
Handled-By : Keith Packard <[email protected]>
Geert Uytterhoeven <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
http://lkml.org/lkml/2008/10/31/162

2008-11-09 20:07:41

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
Submitter : Rafael J. Wysocki <[email protected]>
Date : 2008-11-03 0:28 (7 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4

2008-11-09 20:07:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11994] Computer doesn't power down after commit CPI: EC: do transaction from interrupt context

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11994
Subject : Computer doesn't power down after commit CPI: EC: do transaction from interrupt context
Submitter : François Valenduc <[email protected]>
Date : 2008-11-09 02:02 (1 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=5ceb40417bca2045350e77f740e0c4c94875fff2
Handled-By : ykzhao <[email protected]>

2008-11-09 20:08:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: [Bug #11996] Tracing framework regression in 2.6.28-rc3

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.27. Please verify if it still should be listed and let me know
(either way).


Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11996
Subject : Tracing framework regression in 2.6.28-rc3
Submitter : Pekka Paalanen <[email protected]>
Date : 2008-11-09 10:13 (1 days old)
References : http://marc.info/?l=linux-kernel&m=122624392229317&w=4
Handled-By : Steven Rostedt <[email protected]>

2008-11-09 21:06:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [Bug #11985] 2.6.28-rc3 truncates nfsd results

On Sun, Nov 09, 2008 at 06:59:15PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11985
> Subject : 2.6.28-rc3 truncates nfsd results
> Submitter : Doug Nazar <[email protected]>
> Date : 2008-11-04 18:27 (6 days old)
> References : http://marc.info/?l=linux-kernel&m=122582366509153&w=4
> Handled-By : Doug Nazar <[email protected]>
> J. Bruce Fields <[email protected]>
> Patch : http://marc.info/?l=linux-kernel&m=122592648119790&w=4

The above patch has just been submitted to Linus.

--b.

2008-11-09 21:16:20

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Sun, 2008-11-09 at 18:59 +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11875
> Subject : radeonfb lockup in .28-rc (bisected)
> Submitter : James Cloos <[email protected]>
> Date : 2008-10-28 0:00 (13 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b1ee26bab14886350ba12a5c10cbc0696ac679bf
> References : http://marc.info/?l=linux-kernel&m=122515210200530&w=4
> Handled-By : Benjamin Herrenschmidt <[email protected]>
>

FYI. I'm back at work today, at which point I'll have a similar machine
to one of the victims which should allow me to either reproduce & fix,
or if I can't, send a workaround in the form of disabling that
specific acceleration unless explicitely enabled from the command line.

So expect a patch later today.

Cheers,
Ben.

2008-11-09 23:00:28

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11925] cdrom: missing compat ioctls

"Rafael J. Wysocki" <[email protected]> writes:

> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11925
> Subject : cdrom: missing compat ioctls
> Submitter : Andreas Schwab <[email protected]>
> Date : 2008-10-31 14:02 (10 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=33c2dca4957bd0da3e1af7b96d0758d97e708ef6
> Handled-By : Andreas Schwab <[email protected]>
> Patch : http://marc.info/?l=linux-kernel&m=122548923531545&w=2

The patch has been picked up by akpm.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-09 23:25:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11925] cdrom: missing compat ioctls

On Monday, 10 of November 2008, Andreas Schwab wrote:
> "Rafael J. Wysocki" <[email protected]> writes:
>
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11925
> > Subject : cdrom: missing compat ioctls
> > Submitter : Andreas Schwab <[email protected]>
> > Date : 2008-10-31 14:02 (10 days old)
> > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=33c2dca4957bd0da3e1af7b96d0758d97e708ef6
> > Handled-By : Andreas Schwab <[email protected]>
> > Patch : http://marc.info/?l=linux-kernel&m=122548923531545&w=2
>
> The patch has been picked up by akpm.

OK, but has it been merged into mainline already?

Rafael

2008-11-09 23:39:55

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11925] cdrom: missing compat ioctls

"Rafael J. Wysocki" <[email protected]> writes:

> OK, but has it been merged into mainline already?

No.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-10 03:56:22

by Andrey Borzenkov

[permalink] [raw]
Subject: Re: [Bug #11911] new PCMCIA device instance after resume - orinoco can't download firmware

On Sunday 09 November 2008, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>

Still present in rc4.

>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11911
> Subject : new PCMCIA device instance after resume - orinoco can't download firmware
> Submitter : Andrey Borzenkov <[email protected]>
> Date : 2008-10-28 19:19 (13 days old)
> References : http://marc.info/?l=linux-wireless&m=122522165719760&w=4
> Handled-By : Dave <[email protected]>
> Patch : http://marc.info/?l=linux-wireless&m=122539058601588&w=4
>
>
>



Attachments:
(No filename) (780.00 B)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-11-10 05:26:55

by Michael B. Trausch

[permalink] [raw]
Subject: Re: [Bug #11942] AMD64 reboot regression

On Sun, 09 Nov 2008 20:01:31 UTC
"Rafael J. Wysocki" <[email protected]> wrote:

> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me
> know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11942
> Subject : AMD64 reboot regression

This one is still present in -rc4.

Any idea if the fix will make it in before release?

Thanks,
Mike

--
My sigfile ran away and is on hiatus.
http://www.trausch.us/


Attachments:
signature.asc (197.00 B)

2008-11-10 05:47:12

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Sun, 2008-11-09 at 18:59 +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).

Allright, so I finally managed to find a machine to reproduce it and
I have a patch that fixes it here. I'm basically implementing the same
thing as X which is to ensure the bitmap is padded to 32 pixels. The
core fbcon has support for that to a certain extent so it's a fairly
small change.

Note that there was another bug, I think I was missing one
wait_for_fifo() though fixing that didn't make a difference here.

However, it's possible that this significantly impacts the performances,
maybe to the point where we may want to back out the imageblt
acceleration.

David, would you mind testing on your machine ? It's the one that shows
the biggest performance improvement, and I would like to know how much
it is affected by that patch. As long as the "worst case" performance
is still reasonable, I'm ok to take the hit if the improvement for you
is still significant.

Cheers,
Ben.

radeonfb: Fix accel problems with new imageblit hook

Some radeon chips have issues with color expansion of pixmaps that
aren't a multiple of 32 pixels wide. This works around it the same
way X does by requesting the right pitch alignment from fbcon and
then using the chip scissors to do clipping to the requested size.

Signed-off-by: Benjamin Herrenschmidt <[email protected]>
---

If confirmed by the reporters (in CC), please apply for .28 as it
fixes a regression.

Index: linux-work/drivers/video/aty/radeon_accel.c
===================================================================
--- linux-work.orig/drivers/video/aty/radeon_accel.c 2008-11-10 14:05:06.000000000 +1100
+++ linux-work/drivers/video/aty/radeon_accel.c 2008-11-10 14:34:45.000000000 +1100
@@ -179,7 +179,7 @@ static void radeonfb_prim_imageblit(stru

radeonfb_set_creg(rinfo, DP_GUI_MASTER_CNTL, &rinfo->dp_gui_mc_cache,
rinfo->dp_gui_mc_base |
- GMC_BRUSH_NONE |
+ GMC_BRUSH_NONE | GMC_DST_CLIP_LEAVE |
GMC_SRC_DATATYPE_MONO_FG_BG |
ROP3_S |
GMC_BYTE_ORDER_MSB_TO_LSB |
@@ -189,9 +189,6 @@ static void radeonfb_prim_imageblit(stru
radeonfb_set_creg(rinfo, DP_SRC_FRGD_CLR, &rinfo->dp_src_fg_cache, fg);
radeonfb_set_creg(rinfo, DP_SRC_BKGD_CLR, &rinfo->dp_src_bg_cache, bg);

- radeon_fifo_wait(rinfo, 1);
- OUTREG(DST_Y_X, (image->dy << 16) | image->dx);
-
/* Ensure the dst cache is flushed and the engine idle before
* issuing the operation.
*
@@ -205,13 +202,19 @@ static void radeonfb_prim_imageblit(stru

/* X here pads width to a multiple of 32 and uses the clipper to
* adjust the result. Is that really necessary ? Things seem to
- * work ok for me without that and the doco doesn't seem to imply
+ * work ok for me without that and the doco doesn't seem to imply]
* there is such a restriction.
*/
- OUTREG(DST_WIDTH_HEIGHT, (image->width << 16) | image->height);
+ radeon_fifo_wait(rinfo, 4);
+ OUTREG(SC_TOP_LEFT, (image->dy << 16) | image->dx);
+ OUTREG(SC_BOTTOM_RIGHT, ((image->dy + image->height) << 16) |
+ (image->dx + image->width));
+ OUTREG(DST_Y_X, (image->dy << 16) | image->dx);
+
+ OUTREG(DST_HEIGHT_WIDTH, (image->height << 16) | ((image->width + 31) & ~31));

- src_bytes = (((image->width * image->depth) + 7) / 8) * image->height;
- dwords = (src_bytes + 3) / 4;
+ dwords = (image->width + 31) >> 5;
+ dwords *= image->height;
bits = (u32*)(image->data);

while(dwords >= 8) {
Index: linux-work/drivers/video/aty/radeon_base.c
===================================================================
--- linux-work.orig/drivers/video/aty/radeon_base.c 2008-11-10 14:01:50.000000000 +1100
+++ linux-work/drivers/video/aty/radeon_base.c 2008-11-10 14:36:26.000000000 +1100
@@ -1875,6 +1875,7 @@ static int __devinit radeon_set_fbinfo (
info->fbops = &radeonfb_ops;
info->screen_base = rinfo->fb_base;
info->screen_size = rinfo->mapped_vram;
+
/* Fill fix common fields */
strlcpy(info->fix.id, rinfo->name, sizeof(info->fix.id));
info->fix.smem_start = rinfo->fb_base_phys;
@@ -1889,8 +1890,25 @@ static int __devinit radeon_set_fbinfo (
info->fix.mmio_len = RADEON_REGSIZE;
info->fix.accel = FB_ACCEL_ATI_RADEON;

+ /* Allocate colormap */
fb_alloc_cmap(&info->cmap, 256, 0);

+ /* Setup pixmap used for acceleration */
+#define PIXMAP_SIZE (2048 * 4)
+
+ info->pixmap.addr = kmalloc(PIXMAP_SIZE, GFP_KERNEL);
+ if (!info->pixmap.addr) {
+ printk(KERN_ERR "radeonfb: Failed to allocate pixmap !\n");
+ noaccel = 1;
+ goto bail;
+ }
+ info->pixmap.size = PIXMAP_SIZE;
+ info->pixmap.flags = FB_PIXMAP_SYSTEM;
+ info->pixmap.scan_align = 4;
+ info->pixmap.buf_align = 4;
+ info->pixmap.access_align = 32;
+
+bail:
if (noaccel)
info->flags |= FBINFO_HWACCEL_DISABLED;


2008-11-10 07:13:28

by Paul Collins

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

Benjamin Herrenschmidt <[email protected]> writes:

> On Sun, 2008-11-09 at 18:59 +0100, Rafael J. Wysocki wrote:
>> This message has been generated automatically as a part of a report
>> of recent regressions.
>>
>> The following bug entry is on the current list of known regressions
>> from 2.6.27. Please verify if it still should be listed and let me know
>> (either way).
>
> Allright, so I finally managed to find a machine to reproduce it and
> I have a patch that fixes it here. I'm basically implementing the same
> thing as X which is to ensure the bitmap is padded to 32 pixels.

Works great here (as you might expect).

--
Paul Collins
Wellington, New Zealand

Dag vijandelijk luchtschip de huismeester is dood

2008-11-10 09:06:09

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Mon, 2008-11-10 at 20:13 +1300, Paul Collins wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
> > On Sun, 2008-11-09 at 18:59 +0100, Rafael J. Wysocki wrote:
> >> This message has been generated automatically as a part of a report
> >> of recent regressions.
> >>
> >> The following bug entry is on the current list of known regressions
> >> from 2.6.27. Please verify if it still should be listed and let me know
> >> (either way).
> >
> > Allright, so I finally managed to find a machine to reproduce it and
> > I have a patch that fixes it here. I'm basically implementing the same
> > thing as X which is to ensure the bitmap is padded to 32 pixels.
>
> Works great here (as you might expect).

Yeah, well, Albook G4 with rv350, I think we have the same machine :-)

Ben.

2008-11-10 09:06:48

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

From: Benjamin Herrenschmidt <[email protected]>
Date: Mon, 10 Nov 2008 16:46:25 +1100

> David, would you mind testing on your machine ? It's the one that shows
> the biggest performance improvement, and I would like to know how much
> it is affected by that patch. As long as the "worst case" performance
> is still reasonable, I'm ok to take the hit if the improvement for you
> is still significant.

I will test this out at the very next opportunity.

2008-11-10 12:04:32

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Sun, Nov 09, 2008 at 06:59:16PM +0100, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
> Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
> Submitter : Rafael J. Wysocki <[email protected]>
> Date : 2008-11-03 0:28 (7 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
> References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4

Hi Rafael,

could you provide more informations for this, please?

What is your kernel configuration?
Do you have any binary only modules (nvidia?) loaded?

Is it possible to recreate the bug by e.g. just doing something like

echo 0 > /sys/devices/system/cpu/cpu1/online

(or any other online cpu)? Or does it trigger any lockdep warnings?

2008-11-10 14:42:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Monday, 10 of November 2008, Heiko Carstens wrote:
> On Sun, Nov 09, 2008 at 06:59:16PM +0100, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.27. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
> > Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
> > Submitter : Rafael J. Wysocki <[email protected]>
> > Date : 2008-11-03 0:28 (7 days old)
> > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
> > References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4
>
> Hi Rafael,

Hi,

> could you provide more informations for this, please?
>
> What is your kernel configuration?

Available at: http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc3/kitty-config

> Do you have any binary only modules (nvidia?) loaded?

No, I don't.

> Is it possible to recreate the bug by e.g. just doing something like
>
> echo 0 > /sys/devices/system/cpu/cpu1/online

I haven't checked (yet), I'll do that later today and let you know.

> (or any other online cpu)? Or does it trigger any lockdep warnings?

Thanks,
Rafael

2008-11-10 16:53:56

by Andrey Borzenkov

[permalink] [raw]
Subject: Re: [Bug #11895] 2.6.28-rc2 regression: keyboard dead after reboot on Toshiba Portege 4000

On Sunday 09 November 2008, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>

it is fixed in rc4

>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11895

Could you reassign this to ACPI product so this bug could be further
investigated or should I open seperate one?


Attachments:
(No filename) (516.00 B)
signature.asc (197.00 B)
This is a digitally signed message part.
Download all attachments

2008-11-10 18:01:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11895] 2.6.28-rc2 regression: keyboard dead after reboot on Toshiba Portege 4000

On Monday, 10 of November 2008, Andrey Borzenkov wrote:
> On Sunday 09 November 2008, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.27. Please verify if it still should be listed and let me know
> > (either way).
> >
>
> it is fixed in rc4
>
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11895
>
> Could you reassign this to ACPI product so this bug could be further
> investigated or should I open seperate one?

Since you're saying it's fixed in -rc4, I'll close it and please open a
separate one for the issue that's not been fixed yet.

Thanks,
Rafael

2008-11-10 20:39:50

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

Benjamin Herrenschmidt <[email protected]> writes:

> radeonfb: Fix accel problems with new imageblit hook
>
> Some radeon chips have issues with color expansion of pixmaps that
> aren't a multiple of 32 pixels wide. This works around it the same
> way X does by requesting the right pitch alignment from fbcon and
> then using the chip scissors to do clipping to the requested size.

Unfortunately this does not fix the suspend regression on PowerBook6,7.
Instead I have to use the workaround in
<http://marc.info/?l=linux-kernel&m=122515268301239&w=2>.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-10 21:53:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Mon, 2008-11-10 at 21:39 +0100, Andreas Schwab wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
> > radeonfb: Fix accel problems with new imageblit hook
> >
> > Some radeon chips have issues with color expansion of pixmaps that
> > aren't a multiple of 32 pixels wide. This works around it the same
> > way X does by requesting the right pitch alignment from fbcon and
> > then using the chip scissors to do clipping to the requested size.
>
> Unfortunately this does not fix the suspend regression on PowerBook6,7.
> Instead I have to use the workaround in
> <http://marc.info/?l=linux-kernel&m=122515268301239&w=2>.

Strange. The suspend problem happens also when X hasn't been launched at all ?

Ben.

2008-11-10 22:50:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Monday, 10 of November 2008, Rafael J. Wysocki wrote:
> On Monday, 10 of November 2008, Heiko Carstens wrote:
> > On Sun, Nov 09, 2008 at 06:59:16PM +0100, Rafael J. Wysocki wrote:
> > > This message has been generated automatically as a part of a report
> > > of recent regressions.
> > >
> > > The following bug entry is on the current list of known regressions
> > > from 2.6.27. Please verify if it still should be listed and let me know
> > > (either way).
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
> > > Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
> > > Submitter : Rafael J. Wysocki <[email protected]>
> > > Date : 2008-11-03 0:28 (7 days old)
> > > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
> > > References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4
> >
> > Hi Rafael,
>
> Hi,
>
> > could you provide more informations for this, please?
> >
> > What is your kernel configuration?
>
> Available at: http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc3/kitty-config
>
> > Do you have any binary only modules (nvidia?) loaded?
>
> No, I don't.
>
> > Is it possible to recreate the bug by e.g. just doing something like
> >
> > echo 0 > /sys/devices/system/cpu/cpu1/online
>
> I haven't checked (yet), I'll do that later today and let you know.
>
> > (or any other online cpu)? Or does it trigger any lockdep warnings?

It cannot be reproduced with offlining CPU1 and it doesn't trigger any
warnings from lockdep.

However, it is reproducible by doing

# echo core > /sys/power/pm_test

and repeating

# echo disk > /sys/power/state

for a couple of times, in which case the last two lines printed to the console
before a (solid) hang are:

SMP alternatives: switching to SMP code
Booting processor 1 APIC 0x1 ip 0x6000

So, it evidently fails while re-enabling the non-boot CPU and not during
disabling it as I thought before.

With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the issue is
not reproducible any more.

Thanks,
Rafael

2008-11-10 23:20:42

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

Benjamin Herrenschmidt <[email protected]> writes:

> On Mon, 2008-11-10 at 21:39 +0100, Andreas Schwab wrote:
>> Benjamin Herrenschmidt <[email protected]> writes:
>>
>> > radeonfb: Fix accel problems with new imageblit hook
>> >
>> > Some radeon chips have issues with color expansion of pixmaps that
>> > aren't a multiple of 32 pixels wide. This works around it the same
>> > way X does by requesting the right pitch alignment from fbcon and
>> > then using the chip scissors to do clipping to the requested size.
>>
>> Unfortunately this does not fix the suspend regression on PowerBook6,7.
>> Instead I have to use the workaround in
>> <http://marc.info/?l=linux-kernel&m=122515268301239&w=2>.
>
> Strange. The suspend problem happens also when X hasn't been launched at all ?

There seems to be some race involved here. I cannot reproduce the
problem ATM.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-10 23:35:53

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)


> There seems to be some race involved here. I cannot reproduce the
> problem ATM.

I wonder if it's related to the new acceleration at all then.

I've tried various suspend/resume cycles in straight console mode using
directly snooze -f (kernel ioctl) and from X using ubuntu intrepid and
gnome power manager and it worked fine on a 5,6 which should be fairly
similar to your 6,7 I think.

It's possible that there's yet another X related race though. I've seen
cases of X whacking the chip -after- it has religuished the console back
to the kernel (back to KD_TEXT) in the past which is very wrong, though
I didn't spot that during my testing, there could be some race lurking
there.

Can you describe your problem more precisely ? I didn't see (or forgot)
your initial report. Did it crash on suspend or wakeup ? what symptoms ?

Note also that on PowerBooks, there's a platform hook that allows
radeonfb to wake up the video chip _very_ early, thus allowing easier
debugging of the boot process, so even races like that on wakeup would
surprise me since we do wakup up the chip before we even get a chance to
schedule userspace again (in fact before we even bring back the L2
cache !)

Cheers,
Ben.

2008-11-10 23:55:04

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

Benjamin Herrenschmidt <[email protected]> writes:

> Can you describe your problem more precisely ?

It crashes during suspend (after the console was switched away from X),
but I can only see a frame buffer with apparently random contents when
it happens. When suspend works then those random frame buffer contents
are only briefly visible before the screen is cleared.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-11 01:50:44

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Tue, 2008-11-11 at 00:54 +0100, Andreas Schwab wrote:
>
> > Can you describe your problem more precisely ?
>
> It crashes during suspend (after the console was switched away from X),
> but I can only see a frame buffer with apparently random contents when
> it happens. When suspend works then those random frame buffer contents
> are only briefly visible before the screen is cleared.

Does it actually switches away from X ?

IE. You see the console before the crap on console or not ?

I've seen what you describe happening when doing snooze -f (direct
kernel ioctl) straight from within X. It seems to me that the problem
was that for some reason it didn't switch the console, which would
definitely make it crash. I need to double check what's up, it's
possible that the kernel fails to switch it properly or fails to wait
for X to ack the switch.

In any case, I doesn't seem to be directly related to those radeonfb
changes, though a clash with X like that is indeed more likely to
actually happen if radeonfb relies more heavily on acceleration.

I'll have a look later today at the console switch from X in the kernel
see if it's been broken in a way or another.

Note: I just did some tests using both echo "mem" >/sys/power/state and
snooze -f and it worked fine. IE, the console switch away from X worked.
So while I think I observed your problem once, I also cannot reproduce
it now.

I wonder if there's a race condition in the VT switch. It's possible
that it could be yet another case of X whacking the chip after it has
effectively relinguished control of the VT to the kernel, or it could be
a kernel race.

Cheers,
Ben.

2008-11-11 02:48:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)



On Tue, 11 Nov 2008, Benjamin Herrenschmidt wrote:
>
> In any case, I doesn't seem to be directly related to those radeonfb
> changes, though a clash with X like that is indeed more likely to
> actually happen if radeonfb relies more heavily on acceleration.

Just a silly question, without actually looking at the code - since you
now do acceleration in radeonfb, do you wait for everything to drain
before you switch consoles?

There could be races that depend on timing, where perhaps X is unhappy
about being entered with the acceleration engine busy, or conversely the
radeonfb code is unhappy about perhaps some still-in-progress X thing that
hasn't been synchronously waited for..

Before, radeonfb_imageblit() would always end up doing a
"radeon_engine_idle()", so in practice, I think just about any fbcon
access ended up idling the engine. Now, we can probably do a lot more
without syncronizing - maybe there's insufficient synchronization at the
switch-over from X to text-mode or vice versa?

Linus

2008-11-11 03:21:34

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Mon, 2008-11-10 at 18:47 -0800, Linus Torvalds wrote:
>
> On Tue, 11 Nov 2008, Benjamin Herrenschmidt wrote:
> >
> > In any case, I doesn't seem to be directly related to those radeonfb
> > changes, though a clash with X like that is indeed more likely to
> > actually happen if radeonfb relies more heavily on acceleration.
>
> Just a silly question, without actually looking at the code - since you
> now do acceleration in radeonfb, do you wait for everything to drain
> before you switch consoles?

radeonfb has been doing acceleration for some time :-) Just not color
expansion, only blits and solid fills (so basically scrolling). That is
a lot less common though and thus it's possible that existing races
didn't show up until now.

It does drain the engine in various cases, typically mode change,
blanking, sync callback. fbcon core should at least sync if not blank
when switching to KD_GRAPHICS (or at least used to, I need to double
check). I have additional guards also that disable use of the engine
when sleeping.

> There could be races that depend on timing, where perhaps X is unhappy
> about being entered with the acceleration engine busy, or conversely the
> radeonfb code is unhappy about perhaps some still-in-progress X thing that
> hasn't been synchronously waited for..

Yes. From what's been reported, the more likely thing would be a race
when switching away from X.

> Before, radeonfb_imageblit() would always end up doing a
> "radeon_engine_idle()", so in practice, I think just about any fbcon
> access ended up idling the engine. Now, we can probably do a lot more
> without syncronizing - maybe there's insufficient synchronization at the
> switch-over from X to text-mode or vice versa?

Switch over from X should restore KD_TEXT which should turn to a call to
set_par() that idles the engine before anything gets written to the
screen, but those code path are intricated between the VT code and fbcon
and things may well be subtely broken. I'll dig later today after I'm
done with some other emergency.

At one point, I fixed a crapload of VT bugs where things were done
without any locking, nowadays, everything should pretty much be covered
by the console semaphore, but maybe there's still a problem there.
Another area to look at is X itself. I've had problems with X (or the
DRM) still whacking the card after handing back the console to the
kernel in the past, so it wouldn't surprise me if there was something
bogus there too.

I also had problems with fbcon trying to draw before it re-initialized
the card (ie, it -should- call set_par before any new draw operation
when switching back from KD_GRAPHICS, if not, we don't properly get to
reconfigure the engine before we try to use it, which can be fatal), but
those were fixed last time I looked.

Anyway, I'll dig and let you know what I find.

Cheers,
Ben.

2008-11-11 08:43:54

by Romano Giannetti

[permalink] [raw]
Subject: Re: [Bug #11947] 2.6.28-rc VC switching with Intel graphics broken

Rafael J. Wysocki wrote:
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11947
> Subject : 2.6.28-rc VC switching with Intel graphics broken
> Submitter : Romano Giannetti <[email protected]>
> Date : 2008-11-03 12:10 (7 days old)
> Handled-By : Jesse Barnes <[email protected]>

Still here in 2.6.28-rc4. Complete lock switching back from a VC to X.

Romano

2008-11-11 09:31:42

by Andreas Schwab

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

It looks like you are observing the same failure mode that I do.

Andreas.

--
Andreas Schwab, SuSE Labs, [email protected]
SuSE Linux Products GmbH, Maxfeldstra?e 5, 90409 N?rnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

2008-11-11 10:53:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine


* Rafael J. Wysocki <[email protected]> wrote:

> However, it is reproducible by doing
>
> # echo core > /sys/power/pm_test
>
> and repeating
>
> # echo disk > /sys/power/state
>
> for a couple of times, in which case the last two lines printed to the console
> before a (solid) hang are:
>
> SMP alternatives: switching to SMP code
> Booting processor 1 APIC 0x1 ip 0x6000
>
> So, it evidently fails while re-enabling the non-boot CPU and not
> during disabling it as I thought before.
>
> With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> issue is not reproducible any more.

[ Cc:-ed workqueue/locking/suspend-race-condition experts. ]

Seems like the new kernel/stop_machine.c logic has a race for the test
sequence above. (Below is the bisected commit again, maybe the race is
visible via email review as well.)

Ingo

-------------->
>From c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc Mon Sep 17 00:00:00 2001
From: Heiko Carstens <[email protected]>
Date: Mon, 13 Oct 2008 23:50:10 +0200
Subject: [PATCH] stop_machine: use workqueues instead of kernel threads

Convert stop_machine to a workqueue based approach. Instead of using kernel
threads for stop_machine we now use a an rt workqueue to synchronize all
cpus.
This has the advantage that all needed per cpu threads are already created
when stop_machine gets called. And therefore a call to stop_machine won't
fail anymore. This is needed for s390 which needs a mechanism to synchronize
all cpus without allocating any memory.
As Rusty pointed out free_module() needs a non-failing stop_machine interface
as well.

As a side effect the stop_machine code gets simplified.

Signed-off-by: Heiko Carstens <[email protected]>
Signed-off-by: Rusty Russell <[email protected]>
---
kernel/stop_machine.c | 111 ++++++++++++++++++-------------------------------
1 files changed, 41 insertions(+), 70 deletions(-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index af3c7ce..0e688c6 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -37,9 +37,13 @@ struct stop_machine_data {
/* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
static unsigned int num_threads;
static atomic_t thread_ack;
-static struct completion finished;
static DEFINE_MUTEX(lock);

+static struct workqueue_struct *stop_machine_wq;
+static struct stop_machine_data active, idle;
+static const cpumask_t *active_cpus;
+static void *stop_machine_work;
+
static void set_state(enum stopmachine_state newstate)
{
/* Reset ack counter. */
@@ -51,21 +55,25 @@ static void set_state(enum stopmachine_state newstate)
/* Last one to ack a state moves to the next state. */
static void ack_state(void)
{
- if (atomic_dec_and_test(&thread_ack)) {
- /* If we're the last one to ack the EXIT, we're finished. */
- if (state == STOPMACHINE_EXIT)
- complete(&finished);
- else
- set_state(state + 1);
- }
+ if (atomic_dec_and_test(&thread_ack))
+ set_state(state + 1);
}

-/* This is the actual thread which stops the CPU. It exits by itself rather
- * than waiting for kthread_stop(), because it's easier for hotplug CPU. */
-static int stop_cpu(struct stop_machine_data *smdata)
+/* This is the actual function which stops the CPU. It runs
+ * in the context of a dedicated stopmachine workqueue. */
+static void stop_cpu(struct work_struct *unused)
{
enum stopmachine_state curstate = STOPMACHINE_NONE;
-
+ struct stop_machine_data *smdata = &idle;
+ int cpu = smp_processor_id();
+
+ if (!active_cpus) {
+ if (cpu == first_cpu(cpu_online_map))
+ smdata = &active;
+ } else {
+ if (cpu_isset(cpu, *active_cpus))
+ smdata = &active;
+ }
/* Simple state machine */
do {
/* Chill out and ensure we re-read stopmachine_state. */
@@ -90,7 +98,6 @@ static int stop_cpu(struct stop_machine_data *smdata)
} while (curstate != STOPMACHINE_EXIT);

local_irq_enable();
- do_exit(0);
}

/* Callback for CPUs which aren't supposed to do anything. */
@@ -101,78 +108,34 @@ static int chill(void *unused)

int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
{
- int i, err;
- struct stop_machine_data active, idle;
- struct task_struct **threads;
+ struct work_struct *sm_work;
+ int i;

+ /* Set up initial state. */
+ mutex_lock(&lock);
+ num_threads = num_online_cpus();
+ active_cpus = cpus;
active.fn = fn;
active.data = data;
active.fnret = 0;
idle.fn = chill;
idle.data = NULL;

- /* This could be too big for stack on large machines. */
- threads = kcalloc(NR_CPUS, sizeof(threads[0]), GFP_KERNEL);
- if (!threads)
- return -ENOMEM;
-
- /* Set up initial state. */
- mutex_lock(&lock);
- init_completion(&finished);
- num_threads = num_online_cpus();
set_state(STOPMACHINE_PREPARE);

- for_each_online_cpu(i) {
- struct stop_machine_data *smdata = &idle;
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
- if (!cpus) {
- if (i == first_cpu(cpu_online_map))
- smdata = &active;
- } else {
- if (cpu_isset(i, *cpus))
- smdata = &active;
- }
-
- threads[i] = kthread_create((void *)stop_cpu, smdata, "kstop%u",
- i);
- if (IS_ERR(threads[i])) {
- err = PTR_ERR(threads[i]);
- threads[i] = NULL;
- goto kill_threads;
- }
-
- /* Place it onto correct cpu. */
- kthread_bind(threads[i], i);
-
- /* Make it highest prio. */
- if (sched_setscheduler_nocheck(threads[i], SCHED_FIFO, &param))
- BUG();
- }
-
- /* We've created all the threads. Wake them all: hold this CPU so one
+ /* Schedule the stop_cpu work on all cpus: hold this CPU so one
* doesn't hit this CPU until we're ready. */
get_cpu();
- for_each_online_cpu(i)
- wake_up_process(threads[i]);
-
+ for_each_online_cpu(i) {
+ sm_work = percpu_ptr(stop_machine_work, i);
+ INIT_WORK(sm_work, stop_cpu);
+ queue_work_on(i, stop_machine_wq, sm_work);
+ }
/* This will release the thread on our CPU. */
put_cpu();
- wait_for_completion(&finished);
+ flush_workqueue(stop_machine_wq);
mutex_unlock(&lock);
-
- kfree(threads);
-
return active.fnret;
-
-kill_threads:
- for_each_online_cpu(i)
- if (threads[i])
- kthread_stop(threads[i]);
- mutex_unlock(&lock);
-
- kfree(threads);
- return err;
}

int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
@@ -187,3 +150,11 @@ int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
return ret;
}
EXPORT_SYMBOL_GPL(stop_machine);
+
+static int __init stop_machine_init(void)
+{
+ stop_machine_wq = create_rt_workqueue("kstop");
+ stop_machine_work = alloc_percpu(struct work_struct);
+ return 0;
+}
+early_initcall(stop_machine_init);

2008-11-11 11:30:55

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Tue, 2008-11-11 at 10:31 +0100, Andreas Schwab wrote:
> It looks like you are observing the same failure mode that I do.

Yup, once, haven't reproduced it ever since though :-(

Ben.

2008-11-11 11:31:49

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 11:52:14AM +0100, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > However, it is reproducible by doing
> >
> > # echo core > /sys/power/pm_test
> >
> > and repeating
> >
> > # echo disk > /sys/power/state
> >
> > for a couple of times, in which case the last two lines printed to the console
> > before a (solid) hang are:
> >
> > SMP alternatives: switching to SMP code
> > Booting processor 1 APIC 0x1 ip 0x6000
> >
> > So, it evidently fails while re-enabling the non-boot CPU and not
> > during disabling it as I thought before.
> >
> > With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> > issue is not reproducible any more.
>
> [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
>
> Seems like the new kernel/stop_machine.c logic has a race for the test
> sequence above. (Below is the bisected commit again, maybe the race is
> visible via email review as well.)

FWIW, I tried to reproduce this on s390 and got the following:

A process that would do nothing but onlining/offlining cpus would get
stuck after a while:

0 schedule+842 [0x342522]
1 schedule_timeout+200 [0x342ec4]
2 wait_for_common+362 [0x341fd6]
3 wait_for_completion+54 [0x342146]
4 __synchronize_sched+80 [0x81670]
5 cpu_down+172 [0x33c030]
6 store_online+96 [0x33c488]
7 sysdev_store+52 [0x1bda84]
8 sysfs_write_file+242 [0x1350ba]
9 vfs_write+176 [0xd2028]
10 sys_write+82 [0xd21ea]
11 sysc_noemu+16 [0x269d8]

All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
I login into the system or generate a console interrupt.
I'm going to look into the dump and see if I can figure out what is
broken here.
Dunno if it is the same bug or something else.

2008-11-11 12:42:20

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 12:31:34PM +0100, Heiko Carstens wrote:
> On Tue, Nov 11, 2008 at 11:52:14AM +0100, Ingo Molnar wrote:
> >
> > * Rafael J. Wysocki <[email protected]> wrote:
> >
> > > However, it is reproducible by doing
> > >
> > > # echo core > /sys/power/pm_test
> > >
> > > and repeating
> > >
> > > # echo disk > /sys/power/state
> > >
> > > for a couple of times, in which case the last two lines printed to the console
> > > before a (solid) hang are:
> > >
> > > SMP alternatives: switching to SMP code
> > > Booting processor 1 APIC 0x1 ip 0x6000
> > >
> > > So, it evidently fails while re-enabling the non-boot CPU and not
> > > during disabling it as I thought before.
> > >
> > > With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> > > issue is not reproducible any more.
> >
> > [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
> >
> > Seems like the new kernel/stop_machine.c logic has a race for the test
> > sequence above. (Below is the bisected commit again, maybe the race is
> > visible via email review as well.)
>
> FWIW, I tried to reproduce this on s390 and got the following:
>
> A process that would do nothing but onlining/offlining cpus would get
> stuck after a while:
>
> 0 schedule+842 [0x342522]
> 1 schedule_timeout+200 [0x342ec4]
> 2 wait_for_common+362 [0x341fd6]
> 3 wait_for_completion+54 [0x342146]
> 4 __synchronize_sched+80 [0x81670]
> 5 cpu_down+172 [0x33c030]
> 6 store_online+96 [0x33c488]
> 7 sysdev_store+52 [0x1bda84]
> 8 sysfs_write_file+242 [0x1350ba]
> 9 vfs_write+176 [0xd2028]
> 10 sys_write+82 [0xd21ea]
> 11 sysc_noemu+16 [0x269d8]
>
> All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> I login into the system or generate a console interrupt.
> I'm going to look into the dump and see if I can figure out what is
> broken here.
> Dunno if it is the same bug or something else.

[Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]

Steven, Paul, any idea what could cause the hang? I think I would
get lost in the RCU code...

2008-11-11 13:17:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine


* Heiko Carstens <[email protected]> wrote:

> On Tue, Nov 11, 2008 at 12:31:34PM +0100, Heiko Carstens wrote:
> > On Tue, Nov 11, 2008 at 11:52:14AM +0100, Ingo Molnar wrote:
> > >
> > > * Rafael J. Wysocki <[email protected]> wrote:
> > >
> > > > However, it is reproducible by doing
> > > >
> > > > # echo core > /sys/power/pm_test
> > > >
> > > > and repeating
> > > >
> > > > # echo disk > /sys/power/state
> > > >
> > > > for a couple of times, in which case the last two lines printed to the console
> > > > before a (solid) hang are:
> > > >
> > > > SMP alternatives: switching to SMP code
> > > > Booting processor 1 APIC 0x1 ip 0x6000
> > > >
> > > > So, it evidently fails while re-enabling the non-boot CPU and not
> > > > during disabling it as I thought before.
> > > >
> > > > With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> > > > issue is not reproducible any more.
> > >
> > > [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
> > >
> > > Seems like the new kernel/stop_machine.c logic has a race for the test
> > > sequence above. (Below is the bisected commit again, maybe the race is
> > > visible via email review as well.)
> >
> > FWIW, I tried to reproduce this on s390 and got the following:
> >
> > A process that would do nothing but onlining/offlining cpus would get
> > stuck after a while:
> >
> > 0 schedule+842 [0x342522]
> > 1 schedule_timeout+200 [0x342ec4]
> > 2 wait_for_common+362 [0x341fd6]
> > 3 wait_for_completion+54 [0x342146]
> > 4 __synchronize_sched+80 [0x81670]
> > 5 cpu_down+172 [0x33c030]
> > 6 store_online+96 [0x33c488]
> > 7 sysdev_store+52 [0x1bda84]
> > 8 sysfs_write_file+242 [0x1350ba]
> > 9 vfs_write+176 [0xd2028]
> > 10 sys_write+82 [0xd21ea]
> > 11 sysc_noemu+16 [0x269d8]
> >
> > All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> > or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> > I login into the system or generate a console interrupt.
> > I'm going to look into the dump and see if I can figure out what is
> > broken here.
> > Dunno if it is the same bug or something else.
>
> [Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]
>
> Steven, Paul, any idea what could cause the hang? I think I would
> get lost in the RCU code...

Cc:-ed Thomas - sometimes "RCU hangs" happen due to nohz confusion:
because no timer IRQ happens so there's nothing to drive the RCU
machinery.

Ingo

2008-11-11 13:36:30

by Vegard Nossum

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 11:52 AM, Ingo Molnar <[email protected]> wrote:
> [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]

Heh. I am not expert, but I looked at the code. The obvious suspicious
thing to see is the use of unpaired barriers? Maybe like this:

47 static void set_state(enum stopmachine_state newstate)
48 {
49 /* Reset ack counter. */
50 atomic_set(&thread_ack, num_threads);
51 smp_wmb();

+ /* force ordering between thread_ack/state */

52 state = newstate;
53 }
54
55 /* Last one to ack a state moves to the next state. */
56 static void ack_state(void)
57 {
58 if (atomic_dec_and_test(&thread_ack))

Maybe
+ /* force ordering between thread_ack/state */
+ smp_rmb();
here?

59 set_state(state + 1);
60 }
61

Or maybe I am wrong. But Documentation/memory-barriers.txt is rather
explicit on this point.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-11-11 13:46:49

by Vegard Nossum

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 2:36 PM, Vegard Nossum <[email protected]> wrote:
> On Tue, Nov 11, 2008 at 11:52 AM, Ingo Molnar <[email protected]> wrote:
>> [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
>
> Heh. I am not expert, but I looked at the code. The obvious suspicious
> thing to see is the use of unpaired barriers? Maybe like this:

...

> 55 /* Last one to ack a state moves to the next state. */
> 56 static void ack_state(void)
> 57 {
> 58 if (atomic_dec_and_test(&thread_ack))
>
> Maybe
> + /* force ordering between thread_ack/state */
> + smp_rmb();
> here?

Oops, I am wrong (after a small investigation).

"1490 Any atomic operation that modifies some state in memory and
returns information
1491 about the state (old or new) implies an SMP-conditional general
memory barrier
1492 (smp_mb()) on each side of the actual operation (with the exception of
1493 explicit lock operations, described later). These include:
1494
...
1503 atomic_dec_and_test();"

Won't fix the problem at hand, but maybe something like this would be
nice for future generations :-)

diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index 0e688c6..6796bb1 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -55,6 +55,7 @@ static void set_state(enum stopmachine_state newstate)
/* Last one to ack a state moves to the next state. */
static void ack_state(void)
{
+ /* Implicit memory barrier; no smp_rmb() needed */
if (atomic_dec_and_test(&thread_ack))
set_state(state + 1);
}


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-11-11 13:49:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, 2008-11-11 at 14:36 +0100, Vegard Nossum wrote:
> On Tue, Nov 11, 2008 at 11:52 AM, Ingo Molnar <[email protected]> wrote:
> > [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
>
> Heh. I am not expert, but I looked at the code. The obvious suspicious
> thing to see is the use of unpaired barriers? Maybe like this:
>
> 47 static void set_state(enum stopmachine_state newstate)
> 48 {
> 49 /* Reset ack counter. */
> 50 atomic_set(&thread_ack, num_threads);
> 51 smp_wmb();
>
> + /* force ordering between thread_ack/state */
>
> 52 state = newstate;
> 53 }
> 54
> 55 /* Last one to ack a state moves to the next state. */
> 56 static void ack_state(void)
> 57 {
> 58 if (atomic_dec_and_test(&thread_ack))
>
> Maybe
> + /* force ordering between thread_ack/state */
> + smp_rmb();
> here?

all atomic ops that have return values imply a full barrier, iirc

> 59 set_state(state + 1);
> 60 }
> 61
>
> Or maybe I am wrong. But Documentation/memory-barriers.txt is rather
> explicit on this point.
>
>
> Vegard
>

2008-11-11 14:35:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 01:42:01PM +0100, Heiko Carstens wrote:
> On Tue, Nov 11, 2008 at 12:31:34PM +0100, Heiko Carstens wrote:
> > On Tue, Nov 11, 2008 at 11:52:14AM +0100, Ingo Molnar wrote:
> > >
> > > * Rafael J. Wysocki <[email protected]> wrote:
> > >
> > > > However, it is reproducible by doing
> > > >
> > > > # echo core > /sys/power/pm_test
> > > >
> > > > and repeating
> > > >
> > > > # echo disk > /sys/power/state
> > > >
> > > > for a couple of times, in which case the last two lines printed to the console
> > > > before a (solid) hang are:
> > > >
> > > > SMP alternatives: switching to SMP code
> > > > Booting processor 1 APIC 0x1 ip 0x6000
> > > >
> > > > So, it evidently fails while re-enabling the non-boot CPU and not
> > > > during disabling it as I thought before.
> > > >
> > > > With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> > > > issue is not reproducible any more.
> > >
> > > [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
> > >
> > > Seems like the new kernel/stop_machine.c logic has a race for the test
> > > sequence above. (Below is the bisected commit again, maybe the race is
> > > visible via email review as well.)
> >
> > FWIW, I tried to reproduce this on s390 and got the following:
> >
> > A process that would do nothing but onlining/offlining cpus would get
> > stuck after a while:
> >
> > 0 schedule+842 [0x342522]
> > 1 schedule_timeout+200 [0x342ec4]
> > 2 wait_for_common+362 [0x341fd6]
> > 3 wait_for_completion+54 [0x342146]
> > 4 __synchronize_sched+80 [0x81670]
> > 5 cpu_down+172 [0x33c030]
> > 6 store_online+96 [0x33c488]
> > 7 sysdev_store+52 [0x1bda84]
> > 8 sysfs_write_file+242 [0x1350ba]
> > 9 vfs_write+176 [0xd2028]
> > 10 sys_write+82 [0xd21ea]
> > 11 sysc_noemu+16 [0x269d8]
> >
> > All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> > or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> > I login into the system or generate a console interrupt.
> > I'm going to look into the dump and see if I can figure out what is
> > broken here.
> > Dunno if it is the same bug or something else.
>
> [Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]
>
> Steven, Paul, any idea what could cause the hang? I think I would
> get lost in the RCU code...

Hello, Heiko,

Could you please apply the following debug patch (due to Jiangshan and
myself)? Then you should be able to build with CONFIG_RCU_TRACE,
then mount debugfs after boot, for example, on /debug. This will
create a /debug/rcu directory with three files, "rcucb", "rcu_data",
and "rcu_bh_data". Since you are still able to log in, could you
please send the contents of these three files?

Thanx, Paul

2008-11-11 14:48:16

by Vegard Nossum

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 11:52 AM, Ingo Molnar <[email protected]> wrote:
> [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
>
> Seems like the new kernel/stop_machine.c logic has a race for the test
> sequence above. (Below is the bisected commit again, maybe the race is
> visible via email review as well.)

I try again.

I think that the test for stop_machine_data in stop_cpu() should not
have been moved from __stop_machine(). Because now cpu_online_map may
change in-between calls to stop_cpu() (if the callback tries to
online/offline CPUs), and the end result may be different.

Maybe?


Vegard

2008-11-11 15:02:20

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 06:35:05AM -0800, Paul E. McKenney wrote:
> > > A process that would do nothing but onlining/offlining cpus would get
> > > stuck after a while:
> > >
> > > 0 schedule+842 [0x342522]
> > > 1 schedule_timeout+200 [0x342ec4]
> > > 2 wait_for_common+362 [0x341fd6]
> > > 3 wait_for_completion+54 [0x342146]
> > > 4 __synchronize_sched+80 [0x81670]
> > > 5 cpu_down+172 [0x33c030]
> > > 6 store_online+96 [0x33c488]
> > > 7 sysdev_store+52 [0x1bda84]
> > > 8 sysfs_write_file+242 [0x1350ba]
> > > 9 vfs_write+176 [0xd2028]
> > > 10 sys_write+82 [0xd21ea]
> > > 11 sysc_noemu+16 [0x269d8]
> > >
> > > All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> > > or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> > > I login into the system or generate a console interrupt.
> > > I'm going to look into the dump and see if I can figure out what is
> > > broken here.
> > > Dunno if it is the same bug or something else.
> >
> > [Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]
> >
> > Steven, Paul, any idea what could cause the hang? I think I would
> > get lost in the RCU code...
>
> Hello, Heiko,
>
> Could you please apply the following debug patch (due to Jiangshan and
> myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> then mount debugfs after boot, for example, on /debug. This will
> create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> and "rcu_bh_data". Since you are still able to log in, could you
> please send the contents of these three files?

Hi Paul,

could you attach the patch please? :)

Does the patch also make sense if the system continues to work? That
is the machine isn't stalled anymore as soon as I log in.
On the other hand I do have a dump of the system and can look in
whatever data structures you want. If that helps.

Thanks,
Heiko

2008-11-11 15:02:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 06:35:05AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 11, 2008 at 01:42:01PM +0100, Heiko Carstens wrote:
> > On Tue, Nov 11, 2008 at 12:31:34PM +0100, Heiko Carstens wrote:
> > > On Tue, Nov 11, 2008 at 11:52:14AM +0100, Ingo Molnar wrote:
> > > >
> > > > * Rafael J. Wysocki <[email protected]> wrote:
> > > >
> > > > > However, it is reproducible by doing
> > > > >
> > > > > # echo core > /sys/power/pm_test
> > > > >
> > > > > and repeating
> > > > >
> > > > > # echo disk > /sys/power/state
> > > > >
> > > > > for a couple of times, in which case the last two lines printed to the console
> > > > > before a (solid) hang are:
> > > > >
> > > > > SMP alternatives: switching to SMP code
> > > > > Booting processor 1 APIC 0x1 ip 0x6000
> > > > >
> > > > > So, it evidently fails while re-enabling the non-boot CPU and not
> > > > > during disabling it as I thought before.
> > > > >
> > > > > With commit c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc reverted the
> > > > > issue is not reproducible any more.
> > > >
> > > > [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
> > > >
> > > > Seems like the new kernel/stop_machine.c logic has a race for the test
> > > > sequence above. (Below is the bisected commit again, maybe the race is
> > > > visible via email review as well.)
> > >
> > > FWIW, I tried to reproduce this on s390 and got the following:
> > >
> > > A process that would do nothing but onlining/offlining cpus would get
> > > stuck after a while:
> > >
> > > 0 schedule+842 [0x342522]
> > > 1 schedule_timeout+200 [0x342ec4]
> > > 2 wait_for_common+362 [0x341fd6]
> > > 3 wait_for_completion+54 [0x342146]
> > > 4 __synchronize_sched+80 [0x81670]
> > > 5 cpu_down+172 [0x33c030]
> > > 6 store_online+96 [0x33c488]
> > > 7 sysdev_store+52 [0x1bda84]
> > > 8 sysfs_write_file+242 [0x1350ba]
> > > 9 vfs_write+176 [0xd2028]
> > > 10 sys_write+82 [0xd21ea]
> > > 11 sysc_noemu+16 [0x269d8]
> > >
> > > All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> > > or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> > > I login into the system or generate a console interrupt.
> > > I'm going to look into the dump and see if I can figure out what is
> > > broken here.
> > > Dunno if it is the same bug or something else.
> >
> > [Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]
> >
> > Steven, Paul, any idea what could cause the hang? I think I would
> > get lost in the RCU code...
>
> Hello, Heiko,
>
> Could you please apply the following debug patch (due to Jiangshan and
> myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> then mount debugfs after boot, for example, on /debug. This will
> create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> and "rcu_bh_data". Since you are still able to log in, could you
> please send the contents of these three files?
>
> Thanx, Paul

This time with the patch actually attached... Thanks to Peter Z.
for alerting me to my omission.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---

diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index 4ab8436..735f35a 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -54,6 +54,9 @@ struct rcu_ctrlblk {
/* for current batch to proceed. */
} ____cacheline_internodealigned_in_smp;

+extern struct rcu_ctrlblk rcu_ctrlblk;
+extern struct rcu_ctrlblk rcu_bh_ctrlblk;
+
/* Is batch a before batch b ? */
static inline int rcu_batch_before(long a, long b)
{
@@ -76,6 +79,7 @@ struct rcu_data {
long quiescbatch; /* Batch # for grace period */
int passed_quiesc; /* User-mode/idle loop etc. */
int qs_pending; /* core waits for quiesc state */
+ bool beenonline; /* CPU online at least once */

/* 2) batch handling */
long batch; /* Batch # for current RCU batch */
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 9fdba03..ba32338 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -68,7 +68,6 @@ config PREEMPT_RCU

config RCU_TRACE
bool "Enable tracing for RCU - currently stats in debugfs"
- depends on PREEMPT_RCU
select DEBUG_FS
default y
help
diff --git a/kernel/Makefile b/kernel/Makefile
index 4e1d7df..e0bfce7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -77,6 +77,8 @@ obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
ifeq ($(CONFIG_PREEMPT_RCU),y)
obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
+else
+obj-$(CONFIG_RCU_TRACE) += rcuclassic_trace.o
endif
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index aad93cd..06472fc 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -57,13 +57,13 @@ EXPORT_SYMBOL_GPL(rcu_lock_map);


/* Definition for rcupdate control block. */
-static struct rcu_ctrlblk rcu_ctrlblk = {
+struct rcu_ctrlblk rcu_ctrlblk = {
.cur = -300,
.completed = -300,
.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
.cpumask = CPU_MASK_NONE,
};
-static struct rcu_ctrlblk rcu_bh_ctrlblk = {
+struct rcu_ctrlblk rcu_bh_ctrlblk = {
.cur = -300,
.completed = -300,
.lock = __SPIN_LOCK_UNLOCKED(&rcu_bh_ctrlblk.lock),
@@ -564,6 +564,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
rdp->donetail = &rdp->donelist;
rdp->quiescbatch = rcp->completed;
rdp->qs_pending = 0;
+ rdp->beenonline = 1;
rdp->cpu = cpu;
rdp->blimit = blimit;
}
diff --git a/kernel/rcuclassic_trace.c b/kernel/rcuclassic_trace.c
new file mode 100644
index 0000000..b719048
--- /dev/null
+++ b/kernel/rcuclassic_trace.c
@@ -0,0 +1,198 @@
+/*
+ * Read-Copy Update tracing for classic implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2008
+ *
+ * Updated to use seqfile by Lai Jiangshan.
+ *
+ * Papers: http://www.rdrop.com/users/paulmck/RCU
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU
+ *
+ */
+#include <linux/rcupdate.h>
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+/* Print out rcu_data structures using seqfile facility. */
+
+static struct rcu_data *get_rcu_data_bh(int cpu)
+{
+ return &per_cpu(rcu_bh_data, cpu);
+}
+
+static struct rcu_data *get_rcu_data(int cpu)
+{
+ return &per_cpu(rcu_data, cpu);
+}
+
+static int show_rcu_data(struct seq_file *m, void *v)
+{
+ struct rcu_data *rdp = v;
+
+ if (!rdp->beenonline)
+ return 0;
+
+ seq_printf(m, "processor\t: %d", rdp->cpu);
+ if (cpu_is_offline(rdp->cpu))
+ seq_puts(m, "!\n");
+ else
+ seq_puts(m, "\n");
+ seq_printf(m, "quiescbatch\t: %ld\n", rdp->quiescbatch);
+ seq_printf(m, "batch\t\t: %ld\n", rdp->batch);
+ seq_printf(m, "passed_quiesc\t: %d\n", rdp->passed_quiesc);
+ seq_printf(m, "qs_pending\t: %d\n", rdp->qs_pending);
+ seq_printf(m, "qlen\t\t: %ld\n", rdp->qlen);
+ seq_printf(m, "blimit\t\t: %ld\n", rdp->blimit);
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static void *c_start(struct seq_file *m, loff_t *pos)
+{
+ typedef struct rcu_data *(*get_data_func)(int);
+
+ if (*pos == 0) /* just in case, cpu 0 is not the first */
+ *pos = first_cpu(cpu_possible_map);
+ else
+ *pos = next_cpu_nr(*pos - 1, cpu_possible_map);
+ if ((*pos) < nr_cpu_ids)
+ return ((get_data_func)m->private)(*pos);
+ return NULL;
+}
+
+static void *c_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ (*pos)++;
+ return c_start(m, pos);
+}
+
+static void c_stop(struct seq_file *m, void *v)
+{
+}
+
+const struct seq_operations rcu_data_seq_op = {
+ .start = c_start,
+ .next = c_next,
+ .stop = c_stop,
+ .show = show_rcu_data,
+};
+
+static int rcu_data_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open(file, &rcu_data_seq_op);
+
+ if (ret)
+ return ret;
+ ((struct seq_file *)file->private_data)->private = inode->i_private;
+ return 0;
+}
+
+static const struct file_operations rcu_data_fops = {
+ .owner = THIS_MODULE,
+ .open = rcu_data_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* Print out rcu_ctrlblk structures using seqfile facility. */
+
+static void print_one_rcu_ctrlblk(struct seq_file *m, struct rcu_ctrlblk *rcp)
+{
+ seq_printf(m, "cur=%ld completed=%ld next_pending=%d s=%d\n\t",
+ rcp->cur, rcp->completed, rcp->next_pending, rcp->signaled);
+ seq_cpumask(m, &rcp->cpumask);
+ seq_puts(m, "\n");
+}
+
+static int show_rcucb(struct seq_file *m, void *unused)
+{
+ seq_puts(m, "rcu: ");
+ print_one_rcu_ctrlblk(m, &rcu_ctrlblk);
+ seq_puts(m, "rcu_bh: ");
+ print_one_rcu_ctrlblk(m, &rcu_bh_ctrlblk);
+ seq_puts(m, "online: ");
+ seq_cpumask(m, &cpu_online_map);
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static int rcucb_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, show_rcucb, NULL);
+}
+
+static struct file_operations rcucb_fops = {
+ .owner = THIS_MODULE,
+ .open = rcucb_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct dentry *rcudir, *rcu_bh_data_file, *rcu_data_file, *rcucb_file;
+
+static int __init rcuclassic_trace_init(void)
+{
+ rcudir = debugfs_create_dir("rcu", NULL);
+ if (!rcudir)
+ goto out;
+
+ rcu_bh_data_file = debugfs_create_file("rcu_bh_data", 0444, rcudir,
+ get_rcu_data_bh, &rcu_data_fops);
+ if (!rcu_bh_data_file)
+ goto out_rcudir;
+
+ rcu_data_file = debugfs_create_file("rcu_data", 0444, rcudir,
+ get_rcu_data, &rcu_data_fops);
+ if (!rcu_data_file)
+ goto out_rcudata_bh_file;
+
+ rcucb_file = debugfs_create_file("rcucb", 0444, rcudir,
+ NULL, &rcucb_fops);
+ if (!rcucb_file)
+ goto out_rcudata_file;
+ return 0;
+
+out_rcudata_file:
+ debugfs_remove(rcu_data_file);
+out_rcudata_bh_file:
+ debugfs_remove(rcu_bh_data_file);
+out_rcudir:
+ debugfs_remove(rcudir);
+out:
+ return 1;
+}
+
+static void __exit rcuclassic_trace_cleanup(void)
+{
+ debugfs_remove(rcucb_file);
+ debugfs_remove(rcu_data_file);
+ debugfs_remove(rcu_bh_data_file);
+ debugfs_remove(rcudir);
+}
+
+module_init(rcuclassic_trace_init);
+module_exit(rcuclassic_trace_cleanup);
+
+MODULE_AUTHOR("Paul E. McKenney");
+MODULE_DESCRIPTION("Read-Copy Update tracing for classic implementation");
+MODULE_LICENSE("GPL");
+

2008-11-11 15:12:04

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

2008/11/11 Vegard Nossum <[email protected]>:
> On Tue, Nov 11, 2008 at 11:52 AM, Ingo Molnar <[email protected]> wrote:
>> [ Cc:-ed workqueue/locking/suspend-race-condition experts. ]
>>
>> Seems like the new kernel/stop_machine.c logic has a race for the test
>> sequence above. (Below is the bisected commit again, maybe the race is
>> visible via email review as well.)
>
> I try again.
>
> I think that the test for stop_machine_data in stop_cpu() should not
> have been moved from __stop_machine().

Do you mean the following test?

if (!active_cpus) {
if (cpu == first_cpu(cpu_online_map))
smdata = &active;
} else {
if (cpu_isset(cpu, *active_cpus))
smdata = &active;
}

> Because now cpu_online_map may
> change in-between calls to stop_cpu() (if the callback tries to
> online/offline CPUs), and the end result may be different.

take_cpu_down() may not run earlier than stop_cpu() on all the cpus
have completed the STOPMACHINE_DISABLE_IRQ step, iow. "state ==
STOPMACHINE_RUN". By that moment, 'smdata' has been set up on all
cpus... if this is the case you had in mind.


>
> Maybe?
>
>
> Vegard
>


--
Best regards,
Dmitry Adamushko

2008-11-11 15:33:38

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On 11/11, Vegard Nossum wrote:
>
> I think that the test for stop_machine_data in stop_cpu() should not
> have been moved from __stop_machine(). Because now cpu_online_map may
> change in-between calls to stop_cpu() (if the callback tries to
> online/offline CPUs), and the end result may be different.

I don't think this is possible, the callback must not be called unless
all threads ack (at least) the STOPMACHINE_PREPARE state.


Off-topic question, __stop_machine() does:

/* Schedule the stop_cpu work on all cpus: hold this CPU so one
* doesn't hit this CPU until we're ready. */
get_cpu();
for_each_online_cpu(i) {
sm_work = percpu_ptr(stop_machine_work, i);
INIT_WORK(sm_work, stop_cpu);
queue_work_on(i, stop_machine_wq, sm_work);
}
/* This will release the thread on our CPU. */
put_cpu();

Don't we actually need preempt_disable/preempt_enable instead of
get/put cpu? (yes, there the same currently). We don't care about
the CPU we are running on, and it can't go away until we queue all
works. But we must ensure that stop_cpu() on the same CPU can't
preempt us, right?

Oleg.

2008-11-11 16:08:22

by Oleg Nesterov

[permalink] [raw]
Subject: Q: force_quiescent_state && cpu_online_map

I don't think this matters, but still...

force_quiescent_state:

* cpu_online_map is updated by the _cpu_down()
* using __stop_machine(). Since we're in irqs disabled
* section, __stop_machine() is not exectuting, hence
* the cpu_online_map is stable.
*
* However, a cpu might have been offlined _just_ before
* we disabled irqs while entering here.
* And rcu subsystem might not yet have handled the CPU_DEAD
* notification, leading to the offlined cpu's bit
* being set in the rcp->cpumask.
*
* Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent
* sending smp_reschedule() to an offlined CPU.
*/
cpus_and(cpumask, rcp->cpumask, cpu_online_map);
cpu_clear(rdp->cpu, cpumask);
for_each_cpu_mask_nr(cpu, cpumask)
smp_send_reschedule(cpu);

However,

// called by __stop_machine take_cpu_down()
arch/x86/kernel/smpboot.c:cpu_disable_common()

/*
* HACK:
* Allow any queued timer interrupts to get serviced
* This is only a temporary solution until we cleanup
* fixup_irqs as we do for IA64.
*/
local_irq_enable();
mdelay(1);
local_irq_disable();
...
remove_cpu_from_maps(cpu);

So it is possible to send the ipi to the dying CPU. I know nothing
about this low-level irq code, most probably this is harmless. We
already did clear_local_APIC(), but I don't understand what it does.

Oleg.

2008-11-11 16:14:19

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

> > Could you please apply the following debug patch (due to Jiangshan and
> > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > then mount debugfs after boot, for example, on /debug. This will
> > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > and "rcu_bh_data". Since you are still able to log in, could you
> > please send the contents of these three files?
> >
> > Thanx, Paul
>
> This time with the patch actually attached... Thanks to Peter Z.
> for alerting me to my omission.

Well, your patch doesn't apply on git head. However I used preemptible
RCU instead and had tracing enabled.

This is the output of the three files after it stalled (and continued,
because I caused an interrupt by sending a network packet) twice:

[root@h0545001 rcu]# cat rcuctrs
CPU last cur F M
1 0 0 1 1
3 0 0 1 1
4 0 0 0 0
5 0 0 0 1
6 0 0 0 0
ggp = 1640, state = waitack

[root@h0545001 rcu]# cat rcugp
oldggp=1652 newggp=1655

[root@h0545001 rcu]# cat rcustats
na=33948 nl=3 wa=33945 wl=0 da=33945 dl=0 dr=33945 di=0
1=0 e1=0 i1=1674 ie1=4 g1=1670 a1=1920 ae1=251 a2=1669
z1=1669 ze1=0 z2=1669 m1=4411 me1=2742 m2=1669

2008-11-11 16:17:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 04:01:32PM +0100, Heiko Carstens wrote:
> On Tue, Nov 11, 2008 at 06:35:05AM -0800, Paul E. McKenney wrote:
> > > > A process that would do nothing but onlining/offlining cpus would get
> > > > stuck after a while:
> > > >
> > > > 0 schedule+842 [0x342522]
> > > > 1 schedule_timeout+200 [0x342ec4]
> > > > 2 wait_for_common+362 [0x341fd6]
> > > > 3 wait_for_completion+54 [0x342146]
> > > > 4 __synchronize_sched+80 [0x81670]
> > > > 5 cpu_down+172 [0x33c030]
> > > > 6 store_online+96 [0x33c488]
> > > > 7 sysdev_store+52 [0x1bda84]
> > > > 8 sysfs_write_file+242 [0x1350ba]
> > > > 9 vfs_write+176 [0xd2028]
> > > > 10 sys_write+82 [0xd21ea]
> > > > 11 sysc_noemu+16 [0x269d8]
> > > >
> > > > All cpus are in cpu_idle and no other task in state TASK_INTERRUPTIBLE
> > > > or TASK_UNINTERRUPTIBLE. However it would continue to work as soon as
> > > > I login into the system or generate a console interrupt.
> > > > I'm going to look into the dump and see if I can figure out what is
> > > > broken here.
> > > > Dunno if it is the same bug or something else.
> > >
> > > [Cc:-ed Steven and Paul, since this backtrace seems to be RCU specific]
> > >
> > > Steven, Paul, any idea what could cause the hang? I think I would
> > > get lost in the RCU code...
> >
> > Hello, Heiko,
> >
> > Could you please apply the following debug patch (due to Jiangshan and
> > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > then mount debugfs after boot, for example, on /debug. This will
> > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > and "rcu_bh_data". Since you are still able to log in, could you
> > please send the contents of these three files?
>
> Hi Paul,
>
> could you attach the patch please? :)

Peter Z. beat you to it. ;-)

See previous email.

> Does the patch also make sense if the system continues to work? That
> is the machine isn't stalled anymore as soon as I log in.
> On the other hand I do have a dump of the system and can look in
> whatever data structures you want. If that helps.

Ah!

I would like to see the value of rcu_ctrlblk.cpumask and also the value
of cpu_online_map. One guess would be that rcu_ctrlblk.cpumask has a
bit set that is -not- set in cpu_online_map, which would indicate that
RCU was incorrectly waiting on an offline CPU.

On the other hand, if all the bits set in rcu_ctrlblk.cpumask are also
set in cpu_online_map, then could you please dump out the instances of
the rcu_data per-CPU variable that correspond to the bits set in
rcu_ctrlblk.cpumask?

Finally, if no bits are set in rcu_ctrlblk.cpumask, the question would
be "why isn't the synchronize_sched() waking up?"

BTW, I am assuming that you have the same config as Raphael, in other
words, that you are running Classic RCU rather than preemptable RCU.

The point of the patch is that it allows you to see this info by catting
out the /debug/rcu files, at least assuming that the system is healthy
enough to allow you to cat files. But if you already have a crash dump...

Thanx, Paul

2008-11-11 16:45:37

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 05:14:01PM +0100, Heiko Carstens wrote:
> > > Could you please apply the following debug patch (due to Jiangshan and
> > > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > > then mount debugfs after boot, for example, on /debug. This will
> > > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > > and "rcu_bh_data". Since you are still able to log in, could you
> > > please send the contents of these three files?
> > >
> > > Thanx, Paul
> >
> > This time with the patch actually attached... Thanks to Peter Z.
> > for alerting me to my omission.
>
> Well, your patch doesn't apply on git head. However I used preemptible
> RCU instead and had tracing enabled.

Were you using preemptible RCU earlier as well? Raphael was using
classic RCU. Don't get me wrong, all problems need fixing, just trying
to make sure I understand where the problems are occurring.

> This is the output of the three files after it stalled (and continued,
> because I caused an interrupt by sending a network packet) twice:
>
> [root@h0545001 rcu]# cat rcuctrs
> CPU last cur F M
> 1 0 0 1 1
> 3 0 0 1 1
> 4 0 0 0 0
> 5 0 0 0 1
> 6 0 0 0 0
> ggp = 1640, state = waitack
>
> [root@h0545001 rcu]# cat rcugp
> oldggp=1652 newggp=1655
>
> [root@h0545001 rcu]# cat rcustats
> na=33948 nl=3 wa=33945 wl=0 da=33945 dl=0 dr=33945 di=0
> 1=0 e1=0 i1=1674 ie1=4 g1=1670 a1=1920 ae1=251 a2=1669
> z1=1669 ze1=0 z2=1669 m1=4411 me1=2742 m2=1669

This hang also involved synchronize_sched()? Or synchronize_rcu()?

The reason I ask is that the above stats are for the synchronize_rcu()
rather than synchronize_sched().

Thanx, Paul

2008-11-11 17:25:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Q: force_quiescent_state && cpu_online_map

On Tue, Nov 11, 2008 at 06:03:27PM +0100, Oleg Nesterov wrote:
> I don't think this matters, but still...
>
> force_quiescent_state:
>
> * cpu_online_map is updated by the _cpu_down()
> * using __stop_machine(). Since we're in irqs disabled
> * section, __stop_machine() is not exectuting, hence
> * the cpu_online_map is stable.
> *
> * However, a cpu might have been offlined _just_ before
> * we disabled irqs while entering here.
> * And rcu subsystem might not yet have handled the CPU_DEAD
> * notification, leading to the offlined cpu's bit
> * being set in the rcp->cpumask.
> *
> * Hence cpumask = (rcp->cpumask & cpu_online_map) to prevent
> * sending smp_reschedule() to an offlined CPU.
> */
> cpus_and(cpumask, rcp->cpumask, cpu_online_map);
> cpu_clear(rdp->cpu, cpumask);
> for_each_cpu_mask_nr(cpu, cpumask)
> smp_send_reschedule(cpu);
>
> However,
>
> // called by __stop_machine take_cpu_down()
> arch/x86/kernel/smpboot.c:cpu_disable_common()
>
> /*
> * HACK:
> * Allow any queued timer interrupts to get serviced
> * This is only a temporary solution until we cleanup
> * fixup_irqs as we do for IA64.
> */
> local_irq_enable();
> mdelay(1);
> local_irq_disable();
> ...
> remove_cpu_from_maps(cpu);
>
> So it is possible to send the ipi to the dying CPU. I know nothing
> about this low-level irq code, most probably this is harmless. We
> already did clear_local_APIC(), but I don't understand what it does.

Indeed, some of the things I am doing as part of the hierarchical RCU
implementation need to be applied to preemptable RCU. :-/

Thanx, Paul

2008-11-11 17:35:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 08:45:23AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 11, 2008 at 05:14:01PM +0100, Heiko Carstens wrote:
> > > > Could you please apply the following debug patch (due to Jiangshan and
> > > > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > > > then mount debugfs after boot, for example, on /debug. This will
> > > > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > > > and "rcu_bh_data". Since you are still able to log in, could you
> > > > please send the contents of these three files?
> > > >
> > > > Thanx, Paul
> > >
> > > This time with the patch actually attached... Thanks to Peter Z.
> > > for alerting me to my omission.
> >
> > Well, your patch doesn't apply on git head. However I used preemptible
> > RCU instead and had tracing enabled.
>
> Were you using preemptible RCU earlier as well? Raphael was using
> classic RCU. Don't get me wrong, all problems need fixing, just trying
> to make sure I understand where the problems are occurring.

And here is a version of the patch rebased to linux-2.6 git head.

This adds tracing to classic RCU.

Signed-off-by: Paul E. McKenney <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
---

include/linux/rcuclassic.h | 4
kernel/Kconfig.preempt | 1
kernel/Makefile | 2
kernel/rcuclassic.c | 5 -
kernel/rcuclassic_trace.c | 198 +++++++++++++++++++++++++++++++++++++++++++++
5 files changed, 207 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcuclassic.h b/include/linux/rcuclassic.h
index 5f89b62..ce183a8 100644
--- a/include/linux/rcuclassic.h
+++ b/include/linux/rcuclassic.h
@@ -63,6 +63,9 @@ struct rcu_ctrlblk {
/* for current batch to proceed. */
} ____cacheline_internodealigned_in_smp;

+extern struct rcu_ctrlblk rcu_ctrlblk;
+extern struct rcu_ctrlblk rcu_bh_ctrlblk;
+
/* Is batch a before batch b ? */
static inline int rcu_batch_before(long a, long b)
{
@@ -81,6 +84,7 @@ struct rcu_data {
long quiescbatch; /* Batch # for grace period */
int passed_quiesc; /* User-mode/idle loop etc. */
int qs_pending; /* core waits for quiesc state */
+ bool beenonline; /* CPU online at least once */

/* 2) batch handling */
/*
diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 9fdba03..ba32338 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -68,7 +68,6 @@ config PREEMPT_RCU

config RCU_TRACE
bool "Enable tracing for RCU - currently stats in debugfs"
- depends on PREEMPT_RCU
select DEBUG_FS
default y
help
diff --git a/kernel/Makefile b/kernel/Makefile
index 9a3ec66..9771050 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -79,6 +79,8 @@ obj-$(CONFIG_CLASSIC_RCU) += rcuclassic.o
obj-$(CONFIG_PREEMPT_RCU) += rcupreempt.o
ifeq ($(CONFIG_PREEMPT_RCU),y)
obj-$(CONFIG_RCU_TRACE) += rcupreempt_trace.o
+else
+obj-$(CONFIG_RCU_TRACE) += rcuclassic_trace.o
endif
obj-$(CONFIG_RELAY) += relay.o
obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
diff --git a/kernel/rcuclassic.c b/kernel/rcuclassic.c
index 37f72e5..54bd23b 100644
--- a/kernel/rcuclassic.c
+++ b/kernel/rcuclassic.c
@@ -58,14 +58,14 @@ EXPORT_SYMBOL_GPL(rcu_lock_map);


/* Definition for rcupdate control block. */
-static struct rcu_ctrlblk rcu_ctrlblk = {
+struct rcu_ctrlblk rcu_ctrlblk = {
.cur = -300,
.completed = -300,
.pending = -300,
.lock = __SPIN_LOCK_UNLOCKED(&rcu_ctrlblk.lock),
.cpumask = CPU_MASK_NONE,
};
-static struct rcu_ctrlblk rcu_bh_ctrlblk = {
+struct rcu_ctrlblk rcu_bh_ctrlblk = {
.cur = -300,
.completed = -300,
.pending = -300,
@@ -725,6 +725,7 @@ static void rcu_init_percpu_data(int cpu, struct rcu_ctrlblk *rcp,
rdp->donetail = &rdp->donelist;
rdp->quiescbatch = rcp->completed;
rdp->qs_pending = 0;
+ rdp->beenonline = 1;
rdp->cpu = cpu;
rdp->blimit = blimit;
spin_unlock_irqrestore(&rcp->lock, flags);
diff --git a/kernel/rcuclassic_trace.c b/kernel/rcuclassic_trace.c
new file mode 100644
index 0000000..612170c
--- /dev/null
+++ b/kernel/rcuclassic_trace.c
@@ -0,0 +1,198 @@
+/*
+ * Read-Copy Update tracing for classic implementation
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2008
+ *
+ * Updated to use seqfile by Lai Jiangshan.
+ *
+ * Papers: http://www.rdrop.com/users/paulmck/RCU
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * Documentation/RCU
+ *
+ */
+#include <linux/rcupdate.h>
+#include <linux/module.h>
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+/* Print out rcu_data structures using seqfile facility. */
+
+static struct rcu_data *get_rcu_data_bh(int cpu)
+{
+ return &per_cpu(rcu_bh_data, cpu);
+}
+
+static struct rcu_data *get_rcu_data(int cpu)
+{
+ return &per_cpu(rcu_data, cpu);
+}
+
+static int show_rcu_data(struct seq_file *m, void *v)
+{
+ struct rcu_data *rdp = v;
+
+ if (!rdp->beenonline)
+ return 0;
+
+ seq_printf(m, "processor\t: %d", rdp->cpu);
+ if (cpu_is_offline(rdp->cpu))
+ seq_puts(m, "!\n");
+ else
+ seq_puts(m, "\n");
+ seq_printf(m, "quiescbatch\t: %ld\n", rdp->quiescbatch);
+ seq_printf(m, "batch\t\t: %ld\n", rdp->batch);
+ seq_printf(m, "passed_quiesc\t: %d\n", rdp->passed_quiesc);
+ seq_printf(m, "qs_pending\t: %d\n", rdp->qs_pending);
+ seq_printf(m, "qlen\t\t: %ld\n", rdp->qlen);
+ seq_printf(m, "blimit\t\t: %ld\n", rdp->blimit);
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static void *c_start(struct seq_file *m, loff_t *pos)
+{
+ typedef struct rcu_data *(*get_data_func)(int);
+
+ if (*pos == 0) /* just in case, cpu 0 is not the first */
+ *pos = first_cpu(cpu_possible_map);
+ else
+ *pos = next_cpu_nr(*pos - 1, cpu_possible_map);
+ if ((*pos) < nr_cpu_ids)
+ return ((get_data_func)m->private)(*pos);
+ return NULL;
+}
+
+static void *c_next(struct seq_file *m, void *v, loff_t *pos)
+{
+ (*pos)++;
+ return c_start(m, pos);
+}
+
+static void c_stop(struct seq_file *m, void *v)
+{
+}
+
+const struct seq_operations rcu_data_seq_op = {
+ .start = c_start,
+ .next = c_next,
+ .stop = c_stop,
+ .show = show_rcu_data,
+};
+
+static int rcu_data_open(struct inode *inode, struct file *file)
+{
+ int ret = seq_open(file, &rcu_data_seq_op);
+
+ if (ret)
+ return ret;
+ ((struct seq_file *)file->private_data)->private = inode->i_private;
+ return 0;
+}
+
+static const struct file_operations rcu_data_fops = {
+ .owner = THIS_MODULE,
+ .open = rcu_data_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+/* Print out rcu_ctrlblk structures using seqfile facility. */
+
+static void print_one_rcu_ctrlblk(struct seq_file *m, struct rcu_ctrlblk *rcp)
+{
+ seq_printf(m, "cur=%ld completed=%ld pending=%d s=%d\n\t",
+ rcp->cur, rcp->completed, rcp->pending, rcp->signaled);
+ seq_cpumask(m, &rcp->cpumask);
+ seq_puts(m, "\n");
+}
+
+static int show_rcucb(struct seq_file *m, void *unused)
+{
+ seq_puts(m, "rcu: ");
+ print_one_rcu_ctrlblk(m, &rcu_ctrlblk);
+ seq_puts(m, "rcu_bh: ");
+ print_one_rcu_ctrlblk(m, &rcu_bh_ctrlblk);
+ seq_puts(m, "online: ");
+ seq_cpumask(m, &cpu_online_map);
+ seq_puts(m, "\n");
+ return 0;
+}
+
+static int rcucb_open(struct inode *inode, struct file *file)
+{
+ return single_open(file, show_rcucb, NULL);
+}
+
+static struct file_operations rcucb_fops = {
+ .owner = THIS_MODULE,
+ .open = rcucb_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = single_release,
+};
+
+static struct dentry *rcudir, *rcu_bh_data_file, *rcu_data_file, *rcucb_file;
+
+static int __init rcuclassic_trace_init(void)
+{
+ rcudir = debugfs_create_dir("rcu", NULL);
+ if (!rcudir)
+ goto out;
+
+ rcu_bh_data_file = debugfs_create_file("rcu_bh_data", 0444, rcudir,
+ get_rcu_data_bh, &rcu_data_fops);
+ if (!rcu_bh_data_file)
+ goto out_rcudir;
+
+ rcu_data_file = debugfs_create_file("rcu_data", 0444, rcudir,
+ get_rcu_data, &rcu_data_fops);
+ if (!rcu_data_file)
+ goto out_rcudata_bh_file;
+
+ rcucb_file = debugfs_create_file("rcucb", 0444, rcudir,
+ NULL, &rcucb_fops);
+ if (!rcucb_file)
+ goto out_rcudata_file;
+ return 0;
+
+out_rcudata_file:
+ debugfs_remove(rcu_data_file);
+out_rcudata_bh_file:
+ debugfs_remove(rcu_bh_data_file);
+out_rcudir:
+ debugfs_remove(rcudir);
+out:
+ return 1;
+}
+
+static void __exit rcuclassic_trace_cleanup(void)
+{
+ debugfs_remove(rcucb_file);
+ debugfs_remove(rcu_data_file);
+ debugfs_remove(rcu_bh_data_file);
+ debugfs_remove(rcudir);
+}
+
+module_init(rcuclassic_trace_init);
+module_exit(rcuclassic_trace_cleanup);
+
+MODULE_AUTHOR("Paul E. McKenney");
+MODULE_DESCRIPTION("Read-Copy Update tracing for classic implementation");
+MODULE_LICENSE("GPL");
+

2008-11-11 21:28:20

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

2008/11/10 Rafael J. Wysocki <[email protected]>:
> On Monday, 10 of November 2008, Rafael J. Wysocki wrote:
>> On Monday, 10 of November 2008, Heiko Carstens wrote:
>> > On Sun, Nov 09, 2008 at 06:59:16PM +0100, Rafael J. Wysocki wrote:
>> > > This message has been generated automatically as a part of a report
>> > > of recent regressions.
>> > >
>> > > The following bug entry is on the current list of known regressions
>> > > from 2.6.27. Please verify if it still should be listed and let me know
>> > > (either way).
>> > >
>> > >
>> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
>> > > Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
>> > > Submitter : Rafael J. Wysocki <[email protected]>
>> > > Date : 2008-11-03 0:28 (7 days old)
>> > > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
>> > > References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4
>> >
>> > Hi Rafael,
>>
>> Hi,
>>
>> > could you provide more informations for this, please?
>> >
>> > What is your kernel configuration?
>>
>> Available at: http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc3/kitty-config
>>
>> > Do you have any binary only modules (nvidia?) loaded?
>>
>> No, I don't.
>>
>> > Is it possible to recreate the bug by e.g. just doing something like
>> >
>> > echo 0 > /sys/devices/system/cpu/cpu1/online
>>
>> I haven't checked (yet), I'll do that later today and let you know.
>>
>> > (or any other online cpu)? Or does it trigger any lockdep warnings?
>
> It cannot be reproduced with offlining CPU1 and it doesn't trigger any
> warnings from lockdep.
>
> However, it is reproducible by doing
>
> # echo core > /sys/power/pm_test
>
> and repeating
>
> # echo disk > /sys/power/state
>
> for a couple of times, in which case the last two lines printed to the console
> before a (solid) hang are:
>
> SMP alternatives: switching to SMP code
> Booting processor 1 APIC 0x1 ip 0x6000
>
> So, it evidently fails while re-enabling the non-boot CPU and not during
> disabling it as I thought before.

Can you also provide the full log including the messages when a system
goes down please?

At first glance, "Botting processor..." as the last message looks
strange in this context.
So either wakeup_secondary_cpu()'s completion failed for some reason
(say, due to some kind of a problem that took place while disabling
non-boot cpus... I'm purely speculating here so far) or the printk's
output was not complete.

Perhaps, redoing the test with pr_debug() in arch/x86/kernel/smpboot.c
enabled would shed more light...


--
Best regards,
Dmitry Adamushko

2008-11-11 23:38:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tuesday, 11 of November 2008, Dmitry Adamushko wrote:
> 2008/11/10 Rafael J. Wysocki <[email protected]>:
> > On Monday, 10 of November 2008, Rafael J. Wysocki wrote:
> >> On Monday, 10 of November 2008, Heiko Carstens wrote:
> >> > On Sun, Nov 09, 2008 at 06:59:16PM +0100, Rafael J. Wysocki wrote:
> >> > > This message has been generated automatically as a part of a report
> >> > > of recent regressions.
> >> > >
> >> > > The following bug entry is on the current list of known regressions
> >> > > from 2.6.27. Please verify if it still should be listed and let me know
> >> > > (either way).
> >> > >
> >> > >
> >> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11989
> >> > > Subject : Suspend failure on NForce4-based boards due to chanes in stop_machine
> >> > > Submitter : Rafael J. Wysocki <[email protected]>
> >> > > Date : 2008-11-03 0:28 (7 days old)
> >> > > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c9583e55fa2b08a230c549bd1e3c0bde6c50d9cc
> >> > > References : http://marc.info/?l=linux-kernel&m=122567187604356&w=4
> >> >
> >> > Hi Rafael,
> >>
> >> Hi,
> >>
> >> > could you provide more informations for this, please?
> >> >
> >> > What is your kernel configuration?
> >>
> >> Available at: http://www.sisk.pl/kernel/debug/mainline/2.6.28-rc3/kitty-config
> >>
> >> > Do you have any binary only modules (nvidia?) loaded?
> >>
> >> No, I don't.
> >>
> >> > Is it possible to recreate the bug by e.g. just doing something like
> >> >
> >> > echo 0 > /sys/devices/system/cpu/cpu1/online
> >>
> >> I haven't checked (yet), I'll do that later today and let you know.
> >>
> >> > (or any other online cpu)? Or does it trigger any lockdep warnings?
> >
> > It cannot be reproduced with offlining CPU1 and it doesn't trigger any
> > warnings from lockdep.
> >
> > However, it is reproducible by doing
> >
> > # echo core > /sys/power/pm_test
> >
> > and repeating
> >
> > # echo disk > /sys/power/state
> >
> > for a couple of times, in which case the last two lines printed to the console
> > before a (solid) hang are:
> >
> > SMP alternatives: switching to SMP code
> > Booting processor 1 APIC 0x1 ip 0x6000
> >
> > So, it evidently fails while re-enabling the non-boot CPU and not during
> > disabling it as I thought before.
>
> Can you also provide the full log including the messages when a system
> goes down please?
>
> At first glance, "Botting processor..." as the last message looks
> strange in this context.
> So either wakeup_secondary_cpu()'s completion failed for some reason
> (say, due to some kind of a problem that took place while disabling
> non-boot cpus... I'm purely speculating here so far) or the printk's
> output was not complete.
>
> Perhaps, redoing the test with pr_debug() in arch/x86/kernel/smpboot.c
> enabled would shed more light...

Will do tomorrow.

Thanks,
Rafael

2008-11-12 03:30:26

by Rusty Russell

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Wednesday 12 November 2008 03:01:18 Oleg Nesterov wrote:
> On 11/11, Vegard Nossum wrote:
> > I think that the test for stop_machine_data in stop_cpu() should not
> > have been moved from __stop_machine(). Because now cpu_online_map may
> > change in-between calls to stop_cpu() (if the callback tries to
> > online/offline CPUs), and the end result may be different.
>
> I don't think this is possible, the callback must not be called unless
> all threads ack (at least) the STOPMACHINE_PREPARE state.
>
>
> Off-topic question, __stop_machine() does:
>
> /* Schedule the stop_cpu work on all cpus: hold this CPU so one
> * doesn't hit this CPU until we're ready. */
> get_cpu();
> for_each_online_cpu(i) {
> sm_work = percpu_ptr(stop_machine_work, i);
> INIT_WORK(sm_work, stop_cpu);
> queue_work_on(i, stop_machine_wq, sm_work);
> }
> /* This will release the thread on our CPU. */
> put_cpu();
>
> Don't we actually need preempt_disable/preempt_enable instead of
> get/put cpu? (yes, there the same currently). We don't care about
> the CPU we are running on, and it can't go away until we queue all
> works. But we must ensure that stop_cpu() on the same CPU can't
> preempt us, right?

A subtle distinction, but yes. It used to be true before the recent changes,
where we manually did "this" cpu.

Cheers,
Rusty.

2008-11-12 03:40:15

by Rusty Russell

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tuesday 11 November 2008 21:22:14 Ingo Molnar wrote:> * Rafael J. Wysocki <[email protected]> wrote:> > So, it evidently fails while re-enabling the non-boot CPU and not> > during disabling it as I thought before.
(Resend, due to HTML version previously)
But what is calling stop_machine in that path?
There *is* a race, but I don't think it could cause this (we should make acopy of active.fnret inside the lock before returning it).
Two patches: one fixes that race, the next adds debugging spew.
stop_machine: fix race with return value
We should not access active.fnret outside the lock; in theory the nextstop_machine could overwrite it.
Signed-off-by: Rusty Russell <[email protected]>--- kernel/stop_machine.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff -r d7c9a15da615 kernel/stop_machine.c--- a/kernel/stop_machine.c Mon Nov 10 09:47:45 2008 +1100+++ b/kernel/stop_machine.c Tue Nov 11 23:19:47 2008 +1030@@ -112,7 +112,7 @@ int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus) { struct work_struct *sm_work;- int i;+ int i, ret; /* Set up initial state. */ mutex_lock(&lock);@@ -137,8 +137,9 @@ /* This will release the thread on our CPU. */ put_cpu(); flush_workqueue(stop_machine_wq);+ ret = active.fnret; mutex_unlock(&lock);- return active.fnret;+ return ret; } int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)===diff -r fe7dd39b1cff kernel/stop_machine.c--- a/kernel/stop_machine.c Wed Nov 12 14:07:18 2008 +1030+++ b/kernel/stop_machine.c Wed Nov 12 14:09:08 2008 +1030@@ -89,6 +89,8 @@ case STOPMACHINE_RUN: /* On multiple CPUs only a single error code * is needed to tell that something failed. */+ printk("stop_machine: %i running %p\n",+ smp_processor_id(), smdata->fn); err = smdata->fn(smdata->data); if (err) smdata->fnret = err;@@ -106,6 +108,7 @@ /* Callback for CPUs which aren't supposed to do anything. */ static int chill(void *unused) {+ printk("stop_machine: %i chilling\n", smp_processor_id()); return 0; } @@ -126,17 +129,23 @@ set_state(STOPMACHINE_PREPARE); + printk("stop_machine: running on %i cpus:\n", num_threads);+ dump_stack();+ /* Schedule the stop_cpu work on all cpus: hold this CPU so one * doesn't hit this CPU until we're ready. */ get_cpu(); for_each_online_cpu(i) {+ printk("stop_machine: setting up cpu %i\n", i); sm_work = percpu_ptr(stop_machine_work, i); INIT_WORK(sm_work, stop_cpu); queue_work_on(i, stop_machine_wq, sm_work); } /* This will release the thread on our CPU. */+ printk("stop_machine: releasing CPU %i\n", smp_processor_id()); put_cpu(); flush_workqueue(stop_machine_wq);+ printk("stop_machine: done\n"); ret = active.fnret; mutex_unlock(&lock); return ret;????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2008-11-12 09:05:31

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Tue, Nov 11, 2008 at 09:34:51AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 11, 2008 at 08:45:23AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 11, 2008 at 05:14:01PM +0100, Heiko Carstens wrote:
> > > > > Could you please apply the following debug patch (due to Jiangshan and
> > > > > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > > > > then mount debugfs after boot, for example, on /debug. This will
> > > > > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > > > > and "rcu_bh_data". Since you are still able to log in, could you
> > > > > please send the contents of these three files?
> > > > >
> > > > > Thanx, Paul
> > > >
> > > > This time with the patch actually attached... Thanks to Peter Z.
> > > > for alerting me to my omission.
> > >
> > > Well, your patch doesn't apply on git head. However I used preemptible
> > > RCU instead and had tracing enabled.
> >
> > Were you using preemptible RCU earlier as well? Raphael was using
> > classic RCU. Don't get me wrong, all problems need fixing, just trying
> > to make sure I understand where the problems are occurring.

Indeed, my fault. I just try to reproduce a cpu hotplug bug with classic RCU
and cpu hotplug stress test, but no luck so far.

2008-11-12 16:04:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Wed, Nov 12, 2008 at 10:05:08AM +0100, Heiko Carstens wrote:
> On Tue, Nov 11, 2008 at 09:34:51AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 11, 2008 at 08:45:23AM -0800, Paul E. McKenney wrote:
> > > On Tue, Nov 11, 2008 at 05:14:01PM +0100, Heiko Carstens wrote:
> > > > > > Could you please apply the following debug patch (due to Jiangshan and
> > > > > > myself)? Then you should be able to build with CONFIG_RCU_TRACE,
> > > > > > then mount debugfs after boot, for example, on /debug. This will
> > > > > > create a /debug/rcu directory with three files, "rcucb", "rcu_data",
> > > > > > and "rcu_bh_data". Since you are still able to log in, could you
> > > > > > please send the contents of these three files?
> > > > > >
> > > > > > Thanx, Paul
> > > > >
> > > > > This time with the patch actually attached... Thanks to Peter Z.
> > > > > for alerting me to my omission.
> > > >
> > > > Well, your patch doesn't apply on git head. However I used preemptible
> > > > RCU instead and had tracing enabled.
> > >
> > > Were you using preemptible RCU earlier as well? Raphael was using
> > > classic RCU. Don't get me wrong, all problems need fixing, just trying
> > > to make sure I understand where the problems are occurring.
>
> Indeed, my fault. I just try to reproduce a cpu hotplug bug with classic RCU
> and cpu hotplug stress test, but no luck so far.

OK, then my next step will be to send Rafael an updated version of
my hierarchical RCU, which is more robust than classic RCU against
online/offline stress tests. On the machines I have access to, anyway. ;-)

Then I will look at preemptable RCU, which undoubtably needs some of the
same help that I have been giving to hierarchical RCU. Manfred thus
wins the clairvoyance award!

Thanx, Paul

2008-11-12 16:51:45

by Heiko Carstens

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Wed, Nov 12, 2008 at 08:03:49AM -0800, Paul E. McKenney wrote:
> On Wed, Nov 12, 2008 at 10:05:08AM +0100, Heiko Carstens wrote:
> > On Tue, Nov 11, 2008 at 09:34:51AM -0800, Paul E. McKenney wrote:
> > > > Were you using preemptible RCU earlier as well? Raphael was using
> > > > classic RCU. Don't get me wrong, all problems need fixing, just trying
> > > > to make sure I understand where the problems are occurring.
> >
> > Indeed, my fault. I just try to reproduce a cpu hotplug bug with classic RCU
> > and cpu hotplug stress test, but no luck so far.
>
> OK, then my next step will be to send Rafael an updated version of
> my hierarchical RCU, which is more robust than classic RCU against
> online/offline stress tests. On the machines I have access to, anyway. ;-)
>
> Then I will look at preemptable RCU, which undoubtably needs some of the
> same help that I have been giving to hierarchical RCU. Manfred thus
> wins the clairvoyance award!

Well, I tried all day long to reproduce a cpu hotplug/stop_machine hang
with classic RCU and a kernel configuration that is as close as possible
to Raphael's configuration, but it just continues to work without a bug.

One of the machines is a virtual machine with 8 virtual cpus mapped on
two real cpus. The real cpus are again shared with other guests. So I end
up with cpu steal times of 50-90%. That should have revealed races in the
stop_machine code, considering that thousands of cpu hotplug operations
happened.

I let these test machines running over night. Maybe something happens...
but at a first glance it looks more like the reworked stop_machine code
triggers a different bug that already is present. Hmmm...

2008-11-12 20:21:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Wed, Nov 12, 2008 at 05:51:18PM +0100, Heiko Carstens wrote:
> On Wed, Nov 12, 2008 at 08:03:49AM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 12, 2008 at 10:05:08AM +0100, Heiko Carstens wrote:
> > > On Tue, Nov 11, 2008 at 09:34:51AM -0800, Paul E. McKenney wrote:
> > > > > Were you using preemptible RCU earlier as well? Raphael was using
> > > > > classic RCU. Don't get me wrong, all problems need fixing, just trying
> > > > > to make sure I understand where the problems are occurring.
> > >
> > > Indeed, my fault. I just try to reproduce a cpu hotplug bug with classic RCU
> > > and cpu hotplug stress test, but no luck so far.
> >
> > OK, then my next step will be to send Rafael an updated version of
> > my hierarchical RCU, which is more robust than classic RCU against
> > online/offline stress tests. On the machines I have access to, anyway. ;-)
> >
> > Then I will look at preemptable RCU, which undoubtably needs some of the
> > same help that I have been giving to hierarchical RCU. Manfred thus
> > wins the clairvoyance award!
>
> Well, I tried all day long to reproduce a cpu hotplug/stop_machine hang
> with classic RCU and a kernel configuration that is as close as possible
> to Raphael's configuration, but it just continues to work without a bug.
>
> One of the machines is a virtual machine with 8 virtual cpus mapped on
> two real cpus. The real cpus are again shared with other guests. So I end
> up with cpu steal times of 50-90%. That should have revealed races in the
> stop_machine code, considering that thousands of cpu hotplug operations
> happened.
>
> I let these test machines running over night. Maybe something happens...
> but at a first glance it looks more like the reworked stop_machine code
> triggers a different bug that already is present. Hmmm...

I can make Classic RCU break in 2.6.28-rc3, but I need a 128-CPU machine to
break it. ;-)

Thanx, Paul

2008-11-13 23:11:28

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

From: Benjamin Herrenschmidt <[email protected]>
Date: Mon, 10 Nov 2008 16:46:25 +1100

> David, would you mind testing on your machine ? It's the one that shows
> the biggest performance improvement, and I would like to know how much
> it is affected by that patch. As long as the "worst case" performance
> is still reasonable, I'm ok to take the hit if the improvement for you
> is still significant.

Finally got around to this, we lose about a full second in the
"cat rfc3261.txt" benchmark:

2.6.28-rc4 vanilla:

7.634
7.704
7.688

2.6.28rc4+patch:

8.712
8.685
8.702

2008-11-14 00:55:37

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Thu, 2008-11-13 at 15:11 -0800, David Miller wrote:
> From: Benjamin Herrenschmidt <[email protected]>
> Date: Mon, 10 Nov 2008 16:46:25 +1100
>
> > David, would you mind testing on your machine ? It's the one that shows
> > the biggest performance improvement, and I would like to know how much
> > it is affected by that patch. As long as the "worst case" performance
> > is still reasonable, I'm ok to take the hit if the improvement for you
> > is still significant.
>
> Finally got around to this, we lose about a full second in the
> "cat rfc3261.txt" benchmark:
>
> 2.6.28-rc4 vanilla:
>
> 7.634
> 7.704
> 7.688
>
> 2.6.28rc4+patch:
>
> 8.712
> 8.685
> 8.702

How does it compare with not having the acceleration ? ie. I don't think
I can do anything about it, except maybe optimize for the case where the
pixmap is already aligned (and thus doesn't need scissors), the main
question is is the acceleration still worth it or not at all since it's
generally not worth it on other architectures.

Cheers,
Ben.

2008-11-14 02:51:28

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 14 Nov 2008 11:54:20 +1100

> How does it compare with not having the acceleration ?

I'll find out for you.

2008-11-14 03:05:16

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

From: David Miller <[email protected]>
Date: Thu, 13 Nov 2008 18:50:59 -0800 (PST)

> From: Benjamin Herrenschmidt <[email protected]>
> Date: Fri, 14 Nov 2008 11:54:20 +1100
>
> > How does it compare with not having the acceleration ?
>
> I'll find out for you.

It makes a huge difference, with the acceleration patch:

commit b1ee26bab14886350ba12a5c10cbc0696ac679bf
Author: Benjamin Herrenschmidt <[email protected]>
Date: Wed Oct 15 22:03:46 2008 -0700

radeonfb: accelerate imageblit and other improvements

reverted, the test case takes 25 seconds or more instead of
the 7 or 8 seconds we're seeing now.

2008-11-14 03:30:24

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)


> It makes a huge difference, with the acceleration patch:
>
> commit b1ee26bab14886350ba12a5c10cbc0696ac679bf
> Author: Benjamin Herrenschmidt <[email protected]>
> Date: Wed Oct 15 22:03:46 2008 -0700
>
> radeonfb: accelerate imageblit and other improvements
>
> reverted, the test case takes 25 seconds or more instead of
> the 7 or 8 seconds we're seeing now.

Ok, thanks a lot for those tests !

So I consider the loss of perfs due to the workaround to be minor enough
here. I'll submit the patch for inclusion.

I might look at not doing the clipping in cases things are already
aligned later but I doubt it's going to be worth the pain,

Cheers,
Ben.

2008-11-14 04:28:29

by David Miller

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

From: Benjamin Herrenschmidt <[email protected]>
Date: Fri, 14 Nov 2008 14:29:11 +1100

>
> So I consider the loss of perfs due to the workaround to be minor enough
> here. I'll submit the patch for inclusion.

BTW, there is a warning generated by this fix, the src_bytes
variable becomes unused or something like that.

2008-11-14 08:52:41

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Thu, 2008-11-13 at 20:28 -0800, David Miller wrote:
> From: Benjamin Herrenschmidt <[email protected]>
> Date: Fri, 14 Nov 2008 14:29:11 +1100
>
> >
> > So I consider the loss of perfs due to the workaround to be minor enough
> > here. I'll submit the patch for inclusion.
>
> BTW, there is a warning generated by this fix, the src_bytes
> variable becomes unused or something like that.

Ok thanks. I'll check that asap. I think I did remove the use some
intermediary variable indeed, probably forgot to remove its declaration
too.

Cheers,
Ben.

2008-11-14 14:51:57

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [Bug #11988] Eliminate recursive mutex in compat fb ioctl path

On Sun, 9 Nov 2008, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of recent regressions.
>
> The following bug entry is on the current list of known regressions
> from 2.6.27. Please verify if it still should be listed and let me know
> (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11988
> Subject : Eliminate recursive mutex in compat fb ioctl path
> Submitter : Keith Packard <[email protected]>
> Date : 2008-11-03 7:06 (7 days old)
> References : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
> Handled-By : Keith Packard <[email protected]>
> Geert Uytterhoeven <[email protected]>
> Patch : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
> http://lkml.org/lkml/2008/10/31/162

Fixed in mainline.

commit a684e7d33096892093456dd56a582cfc3bfad648
Author: Geert Uytterhoeven <[email protected]>
Date: Thu Nov 6 12:53:37 2008 -0800

fbdev: fix fb_compat_ioctl() deadlocks

commit 3e680aae4e53ab54cdbb0c29257dae0cbb158e1c ("fb: convert
lock/unlock_kernel() into local fb mutex") introduced several deadlocks
in the fb_compat_ioctl() path, as mutex_lock() doesn't allow recursion,
unlike lock_kernel(). This broke frame buffer applications on 64-bit
systems with a 32-bit userland.

commit 120a37470c2831fea49fdebaceb5a7039f700ce6 ("framebuffer compat_ioctl
deadlock") fixed one of the deadlocks.

This patch fixes the remaining deadlocks:
- Revert commit 120a37470c2831fea49fdebaceb5a7039f700ce6,
- Extract the core logic of fb_ioctl() into a new function do_fb_ioctl(),
- Change all callsites of fb_ioctl() where info->lock is already held to
call do_fb_ioctl() instead,
- Add sparse annotations to all routines that take info->lock.

Signed-off-by: Geert Uytterhoeven <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Krzysztof Helt <[email protected]>
Cc: Alan Cox <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

With kind regards,

Geert Uytterhoeven
Software Architect

Sony Techsoft Centre Europe
The Corporate Village · Da Vincilaan 7-D1 · B-1935 Zaventem · Belgium

Phone: +32 (0)2 700 8453
Fax: +32 (0)2 700 8622
E-mail: [email protected]
Internet: http://www.sony-europe.com/

A division of Sony Europe (Belgium) N.V.
VAT BE 0413.825.160 · RPR Brussels
Fortis · BIC GEBABEBB · IBAN BE41293037680010

2008-11-15 11:47:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11988] Eliminate recursive mutex in compat fb ioctl path

On Friday, 14 of November 2008, Geert Uytterhoeven wrote:
> On Sun, 9 Nov 2008, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of recent regressions.
> >
> > The following bug entry is on the current list of known regressions
> > from 2.6.27. Please verify if it still should be listed and let me know
> > (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11988
> > Subject : Eliminate recursive mutex in compat fb ioctl path
> > Submitter : Keith Packard <[email protected]>
> > Date : 2008-11-03 7:06 (7 days old)
> > References : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
> > Handled-By : Keith Packard <[email protected]>
> > Geert Uytterhoeven <[email protected]>
> > Patch : http://marc.info/?l=linux-kernel&m=122569604828448&w=4
> > http://lkml.org/lkml/2008/10/31/162
>
> Fixed in mainline.
>
> commit a684e7d33096892093456dd56a582cfc3bfad648
> Author: Geert Uytterhoeven <[email protected]>
> Date: Thu Nov 6 12:53:37 2008 -0800
>
> fbdev: fix fb_compat_ioctl() deadlocks
>
> commit 3e680aae4e53ab54cdbb0c29257dae0cbb158e1c ("fb: convert
> lock/unlock_kernel() into local fb mutex") introduced several deadlocks
> in the fb_compat_ioctl() path, as mutex_lock() doesn't allow recursion,
> unlike lock_kernel(). This broke frame buffer applications on 64-bit
> systems with a 32-bit userland.
>
> commit 120a37470c2831fea49fdebaceb5a7039f700ce6 ("framebuffer compat_ioctl
> deadlock") fixed one of the deadlocks.
>
> This patch fixes the remaining deadlocks:
> - Revert commit 120a37470c2831fea49fdebaceb5a7039f700ce6,
> - Extract the core logic of fb_ioctl() into a new function do_fb_ioctl(),
> - Change all callsites of fb_ioctl() where info->lock is already held to
> call do_fb_ioctl() instead,
> - Add sparse annotations to all routines that take info->lock.
>
> Signed-off-by: Geert Uytterhoeven <[email protected]>
> Cc: Mikulas Patocka <[email protected]>
> Cc: Krzysztof Helt <[email protected]>
> Cc: Alan Cox <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> Signed-off-by: Linus Torvalds <[email protected]>

Thanks, closed.

Rafael

2008-11-15 13:33:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [Bug #11989] Suspend failure on NForce4-based boards due to chanes in stop_machine

On Wednesday, 12 of November 2008, Rusty Russell wrote:
> On Tuesday 11 November 2008 21:22:14 Ingo Molnar wrote:
> > * Rafael J. Wysocki <[email protected]> wrote:
> > > So, it evidently fails while re-enabling the non-boot CPU and not
> > > during disabling it as I thought before.
>
> (Resend, due to HTML version previously)
>
> But what is calling stop_machine in that path?
>
> There *is* a race, but I don't think it could cause this (we should make a
> copy of active.fnret inside the lock before returning it).

Still, that seems to be the case.

> Two patches: one fixes that race, the next adds debugging spew.
>
> stop_machine: fix race with return value

With this patch applied (reproduced below for clarity) the problem is not
reproducible any more.

Care to push it upstream ASAP?

Thanks,
Rafael

---
stop_machine: fix race with return value

We should not access active.fnret outside the lock; in theory the next
stop_machine could overwrite it.

Signed-off-by: Rusty Russell <[email protected]>
---
kernel/stop_machine.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff -r d7c9a15da615 kernel/stop_machine.c
--- a/kernel/stop_machine.c Mon Nov 10 09:47:45 2008 +1100
+++ b/kernel/stop_machine.c Tue Nov 11 23:19:47 2008 +1030
@@ -112,7 +112,7 @@
int __stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)
{
struct work_struct *sm_work;
- int i;
+ int i, ret;

/* Set up initial state. */
mutex_lock(&lock);
@@ -137,8 +137,9 @@
/* This will release the thread on our CPU. */
put_cpu();
flush_workqueue(stop_machine_wq);
+ ret = active.fnret;
mutex_unlock(&lock);
- return active.fnret;
+ return ret;
}

int stop_machine(int (*fn)(void *), void *data, const cpumask_t *cpus)

2008-11-21 02:56:13

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Tue, 2008-11-11 at 10:31 +0100, Andreas Schwab wrote:
> It looks like you are observing the same failure mode that I do.

The lockup when shutting down isn't happening for me anymore with recent
X (ubuntu intrepid) btw.

I haven't quite figured out what's up yet.

Cheers,
Ben.

2008-11-21 03:03:37

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Bug #11875] radeonfb lockup in .28-rc (bisected)

On Tue, 2008-11-11 at 10:31 +0100, Andreas Schwab wrote:
> It looks like you are observing the same failure mode that I do.

BTW> I've been running a torture scripts that does an ls -lR / in a
console and constantly chvt between that console and X and so far
haven't got it to crash...

Cheers,
Ben.