[Notes:
* Quite a number of new regressions from 2.6.30 has been reported during
the last three weeks.
* The number of unresolved regressions 2.6.30 -> 2.6.31 is now the second
highest ever.]
This message contains a list of some regressions introduced between 2.6.30 and
2.6.31, for which there are no fixes in the mainline I know of. If any of them
have been fixed already, please let me know.
If you know of any other unresolved regressions introduced between 2.6.30
and 2.6.31, please let me know either and I'll add them to the list.
Also, please let me know if any of the entries below are invalid.
Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.
Listed regressions statistics:
Date Total Pending Unresolved
----------------------------------------
2009-10-02 151 49 42
2009-09-06 123 34 27
2009-08-26 108 33 26
2009-08-20 102 32 29
2009-08-10 89 27 24
2009-08-02 76 36 28
2009-07-27 70 51 43
2009-07-07 35 25 21
2009-06-29 22 22 15
Unresolved regressions
----------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
Subject : WARNING: at net/ipv4/af_inet.c:154
Submitter : Ralf Hildebrandt <[email protected]>
Date : 2009-09-30 12:24 (2 days old)
References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14294
Subject : kernel BUG at drivers/ide/ide-disk.c:187
Submitter : Santiago Garcia Mantinan <[email protected]>
Date : 2009-09-30 11:05 (2 days old)
References : http://marc.info/?l=linux-kernel&m=125430926311466&w=4
Handled-By : David Miller <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
Subject : Cannot boot on a PIII Celeron
Submitter : Michael Tokarev <[email protected]>
Date : 2009-09-28 15:26 (4 days old)
References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14267
Subject : Disassociating atheros wlan
Submitter : Kristoffer Ericson <[email protected]>
Date : 2009-09-24 10:16 (8 days old)
References : http://marc.info/?l=linux-kernel&m=125378723723384&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14266
Subject : regression in page writeback
Submitter : Shaohua Li <[email protected]>
Date : 2009-09-22 5:49 (10 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7831a0bdf06b9f722b947bb0c205ff7d77cebd8
References : http://marc.info/?l=linux-kernel&m=125359858117176&w=4
Handled-By : Wu Fengguang <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14265
Subject : ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
Submitter : Karol Lewandowski <[email protected]>
Date : 2009-09-15 12:05 (17 days old)
References : http://marc.info/?l=linux-kernel&m=125301636509517&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14264
Subject : ehci problem - mouse dead on scroll
Submitter : Volker Armin Hemmann <[email protected]>
Date : 2009-09-12 7:46 (20 days old)
References : http://marc.info/?l=linux-kernel&m=125274202707893&w=4
Handled-By : Alan Stern <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14257
Subject : Not able to boot on 32 bit System
Submitter : Rishikesh <[email protected]>
Date : 2009-09-21 15:25 (11 days old)
References : http://marc.info/?l=linux-kernel&m=125354604314412&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14256
Subject : kernel BUG at fs/ext3/super.c:435
Submitter : Mikael Pettersson <[email protected]>
Date : 2009-09-21 7:29 (11 days old)
References : http://marc.info/?l=linux-kernel&m=125351816109264&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14255
Subject : WARNING: at drivers/char/tty_io.c:1267
Submitter : Heinz Diehl <[email protected]>
Date : 2009-09-20 11:37 (12 days old)
References : http://marc.info/?l=linux-kernel&m=125344629506309&w=4
http://lkml.org/lkml/2009/9/8/393
Handled-By : Linus Torvalds <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14254
Subject : Hibernation broken by clocksource: Save mult_orig in clocksource_disable()
Submitter : Ondrej Zary <[email protected]>
Date : 2009-09-19 19:55 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7121843685de2bf7f3afd3ae1d6a146010bf1fc
References : http://marc.info/?l=linux-kernel&m=125339012527719&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14252
Subject : WARNING: at include/linux/skbuff.h:1382 w/ e1000
Submitter : Stephan von Krawczynski <[email protected]>
Date : 2009-09-20 11:26 (12 days old)
References : http://marc.info/?l=linux-kernel&m=125344599006033&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14251
Subject : 2.6.31: no login prompt
Submitter : Frédéric L. W. Meunier <[email protected]>
Date : 2009-09-19 22:43 (13 days old)
References : http://marc.info/?l=linux-kernel&m=125340020804711&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14249
Subject : BUG: oops in gss_validate on 2.6.31
Submitter : Bastian Blank <[email protected]>
Date : 2009-09-16 10:29 (16 days old)
References : http://marc.info/?l=linux-kernel&m=125309700417283&w=4
Handled-By : Trond Myklebust <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14248
Subject : 2.6.31 wireless: WARNING: at net/wireless/ibss.c:34
Submitter : Jurriaan <[email protected]>
Date : 2009-09-13 7:32 (19 days old)
References : http://marc.info/?l=linux-kernel&m=125282721113553&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14222
Subject : Hibernation oopses for the 2nd time with 2.6.31 (won't fit the screen)
Submitter : Ondrej Zary <[email protected]>
Date : 2009-09-24 14:07 (8 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7121843685de2bf7f3afd3ae1d6a146010bf1fc
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14205
Subject : Intel DX58SO mainboard - powering off takes really long
Submitter : Tomasz Chmielewski <[email protected]>
Date : 2009-09-22 10:14 (10 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14204
Subject : MCE prevent booting on my computer(pentium iii @500Mhz)
Submitter : GNUtoo <[email protected]>
Date : 2009-09-21 20:36 (11 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14185
Subject : Oops in driversbasefirmware_class
Submitter : <[email protected]>
Date : 2009-09-17 05:09 (15 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6e03a201bbe8137487f340d26aa662110e324b20
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14181
Subject : b43 causes panic at system shutdown
Submitter : Jeremy Huddleston <[email protected]>
Date : 2009-09-15 18:34 (17 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14157
Subject : end_request: I/O error, dev cciss/cXdX, sector 0
Submitter : <[email protected]>
Date : 2009-09-11 07:42 (21 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14143
Subject : OOPS when setting nr_requests for md devices
Submitter : aCaB <[email protected]>
Date : 2009-09-08 08:48 (24 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
Subject : order 2 page allocation failures in iwlagn
Submitter : Frans Pop <[email protected]>
Date : 2009-09-06 7:40 (26 days old)
References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
Handled-By : Pekka Enberg <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14133
Subject : WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule
Submitter : Jens Axboe <[email protected]>
Date : 2009-08-31 20:43 (32 days old)
References : http://marc.info/?l=linux-kernel&m=125175143918050&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14114
Subject : Tuning a saa7134 based card is broken in kernel 2.6.31-rc7
Submitter : Tsvety Petrov <[email protected]>
Date : 2009-09-03 21:06 (29 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14090
Subject : WARNING: at fs/notify/inotify/inotify_user.c:394
Submitter : Joerg Platte <[email protected]>
Date : 2009-08-30 15:21 (33 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14070
Subject : lockdep warning triggered by dup_fd
Submitter : Bart Van Assche <[email protected]>
Date : 2009-08-23 09:36 (40 days old)
References : http://lkml.org/lkml/2009/8/23/8
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14058
Subject : Oops in fsnotify
Submitter : Grant Wilson <[email protected]>
Date : 2009-08-20 15:48 (43 days old)
References : http://marc.info/?l=linux-kernel&m=125078450923133&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14013
Subject : hd don't show up
Submitter : Tim Blechmann <[email protected]>
Date : 2009-08-14 8:26 (49 days old)
References : http://marc.info/?l=linux-kernel&m=125023842514480&w=4
Handled-By : Tejun Heo <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13987
Subject : Received NMI interrupt at resume
Submitter : Christian Casteyde <[email protected]>
Date : 2009-08-15 07:55 (48 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13950
Subject : Oops when USB Serial disconnected while in use
Submitter : Bruno Prémont <[email protected]>
Date : 2009-08-08 17:47 (55 days old)
References : http://marc.info/?l=linux-kernel&m=124975432900466&w=4
Handled-By : Alan Stern <[email protected]>
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13943
Subject : WARNING: at net/mac80211/mlme.c:2292 with ath5k
Submitter : Fabio Comolli <[email protected]>
Date : 2009-08-06 20:15 (57 days old)
References : http://marc.info/?l=linux-kernel&m=124958978600600&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13942
Subject : Troubles with AoE and uninitialized object
Submitter : Bruno Prémont <[email protected]>
Date : 2009-08-04 10:12 (59 days old)
References : http://marc.info/?l=linux-kernel&m=124938117104811&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941
Subject : x86 Geode issue
Submitter : Martin-Éric Racine <[email protected]>
Date : 2009-08-03 12:58 (60 days old)
References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13940
Subject : iwlagn and sky2 stopped working, ACPI-related
Submitter : Ricardo Jorge da Fonseca Marques Ferreira <[email protected]>
Date : 2009-08-07 22:33 (56 days old)
References : http://marc.info/?l=linux-kernel&m=124968457731107&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13935
Subject : 2.6.31-rcX breaks Apple MightyMouse (Bluetooth version)
Submitter : Adrian Ulrich <[email protected]>
Date : 2009-08-08 22:08 (55 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa047e4f6fa63a6e9d0ae4d7749538830d14a343
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13906
Subject : Huawei E169 GPRS connection causes Ooops
Submitter : Clemens Eisserer <[email protected]>
Date : 2009-08-04 09:02 (59 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13869
Subject : Radeon framebuffer (w/o KMS) corruption at boot.
Submitter : Duncan <[email protected]>
Date : 2009-07-29 16:44 (65 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13836
Subject : suspend script fails, related to stdout?
Submitter : Tomas M. <[email protected]>
Date : 2009-07-17 21:24 (77 days old)
References : http://marc.info/?l=linux-kernel&m=124785853811667&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13809
Subject : oprofile: possible circular locking dependency detected
Submitter : Jerome Marchand <[email protected]>
Date : 2009-07-22 13:35 (72 days old)
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13733
Subject : 2.6.31-rc2: irq 16: nobody cared
Submitter : Niel Lambrechts <[email protected]>
Date : 2009-07-06 18:32 (88 days old)
References : http://marc.info/?l=linux-kernel&m=124690524027166&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13645
Subject : NULL pointer dereference at (null) (level2_spare_pgt)
Submitter : poornima nayak <[email protected]>
Date : 2009-06-17 17:56 (107 days old)
References : http://lkml.org/lkml/2009/6/17/194
Regressions with patches
------------------------
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14275
Subject : kernel>=2.6.31: ahci.c: do not force unconditionally sb600 to 32bit dma any more?
Submitter : gabriele balducci <[email protected]>
Date : 2009-09-30 15:02 (2 days old)
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=14275#c0
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14261
Subject : e1000e jumbo frames no longer work: 'Unsupported MTU setting'
Submitter : Nix <[email protected]>
Date : 2009-09-26 11:16 (6 days old)
References : http://marc.info/?l=linux-kernel&m=125396433321342&w=4
Handled-By : Alexander Duyck <[email protected]>
Patch : http://patchwork.kernel.org/patch/50277/
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
Subject : Memory leak in SCSI initialization
Submitter : Tetsuo Handa <[email protected]>
Date : 2009-09-22 4:18 (10 days old)
References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
Handled-By : Michael Ellerman <[email protected]>
Patch : http://patchwork.kernel.org/patch/49258/
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14253
Subject : Oops in driversbasefirmware_class
Submitter : Lars Ericsson <[email protected]>
Date : 2009-09-16 20:44 (16 days old)
References : http://lkml.org/lkml/2009/9/16/461
Handled-By : Frederik Deweerdt <[email protected]>
Patch : http://patchwork.kernel.org/patch/49914/
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14137
Subject : usb console regressions
Submitter : Jason Wessel <[email protected]>
Date : 2009-09-05 21:08 (27 days old)
References : http://marc.info/?l=linux-kernel&m=125218501310512&w=4
Handled-By : Jason Wessel <[email protected]>
Patch : http://patchwork.kernel.org/patch/45953/
http://patchwork.kernel.org/patch/45952/
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14017
Subject : _end symbol missing from Symbol.map
Submitter : Hannes Reinecke <[email protected]>
Date : 2009-08-13 6:45 (50 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=091e52c3551d3031343df24b573b770b4c6c72b6
References : http://marc.info/?l=linux-kernel&m=125014649102253&w=4
Handled-By : Hannes Reinecke <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=125014649102253&w=4
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13948
Subject : ath5k broken after suspend-to-ram
Submitter : Johannes Stezenbach <[email protected]>
Date : 2009-08-07 21:51 (56 days old)
References : http://marc.info/?l=linux-kernel&m=124968192727854&w=4
Handled-By : Nick Kossifidis <[email protected]>
Patch : http://patchwork.kernel.org/patch/38550/
For details, please visit the bug entries and follow the links given in
references.
As you can see, there is a Bugzilla entry for each of the listed regressions.
There also is a Bugzilla entry used for tracking the regressions introduced
between 2.6.30 and 2.6.31, unresolved as well as resolved, at:
http://bugzilla.kernel.org/show_bug.cgi?id=13615
Please let me know if there are any Bugzilla entries that should be added to
the list in there.
Thanks,
Rafael
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13645
Subject : NULL pointer dereference at (null) (level2_spare_pgt)
Submitter : poornima nayak <[email protected]>
Date : 2009-06-17 17:56 (107 days old)
References : http://lkml.org/lkml/2009/6/17/194
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13733
Subject : 2.6.31-rc2: irq 16: nobody cared
Submitter : Niel Lambrechts <[email protected]>
Date : 2009-07-06 18:32 (88 days old)
References : http://marc.info/?l=linux-kernel&m=124690524027166&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13809
Subject : oprofile: possible circular locking dependency detected
Submitter : Jerome Marchand <[email protected]>
Date : 2009-07-22 13:35 (72 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13836
Subject : suspend script fails, related to stdout?
Submitter : Tomas M. <[email protected]>
Date : 2009-07-17 21:24 (77 days old)
References : http://marc.info/?l=linux-kernel&m=124785853811667&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13869
Subject : Radeon framebuffer (w/o KMS) corruption at boot.
Submitter : Duncan <[email protected]>
Date : 2009-07-29 16:44 (65 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13906
Subject : Huawei E169 GPRS connection causes Ooops
Submitter : Clemens Eisserer <[email protected]>
Date : 2009-08-04 09:02 (59 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13935
Subject : 2.6.31-rcX breaks Apple MightyMouse (Bluetooth version)
Submitter : Adrian Ulrich <[email protected]>
Date : 2009-08-08 22:08 (55 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa047e4f6fa63a6e9d0ae4d7749538830d14a343
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13940
Subject : iwlagn and sky2 stopped working, ACPI-related
Submitter : Ricardo Jorge da Fonseca Marques Ferreira <[email protected]>
Date : 2009-08-07 22:33 (56 days old)
References : http://marc.info/?l=linux-kernel&m=124968457731107&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13942
Subject : Troubles with AoE and uninitialized object
Submitter : Bruno Prémont <[email protected]>
Date : 2009-08-04 10:12 (59 days old)
References : http://marc.info/?l=linux-kernel&m=124938117104811&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13943
Subject : WARNING: at net/mac80211/mlme.c:2292 with ath5k
Submitter : Fabio Comolli <[email protected]>
Date : 2009-08-06 20:15 (57 days old)
References : http://marc.info/?l=linux-kernel&m=124958978600600&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13948
Subject : ath5k broken after suspend-to-ram
Submitter : Johannes Stezenbach <[email protected]>
Date : 2009-08-07 21:51 (56 days old)
References : http://marc.info/?l=linux-kernel&m=124968192727854&w=4
Handled-By : Nick Kossifidis <[email protected]>
Patch : http://patchwork.kernel.org/patch/38550/
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13941
Subject : x86 Geode issue
Submitter : Martin-Éric Racine <[email protected]>
Date : 2009-08-03 12:58 (60 days old)
References : http://marc.info/?l=linux-kernel&m=124930434732481&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14013
Subject : hd don't show up
Submitter : Tim Blechmann <[email protected]>
Date : 2009-08-14 8:26 (49 days old)
References : http://marc.info/?l=linux-kernel&m=125023842514480&w=4
Handled-By : Tejun Heo <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13987
Subject : Received NMI interrupt at resume
Submitter : Christian Casteyde <[email protected]>
Date : 2009-08-15 07:55 (48 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14017
Subject : _end symbol missing from Symbol.map
Submitter : Hannes Reinecke <[email protected]>
Date : 2009-08-13 6:45 (50 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=091e52c3551d3031343df24b573b770b4c6c72b6
References : http://marc.info/?l=linux-kernel&m=125014649102253&w=4
Handled-By : Hannes Reinecke <[email protected]>
Patch : http://marc.info/?l=linux-kernel&m=125014649102253&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13950
Subject : Oops when USB Serial disconnected while in use
Submitter : Bruno Prémont <[email protected]>
Date : 2009-08-08 17:47 (55 days old)
References : http://marc.info/?l=linux-kernel&m=124975432900466&w=4
Handled-By : Alan Stern <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14058
Subject : Oops in fsnotify
Submitter : Grant Wilson <[email protected]>
Date : 2009-08-20 15:48 (43 days old)
References : http://marc.info/?l=linux-kernel&m=125078450923133&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14090
Subject : WARNING: at fs/notify/inotify/inotify_user.c:394
Submitter : Joerg Platte <[email protected]>
Date : 2009-08-30 15:21 (33 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14070
Subject : lockdep warning triggered by dup_fd
Submitter : Bart Van Assche <[email protected]>
Date : 2009-08-23 09:36 (40 days old)
References : http://lkml.org/lkml/2009/8/23/8
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14137
Subject : usb console regressions
Submitter : Jason Wessel <[email protected]>
Date : 2009-09-05 21:08 (27 days old)
References : http://marc.info/?l=linux-kernel&m=125218501310512&w=4
Handled-By : Jason Wessel <[email protected]>
Patch : http://patchwork.kernel.org/patch/45953/
http://patchwork.kernel.org/patch/45952/
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14114
Subject : Tuning a saa7134 based card is broken in kernel 2.6.31-rc7
Submitter : Tsvety Petrov <[email protected]>
Date : 2009-09-03 21:06 (29 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14133
Subject : WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule
Submitter : Jens Axboe <[email protected]>
Date : 2009-08-31 20:43 (32 days old)
References : http://marc.info/?l=linux-kernel&m=125175143918050&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
Subject : order 2 page allocation failures in iwlagn
Submitter : Frans Pop <[email protected]>
Date : 2009-09-06 7:40 (26 days old)
References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
Handled-By : Pekka Enberg <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14143
Subject : OOPS when setting nr_requests for md devices
Submitter : aCaB <[email protected]>
Date : 2009-09-08 08:48 (24 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14181
Subject : b43 causes panic at system shutdown
Submitter : Jeremy Huddleston <[email protected]>
Date : 2009-09-15 18:34 (17 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14157
Subject : end_request: I/O error, dev cciss/cXdX, sector 0
Submitter : <[email protected]>
Date : 2009-09-11 07:42 (21 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14205
Subject : Intel DX58SO mainboard - powering off takes really long
Submitter : Tomasz Chmielewski <[email protected]>
Date : 2009-09-22 10:14 (10 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14204
Subject : MCE prevent booting on my computer(pentium iii @500Mhz)
Submitter : GNUtoo <[email protected]>
Date : 2009-09-21 20:36 (11 days old)
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14185
Subject : Oops in driversbasefirmware_class
Submitter : <[email protected]>
Date : 2009-09-17 05:09 (15 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6e03a201bbe8137487f340d26aa662110e324b20
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14249
Subject : BUG: oops in gss_validate on 2.6.31
Submitter : Bastian Blank <[email protected]>
Date : 2009-09-16 10:29 (16 days old)
References : http://marc.info/?l=linux-kernel&m=125309700417283&w=4
Handled-By : Trond Myklebust <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14248
Subject : 2.6.31 wireless: WARNING: at net/wireless/ibss.c:34
Submitter : Jurriaan <[email protected]>
Date : 2009-09-13 7:32 (19 days old)
References : http://marc.info/?l=linux-kernel&m=125282721113553&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14222
Subject : Hibernation oopses for the 2nd time with 2.6.31 (won't fit the screen)
Submitter : Ondrej Zary <[email protected]>
Date : 2009-09-24 14:07 (8 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7121843685de2bf7f3afd3ae1d6a146010bf1fc
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14253
Subject : Oops in driversbasefirmware_class
Submitter : Lars Ericsson <[email protected]>
Date : 2009-09-16 20:44 (16 days old)
References : http://lkml.org/lkml/2009/9/16/461
Handled-By : Frederik Deweerdt <[email protected]>
Patch : http://patchwork.kernel.org/patch/49914/
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14252
Subject : WARNING: at include/linux/skbuff.h:1382 w/ e1000
Submitter : Stephan von Krawczynski <[email protected]>
Date : 2009-09-20 11:26 (12 days old)
References : http://marc.info/?l=linux-kernel&m=125344599006033&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14251
Subject : 2.6.31: no login prompt
Submitter : Frédéric L. W. Meunier <[email protected]>
Date : 2009-09-19 22:43 (13 days old)
References : http://marc.info/?l=linux-kernel&m=125340020804711&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14254
Subject : Hibernation broken by clocksource: Save mult_orig in clocksource_disable()
Submitter : Ondrej Zary <[email protected]>
Date : 2009-09-19 19:55 (13 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c7121843685de2bf7f3afd3ae1d6a146010bf1fc
References : http://marc.info/?l=linux-kernel&m=125339012527719&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14256
Subject : kernel BUG at fs/ext3/super.c:435
Submitter : Mikael Pettersson <[email protected]>
Date : 2009-09-21 7:29 (11 days old)
References : http://marc.info/?l=linux-kernel&m=125351816109264&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
Subject : Memory leak in SCSI initialization
Submitter : Tetsuo Handa <[email protected]>
Date : 2009-09-22 4:18 (10 days old)
References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
Handled-By : Michael Ellerman <[email protected]>
Patch : http://patchwork.kernel.org/patch/49258/
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14255
Subject : WARNING: at drivers/char/tty_io.c:1267
Submitter : Heinz Diehl <[email protected]>
Date : 2009-09-20 11:37 (12 days old)
References : http://marc.info/?l=linux-kernel&m=125344629506309&w=4
http://lkml.org/lkml/2009/9/8/393
Handled-By : Linus Torvalds <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14257
Subject : Not able to boot on 32 bit System
Submitter : Rishikesh <[email protected]>
Date : 2009-09-21 15:25 (11 days old)
References : http://marc.info/?l=linux-kernel&m=125354604314412&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14261
Subject : e1000e jumbo frames no longer work: 'Unsupported MTU setting'
Submitter : Nix <[email protected]>
Date : 2009-09-26 11:16 (6 days old)
References : http://marc.info/?l=linux-kernel&m=125396433321342&w=4
Handled-By : Alexander Duyck <[email protected]>
Patch : http://patchwork.kernel.org/patch/50277/
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14264
Subject : ehci problem - mouse dead on scroll
Submitter : Volker Armin Hemmann <[email protected]>
Date : 2009-09-12 7:46 (20 days old)
References : http://marc.info/?l=linux-kernel&m=125274202707893&w=4
Handled-By : Alan Stern <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14266
Subject : regression in page writeback
Submitter : Shaohua Li <[email protected]>
Date : 2009-09-22 5:49 (10 days old)
First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=d7831a0bdf06b9f722b947bb0c205ff7d77cebd8
References : http://marc.info/?l=linux-kernel&m=125359858117176&w=4
Handled-By : Wu Fengguang <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14265
Subject : ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
Submitter : Karol Lewandowski <[email protected]>
Date : 2009-09-15 12:05 (17 days old)
References : http://marc.info/?l=linux-kernel&m=125301636509517&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
Subject : Cannot boot on a PIII Celeron
Submitter : Michael Tokarev <[email protected]>
Date : 2009-09-28 15:26 (4 days old)
References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14267
Subject : Disassociating atheros wlan
Submitter : Kristoffer Ericson <[email protected]>
Date : 2009-09-24 10:16 (8 days old)
References : http://marc.info/?l=linux-kernel&m=125378723723384&w=4
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14275
Subject : kernel>=2.6.31: ahci.c: do not force unconditionally sb600 to 32bit dma any more?
Submitter : gabriele balducci <[email protected]>
Date : 2009-09-30 15:02 (2 days old)
Patch : http://bugzilla.kernel.org/show_bug.cgi?id=14275#c0
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14294
Subject : kernel BUG at drivers/ide/ide-disk.c:187
Submitter : Santiago Garcia Mantinan <[email protected]>
Date : 2009-09-30 11:05 (2 days old)
References : http://marc.info/?l=linux-kernel&m=125430926311466&w=4
Handled-By : David Miller <[email protected]>
This message has been generated automatically as a part of a report
of regressions introduced between 2.6.30 and 2.6.31.
The following bug entry is on the current list of known regressions
introduced between 2.6.30 and 2.6.31. Please verify if it still should
be listed and let me know (either way).
Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
Subject : WARNING: at net/ipv4/af_inet.c:154
Submitter : Ralf Hildebrandt <[email protected]>
Date : 2009-09-30 12:24 (2 days old)
References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
On Thu, 1 Oct 2009, Rafael J. Wysocki wrote:
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14255
> Subject : WARNING: at drivers/char/tty_io.c:1267
> Submitter : Heinz Diehl <[email protected]>
> Date : 2009-09-20 11:37 (12 days old)
> References : http://marc.info/?l=linux-kernel&m=125344629506309&w=4
> http://lkml.org/lkml/2009/9/8/393
> Handled-By : Linus Torvalds <[email protected]>
So the real problem here is that horrible workqueue deadlock, but it turns
out that I think that we should be able to safely do a
cancel_delayed_work_sync(&tty->buf.work);
in tty_ldisc_halt(), because cancel_delayed_work_sync() should never wait
for any other work than the exact work in question. And the buf.work thing
is flush_to_ldisc(), so waiting for _that_ is safe - the problematic thing
was always waiting for the (unrelated) tty->hangup_work, which can (and
does) take the semaphore for do_tty_hangup.
So doing that synchronous version of the delayed work cancel means that we
can now rest easy after tty_ldisc_halt(), and we don't need to worry about
buf.work still being pending. We _do_ in general need to worry about
hangup_work, which will call do_tty_hangup, which will call
tty_ldisc_hangup, but that's actually the routine we are in right now, so
for the _very_ special case of tty_ldisc_hangup that is a non-issue too.
Did that sound subtle to you?
It should.
It's subtle as hell, and I don't like it, but I think that the two subtle
rules above means that the following two-liner patch is safe - it can't
cause any new deadlocks, and getting rid of a the flush_scheduled_work is
safe in this one case.
So please give it a whirl. I'm not happy about the subtlety, but I also
hope that we'll get rid of that in the long run, so as a short-term hack
this looks acceptable.
To recap:
- tty_ldisc_halt() _can_ be called under the ldisc_mutex, because while
it waits for the work, it never waits for _other_ work, and buf.work
itself doesn't need the ldisc_mutex. So no deadlock.
- The flush_scheduled_work() after tty_ldisc_halt() is normally needed to
not just flush the buf.work (which is now done by tty_ldisc_halt()
itself), but to also make sure that there isn't any hangup work
pending.
So we can't remove that in general, and the other cases will still need
to flush all scheduled work (and worry about deadlocks with
ldisc_mutex). HOWEVER, in the special case of tty_ldisc_hangup() we
know that we are inside the hangup work, and thus don't need to wait
for ourselves, so we can just get rid of it there - just nowhere else.
- The other cases of dropping the ldisc_mutex in the middle are
long-standing, and have that TTY_LDISC_CHANGING vs TTY_HUPPED hackery
to take care of the races that it opens. I'd love to get rid of that
too, but they all seem to work. And they have never apparently
triggered the WARN_ON in this bugzilla.
I'm not proud of this patch, and I'm not signing off on it until the
people who have seen this warning have tried it and report that it seems
to work..
Linus
---
drivers/char/tty_ldisc.c | 7 ++-----
1 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/drivers/char/tty_ldisc.c b/drivers/char/tty_ldisc.c
index aafdbae..feb5507 100644
--- a/drivers/char/tty_ldisc.c
+++ b/drivers/char/tty_ldisc.c
@@ -518,7 +518,7 @@ static void tty_ldisc_restore(struct tty_struct *tty, struct tty_ldisc *old)
static int tty_ldisc_halt(struct tty_struct *tty)
{
clear_bit(TTY_LDISC, &tty->flags);
- return cancel_delayed_work(&tty->buf.work);
+ return cancel_delayed_work_sync(&tty->buf.work);
}
/**
@@ -756,12 +756,9 @@ void tty_ldisc_hangup(struct tty_struct *tty)
* N_TTY.
*/
if (tty->driver->flags & TTY_DRIVER_RESET_TERMIOS) {
- /* Make sure the old ldisc is quiescent */
- tty_ldisc_halt(tty);
- flush_scheduled_work();
-
/* Avoid racing set_ldisc or tty_ldisc_release */
mutex_lock(&tty->ldisc_mutex);
+ tty_ldisc_halt(tty);
if (tty->ldisc) { /* Not yet closed */
/* Switch back to N_TTY */
tty_ldisc_reinit(tty);
Hello Jens,
On Thu, 2009-10-01 at 21:55 +0200, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14133
> Subject : WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule
> Submitter : Jens Axboe <[email protected]>
> Date : 2009-08-31 20:43 (32 days old)
> References : http://marc.info/?l=linux-kernel&m=125175143918050&w=4
>
>
Are you still getting this warning in latest -tip. If yes, can you do
git bisect and specify the commit.
Thanks,
--
JSR
Hi.
I suppose this is still valid. I had to work around it by rfkill-ing
the device during the suspend process and reenabling at resume time.
I can try to reproduce it with 2.6.31.1 if you want it.
On Thu, Oct 1, 2009 at 9:55 PM, Rafael J. Wysocki <[email protected]> wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13943
> Subject : WARNING: at net/mac80211/mlme.c:2292 with ath5k
> Submitter : Fabio Comolli <[email protected]>
> Date : 2009-08-06 20:15 (57 days old)
> References : http://marc.info/?l=linux-kernel&m=124958978600600&w=4
>
>
>
Hello Grant,
On Thu, 2009-10-01 at 21:55 +0200, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14058
> Subject : Oops in fsnotify
> Submitter : Grant Wilson <[email protected]>
> Date : 2009-08-20 15:48 (43 days old)
> References : http://marc.info/?l=linux-kernel&m=125078450923133&w=4
>
>
Are you still facing this oops in latest kernel. If yes, can you do
git bisect and specify the commit.
Thanks,
--
JSR
On Fri, Oct 02 2009, Jaswinder Singh Rajput wrote:
> Hello Jens,
>
> On Thu, 2009-10-01 at 21:55 +0200, Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14133
> > Subject : WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule
> > Submitter : Jens Axboe <[email protected]>
> > Date : 2009-08-31 20:43 (32 days old)
> > References : http://marc.info/?l=linux-kernel&m=125175143918050&w=4
> >
> >
>
> Are you still getting this warning in latest -tip. If yes, can you do
> git bisect and specify the commit.
Nope, it seems to have disappeared.
--
Jens Axboe
On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
Michael has been asked to bisect it (if possible). I cant reproduce it
in kvm unfortunately.
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
> Subject : Cannot boot on a PIII Celeron
> Submitter : Michael Tokarev <[email protected]>
> Date : 2009-09-28 15:26 (4 days old)
> References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
>
>
>
On Thursday 01 October 2009, Rafael J. Wysocki wrote:
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
> Subject : order 2 page allocation failures in iwlagn
> Submitter : Frans Pop <[email protected]>
> Date : 2009-09-06 7:40 (26 days old)
> References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
> Handled-By : Pekka Enberg <[email protected]>
I'm not sure about this.
The error messages from failed allocations should now be a lot less as a
result of this commit:
commit f82a924cc88a5541df1d4b9d38a0968cd077a051
Author: Reinette Chatre <[email protected]>
Date: Thu Sep 17 10:43:56 2009 -0700
iwlwifi: reduce noise when skb allocation fails
That commit is in mainline, and I'm not sure if it is important enough for
a stable update (AFAICT it's not listed for 2.6.31.2).
That commit is mostly cosmetic, but possibly the real regression is not in
iwlagn but in the way memory is freed/defragmented. That aspect was also
reported by Bartlomiej (#14016) and was extensively discussed (without a
clear conclusion) here: http://lkml.org/lkml/2009/8/26/140.
My own feeling is that Bartlomiej is correct and that something has changed
since .29 and that on average we do have less higher order areas available
after the system has been in use for some time, but I can't substantiate
that. I do know that before .30 I had never seen the SKB allocation
errors.
Main problem is that it's hard to deliberately and reproducibly get the
system in a state where the errors occur.
I certainly do feel that the kernel should try to make sure higher order
allocations remain possible during system use. They are not only needed
shortly after boot: drivers can be loaded/unloaded at any time. OTOH Mel
probably does have a point that really high order GFP_ATOMIC allocations
by drivers make no sense [1].
Anyway, I have no problems with this BR being closed.
Cheers,
FJP
[1] <[email protected]>
Cyrill Gorcunov wrote:
> On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
>> This message has been generated automatically as a part of a report
>> of regressions introduced between 2.6.30 and 2.6.31.
>>
>> The following bug entry is on the current list of known regressions
>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>> be listed and let me know (either way).
>
> Michael has been asked to bisect it (if possible). I cant reproduce it
> in kvm unfortunately.
Yes, and that's what I'll be trying to do shortly.
I had other issues to sort out and wasn't able to
get to it in few last days.
Also I've a few other suspects. For example, in this .31
config I changed from bzip to lzma compression - and that's
where (or near) kernel is rebooting.
/mjt
>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
>> Subject : Cannot boot on a PIII Celeron
>> Submitter : Michael Tokarev <[email protected]>
>> Date : 2009-09-28 15:26 (4 days old)
>> References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
>>
>>
>>
On Fri, Oct 02, 2009 at 11:11:52AM +0200, Frans Pop wrote:
> On Thursday 01 October 2009, Rafael J. Wysocki wrote:
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
> > Subject : order 2 page allocation failures in iwlagn
> > Submitter : Frans Pop <[email protected]>
> > Date : 2009-09-06 7:40 (26 days old)
> > References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
> > Handled-By : Pekka Enberg <[email protected]>
>
> I'm not sure about this.
>
> The error messages from failed allocations should now be a lot less as a
> result of this commit:
> commit f82a924cc88a5541df1d4b9d38a0968cd077a051
> Author: Reinette Chatre <[email protected]>
> Date: Thu Sep 17 10:43:56 2009 -0700
> iwlwifi: reduce noise when skb allocation fails
>
> That commit is in mainline, and I'm not sure if it is important enough for
> a stable update (AFAICT it's not listed for 2.6.31.2).
>
> That commit is mostly cosmetic, but possibly the real regression is not in
> iwlagn but in the way memory is freed/defragmented. That aspect was also
> reported by Bartlomiej (#14016) and was extensively discussed (without a
> clear conclusion) here: http://lkml.org/lkml/2009/8/26/140.
>
> My own feeling is that Bartlomiej is correct and that something has changed
> since .29 and that on average we do have less higher order areas available
> after the system has been in use for some time, but I can't substantiate
> that. I do know that before .30 I had never seen the SKB allocation
> errors.
>
> Main problem is that it's hard to deliberately and reproducibly get the
> system in a state where the errors occur.
>
Apparently, Karol Lewandowski (cc added) has a reliable
reproduction case for when the firmware loading problem occurs
(http://lkml.org/lkml/2009/9/30/242). While it's not the same problem exactly,
it's probable they're related. I'm hoping the problem commit can be identified
by his bisection whenever he gets around to it.
> I certainly do feel that the kernel should try to make sure higher order
> allocations remain possible during system use. They are not only needed
> shortly after boot: drivers can be loaded/unloaded at any time. OTOH Mel
> probably does have a point that really high order GFP_ATOMIC allocations
> by drivers make no sense [1].
>
While they don't make sense, I accept that the problem is apparently
occuring more now than it did so something has changed that is not
obvious to normal testing. Hopefully Karol will be able to help us out.
> Anyway, I have no problems with this BR being closed.
>
> Cheers,
> FJP
>
> [1] <[email protected]>
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Friday 02 October 2009, Mel Gorman wrote:
> On Fri, Oct 02, 2009 at 11:11:52AM +0200, Frans Pop wrote:
> > My own feeling is that Bartlomiej is correct and that something has
> > changed since .29 and that on average we do have less higher order
> > areas available after the system has been in use for some time, but I
> > can't substantiate that. I do know that before .30 I had never seen
> > the SKB allocation errors.
> >
> > Main problem is that it's hard to deliberately and reproducibly get
> > the system in a state where the errors occur.
>
> Apparently, Karol Lewandowski (cc added) has a reliable
> reproduction case for when the firmware loading problem occurs
> (http://lkml.org/lkml/2009/9/30/242). While it's not the same problem
> exactly, it's probable they're related. I'm hoping the problem commit
> can be identified by his bisection whenever he gets around to it.
That does indeed look like a third independent report for basically the
same issue.
> [...], I accept that the problem is apparently occuring more now than it
> did so something has changed that is not obvious to normal testing.
Cool, that's progress ;-)
Michael Tokarev wrote:
> Cyrill Gorcunov wrote:
>> On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
>>> This message has been generated automatically as a part of a report
>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>
>>> The following bug entry is on the current list of known regressions
>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>> be listed and let me know (either way).
>>
>> Michael has been asked to bisect it (if possible). I cant reproduce it
>> in kvm unfortunately.
>
> Yes, and that's what I'll be trying to do shortly.
> I had other issues to sort out and wasn't able to
> get to it in few last days.
>
> Also I've a few other suspects. For example, in this .31
> config I changed from bzip to lzma compression - and that's
> where (or near) kernel is rebooting.
And that was the problem. After switching from LZMA
to BZIP2 kernel boots again.
Dunno if it can be treated as a regression, but it's
definitely a bug.
/mjt
>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
>>> Subject : Cannot boot on a PIII Celeron
>>> Submitter : Michael Tokarev <[email protected]>
>>> Date : 2009-09-28 15:26 (4 days old)
>>> References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
>>>
>>>
>>>
>
On 10/2/09, Michael Tokarev <[email protected]> wrote:
> Michael Tokarev wrote:
>> Cyrill Gorcunov wrote:
>>> On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
>>>> This message has been generated automatically as a part of a report
>>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>>
>>>> The following bug entry is on the current list of known regressions
>>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>>> be listed and let me know (either way).
>>>
>>> Michael has been asked to bisect it (if possible). I cant reproduce it
>>> in kvm unfortunately.
>>
>> Yes, and that's what I'll be trying to do shortly.
>> I had other issues to sort out and wasn't able to
>> get to it in few last days.
>>
>> Also I've a few other suspects. For example, in this .31
>> config I changed from bzip to lzma compression - and that's
>> where (or near) kernel is rebooting.
>
> And that was the problem. After switching from LZMA
> to BZIP2 kernel boots again.
>
> Dunno if it can be treated as a regression, but it's
> definitely a bug.
>
> /mjt
thanks for tracking it down Michael!
Rafael, who is responsible for LZMA now?
Cc him please.
>
>>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14270
>>>> Subject : Cannot boot on a PIII Celeron
>>>> Submitter : Michael Tokarev <[email protected]>
>>>> Date : 2009-09-28 15:26 (4 days old)
>>>> References : http://marc.info/?l=linux-kernel&m=125415160524110&w=4
>>>>
>>>>
>>>>
>>
>
>
Cyrill Gorcunov wrote:
> On 10/2/09, Michael Tokarev <[email protected]> wrote:
>> Michael Tokarev wrote:
>>> Cyrill Gorcunov wrote:
>>>> On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
>>>>> This message has been generated automatically as a part of a report
>>>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>>>
>>>>> The following bug entry is on the current list of known regressions
>>>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>>>> be listed and let me know (either way).
>>>> Michael has been asked to bisect it (if possible). I cant reproduce it
>>>> in kvm unfortunately.
>>> Yes, and that's what I'll be trying to do shortly.
>>> I had other issues to sort out and wasn't able to
>>> get to it in few last days.
>>>
>>> Also I've a few other suspects. For example, in this .31
>>> config I changed from bzip to lzma compression - and that's
>>> where (or near) kernel is rebooting.
>> And that was the problem. After switching from LZMA
>> to BZIP2 kernel boots again.
>>
>> Dunno if it can be treated as a regression, but it's
>> definitely a bug.
>
> thanks for tracking it down Michael!
> Rafael, who is responsible for LZMA now?
> Cc him please.
Please hold on for a while.
I switched to BZIP2, it booted fine. I switched back to LZMA -
and that one now boots too. Original bzImage, which were built
by the same compiler from the same source using the same
options reboots.
So um... I'm now trying to reproduce it ;)
/mjt
Hi,
for me this bug is fixed by:
commit 42960a13001aa6df52ca9952ce996f94a744ea65
HID: completely remove apple mightymouse from blacklist
Cheers,
Jan
"Rafael J. Wysocki" <[email protected]> writes:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13935
> Subject : 2.6.31-rcX breaks Apple MightyMouse (Bluetooth version)
> Submitter : Adrian Ulrich <[email protected]>
> Date : 2009-08-08 22:08 (55 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa047e4f6fa63a6e9d0ae4d7749538830d14a343
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
This memory leak might exist in all releases since 23 Sep 2005.
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6f3a20242db2597312c50abc11f1e747c5d2326a
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
As of now, the patch is not yet merged into Linus's tree.
It still should be listed.
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
> Subject : Memory leak in SCSI initialization
> Submitter : Tetsuo Handa <[email protected]>
> Date : 2009-09-22 4:18 (10 days old)
> References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
> Handled-By : Michael Ellerman <[email protected]>
> Patch : http://patchwork.kernel.org/patch/49258/
>
Regards.
[Michael Tokarev - Fri, Oct 02, 2009 at 02:59:09PM +0400]
...
>
> Please hold on for a while.
>
> I switched to BZIP2, it booted fine. I switched back to LZMA -
> and that one now boots too. Original bzImage, which were built
> by the same compiler from the same source using the same
> options reboots.
>
> So um... I'm now trying to reproduce it ;)
>
> /mjt
>
ok, perhaps it was indirect error or cosmic rays
-- Cyrill
On Thu, 1 Oct 2009, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13935
> Subject : 2.6.31-rcX breaks Apple MightyMouse (Bluetooth version)
> Submitter : Adrian Ulrich <[email protected]>
> Date : 2009-08-08 22:08 (55 days old)
> First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa047e4f6fa63a6e9d0ae4d7749538830d14a343
Fixed now in Linus' tree (42960a13) and submitted for stable. Please
close.
--
Jiri Kosina
SUSE Labs, Novell Inc.
On Friday 02 October 2009, Jiri Kosina wrote:
> On Thu, 1 Oct 2009, Rafael J. Wysocki wrote:
>
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13935
> > Subject : 2.6.31-rcX breaks Apple MightyMouse (Bluetooth version)
> > Submitter : Adrian Ulrich <[email protected]>
> > Date : 2009-08-08 22:08 (55 days old)
> > First-Bad-Commit: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fa047e4f6fa63a6e9d0ae4d7749538830d14a343
>
> Fixed now in Linus' tree (42960a13) and submitted for stable. Please
> close.
Done.
Thanks,
Rafael
On Friday 02 October 2009, Fabio Comolli wrote:
> Hi.
> I suppose this is still valid. I had to work around it by rfkill-ing
> the device during the suspend process and reenabling at resume time.
Thanks for the update.
> I can try to reproduce it with 2.6.31.1 if you want it.
In fact I'm more interested in whether or not it's still present in the Linus'
tree.
Thanks,
Rafael
On Friday 02 October 2009, Jens Axboe wrote:
> On Fri, Oct 02 2009, Jaswinder Singh Rajput wrote:
> > Hello Jens,
> >
> > On Thu, 2009-10-01 at 21:55 +0200, Rafael J. Wysocki wrote:
> > > This message has been generated automatically as a part of a report
> > > of regressions introduced between 2.6.30 and 2.6.31.
> > >
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14133
> > > Subject : WARNING: at arch/x86/kernel/smp.c:117 native_smp_send_reschedule
> > > Submitter : Jens Axboe <[email protected]>
> > > Date : 2009-08-31 20:43 (32 days old)
> > > References : http://marc.info/?l=linux-kernel&m=125175143918050&w=4
> > >
> > >
> >
> > Are you still getting this warning in latest -tip. If yes, can you do
> > git bisect and specify the commit.
>
> Nope, it seems to have disappeared.
OK, closed.
Thanks,
Rafael
On Friday 02 October 2009, Tetsuo Handa wrote:
> Rafael J. Wysocki wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> This memory leak might exist in all releases since 23 Sep 2005.
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=6f3a20242db2597312c50abc11f1e747c5d2326a
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> As of now, the patch is not yet merged into Linus's tree.
> It still should be listed.
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
> > Subject : Memory leak in SCSI initialization
> > Submitter : Tetsuo Handa <[email protected]>
> > Date : 2009-09-22 4:18 (10 days old)
> > References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
> > Handled-By : Michael Ellerman <[email protected]>
> > Patch : http://patchwork.kernel.org/patch/49258/
Thanks for the update.
Rafael
On Thu, 01 October 2009 "Rafael J. Wysocki" <[email protected]> wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still
> should be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13942
> Subject : Troubles with AoE and uninitialized object
> Submitter : Bruno Prémont <[email protected]>
> Date : 2009-08-04 10:12 (59 days old)
> References : http://marc.info/?l=linux-kernel&m=124938117104811&w=4
This should have been fixed with commits:
18d8217bc441630c3c5ec7416c5a65c69e8a0979
aoe: end barrier bios with EOPNOTSUPP
This addresses the trace on unmounting XFS
7135a71b19be1faf48b7148d77844d03bc0717d6
aoe: allocate unused request_queue for sysfs
This addresses the NULL kobject part
I think the second one made it into 2.6.31 but first one didn't,
please double-check! I've not seen them on stable though (which might
be worth especially for the first one)
Bruno
On Thu, 01 October 2009 "Rafael J. Wysocki" <[email protected]> wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still
> should be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13950
> Subject : Oops when USB Serial disconnected while in
> use Submitter : Bruno Prémont <[email protected]>
> Date : 2009-08-08 17:47 (55 days old)
> References : http://marc.info/?l=linux-kernel&m=124975432900466&w=4
> Handled-By : Alan Stern <[email protected]>
This has been adressed with following commits:
41bd34ddd7aa46dbc03b5bb33896e0fa8100fe7b
usb-serial: change referencing of port and serial structures
f5b0953a89fa3407fb293cc54ead7d8feec489e4
usb-serial: put subroutines in logical order
8bc2c1b2daf95029658868cb1427baea2da87139
usb-serial: change logic of serial lookups
cc56cd0157753c04a987888a2f793803df661a40
usb-serial: acquire references when a new tty is installed
7e29bb4b779f4f35385e6f21994758845bf14d23
usb-serial: fix termios initialization logic
74556123e034c8337b69a3ebac2f3a5fc0a97032
usb-serial: rename subroutines
ff8324df1187b7280e507c976777df76c73a1ef1
usb-serial: add missing tests and debug lines
320348c8d5c9b591282633ddb8959b42f7fc7a1c
usb-serial: straighten out serial_open
They went into 2.6.32-rc1 and are probably queued for 2.6.31.2 stable.
Bruno
On Thu, 1 Oct 2009, Rafael J. Wysocki wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13942
> Subject : Troubles with AoE and uninitialized object
> Submitter : Bruno Prémont <[email protected]>
> Date : 2009-08-04 10:12 (59 days old)
> References : http://marc.info/?l=linux-kernel&m=124938117104811&w=4
>
This should be fixed with 18d8217 in Linus' tree.
On Fri, Oct 02, 2009 at 10:32:26AM +0100, Mel Gorman wrote:
> On Fri, Oct 02, 2009 at 11:11:52AM +0200, Frans Pop wrote:
> > My own feeling is that Bartlomiej is correct and that something has changed
> > since .29 and that on average we do have less higher order areas available
> > after the system has been in use for some time, but I can't substantiate
> > that. I do know that before .30 I had never seen the SKB allocation
> > errors.
> >
> > Main problem is that it's hard to deliberately and reproducibly get the
> > system in a state where the errors occur.
> >
>
> Apparently, Karol Lewandowski (cc added) has a reliable
> reproduction case for when the firmware loading problem occurs
> (http://lkml.org/lkml/2009/9/30/242). While it's not the same problem exactly,
> it's probable they're related. I'm hoping the problem commit can be identified
> by his bisection whenever he gets around to it.
Unfortunately, I've had little success with bisecting this problem.
I've spend fair amount of time today trying to reproduce this problem,
but I'm unable to do so even on kernels that failed "easily" before.
Nothing has changed in hardware/software. I've changed methodology
somewhat from suspend-and-look-for-failure-on-resume to rmmod, fill
memory, modprobe-and-see-it-fail... but well, few days ago it failed
in either case.
Damn.
On 1 Oct 2009, Rafael J. Wysocki stated:
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
The patch fixes it.
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14261
> Subject : e1000e jumbo frames no longer work: 'Unsupported MTU setting'
> Submitter : Nix <[email protected]>
> Date : 2009-09-26 11:16 (6 days old)
> References : http://marc.info/?l=linux-kernel&m=125396433321342&w=4
> Handled-By : Alexander Duyck <[email protected]>
> Patch : http://patchwork.kernel.org/patch/50277/
(Possibly a stable candidate? It's not in 2.6.31.2-to-be, perhaps the
only patch that isn't. ;) )
On Friday 02 October 2009, Bruno Prémont wrote:
> On Thu, 01 October 2009 "Rafael J. Wysocki" <[email protected]> wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still
> > should be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13942
> > Subject : Troubles with AoE and uninitialized object
> > Submitter : Bruno Prémont <[email protected]>
> > Date : 2009-08-04 10:12 (59 days old)
> > References : http://marc.info/?l=linux-kernel&m=124938117104811&w=4
>
> This should have been fixed with commits:
>
> 18d8217bc441630c3c5ec7416c5a65c69e8a0979
> aoe: end barrier bios with EOPNOTSUPP
>
> This addresses the trace on unmounting XFS
>
>
> 7135a71b19be1faf48b7148d77844d03bc0717d6
> aoe: allocate unused request_queue for sysfs
>
> This addresses the NULL kobject part
>
>
> I think the second one made it into 2.6.31 but first one didn't,
Yes, it idid.
> please double-check! I've not seen them on stable though (which might
> be worth especially for the first one)
Thanks, closed.
Rafael
On Friday 02 October 2009, Bruno Prémont wrote:
> On Thu, 01 October 2009 "Rafael J. Wysocki" <[email protected]> wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still
> > should be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=13950
> > Subject : Oops when USB Serial disconnected while in
> > use Submitter : Bruno Prémont <[email protected]>
> > Date : 2009-08-08 17:47 (55 days old)
> > References : http://marc.info/?l=linux-kernel&m=124975432900466&w=4
> > Handled-By : Alan Stern <[email protected]>
>
> This has been adressed with following commits:
> 41bd34ddd7aa46dbc03b5bb33896e0fa8100fe7b
> usb-serial: change referencing of port and serial structures
>
> f5b0953a89fa3407fb293cc54ead7d8feec489e4
> usb-serial: put subroutines in logical order
>
> 8bc2c1b2daf95029658868cb1427baea2da87139
> usb-serial: change logic of serial lookups
>
> cc56cd0157753c04a987888a2f793803df661a40
> usb-serial: acquire references when a new tty is installed
>
> 7e29bb4b779f4f35385e6f21994758845bf14d23
> usb-serial: fix termios initialization logic
>
> 74556123e034c8337b69a3ebac2f3a5fc0a97032
> usb-serial: rename subroutines
>
> ff8324df1187b7280e507c976777df76c73a1ef1
> usb-serial: add missing tests and debug lines
>
> 320348c8d5c9b591282633ddb8959b42f7fc7a1c
> usb-serial: straighten out serial_open
>
> They went into 2.6.32-rc1 and are probably queued for 2.6.31.2 stable.
Thanks a lot for the detailed info, bug closed.
Rafael
On Friday 02 October 2009, Nix wrote:
> On 1 Oct 2009, Rafael J. Wysocki stated:
>
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
>
> The patch fixes it.
>
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14261
> > Subject : e1000e jumbo frames no longer work: 'Unsupported MTU setting'
> > Submitter : Nix <[email protected]>
> > Date : 2009-09-26 11:16 (6 days old)
> > References : http://marc.info/?l=linux-kernel&m=125396433321342&w=4
> > Handled-By : Alexander Duyck <[email protected]>
> > Patch : http://patchwork.kernel.org/patch/50277/
>
> (Possibly a stable candidate? It's not in 2.6.31.2-to-be, perhaps the
> only patch that isn't. ;) )
Most likely because it's not in the Linus' tree yet.
[e1000e maintainers, we have a regression fix to merge, please.]
Thanks,
Rafael
Hi.
On Fri, Oct 2, 2009 at 7:17 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday 02 October 2009, Fabio Comolli wrote:
>> Hi.
>> I suppose this is still valid. I had to work around it by rfkill-ing
>> the device during the suspend process and reenabling at resume time.
>
> Thanks for the update.
>
>> I can try to reproduce it with 2.6.31.1 if you want it.
>
> In fact I'm more interested in whether or not it's still present in the Linus'
> tree.
>
Well, I just tried 2.6.32-rc1-git3 and I have to say that it's mostly
useless with my eeepc. The warning didn't show up after resume but it
was impossible to reassociate with my AP and after some tentative the
screen went blank.
I was able to poweroff the netbook using the power button but I'm a
little scared to try again.
Maybe I'll try with -rc3 or something.
> Thanks,
> Rafael
>
Regards,
Fabio
On Friday 02 October 2009, Fabio Comolli wrote:
> Hi.
>
> On Fri, Oct 2, 2009 at 7:17 PM, Rafael J. Wysocki <[email protected]> wrote:
> > On Friday 02 October 2009, Fabio Comolli wrote:
> >> Hi.
> >> I suppose this is still valid. I had to work around it by rfkill-ing
> >> the device during the suspend process and reenabling at resume time.
> >
> > Thanks for the update.
> >
> >> I can try to reproduce it with 2.6.31.1 if you want it.
> >
> > In fact I'm more interested in whether or not it's still present in the Linus'
> > tree.
> >
>
> Well, I just tried 2.6.32-rc1-git3 and I have to say that it's mostly
> useless with my eeepc. The warning didn't show up after resume but it
> was impossible to reassociate with my AP and after some tentative the
> screen went blank.
>
> I was able to poweroff the netbook using the power button but I'm a
> little scared to try again.
It shouldn't kill the system cold dead, so as long as you have your data
backed up, you can do some debugging IMHO.
> Maybe I'll try with -rc3 or something.
I guess you should report the issues you have at the moment. Then, it's
actually more likely that someone will take care of fixing them.
Thanks,
Rafael
On Fri, Oct 2, 2009 at 14:31, Rafael J. Wysocki <[email protected]> wrote:
> On Friday 02 October 2009, Nix wrote:
>> On 1 Oct 2009, Rafael J. Wysocki stated:
>>
>> > The following bug entry is on the current list of known regressions
>> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
>> > be listed and let me know (either way).
>>
>> The patch fixes it.
>>
>> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14261
>> > Subject : e1000e jumbo frames no longer work: 'Unsupported MTU setting'
>> > Submitter : Nix <[email protected]>
>> > Date : 2009-09-26 11:16 (6 days old)
>> > References : http://marc.info/?l=linux-kernel&m=125396433321342&w=4
>> > Handled-By : Alexander Duyck <[email protected]>
>> > Patch : http://patchwork.kernel.org/patch/50277/
>>
>> (Possibly a stable candidate? It's not in 2.6.31.2-to-be, perhaps the
>> only patch that isn't. ;) )
>
> Most likely because it's not in the Linus' tree yet.
>
> [e1000e maintainers, we have a regression fix to merge, please.]
>
> Thanks,
> Rafael
Sorry, I forgot to send this patch out last night. I will send it now.
--
Cheers,
Jeff
Rafael J. Wysocki a écrit :
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
> Subject : WARNING: at net/ipv4/af_inet.c:154
> Submitter : Ralf Hildebrandt <[email protected]>
> Date : 2009-09-30 12:24 (2 days old)
> References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>
>
If commit d99927f4d93f36553699573b279e0ff98ad7dea6
(net: Fix sock_wfree() race) doesnt fix this problem, then
maybe we should take a look at an old patch.
< data mining... running... output results to lkml/netdev >
Random guesses
1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
(net: Move rx skb_orphan call to where needed)
A similar problem on SCTP was fixed by commit
1bc4ee4088c9a502db0e9c87f675e61e57fa1734
(sctp: fix warning at inet_sock_destruct() while release sctp socket)
2) CORK and UDP sockets
It seems we can leave an UDP socket with a frame in sk_write_queue
Purge of this queue is done by udp_flush_pending_frames()
This calls ip_flush_pending_frames()
But this function only calls kfree_skb(), not sk_wmem_free_skb()...
Could you try following patch ?
Thanks
[PATCH] net: UDP should not use ip_flush_pending_frames()
Now xmit UDP messages are charged, we must take care of calling right
skb freeing function.
In case a close() is performed on a socket where CORKED frame
is still queued in sk_write_queue, calling ip_flush_pending_frames()
leads to sk_forward_alloc leak.
Reported-by: Ralf Hildebrandt <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
include/net/sock.h | 10 ++++++++++
include/net/tcp.h | 10 ----------
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/udp.c | 2 +-
5 files changed, 13 insertions(+), 13 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..7c80fec 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -882,6 +882,16 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
__kfree_skb(skb);
}
+/* write queue abstraction */
+static inline void sk_write_queue_purge(struct sock *sk)
+{
+ struct sk_buff *skb;
+
+ while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+ sk_wmem_free_skb(sk, skb);
+ sk_mem_reclaim(sk);
+}
+
/* Used by processes to "lock" a socket state, so that
* interrupts and bottom half handlers won't change it
* from under us. It essentially blocks any incoming
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..4c7036a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1220,16 +1220,6 @@ static inline void tcp_put_md5sig_pool(void)
put_cpu();
}
-/* write queue abstraction */
-static inline void tcp_write_queue_purge(struct sock *sk)
-{
- struct sk_buff *skb;
-
- while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
- sk_wmem_free_skb(sk, skb);
- sk_mem_reclaim(sk);
-}
-
static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
{
return skb_peek(&sk->sk_write_queue);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 64d0af6..0124f5b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1992,7 +1992,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tcp_clear_xmit_timers(sk);
__skb_queue_purge(&sk->sk_receive_queue);
- tcp_write_queue_purge(sk);
+ sk_write_queue_purge(sk);
__skb_queue_purge(&tp->out_of_order_queue);
#ifdef CONFIG_NET_DMA
__skb_queue_purge(&sk->sk_async_wait_queue);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..76e59df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1845,7 +1845,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
tcp_cleanup_congestion_control(sk);
/* Cleanup up the write buffer. */
- tcp_write_queue_purge(sk);
+ sk_write_queue_purge(sk);
/* Cleans up our, hopefully empty, out_of_order_queue. */
__skb_queue_purge(&tp->out_of_order_queue);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..58007d1 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -464,7 +464,7 @@ void udp_flush_pending_frames(struct sock *sk)
if (up->pending) {
up->len = 0;
up->pending = 0;
- ip_flush_pending_frames(sk);
+ sk_write_queue_purge(sk);
}
}
EXPORT_SYMBOL(udp_flush_pending_frames);
Eric Dumazet a écrit :
> Rafael J. Wysocki a écrit :
>> This message has been generated automatically as a part of a report
>> of regressions introduced between 2.6.30 and 2.6.31.
>>
>> The following bug entry is on the current list of known regressions
>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>> be listed and let me know (either way).
>>
>>
>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
>> Subject : WARNING: at net/ipv4/af_inet.c:154
>> Submitter : Ralf Hildebrandt <[email protected]>
>> Date : 2009-09-30 12:24 (2 days old)
>> References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>
>>
>
> If commit d99927f4d93f36553699573b279e0ff98ad7dea6
> (net: Fix sock_wfree() race) doesnt fix this problem, then
> maybe we should take a look at an old patch.
>
> < data mining... running... output results to lkml/netdev >
>
> Random guesses
>
> 1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
> (net: Move rx skb_orphan call to where needed)
>
> A similar problem on SCTP was fixed by commit
> 1bc4ee4088c9a502db0e9c87f675e61e57fa1734
> (sctp: fix warning at inet_sock_destruct() while release sctp socket)
>
> 2) CORK and UDP sockets
> It seems we can leave an UDP socket with a frame in sk_write_queue
> Purge of this queue is done by udp_flush_pending_frames()
> This calls ip_flush_pending_frames()
> But this function only calls kfree_skb(), not sk_wmem_free_skb()...
>
>
> Could you try following patch ?
>
Hmm, I missed the ip_cork_release(), here is an updated version.
[PATCH] net: UDP should not use ip_flush_pending_frames()
Now xmit UDP messages are charged, we must take care of calling right
skb freeing function.
In case a close() is performed on a socket where CORKED frame
is still queued in sk_write_queue, calling ip_flush_pending_frames()
leads to sk_forward_alloc leak.
Fix this by calling sk_write_queue_purge() and ip_cork_release()
instead.
Reported-by: Ralf Hildebrandt <[email protected]>
Signed-off-by: Eric Dumazet <[email protected]>
---
include/net/ip.h | 1 +
include/net/sock.h | 10 ++++++++++
include/net/tcp.h | 10 ----------
net/ipv4/tcp.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/udp.c | 3 ++-
6 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/include/net/ip.h b/include/net/ip.h
index 2f47e54..c8d8828 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -117,6 +117,7 @@ extern int ip_generic_getfrag(void *from, char *to, int offset, int len, int od
extern ssize_t ip_append_page(struct sock *sk, struct page *page,
int offset, size_t size, int flags);
extern int ip_push_pending_frames(struct sock *sk);
+extern void ip_cork_release(struct inet_sock *);
extern void ip_flush_pending_frames(struct sock *sk);
/* datagram.c */
diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..7c80fec 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -882,6 +882,16 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
__kfree_skb(skb);
}
+/* write queue abstraction */
+static inline void sk_write_queue_purge(struct sock *sk)
+{
+ struct sk_buff *skb;
+
+ while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+ sk_wmem_free_skb(sk, skb);
+ sk_mem_reclaim(sk);
+}
+
/* Used by processes to "lock" a socket state, so that
* interrupts and bottom half handlers won't change it
* from under us. It essentially blocks any incoming
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 03a49c7..4c7036a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1220,16 +1220,6 @@ static inline void tcp_put_md5sig_pool(void)
put_cpu();
}
-/* write queue abstraction */
-static inline void tcp_write_queue_purge(struct sock *sk)
-{
- struct sk_buff *skb;
-
- while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
- sk_wmem_free_skb(sk, skb);
- sk_mem_reclaim(sk);
-}
-
static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
{
return skb_peek(&sk->sk_write_queue);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 64d0af6..0124f5b 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1992,7 +1992,7 @@ int tcp_disconnect(struct sock *sk, int flags)
tcp_clear_xmit_timers(sk);
__skb_queue_purge(&sk->sk_receive_queue);
- tcp_write_queue_purge(sk);
+ sk_write_queue_purge(sk);
__skb_queue_purge(&tp->out_of_order_queue);
#ifdef CONFIG_NET_DMA
__skb_queue_purge(&sk->sk_async_wait_queue);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7cda24b..76e59df 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1845,7 +1845,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
tcp_cleanup_congestion_control(sk);
/* Cleanup up the write buffer. */
- tcp_write_queue_purge(sk);
+ sk_write_queue_purge(sk);
/* Cleans up our, hopefully empty, out_of_order_queue. */
__skb_queue_purge(&tp->out_of_order_queue);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..b6370d0 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -464,7 +464,8 @@ void udp_flush_pending_frames(struct sock *sk)
if (up->pending) {
up->len = 0;
up->pending = 0;
- ip_flush_pending_frames(sk);
+ sk_write_queue_purge(sk);
+ ip_cork_release(inet_sk(sk));
}
}
EXPORT_SYMBOL(udp_flush_pending_frames);
Hi Rafael.
On Fri, Oct 2, 2009 at 11:42 PM, Rafael J. Wysocki <[email protected]> wrote:
> On Friday 02 October 2009, Fabio Comolli wrote:
>> Hi.
>>
>> On Fri, Oct 2, 2009 at 7:17 PM, Rafael J. Wysocki <[email protected]> wrote:
>> > On Friday 02 October 2009, Fabio Comolli wrote:
>> >> Hi.
>> >> I suppose this is still valid. I had to work around it by rfkill-ing
>> >> the device during the suspend process and reenabling at resume time.
>> >
>> > Thanks for the update.
>> >
>> >> I can try to reproduce it with 2.6.31.1 if you want it.
>> >
>> > In fact I'm more interested in whether or not it's still present in the Linus'
>> > tree.
>> >
>>
>> Well, I just tried 2.6.32-rc1-git3 and I have to say that it's mostly
>> useless with my eeepc. The warning didn't show up after resume but it
>> was impossible to reassociate with my AP and after some tentative the
>> screen went blank.
>>
>> I was able to poweroff the netbook using the power button but I'm a
>> little scared to try again.
>
> It shouldn't kill the system cold dead, so as long as you have your data
> backed up, you can do some debugging IMHO.
>
>> Maybe I'll try with -rc3 or something.
>
> I guess you should report the issues you have at the moment. Then, it's
> actually more likely that someone will take care of fixing them.
>
OK. This is what I've been able to come up with so far:
* with 2.6.31.x the warning shows up more or less every suspend-to-ram cycle;
* with 2.6.32-rc the warning never shows up;
* with 2.6.31.x when the warning shows up wifi is unusable until rfkill cycle;
* whith 2.6.32-rc after suspend-to-ram cycle wifi is unusable and
rfkill does not cure it unless I rfkill it off - suspend-to-ram -
resume - rfkill it on. This seems to work.
When wifi does not work in 2.6.32-rc the messages show:
---------------------------------------------
[ 49.647233] wlan0: direct probe to AP xx:xx:xx:xx:xx:xx (try 1)
[ 49.649234] wlan0: direct probe responded
[ 49.649244] wlan0: authenticate with AP xx:xx:xx:xx:xx:xx (try 1)
[ 49.650546] wlan0: authenticated
[ 49.650581] wlan0: associate with AP xx:xx:xx:xx:xx:xx (try 1)
[ 49.652183] wlan0: RX AssocResp from xx:xx:xx:xx:xx:xx (capab=0x451
status=12 aid=1)
[ 49.652190] wlan0: AP denied association (code=12)
---------------------------------------------
The script I feed to pm-utils to have wifi work across the
suspend-to-ram cycle is just:
---------------------------------------------
#!/bin/sh
RFKILL=`basename /sys/devices/platform/eeepc/rfkill/*`
case "$1" in
hibernate|suspend)
cat /sys/devices/platform/eeepc/rfkill/$RFKILL/state > /tmp/suspend
echo 0 > /sys/devices/platform/eeepc/rfkill/$RFKILL/state
;;
thaw|resume)
cat /tmp/suspend > /sys/devices/platform/eeepc/rfkill/$RFKILL/state
;;
*) exit $NA
;;
esac
---------------------------------------------
I can confirm that with 32-rc sometimes the screen flickers badly
after resume, for example running a simple dmesg command. However,
nothing is written in the logs, neither messages nor Xorg.0.log.
Chipset is i915.
Hope this helps. Please note that git is not an option for me on this machine.
> Thanks,
> Rafael
Regards,
Fabio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Rafael J. Wysocki a écrit :
>>> This message has been generated automatically as a part of a report
>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>
>>> The following bug entry is on the current list of known regressions
>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>> be listed and let me know (either way).
>>>
>>>
>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
>>> Subject : WARNING: at net/ipv4/af_inet.c:154
>>> Submitter : Ralf Hildebrandt <[email protected]>
>>> Date : 2009-09-30 12:24 (2 days old)
>>> References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>>
>>>
>> If commit d99927f4d93f36553699573b279e0ff98ad7dea6
>> (net: Fix sock_wfree() race) doesnt fix this problem, then
>> maybe we should take a look at an old patch.
>>
>> < data mining... running... output results to lkml/netdev >
>>
>> Random guesses
>>
>> 1) : commit d55d87fdff8252d0e2f7c28c2d443aee17e9d70f
>> (net: Move rx skb_orphan call to where needed)
>>
>> A similar problem on SCTP was fixed by commit
>> 1bc4ee4088c9a502db0e9c87f675e61e57fa1734
>> (sctp: fix warning at inet_sock_destruct() while release sctp socket)
>>
>> 2) CORK and UDP sockets
>> It seems we can leave an UDP socket with a frame in sk_write_queue
>> Purge of this queue is done by udp_flush_pending_frames()
>> This calls ip_flush_pending_frames()
>> But this function only calls kfree_skb(), not sk_wmem_free_skb()...
>>
>>
>> Could you try following patch ?
>>
>
> Hmm, I missed the ip_cork_release(), here is an updated version.
>
Please ignore this patch, I was wrong, sk_forward_alloc is not used
on xmit side for udp, only receive side. CORK/UDP should be fine
Investigation still needed...
Michael Tokarev wrote:
> Cyrill Gorcunov wrote:
>> On 10/2/09, Michael Tokarev <[email protected]> wrote:
>>> Michael Tokarev wrote:
>>>> Cyrill Gorcunov wrote:
>>>>> On 10/1/09, Rafael J. Wysocki <[email protected]> wrote:
>>>>>> This message has been generated automatically as a part of a report
>>>>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>>>>
>>>>>> The following bug entry is on the current list of known regressions
>>>>>> introduced between 2.6.30 and 2.6.31. Please verify if it still
>>>>>> should
>>>>>> be listed and let me know (either way).
>>>>> Michael has been asked to bisect it (if possible). I cant reproduce it
>>>>> in kvm unfortunately.
>>>> Yes, and that's what I'll be trying to do shortly.
>>>> I had other issues to sort out and wasn't able to
>>>> get to it in few last days.
>>>>
>>>> Also I've a few other suspects. For example, in this .31
>>>> config I changed from bzip to lzma compression - and that's
>>>> where (or near) kernel is rebooting.
>>> And that was the problem. After switching from LZMA
>>> to BZIP2 kernel boots again.
>>>
>>> Dunno if it can be treated as a regression, but it's
>>> definitely a bug.
>>
>> thanks for tracking it down Michael!
>> Rafael, who is responsible for LZMA now?
>> Cc him please.
>
> Please hold on for a while.
>
> I switched to BZIP2, it booted fine. I switched back to LZMA -
> and that one now boots too. Original bzImage, which were built
> by the same compiler from the same source using the same
> options reboots.
>
> So um... I'm now trying to reproduce it ;)
I performed about 20 kernel recompiles, and finally have some "statistics".
The problem is almost reproduceable, in a sense that I was able to get 6
more cases behaving the same way (rebooting right at early boot on a cel).
And all 3 "non-working" cases were with ccache. Ie, about half out of ~25
compiles done with ccache, and 7 of the resulting kernels are buggy. No
single failure without ccache so far.
Maybe it's some stale .o file cached by ccache (and it indeed looks like
that) -- I didn't try to remove the cache yet (but my guess is that I
wont be able to reproduce the issue with clean cache anymore).
What puzzles me most is the "failure mode". The difference between the
two processors is minimal. Having a corrupt .o file and almost-working
kernel is almost impossible by its own. And hitting this difference with
a corrupt .o file is.. unbelievable.
So I'm declaring it's a false alarm for now, and closing the bug.
/mjt
[Michael Tokarev - Sun, Oct 04, 2009 at 04:14:44PM +0400]
...
>>
>> I switched to BZIP2, it booted fine. I switched back to LZMA -
>> and that one now boots too. Original bzImage, which were built
>> by the same compiler from the same source using the same
>> options reboots.
>>
>> So um... I'm now trying to reproduce it ;)
>
> I performed about 20 kernel recompiles, and finally have some "statistics".
> The problem is almost reproduceable, in a sense that I was able to get 6
> more cases behaving the same way (rebooting right at early boot on a cel).
> And all 3 "non-working" cases were with ccache. Ie, about half out of ~25
> compiles done with ccache, and 7 of the resulting kernels are buggy. No
> single failure without ccache so far.
>
> Maybe it's some stale .o file cached by ccache (and it indeed looks like
> that) -- I didn't try to remove the cache yet (but my guess is that I
> wont be able to reproduce the issue with clean cache anymore).
>
> What puzzles me most is the "failure mode". The difference between the
> two processors is minimal. Having a corrupt .o file and almost-working
> kernel is almost impossible by its own. And hitting this difference with
> a corrupt .o file is.. unbelievable.
>
> So I'm declaring it's a false alarm for now, and closing the bug.
ok, thanks for hard work on this Michael!
>
> /mjt
>
-- Cyrill
Rafael J. Wysocki wrote:
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> Subject : kernel BUG at fs/ext3/super.c:435
> Submitter : Mikael Pettersson <[email protected]>
> Date : 2009-09-21 7:29 (11 days old)
> References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
On Fri, Oct 02, 2009 at 10:01:43PM +0200, Karol Lewandowski wrote:
> On Fri, Oct 02, 2009 at 10:32:26AM +0100, Mel Gorman wrote:
> > Apparently, Karol Lewandowski (cc added) has a reliable
> > reproduction case for when the firmware loading problem occurs
> > (http://lkml.org/lkml/2009/9/30/242). While it's not the same problem exactly,
> > it's probable they're related. I'm hoping the problem commit can be identified
> > by his bisection whenever he gets around to it.
>
> Unfortunately, I've had little success with bisecting this problem.
> I've spend fair amount of time today trying to reproduce this problem,
> but I'm unable to do so even on kernels that failed "easily" before.
I've been able to reproduce this problem on 2.6.32-rc1. No "luck"
with bisecting, though.
On Sunday 04 October 2009, Mikael Pettersson wrote:
> Rafael J. Wysocki wrote:
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > Subject : kernel BUG at fs/ext3/super.c:435
> > Submitter : Mikael Pettersson <[email protected]>
> > Date : 2009-09-21 7:29 (11 days old)
> > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
>
> The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
Thanks for the update.
Could you check the current Linus' tree, please? There are some known
regression fixes in there.
Best,
Rafael
Rafael J. Wysocki writes:
> On Sunday 04 October 2009, Mikael Pettersson wrote:
> > Rafael J. Wysocki wrote:
> > > The following bug entry is on the current list of known regressions
> > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > be listed and let me know (either way).
> > >
> > >
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > > Subject : kernel BUG at fs/ext3/super.c:435
> > > Submitter : Mikael Pettersson <[email protected]>
> > > Date : 2009-09-21 7:29 (11 days old)
> > > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
> >
> > The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
>
> Thanks for the update.
>
> Could you check the current Linus' tree, please? There are some known
> regression fixes in there.
I tried simplified versions of the bug trigger on two machines
running 2.6.32-rc1-git6, and neither triggered the kernel bug.
The original recipe involved doing a glibc rebuild, run its test
suite, install it, and reboot. Today however machine 1 was already
doing a rebuild so after the rebuild it did a reboot into the new
kernel before the install. The second machine booted the new kernel
directly to install the binary packages from the first machine.
I'll re-run the full bug trigger recipe on a third machine later next
week (it must rebuild glibc itself anyway due to arch differences).
On Thu, Oct 1, 2009 at 12:56 PM, Rafael J. Wysocki <[email protected]> wrote:
> This message has been generated automatically as a part of a report
> of regressions introduced between 2.6.30 and 2.6.31.
>
> The following bug entry is on the current list of known regressions
> introduced between 2.6.30 and 2.6.31. ?Please verify if it still should
> be listed and let me know (either way).
>
>
> Bug-Entry ? ? ? : http://bugzilla.kernel.org/show_bug.cgi?id=14267
> Subject ? ? ? ? : Disassociating atheros wlan
> Submitter ? ? ? : Kristoffer Ericson <[email protected]>
> Date ? ? ? ? ? ?: 2009-09-24 10:16 (8 days old)
> References ? ? ?: http://marc.info/?l=linux-kernel&m=125378723723384&w=4
>
>
>
Sorry for the delay
(spent some time in bodie).
yes it should be still open.
--
Justin P. Mattock
On Friday 02 October 2009, Frans Pop wrote:
> On Thursday 01 October 2009, Rafael J. Wysocki wrote:
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14141
> > Subject : order 2 page allocation failures in iwlagn
> > Submitter : Frans Pop <[email protected]>
> > Date : 2009-09-06 7:40 (26 days old)
> > References : http://marc.info/?l=linux-kernel&m=125222287419691&w=4
> > Handled-By : Pekka Enberg <[email protected]>
>
> I'm not sure about this.
>
> The error messages from failed allocations should now be a lot less as a
> result of this commit:
> commit f82a924cc88a5541df1d4b9d38a0968cd077a051
> Author: Reinette Chatre <[email protected]>
> Date: Thu Sep 17 10:43:56 2009 -0700
> iwlwifi: reduce noise when skb allocation fails
>
> That commit is in mainline, and I'm not sure if it is important enough
> for a stable update (AFAICT it's not listed for 2.6.31.2).
>
> That commit is mostly cosmetic, but possibly the real regression is not
> in iwlagn but in the way memory is freed/defragmented. That aspect was
> also reported by Bartlomiej (#14016) and was extensively discussed
> (without a clear conclusion) here: http://lkml.org/lkml/2009/8/26/140.
I may be getting somewhere with this. I just got the allocation failures
included below on .32-rc3. Note that these are not the "fixable" failures
that got suppressed with the commit referenced above, but the "this could
affect networking" failures that are still reported.
What I was doing when I got them is also interesting:
- a kernel build
- a gitk for the kernel tree (with full history this uses ~40% of memory)
- by mistake I then started a _second_ gitk
The second gitk (which shows as 'wish8.5' in top) caused a massive swap
out which brought the system to a standstill for a while (with huge
latencies as well, including a completely stuck mouse cursor, which
happens rarely).
The system has 2GB RAM + 2GB swap, so IIUC there is no danger of getting
into an OOM as the first gitk can be swapped out completely.
I'll dig into this a bit more as it looks like this should be reproducible,
probably even without the kernel build. Next step is to see how .30 behaves
in the same situation.
Even if it is reproducible with .30, I wonder if the kernel shouldn't be
more robust in this situation. Currently it seems to allow one single
process to claim so much memory before swapping out that "normal" operation
of other processes is affected. I can understand that such a situation may
be hard to avoid on a very busy system where multiple processes start
claiming (a lot of) memory at roughly the same time, but I'd say it should
be avoidable if a single process is the culprit.
BTW, the system recovered completely, although that took some time (the
first gitk remained visible in top long after I closed its window; I think
because the system was busy swapping it back in before terminating it).
Cheers,
FJP
kcryptd: page allocation failure. order:2, mode:0x4020
Pid: 1483, comm: kcryptd Not tainted 2.6.32-rc3 #22
Call Trace:
<IRQ> [<ffffffff8107c3d5>] __alloc_pages_nodemask+0x5a2/0x5ec
[<ffffffff81264892>] ? _spin_unlock+0x9/0xb
[<ffffffff811e73cd>] ? __alloc_skb+0x3c/0x15b
[<ffffffffa03202cb>] ? iwl_rx_allocate+0x8f/0x305 [iwlcore]
[<ffffffff8107c431>] __get_free_pages+0x12/0x41
[<ffffffff8109cb1a>] __kmalloc_track_caller+0x3b/0xed
[<ffffffff811e73f7>] __alloc_skb+0x66/0x15b
[<ffffffffa03202cb>] iwl_rx_allocate+0x8f/0x305 [iwlcore]
[<ffffffffa0320557>] iwl_rx_replenish_now+0x16/0x23 [iwlcore]
[<ffffffffa035c0c8>] iwl_rx_handle+0x3a8/0x3c1 [iwlagn]
[<ffffffff81051add>] ? sched_clock_local+0x1c/0x80
[<ffffffffa035c60d>] iwl_irq_tasklet_legacy+0x52c/0x7a4 [iwlagn]
[<ffffffffa0317aaf>] ? __iwl_read32+0xa5/0xb4 [iwlcore]
[<ffffffff8103efb8>] tasklet_action+0x71/0xbc
[<ffffffff8103f837>] __do_softirq+0x96/0x11b
[<ffffffff8100cabc>] call_softirq+0x1c/0x28
[<ffffffff8100e5ef>] do_softirq+0x33/0x6b
[<ffffffff8103f5c5>] irq_exit+0x36/0x75
[<ffffffff8100dcf1>] do_IRQ+0xa3/0xba
[<ffffffff8100c353>] ret_from_intr+0x0/0xa
<EOI> [<ffffffff811199dd>] ? scatterwalk_start+0x11/0x19
[<ffffffff8111bbca>] ? blkcipher_walk_first+0x173/0x196
[<ffffffff8111b67b>] ? blkcipher_walk_done+0xe6/0x1b8
[<ffffffff8111bc35>] ? blkcipher_walk_virt+0x1a/0x1d
[<ffffffffa02001cf>] ? crypto_cbc_encrypt+0x43/0x18e [cbc]
[<ffffffff81127efd>] ? blk_recount_segments+0x1b/0x2c
[<ffffffffa021371e>] ? aes_encrypt+0x0/0xf [aes_x86_64]
[<ffffffff8111af64>] ? async_encrypt+0x38/0x3a
[<ffffffffa01f7b54>] ? crypt_convert+0x1f9/0x28b [dm_crypt]
[<ffffffffa01f8009>] ? kcryptd_crypt+0x423/0x449 [dm_crypt]
[<ffffffffa01f7be6>] ? kcryptd_crypt+0x0/0x449 [dm_crypt]
[<ffffffff81049bfd>] ? worker_thread+0x146/0x1d8
[<ffffffff8104d706>] ? autoremove_wake_function+0x0/0x38
[<ffffffff81049ab7>] ? worker_thread+0x0/0x1d8
[<ffffffff8104d3f4>] ? kthread+0x7d/0x85
[<ffffffff8100c9ba>] ? child_rip+0xa/0x20
[<ffffffff8104d377>] ? kthread+0x0/0x85
[<ffffffff8100c9b0>] ? child_rip+0x0/0x20
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 171
CPU 1: hi: 186, btch: 31 usd: 177
active_anon:298532 inactive_anon:100163 isolated_anon:52
active_file:3993 inactive_file:4001 isolated_file:12
unevictable:399 dirty:0 writeback:76102 unstable:0 buffer:125
free:14107 slab_reclaimable:4510 slab_unreclaimable:20421
mapped:7949 shmem:0 pagetables:4437 bounce:0
DMA free:7928kB min:40kB low:48kB high:60kB active_anon:3340kB inactive_anon:3608kB active_file:384kB
inactive_file:472kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15336kB mlocked:0kB
dirty:0kB writeback:80kB mapped:256kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:104kB kernel_stack:0kB
pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1976 1976 1976
DMA32 free:48500kB min:5664kB low:7080kB high:8496kB active_anon:1190788kB inactive_anon:397044kB active_file:15588kB
inactive_file:15532kB unevictable:1596kB isolated(anon):208kB isolated(file):48kB present:2023748kB mlocked:1596kB
dirty:0kB writeback:304328kB mapped:31540kB shmem:0kB slab_reclaimable:18028kB slab_unreclaimable:81496kB kernel_stack:1672kB
pagetables:17732kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 19*4kB 13*8kB 3*16kB 7*32kB 11*64kB 11*128kB 5*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 7940kB
DMA32: 9299*4kB 1341*8kB 4*16kB 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 48500kB
98572 total pagecache pages
90213 pages in swap cache
Swap cache stats: add 175874, delete 85661, find 7850/8731
Free swap = 1425944kB
Total swap = 2097144kB
518064 pages RAM
10350 pages reserved
82388 pages shared
437481 pages non-shared
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
swapper: page allocation failure. order:2, mode:0x4020
Pid: 0, comm: swapper Not tainted 2.6.32-rc3 #22
Call Trace:
<IRQ> [<ffffffff8107c3d5>] __alloc_pages_nodemask+0x5a2/0x5ec
[<ffffffff81264892>] ? _spin_unlock+0x9/0xb
[<ffffffff811e73cd>] ? __alloc_skb+0x3c/0x15b
[<ffffffffa03202cb>] ? iwl_rx_allocate+0x8f/0x305 [iwlcore]
[<ffffffff8107c431>] __get_free_pages+0x12/0x41
[<ffffffff8109cb1a>] __kmalloc_track_caller+0x3b/0xed
[<ffffffff811e73f7>] __alloc_skb+0x66/0x15b
[<ffffffffa03202cb>] iwl_rx_allocate+0x8f/0x305 [iwlcore]
[<ffffffffa0320557>] iwl_rx_replenish_now+0x16/0x23 [iwlcore]
[<ffffffffa035c0c8>] iwl_rx_handle+0x3a8/0x3c1 [iwlagn]
[<ffffffffa035c60d>] iwl_irq_tasklet_legacy+0x52c/0x7a4 [iwlagn]
[<ffffffffa0317aaf>] ? __iwl_read32+0xa5/0xb4 [iwlcore]
[<ffffffff8103efb8>] tasklet_action+0x71/0xbc
[<ffffffff8103f837>] __do_softirq+0x96/0x11b
[<ffffffff8100cabc>] call_softirq+0x1c/0x28
[<ffffffff8100e5ef>] do_softirq+0x33/0x6b
[<ffffffff8103f5c5>] irq_exit+0x36/0x75
[<ffffffff8100dcf1>] do_IRQ+0xa3/0xba
[<ffffffff8100c353>] ret_from_intr+0x0/0xa
<EOI> [<ffffffffa0278ec7>] ? acpi_idle_enter_simple+0xf9/0x127 [processor]
[<ffffffffa0278ebd>] ? acpi_idle_enter_simple+0xef/0x127 [processor]
[<ffffffff811da545>] ? cpuidle_idle_call+0x8c/0xc7
[<ffffffff8100ae2e>] ? cpu_idle+0x55/0x8d
[<ffffffff8125432d>] ? rest_init+0x61/0x63
[<ffffffff81436c3e>] ? start_kernel+0x348/0x353
[<ffffffff8143629a>] ? x86_64_start_reservations+0xaa/0xae
[<ffffffff8143637f>] ? x86_64_start_kernel+0xe1/0xe8
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 171
CPU 1: hi: 186, btch: 31 usd: 155
active_anon:297901 inactive_anon:99948 isolated_anon:52
active_file:3920 inactive_file:3948 isolated_file:12
unevictable:399 dirty:0 writeback:34634 unstable:0 buffer:125
free:23390 slab_reclaimable:4510 slab_unreclaimable:11714
mapped:7819 shmem:0 pagetables:4437 bounce:0
DMA free:7908kB min:40kB low:48kB high:60kB active_anon:3340kB inactive_anon:3608kB active_file:384kB
inactive_file:472kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15336kB mlocked:0kB
dirty:0kB writeback:36kB mapped:256kB shmem:0kB slab_reclaimable:12kB slab_unreclaimable:12kB kernel_stack:0kB
pagetables:16kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 1976 1976 1976
DMA32 free:85652kB min:5664kB low:7080kB high:8496kB active_anon:1188264kB inactive_anon:396184kB active_file:15296kB
inactive_file:15320kB unevictable:1596kB isolated(anon):208kB isolated(file):48kB present:2023748kB mlocked:1596kB
dirty:0kB writeback:138500kB mapped:31020kB shmem:0kB slab_reclaimable:18028kB slab_unreclaimable:46844kB
kernel_stack:1672kB pagetables:17732kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 17*4kB 12*8kB 4*16kB 6*32kB 11*64kB 11*128kB 5*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 7908kB
DMA32: 12419*4kB 4439*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 85652kB
97616 total pagecache pages
89394 pages in swap cache
Swap cache stats: add 175906, delete 86512, find 7850/8733
Free swap = 1425864kB
Total swap = 2097144kB
518064 pages RAM
10350 pages reserved
82282 pages shared
428383 pages non-shared
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
On Monday 05 October 2009, Frans Pop wrote:
> I'll dig into this a bit more as it looks like this should be
> reproducible, probably even without the kernel build. Next step is to
> see how .30 behaves in the same situation.
This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
only starting gitk. I only started music playing in the background
(amarok) from an NFS share to ensure network activity.
With .32-rc3 I got 4 SKB allocation errors while starting the *second* gitk
instance. And the system was completely frozen with music stopped until gitk
finished loading.
With .30 I was able to start *three* gitk's (which meant 2 of them got
(partially) swapped out) without any allocation errors. And with the system
remaining relatively responsive. There was a short break in the music while
I started the 2nd instance, but it just continued playing afterwards. There
was also some mild latency in the mouse cursor, but nothing like the full
desktop freeze I get with .32-rc3.
With .30 I looked at /proc/buddyinfo while the 3rd gitk was being started,
and that looked fairly healthy all the time:
Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 579 67 25 8 5 1 1 0 1 1 0
Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 276 54 13 15 8 10 3 1 1 1 0
Node 0, zone DMA 4 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 119 45 24 18 12 4 5 2 1 1 0
Node 0, zone DMA 4 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 527 13 9 5 5 3 2 1 1 1 0
Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 1375 24 7 7 8 5 1 1 0 1 0
Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
Node 0, zone DMA32 329 21 3 3 17 8 5 1 0 1 0
With .32 it was obviously impossible to get that info due to the total
freeze of the desktop. Not sure if the scheduler changes in .32 contribute
to this. Guess I could find out by doing the same test with .31.
One thing I should mention: my swap is an LVM volume that's in a VG that's
on a LUKS encrypted partition.
Does this give you enough info to go on, or should I try a bisection?
Cheers,
FJP
On Monday 05 October 2009, Frans Pop wrote:
> With .32 it was obviously impossible to get that info due to the total
> freeze of the desktop. Not sure if the scheduler changes in .32
> contribute to this. Guess I could find out by doing the same test with
> .31.
I've tried with .31.1 too now and there does seem to be a scheduler
component too. With .31.1 I also get the SKB allocation errors, but the
desktop freeze seems to be less severe than with .32-rc3. I would suggest
looking into that _after_ the allocation issue has been traced/solved.
I did manage to really (partially) hang up the desktop with .31.1: music
did not come back and the task manager of the KDE desktop remained frozen,
but I could still use konsole [1].
I suspect this is because I also got an OOPS in between the SKB failures:
IP: [<ffffffffa0444ea2>] rpcauth_checkverf+0x4e/0x5a[sunrpc]
PGD 77b83067 PUD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/class/power_supply/C23D/charge_full
CPU 0
Modules linked in: i915 drm i2c_algo_bit i2c_core ppdev parport_pc lp parport cpufreq_conservative
cpufreq_userspace cpufreq_stats cpufreq_powersave ipv6 nfsd exportfs nfs lockd nfs_acl auth_rpcgss
sunrpc ext2 coretemp hp_wmi acpi_cpufreq loop snd_hda_codec_analog snd_hda_intel snd_hda_codec
arc4 snd_pcm_oss snd_mixer_oss ecb snd_pcm snd_seq_dummy snd_seq_oss iwlagn iwlcore snd_seq_midi
pcmcia mac80211 snd_rawmidi usblp snd_seq_midi_event snd_seq pcspkr cfg80211 yenta_socket
rsrc_nonstatic pcmcia_core psmouse snd_timer snd_seq_device rfkill serio_raw snd soundcore
snd_page_alloc hp_accel lis3lv02d video container output wmi intel_agp input_polldev battery ac
processor button joydev evdev ext3 jbd mbcache sha256_generic aes_x86_64 aes_generic cbc usbhid hid
dm_crypt dm_mirror dm_region_hash dm_log dm_snapshot dm_mod sg sr_mod sd_mod cdrom ide_pci_generic piix
ide_core pata_acpi uhci_hcd ata_piix ohci1394 sdhci_pci sdhci mmc_core led_class ieee1394 ricoh_mmc
ata_generic ehci_hcd libta e1000e scsi_mod thermal fan thermal_sys [last unloaded: scsi_wait_scan]
Pid: 3226, comm: rpciod/0 Not tainted 2.6.31.1 #20 HP Compaq 2510p Notebook PC
RIP: 0010:[<ffffffffa0444ea2>] [<ffffffffa0444ea2>]rpcauth_checkverf+0x4e/0x5a [sunrpc]
RSP: 0018:ffff88007aafbda0 EFLAGS: 00010246
RAX: 0000000400001000 RBX: ffff88003a718e40 RCX: 0000000000000001
RDX: ffff880038b821bc RSI: ffff880038b821c8 RDI: ffff8800618358c8
RBP: ffff88007aafbdc0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff880001514d80 R11: ffff8800536401f0 R12: ffff8800618358c8
R13: ffff880038b821c8 R14: ffff880037bb4bd0 R15: ffffffffa04bf52b
FS: 0000000000000000(0000) GS:ffff880001504000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000400001038 CR3: 0000000067ee5000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process rpciod/0 (pid: 3226, threadinfo ffff88007aafa000, task ffff88007c431670)
Stack:
ffff88007aafbde0 ffff880037bb4bd0 ffff8800618358c8 ffff880061835958
<0> ffff88007aafbe00 ffffffffa043e24a ffff88007c4319e0 ffff8800618358c8
<0> ffff880061835970 ffff880061835958 0000000000000000 0000000000000001
Call Trace:
[<ffffffffa043e24a>] call_decode+0x374/0x68e [sunrpc]
[<ffffffffa044430e>] __rpc_execute+0x86/0x244 [sunrpc]
[<ffffffffa04444f8>] ? rpc_async_schedule+0x0/0x12 [sunrpc]
[<ffffffffa0444508>] rpc_async_schedule+0x10/0x12 [sunrpc]
[<ffffffff81048bd5>] worker_thread+0x132/0x1ca
[<ffffffff8104c657>] ? autoremove_wake_function+0x0/0x38
[<ffffffff81048aa3>] ? worker_thread+0x0/0x1ca
[<ffffffff8104c335>] kthread+0x8f/0x97
[<ffffffff8100ca7a>] child_rip+0xa/0x20
[<ffffffff8104c2a6>] ? kthread+0x0/0x97
[<ffffffff8100ca70>] ? child_rip+0x0/0x20
Code: 30 0f b7 b7 06 01 00 00 48 89 d9 48 c7 c7 30 42
45 a0 48 8b 40 10 48 8b 50 10 31 c0 e8 73 f8 e0 e0 48 8b 43 38 4c 89 ee 4c 89 e7 <ff> 50 38 41 59 5b 41 5c 41 5d c9 c3 55 48 89 e5 41 55 49 89 f5
RIP [<ffffffffa0444ea2>] rpcauth_checkverf+0x4e/0x5a [sunrpc]
RSP <ffff88007aafbda0>
CR2: 0000000400001038
Not sure whether it's worth following up on that as a separate issue.
Cheers,
FJP
[1] KDE's task manager freezing for short periods is normal for me while
amarok is blocked by NFS. This normally only happens when I start amarok
for the first time, but it does explain how the NFS oops can have the
same effect.
On Mon, Oct 05, 2009 at 08:50:58AM +0200, Frans Pop wrote:
> On Monday 05 October 2009, Frans Pop wrote:
> > I'll dig into this a bit more as it looks like this should be
> > reproducible, probably even without the kernel build. Next step is to
> > see how .30 behaves in the same situation.
>
> This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> only starting gitk. I only started music playing in the background
> (amarok) from an NFS share to ensure network activity.
>
> With .32-rc3 I got 4 SKB allocation errors while starting the *second* gitk
> instance. And the system was completely frozen with music stopped until gitk
> finished loading.
>
> With .30 I was able to start *three* gitk's (which meant 2 of them got
> (partially) swapped out) without any allocation errors. And with the system
> remaining relatively responsive. There was a short break in the music while
> I started the 2nd instance, but it just continued playing afterwards. There
> was also some mild latency in the mouse cursor, but nothing like the full
> desktop freeze I get with .32-rc3.
>
> With .30 I looked at /proc/buddyinfo while the 3rd gitk was being started,
> and that looked fairly healthy all the time:
> Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 579 67 25 8 5 1 1 0 1 1 0
> Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 276 54 13 15 8 10 3 1 1 1 0
> Node 0, zone DMA 4 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 119 45 24 18 12 4 5 2 1 1 0
> Node 0, zone DMA 4 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 527 13 9 5 5 3 2 1 1 1 0
> Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 1375 24 7 7 8 5 1 1 0 1 0
> Node 0, zone DMA 5 9 22 20 21 11 0 0 0 0 1
> Node 0, zone DMA32 329 21 3 3 17 8 5 1 0 1 0
>
> With .32 it was obviously impossible to get that info due to the total
> freeze of the desktop. Not sure if the scheduler changes in .32 contribute
> to this. Guess I could find out by doing the same test with .31.
>
> One thing I should mention: my swap is an LVM volume that's in a VG that's
> on a LUKS encrypted partition.
>
> Does this give you enough info to go on, or should I try a bisection?
>
I'll be trying to reproduce it, but it's unlikely I'll manage to
reproduce it reliably as there may be a specific combination of hardware
necessary as well. What I'm going to try is writing a module that
allocates order-5 every second GFP_ATOMIC and see can I reproduce using
scenarios similar to yours but it'll take some time with no guarantee of
success. If you could bisect it, it would be fantastic.
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Monday 05 October 2009, Justin Mattock wrote:
> On Thu, Oct 1, 2009 at 12:56 PM, Rafael J. Wysocki <[email protected]> wrote:
> > This message has been generated automatically as a part of a report
> > of regressions introduced between 2.6.30 and 2.6.31.
> >
> > The following bug entry is on the current list of known regressions
> > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > be listed and let me know (either way).
> >
> >
> > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14267
> > Subject : Disassociating atheros wlan
> > Submitter : Kristoffer Ericson <[email protected]>
> > Date : 2009-09-24 10:16 (8 days old)
> > References : http://marc.info/?l=linux-kernel&m=125378723723384&w=4
> >
> >
> >
>
> Sorry for the delay
> (spent some time in bodie).
> yes it should be still open.
Thanks for the update.
Rafael
On Monday 05 October 2009, Mel Gorman wrote:
> On Mon, Oct 05, 2009 at 08:50:58AM +0200, Frans Pop wrote:
> > On Monday 05 October 2009, Frans Pop wrote:
> > > I'll dig into this a bit more as it looks like this should be
> > > reproducible, probably even without the kernel build. Next step is
> > > to see how .30 behaves in the same situation.
> >
> > This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> > only starting gitk. I only started music playing in the background
> > (amarok) from an NFS share to ensure network activity.
> >
> > With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> > gitk instance. And the system was completely frozen with music stopped
> > until gitk finished loading.
> >
> > With .30 I was able to start *three* gitk's (which meant 2 of them got
> > (partially) swapped out) without any allocation errors. And with the
> > system remaining relatively responsive. There was a short break in the
> > music while I started the 2nd instance, but it just continued playing
> > afterwards. There was also some mild latency in the mouse cursor, but
> > nothing like the full desktop freeze I get with .32-rc3.
> >
> > One thing I should mention: my swap is an LVM volume that's in a VG
> > that's on a LUKS encrypted partition.
> >
> > Does this give you enough info to go on, or should I try a bisection?
>
> I'll be trying to reproduce it, but it's unlikely I'll manage to
> reproduce it reliably as there may be a specific combination of hardware
> necessary as well. What I'm going to try is writing a module that
> allocates order-5 every second GFP_ATOMIC and see can I reproduce using
> scenarios similar to yours but it'll take some time with no guarantee of
> success. If you could bisect it, it would be fantastic.
And the winner is:
2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
Author: David Rientjes <[email protected]>
Date: Tue Jun 16 15:32:56 2009 -0700
oom: move oom_adj value from task_struct to mm_struct
I'm confident that the bisection is good. The test case was very reliable
while zooming in on the merge from akpm.
Cheers,
FJP
On Mon, 5 Oct 2009, Frans Pop wrote:
> And the winner is:
> 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> Author: David Rientjes <[email protected]>
> Date: Tue Jun 16 15:32:56 2009 -0700
>
> oom: move oom_adj value from task_struct to mm_struct
>
> I'm confident that the bisection is good. The test case was very reliable
> while zooming in on the merge from akpm.
>
I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
allocations which would be unaffected by oom killer scores.
> On Mon, 5 Oct 2009, Frans Pop wrote:
>
> > And the winner is:
> > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > Author: David Rientjes <[email protected]>
> > Date: Tue Jun 16 15:32:56 2009 -0700
> >
> > oom: move oom_adj value from task_struct to mm_struct
> >
> > I'm confident that the bisection is good. The test case was very reliable
> > while zooming in on the merge from akpm.
> >
>
> I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
> 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
> allocations which would be unaffected by oom killer scores.
I agree. this patch is pretty obvious correct. it was reverted by
one unfortunately regression.
On Mon, Oct 05, 2009 at 05:04:55PM -0700, David Rientjes wrote:
> On Mon, 5 Oct 2009, Frans Pop wrote:
>
> > And the winner is:
> > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > Author: David Rientjes <[email protected]>
> > Date: Tue Jun 16 15:32:56 2009 -0700
> >
> > oom: move oom_adj value from task_struct to mm_struct
> >
> > I'm confident that the bisection is good. The test case was very reliable
> > while zooming in on the merge from akpm.
> >
>
> I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
> 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
> allocations which would be unaffected by oom killer scores.
>
However, the problem was reported to start showing up in 2.6.31-rc1 so
while it might not be *the* patch, it might be making the type of change
that caused more fragmentation. This patch adjusted the size of
mm_struct and maybe it was enough to change the "order" required for the
slab. Maybe there are other slabs that have changed size as well in that
timeframe.
Frans, what is the size of mm_struct before and after this patch was
applied? Find it with either
grep mm_struct /proc/slabinfo
and if the information is not available there, try
cat /sys/kernel/slab/mm_struct/slab_size and
/sys/kernel/slab/mm_struct/order
Thanks
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, 6 Oct 2009, Mel Gorman wrote:
> > > And the winner is:
> > > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > > Author: David Rientjes <[email protected]>
> > > Date: Tue Jun 16 15:32:56 2009 -0700
> > >
> > > oom: move oom_adj value from task_struct to mm_struct
> > >
> > > I'm confident that the bisection is good. The test case was very reliable
> > > while zooming in on the merge from akpm.
> > >
> >
> > I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
> > 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
> > allocations which would be unaffected by oom killer scores.
> >
>
> However, the problem was reported to start showing up in 2.6.31-rc1 so
> while it might not be *the* patch, it might be making the type of change
> that caused more fragmentation. This patch adjusted the size of
> mm_struct and maybe it was enough to change the "order" required for the
> slab. Maybe there are other slabs that have changed size as well in that
> timeframe.
>
> Frans, what is the size of mm_struct before and after this patch was
> applied? Find it with either
>
> grep mm_struct /proc/slabinfo
>
> and if the information is not available there, try
>
> cat /sys/kernel/slab/mm_struct/slab_size and
> /sys/kernel/slab/mm_struct/order
>
If that's the case and the problem still persists in 2.6.31-rc7 as
reported, then you'd need to compare the current slab order for both
mm_struct and signal_struct to the previously known working kernel
since the latter is where oom_adj was moved. (You'd still have to check
the former to see if there were any mm_struct additions between rc1 and
rc7 between the commit and revert, though.)
On Tue, Oct 06, 2009 at 02:14:26AM -0700, David Rientjes wrote:
> On Tue, 6 Oct 2009, Mel Gorman wrote:
>
> > > > And the winner is:
> > > > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > > > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > > > Author: David Rientjes <[email protected]>
> > > > Date: Tue Jun 16 15:32:56 2009 -0700
> > > >
> > > > oom: move oom_adj value from task_struct to mm_struct
> > > >
> > > > I'm confident that the bisection is good. The test case was very reliable
> > > > while zooming in on the merge from akpm.
> > > >
> > >
> > > I doubt it for two reasons: (i) this commit was reverted in 0753ba0 since
> > > 2.6.31-rc7 and is no longer in the kernel, and (ii) these are GFP_ATOMIC
> > > allocations which would be unaffected by oom killer scores.
> > >
> >
> > However, the problem was reported to start showing up in 2.6.31-rc1 so
> > while it might not be *the* patch, it might be making the type of change
> > that caused more fragmentation. This patch adjusted the size of
> > mm_struct and maybe it was enough to change the "order" required for the
> > slab. Maybe there are other slabs that have changed size as well in that
> > timeframe.
> >
> > Frans, what is the size of mm_struct before and after this patch was
> > applied? Find it with either
> >
> > grep mm_struct /proc/slabinfo
> >
> > and if the information is not available there, try
> >
> > cat /sys/kernel/slab/mm_struct/slab_size and
> > /sys/kernel/slab/mm_struct/order
> >
>
> If that's the case and the problem still persists in 2.6.31-rc7 as
> reported, then you'd need to compare the current slab order for both
> mm_struct and signal_struct to the previously known working kernel
> since the latter is where oom_adj was moved. (You'd still have to check
> the former to see if there were any mm_struct additions between rc1 and
> rc7 between the commit and revert, though.)
>
Best to just grab all of slabinfo for a poke around. I know task_struct
has increases in size since 2.6.29 but not enough on the machines I've
changed to make a difference to the order of pages requested. It might
be different on the problem machines.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tuesday 06 October 2009, David Rientjes wrote:
> On Mon, 5 Oct 2009, Frans Pop wrote:
> > And the winner is:
> > 2ff05b2b4eac2e63d345fc731ea151a060247f53 is first bad commit
> > commit 2ff05b2b4eac2e63d345fc731ea151a060247f53
> > Author: David Rientjes <[email protected]>
> > Date: Tue Jun 16 15:32:56 2009 -0700
> >
> > oom: move oom_adj value from task_struct to mm_struct
> >
> > I'm confident that the bisection is good. The test case was very
> > reliable while zooming in on the merge from akpm.
>
> I doubt it for two reasons: (i) this commit was reverted in 0753ba0
> since 2.6.31-rc7 and is no longer in the kernel, and (ii) these are
> GFP_ATOMIC allocations which would be unaffected by oom killer scores.
OK. Looks like I have been getting some false "good" results. I've been
redoing part of the bisect and am getting close to a new candidate. Will
explain further when I have that.
Cheers,
FJP
Rafael J. Wysocki wrote:
> > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
> > > Subject : Memory leak in SCSI initialization
> > > Submitter : Tetsuo Handa <[email protected]>
> > > Date : 2009-09-22 4:18 (10 days old)
> > > References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
> > > Handled-By : Michael Ellerman <[email protected]>
> > > Patch : http://patchwork.kernel.org/patch/49258/
>
> Thanks for the update.
http://patchwork.kernel.org/patch/49258/ would be replaced by
an updated patch at http://lkml.org/lkml/2009/10/2/335
Regards.
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Eric Dumazet a écrit :
>>> Rafael J. Wysocki a écrit :
>>>> This message has been generated automatically as a part of a report
>>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>>
>>>> The following bug entry is on the current list of known regressions
>>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>>> be listed and let me know (either way).
>>>>
>>>>
>>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
>>>> Subject : WARNING: at net/ipv4/af_inet.c:154
>>>> Submitter : Ralf Hildebrandt <[email protected]>
>>>> Date : 2009-09-30 12:24 (2 days old)
>>>> References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>>>
>
> Investigation still needed...
>
OK, my last (buggy ???) feeling is about commit 95766fff6b9a78d1
[UDP]: Add memory accounting.
(Its a two years old patch, oh well...)
Problem is the udp_poll() :
We check the first frame to be dequeued from sk_receive_queue has a good checksum.
If it doesnt, we drop the frame ( calling kfree_skb(skb); )
Problem is now we perform memory accounting on UDP, this kfree_skb()
should be done with socket locked, but we are allowed to
call lock_sock() from this udp_poll() context
unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask = datagram_poll(file, sock, wait);
struct sock *sk = sock->sk;
int is_lite = IS_UDPLITE(sk);
/* Check for false positives due to checksum errors */
if ((mask & POLLRDNORM) &&
!(file->f_flags & O_NONBLOCK) &&
!(sk->sk_shutdown & RCV_SHUTDOWN)) {
struct sk_buff_head *rcvq = &sk->sk_receive_queue;
struct sk_buff *skb;
spin_lock_bh(&rcvq->lock);
while ((skb = skb_peek(rcvq)) != NULL &&
udp_lib_checksum_complete(skb)) {
UDP_INC_STATS_BH(sock_net(sk),
UDP_MIB_INERRORS, is_lite);
__skb_unlink(skb, rcvq);
<<HERE>> kfree_skb(skb);
}
spin_unlock_bh(&rcvq->lock);
David, Herbert, any idea how to solve this problem ?
1) Allow false positives
Or
2) Maybe we should finally convert sk_forward_alloc to an atomic_t after all...
This would make things easier, and speedup UDP (no more need to lock_sock())
Or
3) ???
On Fri, Oct 02, 2009 at 03:13:07PM -0700, Jeff Kirsher wrote:
> >> > Patch ? ? ? ? ? ? ? : http://patchwork.kernel.org/patch/50277/
> >>
> > Most likely because it's not in the Linus' tree yet.
> >
> > [e1000e maintainers, we have a regression fix to merge, please.]
>
> Sorry, I forgot to send this patch out last night. I will send it now.
Do we have a status on this progress of this patch to mainline? Thanks,
- Ted
On Wed, Oct 7, 2009 at 11:34, Theodore Tso <[email protected]> wrote:
> On Fri, Oct 02, 2009 at 03:13:07PM -0700, Jeff Kirsher wrote:
>> >> > Patch : http://patchwork.kernel.org/patch/50277/
>> >>
>> > Most likely because it's not in the Linus' tree yet.
>> >
>> > [e1000e maintainers, we have a regression fix to merge, please.]
>>
>> Sorry, I forgot to send this patch out last night. I will send it now.
>
> Do we have a status on this progress of this patch to mainline? Thanks,
>
> - Ted
The patch has been submitted and accepted into David Miller's net-2.6
tree. I will submit the patch for 2.6.31 stable tree once it makes it
into Linus's tree later this week.
--
Cheers,
Jeff
On Wednesday 07 October 2009, Tetsuo Handa wrote:
> Rafael J. Wysocki wrote:
> > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14258
> > > > Subject : Memory leak in SCSI initialization
> > > > Submitter : Tetsuo Handa <[email protected]>
> > > > Date : 2009-09-22 4:18 (10 days old)
> > > > References : http://marc.info/?l=linux-kernel&m=125359311312243&w=4
> > > > Handled-By : Michael Ellerman <[email protected]>
> > > > Patch : http://patchwork.kernel.org/patch/49258/
> >
> > Thanks for the update.
> http://patchwork.kernel.org/patch/49258/ would be replaced by
> an updated patch at http://lkml.org/lkml/2009/10/2/335
Thanks, updated.
Rafael
Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>> Eric Dumazet a écrit :
>>> Eric Dumazet a écrit :
>>>> Rafael J. Wysocki a écrit :
>>>>> This message has been generated automatically as a part of a report
>>>>> of regressions introduced between 2.6.30 and 2.6.31.
>>>>>
>>>>> The following bug entry is on the current list of known regressions
>>>>> introduced between 2.6.30 and 2.6.31. Please verify if it still should
>>>>> be listed and let me know (either way).
>>>>>
>>>>>
>>>>> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14301
>>>>> Subject : WARNING: at net/ipv4/af_inet.c:154
>>>>> Submitter : Ralf Hildebrandt <[email protected]>
>>>>> Date : 2009-09-30 12:24 (2 days old)
>>>>> References : http://marc.info/?l=linux-kernel&m=125431350218137&w=4
>>>>>
>> Investigation still needed...
>>
>
> OK, my last (buggy ???) feeling is about commit 95766fff6b9a78d1
>
> [UDP]: Add memory accounting.
>
> (Its a two years old patch, oh well...)
>
> Problem is the udp_poll() :
>
> We check the first frame to be dequeued from sk_receive_queue has a good checksum.
>
> If it doesnt, we drop the frame ( calling kfree_skb(skb); )
>
> Problem is now we perform memory accounting on UDP, this kfree_skb()
> should be done with socket locked, but are we allowed to
> call lock_sock() from this udp_poll() context ?
>
It seems we can lock_sock() from udp_poll() context, so here is a patch.
[PATCH] udp: Fix udp_poll()
udp_poll() can in some circumstances drop frames with incorrect checksums.
Problem is we now have to lock the socket while dropping frames, or risk
sk_forward corruption.
This bug is present since commit 95766fff6b9a78d1
([UDP]: Add memory accounting.)
While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
Signed-off-by: Eric Dumazet <[email protected]>
---
net/ipv4/udp.c | 73 +++++++++++++++++++++++++++--------------------
1 files changed, 43 insertions(+), 30 deletions(-)
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..d0d436d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -841,6 +841,42 @@ out:
return ret;
}
+
+/**
+ * first_packet_length - return length of first packet in receive queue
+ * @sk: socket
+ *
+ * Drops all bad checksum frames, until a valid one is found.
+ * Returns the length of found skb, or 0 if none is found.
+ */
+static unsigned int first_packet_length(struct sock *sk)
+{
+ struct sk_buff_head list_kill, *rcvq = &sk->sk_receive_queue;
+ struct sk_buff *skb;
+ unsigned int res;
+
+ __skb_queue_head_init(&list_kill);
+
+ spin_lock_bh(&rcvq->lock);
+ while ((skb = skb_peek(rcvq)) != NULL &&
+ udp_lib_checksum_complete(skb)) {
+ UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS,
+ IS_UDPLITE(sk));
+ __skb_unlink(skb, rcvq);
+ __skb_queue_tail(&list_kill, skb);
+ }
+ res = skb ? skb->len : 0;
+ spin_unlock_bh(&rcvq->lock);
+
+ if (!skb_queue_empty(&list_kill)) {
+ lock_sock(sk);
+ __skb_queue_purge(&list_kill);
+ sk_mem_reclaim_partial(sk);
+ release_sock(sk);
+ }
+ return res;
+}
+
/*
* IOCTL requests applicable to the UDP protocol
*/
@@ -857,21 +893,16 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
case SIOCINQ:
{
- struct sk_buff *skb;
- unsigned long amount;
+ unsigned int amount = first_packet_length(sk);
- amount = 0;
- spin_lock_bh(&sk->sk_receive_queue.lock);
- skb = skb_peek(&sk->sk_receive_queue);
- if (skb != NULL) {
+ if (amount)
/*
* We will only return the amount
* of this packet since that is all
* that will be read.
*/
- amount = skb->len - sizeof(struct udphdr);
- }
- spin_unlock_bh(&sk->sk_receive_queue.lock);
+ amount -= sizeof(struct udphdr);
+
return put_user(amount, (int __user *)arg);
}
@@ -1540,29 +1571,11 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
{
unsigned int mask = datagram_poll(file, sock, wait);
struct sock *sk = sock->sk;
- int is_lite = IS_UDPLITE(sk);
/* Check for false positives due to checksum errors */
- if ((mask & POLLRDNORM) &&
- !(file->f_flags & O_NONBLOCK) &&
- !(sk->sk_shutdown & RCV_SHUTDOWN)) {
- struct sk_buff_head *rcvq = &sk->sk_receive_queue;
- struct sk_buff *skb;
-
- spin_lock_bh(&rcvq->lock);
- while ((skb = skb_peek(rcvq)) != NULL &&
- udp_lib_checksum_complete(skb)) {
- UDP_INC_STATS_BH(sock_net(sk),
- UDP_MIB_INERRORS, is_lite);
- __skb_unlink(skb, rcvq);
- kfree_skb(skb);
- }
- spin_unlock_bh(&rcvq->lock);
-
- /* nothing to see, move along */
- if (skb == NULL)
- mask &= ~(POLLIN | POLLRDNORM);
- }
+ if ((mask & POLLRDNORM) && !(file->f_flags & O_NONBLOCK) &&
+ !(sk->sk_shutdown & RCV_SHUTDOWN) && !first_packet_length(sk))
+ mask &= ~(POLLIN | POLLRDNORM);
return mask;
Mikael Pettersson writes:
> Rafael J. Wysocki writes:
> > On Sunday 04 October 2009, Mikael Pettersson wrote:
> > > Rafael J. Wysocki wrote:
> > > > The following bug entry is on the current list of known regressions
> > > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > > be listed and let me know (either way).
> > > >
> > > >
> > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > > > Subject : kernel BUG at fs/ext3/super.c:435
> > > > Submitter : Mikael Pettersson <[email protected]>
> > > > Date : 2009-09-21 7:29 (11 days old)
> > > > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
> > >
> > > The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
> >
> > Thanks for the update.
> >
> > Could you check the current Linus' tree, please? There are some known
> > regression fixes in there.
>
> I tried simplified versions of the bug trigger on two machines
> running 2.6.32-rc1-git6, and neither triggered the kernel bug.
>
> The original recipe involved doing a glibc rebuild, run its test
> suite, install it, and reboot. Today however machine 1 was already
> doing a rebuild so after the rebuild it did a reboot into the new
> kernel before the install. The second machine booted the new kernel
> directly to install the binary packages from the first machine.
>
> I'll re-run the full bug trigger recipe on a third machine later next
> week (it must rebuild glibc itself anyway due to arch differences).
Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
exact same bug on the third machine.
/Mikael
On Friday 09 October 2009, Mikael Pettersson wrote:
> Mikael Pettersson writes:
> > Rafael J. Wysocki writes:
> > > On Sunday 04 October 2009, Mikael Pettersson wrote:
> > > > Rafael J. Wysocki wrote:
> > > > > The following bug entry is on the current list of known regressions
> > > > > introduced between 2.6.30 and 2.6.31. Please verify if it still should
> > > > > be listed and let me know (either way).
> > > > >
> > > > >
> > > > > Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=3D14256
> > > > > Subject : kernel BUG at fs/ext3/super.c:435
> > > > > Submitter : Mikael Pettersson <[email protected]>
> > > > > Date : 2009-09-21 7:29 (11 days old)
> > > > > References : http://marc.info/?l=3Dlinux-kernel&m=3D125351816109264&w=3D4
> > > >
> > > > The exact same bug (same cause, same symptom) just hit me again in 2.6.32-rc1.
> > >
> > > Thanks for the update.
> > >
> > > Could you check the current Linus' tree, please? There are some known
> > > regression fixes in there.
> >
> > I tried simplified versions of the bug trigger on two machines
> > running 2.6.32-rc1-git6, and neither triggered the kernel bug.
> >
> > The original recipe involved doing a glibc rebuild, run its test
> > suite, install it, and reboot. Today however machine 1 was already
> > doing a rebuild so after the rebuild it did a reboot into the new
> > kernel before the install. The second machine booted the new kernel
> > directly to install the binary packages from the first machine.
> >
> > I'll re-run the full bug trigger recipe on a third machine later next
> > week (it must rebuild glibc itself anyway due to arch differences).
>
> Not fixed in 2.6.32-rc3. A glibc rebuild + install triggered the
> exact same bug on the third machine.
Thanks for the update.
Rafael
Sorry for going quiet on this issue for a few days, but I have been
spending *a lot* of time on it. I've done what amounts to 5 bisection
rounds at ~20 minutes per iteration and in total over 80 boots.
The problem with my first bisection was that there are *at least two*
changes at the root of this issue, both committed between .30 and .30-rc1.
Because of this a normal bisection will not lead to a reliable result and
even with my last effort I can only narrow it down to two different areas,
and not 100% to specific commits.
The two identified areas are:
1) a wireless merge which causes the SKB errors to appear in the first
place, but not always;
2) an mm merge which makes the SKB errors occur *much* quicker; IMHO this
is the change that also causes the regressions reported by Pekka and
Karol.
So below my results. The issue is both complex and subtle. Now it's up to
you, domain experts for both mm *and* wireless/networking, to make sense of
it all and come up with suggestions on how to proceed.
I've improved my test and it's now a lot more reliable, but there are still
timing influences. Also, because this is all merge-window stuff, I'm
hitting quite a few minor and major regressions between commits that can
affect tests.
Please study the information below carefully. I know it's long, but I think
this issue justifies that.
On Monday 05 October 2009, Frans Pop wrote:
> This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> only starting gitk. I only started music playing in the background
> (amarok) from an NFS share to ensure network activity.
>
> With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> gitk instance. And the system was completely frozen with music stopped
> until gitk finished loading.
With .32-rc3, .31.1 and vanilla .31 I will get multiple SKB allocation
errors the *first time* I run the test, *every* time.
> With .30 I was able to start *three* gitk's (which meant 2 of them got
> (partially) swapped out) without any allocation errors. And with the
> system remaining relatively responsive. There was a short break in the
> music while I started the 2nd instance, but it just continued playing
> afterwards. There was also some mild latency in the mouse cursor, but
> nothing like the full desktop freeze I get with .32-rc3.
With both .30.2 and vanilla .30 I have *never* been able to get any SKB
allocation errors. No matter how often I repeat the test.
So, the start and end position are 100% reproducible. Problem is that this
changes during the bisection. At some point the test will fail (no SKB
errors) the first time I run it, but it will fail on the second or third
attempt.
Apparently at some point memory must already be fragmented (or higher
orders already used up) to some extend for the errors to trigger.
TEST METHOD
-----------
As a normal bisection (I tried 3 times...) did not lead anywhere, I had to
think of an alternative approach. I decided to start by manually selecting
merges by Linus into mainline. The advantage is that that makes the
bisection linear and makes it a lot easier to see patterns.
After narrowing down to a specific merge, I bisected (again semi-manually)
inside that merge.
Because I suspected there were multiple changes involved, I deliberately
tried to find two points:
- where do I first start seeing SKB errors at all, even if it is only at
the second or third try;
- where do I start getting SKB errors reliably on the first try.
I worked from "good" to "bad", i.e. I started at .30. The merges were not
chosen completely randomly. From the first 3 bisections I strongly
suspected the first 'net-next' merge and the first 'akpm' merge, but I did
make sure to confirm that suspicion.
TEST DESCRIPTION
----------------
The test I've ended up using is:
1) clean boot
2) start music in amarok from NFS share; use very long song to avoid file
changes and thus ensure a fluent stream of network data during the test
3) start 'gitk v2.6.29..master &' - to use up some memory
4) start first 'gitk master &' - after this all normal memory is as good as
used up, with minor swap; this never resulted in SKB errors
5) start second 'gitk master &' - this causes heavy swapping (>700 MB) and
is the real test
6) if there were no SKB errors after 5), kill the gitk processes and repeat
steps 3) to 5). I've done this up to 4 times in some cases
7) if the results are not clear or when there is doubt later, repeat from
step 1) with same kernel
Memory after initial 'gitk v2.6.29..master &':
total used free shared buffers cached
Mem: 2030776 1153008 877768 0 41572 333968
-/+ buffers/cache: 777468 1253308
Swap: 2097144 0 2097144
Memory after first 'gitk master &':
total used free shared buffers cached
Mem: 2030776 1979040 51736 0 35684 238420
-/+ buffers/cache: 1704936 325840
Swap: 2097144 21876 2075268
Memory after second 'gitk master &' (with .30.2):
total used free shared buffers cached
Mem: 2030776 2011608 19168 0 21836 92336
-/+ buffers/cache: 1897436 133340
Swap: 2097144 776160 1320984
OVERVIEW OF RESULTS
-------------------
Below I list the most relevant merges and commits. Note that they are
listed in commit order; my kernel version shows the order of testing.
For the commits I tested the test results are listed on the next line.
The first number on that line consists of the test series + the iteration
(and also identifies the kernel I used).
A "+" means I got no SKB errors, a "-" that I did get them. A "|" means I
rebooted for a second series of tests.
v2.6.30-2330-gdb8e7f1 'x86-fixes-for-linus' of linux-2.6-tip
1.1 +++ iwlagn sw-error during first test
v2.6.30-4127-g0fa2133 'merge' of powerpc (last merge before net-next-2.6)
1.2 +++
v2.6.30-5398-g2ed0e21 net-next-2.6 (mega-merge!)
1.4 +- system reboot fails after testing
v2.6.30-5517-g609106b 'merge' of powerpc
1.3 +- system reboot fails after testing
v2.6.30-5927-gf83b1e6 'for-linus' of linux1394-2.6 (last merge before akpm)
2.2 ++-
v2.6.30-6111-g517d086 'akpm'
2.1 -|-
BISECTION OF net-next-2.6 MERGE
-------------------------------
Note that this merge was based not on .30 vanilla, but partly on
v2.6.30-rc1 and partly on v2.6.30-rc6.
I think this had an influence on the latencies I saw (i.e. because some
post-rc6 bug fixes were not present it changes the general behavior of the
system during the swapping). For example: with v2.6.30-4127-g0fa2133 the
system remained more responsive (smaller music skips) than with
v2.6.30-rc1-1219-g82d0481.
I started again by testing merges, this time those by David.
v2.6.30-rc1-1219-g82d0481 'master' of wireless-next-2.6
1.5 ++++ bad latencies
v2.6.30-rc6-660-gbb803cf 'master' of net-2.6
v2.6.30-rc6-808-g45ea4ea 'master' of wireless-next-2.6
v2.6.30-rc6-850-gc649c0e 'master' of net-2.6
v2.6.30-rc6-922-g3f1f39c 'linux-2.6.31.y' of wimax
v2.6.30-rc6-999-gb2f8f75 'master' of net-2.6
v2.6.30-rc6-1028-ga8c617e 'net-next' of lksctp-dev
1.7 ++++|++++|++++
I went back to this one twice because the bisection inside the
next merge (see below) did not give a clear result.
v2.6.30-rc6-1103-gb1bc81a 'master' of wireless-next-2.6
1.8 +-
v2.6.30-rc6-1224-g84503dd 'master' of wireless-next-2.6
1.6 +-
So the problem started in the v2.6.30-rc6-1103-gb1bc81a merge.
I was unable to narrow it down to an exact commit; AFAICT the remaining
ones (between v2.6.30-rc6-1028-g8fc0fee and v2.6.30-rc6-1032-g7ba10a8) are
uninteresting. But it *must* be in this area!
For a good overview of the area, use 'gitk 3f1f39c4..b1bc81a0'.
v2.6.30-rc6-1028-g8fc0fee cfg80211: use key size constants
1.11 ++++
v2.6.30-rc6-1031-g1bb5633 iwmc3200wifi: fix printk format
1.14 +++- not quite conclusive...
v2.6.30-rc6-1032-g7ba10a8 mac80211: fix transposed min/max CW values
1.13 -
This is a bugfix for aa837ee1d from an earlier merge! Could this maybe
influence the test results in between? There are various SKB related
changes there, for example: dfbf97f3..e5b9215e.
v2.6.30-rc6-1037-g2c5b9e5 wireless: libertas: fix unaligned accesses
1.12 +-
v2.6.30-rc6-1044-g729e9c7 cfg80211: fix for duplicate userspace replies
1.10 +-
v2.6.30-rc6-1075-gc587de0 iwlwifi: unify station management
1.9 ++-|+-
v2.6.30-rc6-1076-gd14d444 iwl3945: port allow skb allocation in tasklet
I thought this was a prime candidate, but as you can see several commits
before failed too. Still worth looking at I think!
BISECTION of akpm (mm) MERGE
----------------------------
So here I went looking for "where does the test start failing on the first
try". Again, I was unable to narrow it down to a single commit.
For a good overview of the area, use 'gitk f83b1e61..517d0869'.
v2.6.30-5466-ga1dd268 mm: use alloc_pages_exact in alloc_large_system_hash
2.3 +-
v2.6.30-5478-ge9bb35d mm: setup_per_zone_inactive_ratio - fix comment and..
2.5 +-
v2.6.30-5486-g35282a2 migration: only migrate_prep() once per move_pages()
2.6 -|+|- not quite conclusive...
v2.6.30-5492-gbce7394 page-allocator: reset wmark_min and inactive ratio..
2.4 -|-
WHERE NEXT?
===========
I think the results confirm there is definitely an issue here and that my
test is reliable and consistent enough to show it. And as it currently is
the only test we have...
I hope that the info above is enough for the mm and wireless domain
experts to identify likely candidates in the areas I've identified.
The next step could be trying specific reverts or debug patches, either on
top of current git, or 2.6.31, or inside the identified areas.
I'll run anything you care to throw at me and will try to provide any
additional info you need, but at this point it's up to you.
Cheers,
FJP
On Monday 12 October 2009, Frans Pop wrote:
> BISECTION of akpm (mm) MERGE
> ----------------------------
> So here I went looking for "where does the test start failing on the
> first try". Again, I was unable to narrow it down to a single commit.
Note that this merge is based on mainline at v2.6.30-5415-g03347e2, so a
number of merges "drop out" once I started bisecting into this merge. But
that point is still *after* the net-next-2.6 merge, which is all that's
really relevant for this issue.
> For a good overview of the area, use 'gitk f83b1e61..517d0869'.
>
> v2.6.30-5466-ga1dd268 mm: use alloc_pages_exact in alloc_large_system_hash
> 2.3 +-
> v2.6.30-5478-ge9bb35d mm: setup_per_zone_inactive_ratio - fix comment and..
> 2.5 +-
> v2.6.30-5486-g35282a2 migration: only migrate_prep() once per move_pages()
> 2.6 -|+|- not quite conclusive...
> v2.6.30-5492-gbce7394 page-allocator: reset wmark_min and inactive ratio..
> 2.4 -|-
On Mon, Oct 12, 2009 at 01:10:25AM +0200, Frans Pop wrote:
> Sorry for going quiet on this issue for a few days, but I have been
> spending *a lot* of time on it. I've done what amounts to 5 bisection
> rounds at ~20 minutes per iteration and in total over 80 boots.
>
> The problem with my first bisection was that there are *at least two*
> changes at the root of this issue, both committed between .30 and .30-rc1.
> Because of this a normal bisection will not lead to a reliable result and
> even with my last effort I can only narrow it down to two different areas,
> and not 100% to specific commits.
>
Thanks very much for your detailed work on this.
> The two identified areas are:
> 1) a wireless merge which causes the SKB errors to appear in the first
> place, but not always;
> 2) an mm merge which makes the SKB errors occur *much* quicker; IMHO this
> is the change that also causes the regressions reported by Pekka and
> Karol.
>
> So below my results. The issue is both complex and subtle. Now it's up to
> you, domain experts for both mm *and* wireless/networking, to make sense of
> it all and come up with suggestions on how to proceed.
>
> I've improved my test and it's now a lot more reliable, but there are still
> timing influences.
The timing influences is probably because kswapd is working from the
time memory gets full. High-order allocation failures would cause it to
start reclaiming at that order so it's a race always to see can it do
its work before an atomic allocation fails or not.
> Also, because this is all merge-window stuff, I'm
> hitting quite a few minor and major regressions between commits that can
> affect tests.
>
> Please study the information below carefully. I know it's long, but I think
> this issue justifies that.
>
Agreed. I'll be looking at commits, both wireless and mm but obviously
anything I saw about wireless needs to be taken with a generous dose of
salt.
> On Monday 05 October 2009, Frans Pop wrote:
> > This looks conclusive. I tested .30 and .32-rc3 from clean reboots and
> > only starting gitk. I only started music playing in the background
> > (amarok) from an NFS share to ensure network activity.
> >
> > With .32-rc3 I got 4 SKB allocation errors while starting the *second*
> > gitk instance. And the system was completely frozen with music stopped
> > until gitk finished loading.
>
> With .32-rc3, .31.1 and vanilla .31 I will get multiple SKB allocation
> errors the *first time* I run the test, *every* time.
>
So, this remains a current problem that wasn't solved by accident.
> > With .30 I was able to start *three* gitk's (which meant 2 of them got
> > (partially) swapped out) without any allocation errors. And with the
> > system remaining relatively responsive. There was a short break in the
> > music while I started the 2nd instance, but it just continued playing
> > afterwards. There was also some mild latency in the mouse cursor, but
> > nothing like the full desktop freeze I get with .32-rc3.
>
> With both .30.2 and vanilla .30 I have *never* been able to get any SKB
> allocation errors. No matter how often I repeat the test.
>
> So, the start and end position are 100% reproducible. Problem is that this
> changes during the bisection. At some point the test will fail (no SKB
> errors) the first time I run it, but it will fail on the second or third
> attempt.
> Apparently at some point memory must already be fragmented (or higher
> orders already used up) to some extend for the errors to trigger.
>
That is a reasonable assessment. It could be because
1. Something in the intevening commits greatly increases the number of
GFP_ATOMIC allocations that are occuring. It's a pity that the allocator
tracepoints are not available in those kernels. It would have made
investigating this theory easier.
2. kswapd is no longer reclaiming high-order pages as well as it used
to be it due to changes in kswapd itself or lumpy reclaim
3. Fragmentation avoidance has been broken in some subtle manner
I think 3 is particularly unlikely and am expecting it to be 1 or 2.
> TEST METHOD
> -----------
> As a normal bisection (I tried 3 times...) did not lead anywhere, I had to
> think of an alternative approach. I decided to start by manually selecting
> merges by Linus into mainline. The advantage is that that makes the
> bisection linear and makes it a lot easier to see patterns.
> After narrowing down to a specific merge, I bisected (again semi-manually)
> inside that merge.
>
> Because I suspected there were multiple changes involved, I deliberately
> tried to find two points:
> - where do I first start seeing SKB errors at all, even if it is only at
> the second or third try;
> - where do I start getting SKB errors reliably on the first try.
>
> I worked from "good" to "bad", i.e. I started at .30. The merges were not
> chosen completely randomly. From the first 3 bisections I strongly
> suspected the first 'net-next' merge and the first 'akpm' merge, but I did
> make sure to confirm that suspicion.
>
A very good approach.
> TEST DESCRIPTION
> ----------------
> The test I've ended up using is:
> 1) clean boot
> 2) start music in amarok from NFS share; use very long song to avoid file
> changes and thus ensure a fluent stream of network data during the test
> 3) start 'gitk v2.6.29..master &' - to use up some memory
> 4) start first 'gitk master &' - after this all normal memory is as good as
> used up, with minor swap; this never resulted in SKB errors
> 5) start second 'gitk master &' - this causes heavy swapping (>700 MB) and
> is the real test
> 6) if there were no SKB errors after 5), kill the gitk processes and repeat
> steps 3) to 5). I've done this up to 4 times in some cases
> 7) if the results are not clear or when there is doubt later, repeat from
> step 1) with same kernel
>
> Memory after initial 'gitk v2.6.29..master &':
> total used free shared buffers cached
> Mem: 2030776 1153008 877768 0 41572 333968
> -/+ buffers/cache: 777468 1253308
> Swap: 2097144 0 2097144
>
> Memory after first 'gitk master &':
> total used free shared buffers cached
> Mem: 2030776 1979040 51736 0 35684 238420
> -/+ buffers/cache: 1704936 325840
> Swap: 2097144 21876 2075268
>
> Memory after second 'gitk master &' (with .30.2):
> total used free shared buffers cached
> Mem: 2030776 2011608 19168 0 21836 92336
> -/+ buffers/cache: 1897436 133340
> Swap: 2097144 776160 1320984
>
> OVERVIEW OF RESULTS
> -------------------
> Below I list the most relevant merges and commits. Note that they are
> listed in commit order; my kernel version shows the order of testing.
>
> For the commits I tested the test results are listed on the next line.
> The first number on that line consists of the test series + the iteration
> (and also identifies the kernel I used).
> A "+" means I got no SKB errors, a "-" that I did get them. A "|" means I
> rebooted for a second series of tests.
>
> v2.6.30-2330-gdb8e7f1 'x86-fixes-for-linus' of linux-2.6-tip
> 1.1 +++ iwlagn sw-error during first test
> v2.6.30-4127-g0fa2133 'merge' of powerpc (last merge before net-next-2.6)
> 1.2 +++
> v2.6.30-5398-g2ed0e21 net-next-2.6 (mega-merge!)
> 1.4 +- system reboot fails after testing
> v2.6.30-5517-g609106b 'merge' of powerpc
> 1.3 +- system reboot fails after testing
> v2.6.30-5927-gf83b1e6 'for-linus' of linux1394-2.6 (last merge before akpm)
> 2.2 ++-
> v2.6.30-6111-g517d086 'akpm'
> 2.1 -|-
>
> BISECTION OF net-next-2.6 MERGE
> -------------------------------
> Note that this merge was based not on .30 vanilla, but partly on
> v2.6.30-rc1 and partly on v2.6.30-rc6.
> I think this had an influence on the latencies I saw (i.e. because some
> post-rc6 bug fixes were not present it changes the general behavior of the
> system during the swapping). For example: with v2.6.30-4127-g0fa2133 the
> system remained more responsive (smaller music skips) than with
> v2.6.30-rc1-1219-g82d0481.
>
> I started again by testing merges, this time those by David.
>
> v2.6.30-rc1-1219-g82d0481 'master' of wireless-next-2.6
> 1.5 ++++ bad latencies
The bad latencies might imply that there are a lot more allocations
going on than there used to be. Maybe it was just because of a wireless
bug though that was later fixed.
> v2.6.30-rc6-660-gbb803cf 'master' of net-2.6
> v2.6.30-rc6-808-g45ea4ea 'master' of wireless-next-2.6
> v2.6.30-rc6-850-gc649c0e 'master' of net-2.6
> v2.6.30-rc6-922-g3f1f39c 'linux-2.6.31.y' of wimax
> v2.6.30-rc6-999-gb2f8f75 'master' of net-2.6
> v2.6.30-rc6-1028-ga8c617e 'net-next' of lksctp-dev
> 1.7 ++++|++++|++++
> I went back to this one twice because the bisection inside the
> next merge (see below) did not give a clear result.
> v2.6.30-rc6-1103-gb1bc81a 'master' of wireless-next-2.6
> 1.8 +-
> v2.6.30-rc6-1224-g84503dd 'master' of wireless-next-2.6
> 1.6 +-
>
> So the problem started in the v2.6.30-rc6-1103-gb1bc81a merge.
> I was unable to narrow it down to an exact commit; AFAICT the remaining
> ones (between v2.6.30-rc6-1028-g8fc0fee and v2.6.30-rc6-1032-g7ba10a8) are
> uninteresting. But it *must* be in this area!
>
> For a good overview of the area, use 'gitk 3f1f39c4..b1bc81a0'.
>
> v2.6.30-rc6-1028-g8fc0fee cfg80211: use key size constants
> 1.11 ++++
> v2.6.30-rc6-1031-g1bb5633 iwmc3200wifi: fix printk format
> 1.14 +++- not quite conclusive...
> v2.6.30-rc6-1032-g7ba10a8 mac80211: fix transposed min/max CW values
> 1.13 -
> This is a bugfix for aa837ee1d from an earlier merge! Could this maybe
> influence the test results in between? There are various SKB related
> changes there, for example: dfbf97f3..e5b9215e.
Maybe. Your commit id's are different to what I see. Maybe it's because your
tree has been shuffled around a bit but after some digging around in this
general area, I saw this patch
4752c93c30 iwlcore: Allow skb allocation from tasklet
This patch increases the number of GFP_ATOMIC allocations that can occur by
allocating GFP_ATOMIC in some cases and GFP_KERNEL in others. Previously,
only GFP_KERNEL was used and I didn't realise this allocation method was
so recent. Problems of this sort have cropped up before and while there
are later changes that suppress some of these warnings, I believe this is
a strong candidate for where the allocation failures started appearing.
> v2.6.30-rc6-1037-g2c5b9e5 wireless: libertas: fix unaligned accesses
> 1.12 +-
> v2.6.30-rc6-1044-g729e9c7 cfg80211: fix for duplicate userspace replies
> 1.10 +-
> v2.6.30-rc6-1075-gc587de0 iwlwifi: unify station management
> 1.9 ++-|+-
> v2.6.30-rc6-1076-gd14d444 iwl3945: port allow skb allocation in tasklet
> I thought this was a prime candidate, but as you can see several commits
> before failed too. Still worth looking at I think!
>
Your commit IDs are different to what I see but it's the commit merge at
b1bc81a0ef86b86fa410dd303d84c8c7bd09a64d. I agree that the last commit
(d14d44407b9f06e3cf967fcef28ccb780caf0583) could make the problem worse
because it expands the use of GFP_ATOMIC for another driver.
> BISECTION of akpm (mm) MERGE
> ----------------------------
> So here I went looking for "where does the test start failing on the first
> try". Again, I was unable to narrow it down to a single commit.
>
> For a good overview of the area, use 'gitk f83b1e61..517d0869'.
>
> v2.6.30-5466-ga1dd268 mm: use alloc_pages_exact in alloc_large_system_hash
> 2.3 +-
> v2.6.30-5478-ge9bb35d mm: setup_per_zone_inactive_ratio - fix comment and..
> 2.5 +-
> v2.6.30-5486-g35282a2 migration: only migrate_prep() once per move_pages()
> 2.6 -|+|- not quite conclusive...
> v2.6.30-5492-gbce7394 page-allocator: reset wmark_min and inactive ratio..
> 2.4 -|-
>
While I didn't spot anything too out of the ordinary here, they did occur
shortly after a number of other page allocator related patches. One small
thing I noticed there is that kswapd is getting woken up less now than it did
previously. Generally, I wouldn't have expected it to make a difference but
it's possible that kswapd is not being woken up to reclaim at a higher order
than it was previously. I have a patch for this below. It'd be nice if you
could apply it and see do fewer allocation failures occur on current mainline.
> WHERE NEXT?
> ===========
> I think the results confirm there is definitely an issue here and that my
> test is reliable and consistent enough to show it. And as it currently is
> the only test we have...
>
> I hope that the info above is enough for the mm and wireless domain
> experts to identify likely candidates in the areas I've identified.
>
> The next step could be trying specific reverts or debug patches, either on
> top of current git, or 2.6.31, or inside the identified areas.
> I'll run anything you care to throw at me and will try to provide any
> additional info you need, but at this point it's up to you.
>
For the wireless people in mainline - iwl_rx_replenish_now() is doing
a GFP_ATOMIC allocation that does not use __GFP_NOWARN. As part of
investigating allocation failures, iwl_rx_allocate() was taught to
distinguish between a benign and serious allocation failure - serious
being there are very few RX buffers left and packet loss could occur soon
(see commit f82a924cc88a5541df1d4b9d38a0968cd077a051). I think this GFP mask
should be made GFP_ATOMIC|__GFP_NOWARN so that warnings only appear when the
failure is serious, dump stack after the warning if you need it. I have a
feeling that almost all these warnings have been benign and are related to
the introduction of GFP_ATOMIC being used so heavily to move more expensive
allocations to the tasklet (presumably to reduce user-visible latency).
Frans, could you try the following kswapd-related patch please? I'd be
interested in seeing if the number of allocation failure warnings are
reduced with it. After that, could you edit
drivers/net/wireless/iwlwifi/iwl-rx.c and make the GFP_ATOMIC in
iwl_rx_replenish_now() GFP_ATOMIC|__GFP_NOWARN and see do any of the
"serious" allocation failure messages appear.
Thanks again for your persistence.
==== CUT HERE ====
>From 5296f50ce7ee6b276723ca21fa50d6db3d266075 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Mon, 12 Oct 2009 14:21:52 +0100
Subject: [PATCH] page-allocator: Always wake kswapd when restarting an allocation attempt after direct reclaim failed
If a direct reclaim makes no forward progress, it considers whether it should
go OOM or not. Whether OOM is triggered or not, it may retry the application
afterwards. In times past, this would always wake kswapd as well but
currently, kswapd is not woken up after direct reclaim fails. For
order-0 allocations, this makes little difference but if there is a heavy
mix of higher-order allocations that direct reclaim is failing for, it
might mean that kswapd is not reclaiming for higher orders as much as it
did previously.
This patch wakes up kswapd when an allocation is being retried after a direct
reclaim failure. It would be expected that kswapd is already awake, but
this has the effect of telling kswapd to reclaim at the higher order as well.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..dfa4362 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1817,9 +1817,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);
-restart:
/*
* OK, we're below the kswapd watermark and have kicked background
* reclaim. Now things get more complex, so set up alloc_flags according
On Monday 12 October 2009, Mel Gorman wrote:
> Maybe. Your commit id's are different to what I see. Maybe it's because
> your tree has been shuffled around a bit
No, the commit IDs should be identical. My tree is just plain mainline.
Just to make sure... You did remove the "g" from the IDs, right?
So v2.6.30-rc6-1103-gb1bc81a becomes 'b1bc81a' and if you do
'git describe b1bc81a' you really should end up with the same IDs I have.
> but after some digging around in this general area, I saw this patch
>
> 4752c93c30 iwlcore: Allow skb allocation from tasklet
That is v2.6.30-rc6-773-g4752c93, which is part of the first wireless
merge I tested and where I saw no issues. But see below.
> This patch increases the number of GFP_ATOMIC allocations that can occur
> by allocating GFP_ATOMIC in some cases and GFP_KERNEL in others.
> Previously, only GFP_KERNEL was used and I didn't realise this
> allocation method was so recent. Problems of this sort have cropped up
> before and while there are later changes that suppress some of these
> warnings, I believe this is a strong candidate for where the allocation
> failures started appearing.
>
> > v2.6.30-rc6-1032-g7ba10a8 mac80211: fix transposed min/max CW values
> > 1.13 -
> > This is a bugfix for aa837ee1d from an earlier merge! Could this maybe
There's a typo here. That ID should be: aa837e1d.
> > influence the test results in between? There are various SKB related
> > changes there, for example: dfbf97f3..e5b9215e.
> > v2.6.30-rc6-1037-g2c5b9e5 wireless: libertas: fix unaligned accesses
> > 1.12 +-
> > v2.6.30-rc6-1044-g729e9c7 cfg80211: fix for duplicate userspace replies
> > 1.10 +-
> > v2.6.30-rc6-1075-gc587de0 iwlwifi: unify station management
> > 1.9 ++-|+-
> > v2.6.30-rc6-1076-gd14d444 iwl3945: port allow skb allocation in tasklet
> > I thought this was a prime candidate, but as you can see
> > several commits before failed too. Still worth looking at I think!
>
> Your commit IDs are different to what I see but it's the commit merge at
> b1bc81a0ef86b86fa410dd303d84c8c7bd09a64d. I agree that the last commit
> (d14d44407b9f06e3cf967fcef28ccb780caf0583) could make the problem worse
> because it expands the use of GFP_ATOMIC for another driver.
No, that was a mistake of mine. d14d444 is in a driver I don't even compile.
The one you identified (which is the same change for iwlagn) is much more
interesting.
I really do think that v2.6.30-rc6-1032-g7ba10a8 could play a role here.
That's a fix for v2.6.30-rc1-1131-gaa837e1. So that bug was introduced
_before_ the merge 82d0481 and may thus well explain both the latencies I
saw _and_ why that merge tested without problems. And that would also go a
long way to explain my test results.
So I'm going to retest 82d0481 with 7ba10a8 cherry-picked on top.
> > BISECTION of akpm (mm) MERGE
> > ----------------------------
[...]
> While I didn't spot anything too out of the ordinary here, they did
> occur shortly after a number of other page allocator related patches.
> One small thing I noticed there is that kswapd is getting woken up less
> now than it did previously. Generally, I wouldn't have expected it to
> make a difference but it's possible that kswapd is not being woken up to
> reclaim at a higher order than it was previously. I have a patch for
> this below. It'd be nice if you could apply it and see do fewer
> allocation failures occur on current mainline.
I'll give that patch a try and report back.
On Mon, Oct 12, 2009 at 07:32:11PM +0200, Frans Pop wrote:
> On Monday 12 October 2009, Mel Gorman wrote:
> > Maybe. Your commit id's are different to what I see. Maybe it's because
> > your tree has been shuffled around a bit
>
> No, the commit IDs should be identical. My tree is just plain mainline.
>
> Just to make sure... You did remove the "g" from the IDs, right?
> So v2.6.30-rc6-1103-gb1bc81a becomes 'b1bc81a' and if you do
> 'git describe b1bc81a' you really should end up with the same IDs I have.
>
Bah, that's what I was doing all right. No excuse, that was just plain
stupid of me.
> > but after some digging around in this general area, I saw this patch
> >
> > 4752c93c30 iwlcore: Allow skb allocation from tasklet
>
> That is v2.6.30-rc6-773-g4752c93, which is part of the first wireless
> merge I tested and where I saw no issues. But see below.
>
While there were no issues at that point, I think it might have been the
beginning of a few patches that made things progressively worse. It is
possible there is more than one patch causing trouble here and bisecting
each of them is unlikely to be an option. More on this later though.
> > This patch increases the number of GFP_ATOMIC allocations that can occur
> > by allocating GFP_ATOMIC in some cases and GFP_KERNEL in others.
> > Previously, only GFP_KERNEL was used and I didn't realise this
> > allocation method was so recent. Problems of this sort have cropped up
> > before and while there are later changes that suppress some of these
> > warnings, I believe this is a strong candidate for where the allocation
> > failures started appearing.
> >
> > > v2.6.30-rc6-1032-g7ba10a8 mac80211: fix transposed min/max CW values
> > > 1.13 -
> > > This is a bugfix for aa837ee1d from an earlier merge! Could this maybe
>
> There's a typo here. That ID should be: aa837e1d.
>
> > > influence the test results in between? There are various SKB related
> > > changes there, for example: dfbf97f3..e5b9215e.
> > > v2.6.30-rc6-1037-g2c5b9e5 wireless: libertas: fix unaligned accesses
> > > 1.12 +-
> > > v2.6.30-rc6-1044-g729e9c7 cfg80211: fix for duplicate userspace replies
> > > 1.10 +-
> > > v2.6.30-rc6-1075-gc587de0 iwlwifi: unify station management
> > > 1.9 ++-|+-
> > > v2.6.30-rc6-1076-gd14d444 iwl3945: port allow skb allocation in tasklet
> > > I thought this was a prime candidate, but as you can see
> > > several commits before failed too. Still worth looking at I think!
> >
> > Your commit IDs are different to what I see but it's the commit merge at
> > b1bc81a0ef86b86fa410dd303d84c8c7bd09a64d. I agree that the last commit
> > (d14d44407b9f06e3cf967fcef28ccb780caf0583) could make the problem worse
> > because it expands the use of GFP_ATOMIC for another driver.
>
> No, that was a mistake of mine. d14d444 is in a driver I don't even compile.
> The one you identified (which is the same change for iwlagn) is much more
> interesting.
>
I had forgotten what model your card was and assumed it must have been based
on this driver for the problem to get worse for you that point.
> I really do think that v2.6.30-rc6-1032-g7ba10a8 could play a role here.
> That's a fix for v2.6.30-rc1-1131-gaa837e1. So that bug was introduced
> _before_ the merge 82d0481 and may thus well explain both the latencies I
> saw _and_ why that merge tested without problems. And that would also go a
> long way to explain my test results.
Very good point.
> So I'm going to retest 82d0481 with 7ba10a8 cherry-picked on top.
>
Great.
> > > BISECTION of akpm (mm) MERGE
> > > ----------------------------
> [...]
> > While I didn't spot anything too out of the ordinary here, they did
> > occur shortly after a number of other page allocator related patches.
> > One small thing I noticed there is that kswapd is getting woken up less
> > now than it did previously. Generally, I wouldn't have expected it to
> > make a difference but it's possible that kswapd is not being woken up to
> > reclaim at a higher order than it was previously. I have a patch for
> > this below. It'd be nice if you could apply it and see do fewer
> > allocation failures occur on current mainline.
>
> I'll give that patch a try and report back.
>
Thanks a lot.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
From: Eric Dumazet <[email protected]>
Date: Fri, 09 Oct 2009 16:43:40 +0200
> [PATCH] udp: Fix udp_poll()
>
> udp_poll() can in some circumstances drop frames with incorrect checksums.
>
> Problem is we now have to lock the socket while dropping frames, or risk
> sk_forward corruption.
>
> This bug is present since commit 95766fff6b9a78d1
> ([UDP]: Add memory accounting.)
>
> While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
>
> Signed-off-by: Eric Dumazet <[email protected]>
Looks good, applied, thanks Eric!
On Monday 12 October 2009, Frans Pop wrote:
> On Monday 12 October 2009, Mel Gorman wrote:
> > but after some digging around in this general area, I saw this patch
> >
> > 4752c93c30 iwlcore: Allow skb allocation from tasklet
>
> That is v2.6.30-rc6-773-g4752c93, which is part of the first wireless
> merge I tested and where I saw no issues. But see below.
>
> > This patch increases the number of GFP_ATOMIC allocations that can
> > occur by allocating GFP_ATOMIC in some cases and GFP_KERNEL in others.
> > Previously, only GFP_KERNEL was used and I didn't realise this
> > allocation method was so recent. Problems of this sort have cropped up
> > before and while there are later changes that suppress some of these
> > warnings, I believe this is a strong candidate for where the
> > allocation failures started appearing.
I have tried reverting this patch and that does make a significant
difference, but the results are still not really conclusive.
I tested the revert on top of:
- the first net-next-2.6 merge (2ed0e21), i.e. before the mm merge
- 2.6.31.1
In both cases I no longer get SKB errors, but instead (?) I get firmware
errors:
iwlagn 0000:10:00.0: Microcode SW error detected. Restarting 0x2000000.
So on the wireless side it does look as if there is more than one change
involved. Remember that with .30 I don't get any errors, only relatively
mild latencies and skips in the music.
> I really do think that v2.6.30-rc6-1032-g7ba10a8 could play a role here.
> That's a fix for v2.6.30-rc1-1131-gaa837e1. So that bug was introduced
> _before_ the merge 82d0481 and may thus well explain both the latencies
> I saw _and_ why that merge tested without problems. And that would also
> go a long way to explain my test results.
> So I'm going to retest 82d0481 with 7ba10a8 cherry-picked on top.
^^^^^^^-- should be 45ea4ea
I've tried this but still don't get any SKB errors, so that bug does not
seem to make a difference.
> > > BISECTION of akpm (mm) MERGE
> > > ----------------------------
> > While I didn't spot anything too out of the ordinary here, they did
> > occur shortly after a number of other page allocator related patches.
> > One small thing I noticed there is that kswapd is getting woken up
> > less now than it did previously. Generally, I wouldn't have expected
> > it to make a difference but it's possible that kswapd is not being
> > woken up to reclaim at a higher order than it was previously. I have a
> > patch for this below. It'd be nice if you could apply it and see do
> > fewer allocation failures occur on current mainline.
>
> I'll give that patch a try and report back.
With your patch on .32-rc4 I still get the SKB errors, so it does not seem
to help. The only change there may have been is that the desktop was
frozen longer than without the patch, but that is an impression, not a
hard fact.
Although identifying the problem on the wireless side is important, I still
feel that tracing the mm change should have priority as it influences much
more than just iwlagn, as the other reports prove.
> > After that, could you edit drivers/net/wireless/iwlwifi/iwl-rx.c and
> > make the GFP_ATOMIC in iwl_rx_replenish_now() GFP_ATOMIC|__GFP_NOWARN
> > and see do any of the "serious" allocation failure messages appear.
For the above reason I've not yet tried this. It seems to me that this
change will not really solve anything, but just suppress errors.
Cheers,
FJP
On Tue, Oct 13, 2009 at 10:38:37PM +0200, Frans Pop wrote:
> On Monday 12 October 2009, Frans Pop wrote:
> > On Monday 12 October 2009, Mel Gorman wrote:
> > > but after some digging around in this general area, I saw this patch
> > >
> > > 4752c93c30 iwlcore: Allow skb allocation from tasklet
> >
> > That is v2.6.30-rc6-773-g4752c93, which is part of the first wireless
> > merge I tested and where I saw no issues. But see below.
> >
> > > This patch increases the number of GFP_ATOMIC allocations that can
> > > occur by allocating GFP_ATOMIC in some cases and GFP_KERNEL in others.
> > > Previously, only GFP_KERNEL was used and I didn't realise this
> > > allocation method was so recent. Problems of this sort have cropped up
> > > before and while there are later changes that suppress some of these
> > > warnings, I believe this is a strong candidate for where the
> > > allocation failures started appearing.
>
> I have tried reverting this patch and that does make a significant
> difference, but the results are still not really conclusive.
> I tested the revert on top of:
> - the first net-next-2.6 merge (2ed0e21), i.e. before the mm merge
> - 2.6.31.1
>
I think this is very significant. Either that change needs to be backed
out or more likely, __GFP_NOWARN needs to be specified and warnings
*only* printed when the RX buffers are really low. My expectation would
be that some GFP_ATOMIC allocations fail during refill but the fact they
fail wakes kswapd to reclaim order-2 pages while the RX buffers in the
pool are consumed.
> In both cases I no longer get SKB errors, but instead (?) I get firmware
> errors:
> iwlagn 0000:10:00.0: Microcode SW error detected. Restarting 0x2000000.
>
I am no wireless expert, but that looks like an separate problem to me.
I don't see how an allocation failure could trigger errors in the
microcode.
I really really hate to say it, but this might need a separate bisection
with 4752c93c30 either reverted or patched as I do below.
> So on the wireless side it does look as if there is more than one change
> involved. Remember that with .30 I don't get any errors, only relatively
> mild latencies and skips in the music.
>
2.6.31 does not appear to have done wireless any favours.
> > I really do think that v2.6.30-rc6-1032-g7ba10a8 could play a role here.
> > That's a fix for v2.6.30-rc1-1131-gaa837e1. So that bug was introduced
> > _before_ the merge 82d0481 and may thus well explain both the latencies
> > I saw _and_ why that merge tested without problems. And that would also
> > go a long way to explain my test results.
> > So I'm going to retest 82d0481 with 7ba10a8 cherry-picked on top.
> ^^^^^^^-- should be 45ea4ea
>
> I've tried this but still don't get any SKB errors, so that bug does not
> seem to make a difference.
>
> > > > BISECTION of akpm (mm) MERGE
> > > > ----------------------------
> > > While I didn't spot anything too out of the ordinary here, they did
> > > occur shortly after a number of other page allocator related patches.
> > > One small thing I noticed there is that kswapd is getting woken up
> > > less now than it did previously. Generally, I wouldn't have expected
> > > it to make a difference but it's possible that kswapd is not being
> > > woken up to reclaim at a higher order than it was previously. I have a
> > > patch for this below. It'd be nice if you could apply it and see do
> > > fewer allocation failures occur on current mainline.
> >
> > I'll give that patch a try and report back.
>
> With your patch on .32-rc4 I still get the SKB errors, so it does not seem
> to help. The only change there may have been is that the desktop was
> frozen longer than without the patch, but that is an impression, not a
> hard fact.
>
Actually, that's fairly interesting and I think justifies pushing the
patch. Direct reclaim can stall processes in a user-visible manner which kswapd
is meant to avoid in the majority of cases but is tricky to quantify without
instrumenting the kernel to measure direct reclaim frequency and latency (I
have WIP tracepoints for this but it's still a WIP). If you notice shorter
stalls with the patch applied, it means that kswapd really did need to be
informed of the problems.
> Although identifying the problem on the wireless side is important, I still
> feel that tracing the mm change should have priority as it influences much
> more than just iwlagn, as the other reports prove.
>
There still has not been a mm-change identified that makes fragmentation
significantly worse. The majority of the wireless reports have been in this
driver and I think we have the problem commit there. The only other is a
firmware loading problem in e100 after resume that fails to make an atomic
order-5 fail. It's possible that something has changed in resume in the
2.6.31 window there - maybe something like drivers now reload during resume
where they didn't previously or less memory being pushed to swap during resume.
> > > After that, could you edit drivers/net/wireless/iwlwifi/iwl-rx.c and
> > > make the GFP_ATOMIC in iwl_rx_replenish_now() GFP_ATOMIC|__GFP_NOWARN
> > > and see do any of the "serious" allocation failure messages appear.
>
> For the above reason I've not yet tried this. It seems to me that this
> change will not really solve anything, but just suppress errors.
>
I disagree. Harmless allocation errors get suppressed but it still warns when
things get really bad. See the following patch that suppresses the
warnings from GFP_ATOMIC but warns for GFP_KERNEL failures and dumps a
stack on serious allocation failure.
We either need a patch like this or the
GFP_ATOMIC-direct-with-refills-from-tasklet patch needs to be reverted.
=== CUT HERE ===
>From 5fb9f897117bf2701f9fdebe4d008dbe34358ab9 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Wed, 14 Oct 2009 11:19:57 +0100
Subject: [PATCH] iwlwifi: Suppress warnings related to GFP_ATOMIC allocations that do not matter
iwlwifi refills RX buffers in two ways - a direct method using GFP_ATOMIC
and a tasklet method using GFP_KERNEL. There are a number of RX buffers and
there are only serious issues when there are no RX buffers left. The driver
explicitly warns when refills are failing and the buffers are low but it
always warns when a GFP_ATOMIC allocation fails even when there is no
packet loss as a result.
This patch specifies __GFP_NOWARN for the direct refill method that uses
GFP_ATOMIC. To help identify where allocation failures might be coming
from, the stack is dumped when the RX queue is dangerously low.
Signed-off-by: Mel Gorman <[email protected]>
---
drivers/net/wireless/iwlwifi/iwl-rx.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
old mode 100644
new mode 100755
diff --git a/drivers/net/wireless/iwlwifi/iwl-rx.c b/drivers/net/wireless/iwlwifi/iwl-rx.c
index 8e1bb53..f91a108 100644
--- a/drivers/net/wireless/iwlwifi/iwl-rx.c
+++ b/drivers/net/wireless/iwlwifi/iwl-rx.c
@@ -260,10 +260,12 @@ void iwl_rx_allocate(struct iwl_priv *priv, gfp_t priority)
if (net_ratelimit())
IWL_DEBUG_INFO(priv, "Failed to allocate SKB buffer.\n");
if ((rxq->free_count <= RX_LOW_WATERMARK) &&
- net_ratelimit())
+ net_ratelimit()) {
IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free buffers remaining.\n",
priority == GFP_ATOMIC ? "GFP_ATOMIC" : "GFP_KERNEL",
rxq->free_count);
+ dump_stack();
+ }
/* We don't reschedule replenish work here -- we will
* call the restock method and if it still needs
* more buffers it will schedule replenish */
@@ -320,7 +322,7 @@ EXPORT_SYMBOL(iwl_rx_replenish);
void iwl_rx_replenish_now(struct iwl_priv *priv)
{
- iwl_rx_allocate(priv, GFP_ATOMIC);
+ iwl_rx_allocate(priv, GFP_ATOMIC|__GFP_NOWARN);
iwl_rx_queue_restock(priv);
}
On Wednesday 14 October 2009, Mel Gorman wrote:
> I think this is very significant. Either that change needs to be backed
> out or more likely, __GFP_NOWARN needs to be specified and warnings
> *only* printed when the RX buffers are really low. My expectation would
> be that some GFP_ATOMIC allocations fail during refill but the fact they
> fail wakes kswapd to reclaim order-2 pages while the RX buffers in the
> pool are consumed.
Sorry I did not actually mention this, but the SKB failures I get with .32
have loads of the "Failed to allocate SKB buffer with GFP_ATOMIC. Only 0
free buffers remaining." errors. That's why I don't think your patch will
help anything.
zgrep "Only 0 free buffers remaining" /var/log/kern.log* | wc -l
84
OK, they are all GPF_ATOMIC and not GPF_KERNEL, but they also almost all
have "0 free buffers"! Next to the 84 warnings for 0 remaining I only have
one with "3 free buffers" and one with "1 free buffers".
And that does not even count the rate limitting:
Oct 12 20:15:07 aragorn kernel: __ratelimit: 45 callbacks suppressed
Oct 12 20:25:19 aragorn kernel: __ratelimit: 27 callbacks suppressed
Oct 12 20:25:20 aragorn kernel: __ratelimit: 2 callbacks suppressed
Attached the kernel log for one test I did with .32.
> > In both cases I no longer get SKB errors, but instead (?) I get
> > firmware errors:
> > iwlagn 0000:10:00.0: Microcode SW error detected. Restarting
> > 0x2000000.
>
> I am no wireless expert, but that looks like an separate problem to me.
> I don't see how an allocation failure could trigger errors in the
> microcode.
Yes, it is a separate problem, but it is still significant that reverting
that patch triggers them in the extreme swap situation.
> > With your patch on .32-rc4 I still get the SKB errors, so it does not
> > seem to help. The only change there may have been is that the desktop
> > was frozen longer than without the patch, but that is an impression,
> > not a hard fact.
>
> Actually, that's fairly interesting and I think justifies pushing the
> patch. Direct reclaim can stall processes in a user-visible manner which
> kswapd is meant to avoid in the majority of cases but is tricky to
> quantify without instrumenting the kernel to measure direct reclaim
> frequency and latency (I have WIP tracepoints for this but it's still a
> WIP). If you notice shorter stalls with the patch applied, it means that
> kswapd really did need to be informed of the problems.
No, I thought I saw _longer_ stalls with your patch applied...
> There still has not been a mm-change identified that makes fragmentation
> significantly worse.
My bisection shows a very clear point, even if not an individual commit, in
the 'akpm' merge where SKB errors suddenly become *much* more frequent and
easy to trigger.
I'm sorry to say this, but the fact that nothing has been identified yet is
IMO the result of a lack of effort, not because there is no such change.
> The majority of the wireless reports have been in
> this driver and I think we have the problem commit there. The only other
> is a firmware loading problem in e100 after resume that fails to make an
> atomic order-5 fail.
Not exactly true. Bartlomiej's report was about ipw2200, so there are at
least 3 different drivers involved, two wireless and one wired. Besides
that one report is related to heavy swap, one to resume and one to driver
reload.
So it's much more likely that there is some common regression (in mm) that
affected all three than that there are three unrelated regressions.
And although both of the others did extremely high allocations, they both
started appearing in the same timeframe. And Bart's very first report
linked it to mm changes.
> It's possible that something has changed in resume
> in the 2.6.31 window there - maybe something like drivers now reload
> during resume where they didn't previously or less memory being pushed
> to swap during resume.
IMO you're sticking your head in the sand here.
I'm not saying that mm is the only issue here, but I'm convinced that there
_is_ an mm change that has contributed in a major way to these issues,
even if we've not yet been able to identify it.
> - net_ratelimit())
> + net_ratelimit()) {
> IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free
> buffers remaining.\n", priority == GFP_ATOMIC ? "GFP_ATOMIC" :
> "GFP_KERNEL",
Haven't you broken the test 'priority == GFP_ATOMIC' here by setting
priority to GFP_ATOMIC|__GFP_NOWARN?
Cheers,
FJP
On Wed, Oct 14, 2009 at 03:10:08PM +0200, Frans Pop wrote:
> On Wednesday 14 October 2009, Mel Gorman wrote:
> > I think this is very significant. Either that change needs to be backed
> > out or more likely, __GFP_NOWARN needs to be specified and warnings
> > *only* printed when the RX buffers are really low. My expectation would
> > be that some GFP_ATOMIC allocations fail during refill but the fact they
> > fail wakes kswapd to reclaim order-2 pages while the RX buffers in the
> > pool are consumed.
>
> Sorry I did not actually mention this, but the SKB failures I get with .32
> have loads of the "Failed to allocate SKB buffer with GFP_ATOMIC. Only 0
> free buffers remaining." errors. That's why I don't think your patch will
> help anything.
>
> zgrep "Only 0 free buffers remaining" /var/log/kern.log* | wc -l
> 84
>
> OK, they are all GPF_ATOMIC and not GPF_KERNEL, but they also almost all
> have "0 free buffers"! Next to the 84 warnings for 0 remaining I only have
> one with "3 free buffers" and one with "1 free buffers".
>
This is fairly important. It shows that the refills are not keeping up
with the GFP_ATOMIC usage. I'm not sure what to do with this. As the
driver introduced GFP_ATOMIC usage at all, I'm tempted to say revert the
changes in the driver that makes use of GFP_ATOMIC but I'm not the
maintainer. They could also consider having a GFP_ATOMIC-optimistic,
GFP_KERNEL-if-no-buffers-free-and-directly-allocating with GFP_KERNEL
refills always happening in the tasklet.
However, it might be just avoiding the MM problem on my part. It's possible
that if I figure out what went wrong in mm and drivers use of GFP_ATOMIC
will be swept under the carpet.
> And that does not even count the rate limitting:
> Oct 12 20:15:07 aragorn kernel: __ratelimit: 45 callbacks suppressed
> Oct 12 20:25:19 aragorn kernel: __ratelimit: 27 callbacks suppressed
> Oct 12 20:25:20 aragorn kernel: __ratelimit: 2 callbacks suppressed
>
> Attached the kernel log for one test I did with .32.
>
> > > In both cases I no longer get SKB errors, but instead (?) I get
> > > firmware errors:
> > > iwlagn 0000:10:00.0: Microcode SW error detected. Restarting
> > > 0x2000000.
> >
> > I am no wireless expert, but that looks like an separate problem to me.
> > I don't see how an allocation failure could trigger errors in the
> > microcode.
>
> Yes, it is a separate problem, but it is still significant that reverting
> that patch triggers them in the extreme swap situation.
>
True.
> > > With your patch on .32-rc4 I still get the SKB errors, so it does not
> > > seem to help. The only change there may have been is that the desktop
> > > was frozen longer than without the patch, but that is an impression,
> > > not a hard fact.
> >
> > Actually, that's fairly interesting and I think justifies pushing the
> > patch. Direct reclaim can stall processes in a user-visible manner which
> > kswapd is meant to avoid in the majority of cases but is tricky to
> > quantify without instrumenting the kernel to measure direct reclaim
> > frequency and latency (I have WIP tracepoints for this but it's still a
> > WIP). If you notice shorter stalls with the patch applied, it means that
> > kswapd really did need to be informed of the problems.
>
> No, I thought I saw _longer_ stalls with your patch applied...
>
Sorry, I misinterpreted. If the stalls are longer, it likely means that
kswapd is doing more work and causing more IO when applied as it tries to
get order-2 pages free. You said you still got SKB errors. Were there any
significant change to the number of failures or can that be told?
> > There still has not been a mm-change identified that makes fragmentation
> > significantly worse.
>
> My bisection shows a very clear point, even if not an individual commit, in
> the 'akpm' merge where SKB errors suddenly become *much* more frequent and
> easy to trigger.
> I'm sorry to say this, but the fact that nothing has been identified yet is
> IMO the result of a lack of effort, not because there is no such change.
>
I apologise if I've given that impression. I've been starting at the commits
but could not find an obvious candidate within the page allocator itself which
is why I've been looking at other areas. I put together a hack that allocated
order-2 atomics at a constant rate and order-5 atomics at a lower rate to
try replicate the problem without drivers. I ran some workloads but I wasn't
able to get reliable figures that would have allowed me to investigate further.
> > The majority of the wireless reports have been in
> > this driver and I think we have the problem commit there. The only other
> > is a firmware loading problem in e100 after resume that fails to make an
> > atomic order-5 fail.
>
> Not exactly true. Bartlomiej's report was about ipw2200, so there are at
> least 3 different drivers involved, two wireless and one wired. Besides
> that one report is related to heavy swap, one to resume and one to driver
> reload.
> So it's much more likely that there is some common regression (in mm) that
> affected all three than that there are three unrelated regressions.
Very very likely, I'm not denying this.
> And although both of the others did extremely high allocations, they both
> started appearing in the same timeframe. And Bart's very first report
> linked it to mm changes.
>
> > It's possible that something has changed in resume
> > in the 2.6.31 window there - maybe something like drivers now reload
> > during resume where they didn't previously or less memory being pushed
> > to swap during resume.
>
> IMO you're sticking your head in the sand here.
No. If I was sticking my head in the sand, I would have dismissed this
entirely as "GFP_ATOMIC allocations can fail boo hoo hoo deal with it".
What I'm trying to identify what changed that would affect fragmentation
but that is not within the page allocator itself - largely because with
the exception of the patch I gave you, I couldn't find obvious breakage.
You highlighted the first akpm merge so lets look closer at that as I don't
think there is anything more I can do with the wireless driver other than the
suggestions made already. I looked at this already but I felt fixing GFP_ATOMIC
in wireless was the more likely fix.
Here is what you said about the merge.
====
For a good overview of the area, use 'gitk f83b1e61..517d0869'.
v2.6.30-5466-ga1dd268 mm: use alloc_pages_exact in alloc_large_system_hash
2.3 +-
v2.6.30-5478-ge9bb35d mm: setup_per_zone_inactive_ratio - fix comment and..
2.5 +-
v2.6.30-5486-g35282a2 migration: only migrate_prep() once per move_pages()
2.6 -|+|- not quite conclusive...
v2.6.30-5492-gbce7394 page-allocator: reset wmark_min and inactive ratio..
2.4 -|-
====
This is what I found. The following were the possible commits that might
be causing the problem.
d239171..72807a7 -- page allocator
These are the bulk of the page-allocator changes that happened int
the 2.6.30..2.6.31 cycle. It's also the location of the change to
kswapd that I sent you a patch for. If there was a marked increase
in the number of failures before and after this patchset, it means
that I was wrong about the problem not being in the page allocator
and I have to go back and keep looking. However, you report that
commit e9bb35d mm: setup_per_zone_inactive_ratio - fix comment
had relatively good results - relatively being that it didn't fail
on the first try. In my head, these patches have been struck off the
list of possibilities and is why I've been looking in other subsystems.
56e49d2..f166777 -- reclaim
I would have considered this strong candidates except again, the last
good commit happened after this point. If other obvious candidates
don't crop up, it might be worth double checking within this range, particularly
commit 56e49d2 vmscan: evict use-once pages first
as it is targeted at streaming-IO workloads which would include
your music workload. This commit also will cleanly revert on
mainline so is relatively easy to test
5c87ead..e9bb35d -- inactive ratio changes
These patches should be harmless but just in case, please
compare the output of
# grep inactive_ratio /proc/zoneinfo
on 2.6.30 and 2.6.31 and make sure the ratios are the same.
e9bb35d..bce7394 -- various changes
According to your analysis, this is the most likely location of
the problem commit.
Commit b70d94e altered how zonelists were selected during
allocation. This was tested fairly heavily but if the testing
missed something, it would mean that some allocations are not
using the zones they should be. However, my expectation would
be that mistakes here would have severe consequences affecting a
large number of people. This does not revert cleanly but there is
an untested patch below that should do the job. While it's hard to
imagine this patch being the problem, it's the most likely commit
with the range of commits your analysis identified.
Commit bc75d33 is totally harmless but it mentions min_free_kbytes. I
checked on my machine to make sure min_free_kbytes was the same on both
2.6.30 and 2.6.31. Can you check that this is true for your machine? If
min_free_kbytes decreased, it could explain GFP_ATOMIC failures.
An extremely unlikely candidate is 75927af8. For this to be a problem,
much of your userspace would have to be calling madvise() with
stupid parameters and depending on it silently ignore the
parameters
A vague potential candidate for swapless systems is 69c85481 but
your machine has swap so it can't be this.
Commit bce7394 affects min_free_kbytes but only on hotplug so it
can't be this either for your machine
After this point, your analysis indicates that things are already broken
but lets look at some of the candidates anyway. Out of curiousity,
was CONFIG_UNEVICTABLE_LRU unset in your .config for 2.6.30? I could
only find your 2.6.31 .config. If it was, it might be worth reverting
6837765963f1723e80ca97b1fae660f3a60d77df and unsetting it in 2.6.31 and
seeing what happens.
Commit 8cab4754d24a0f2e05920170c845bd84472814c6 keeps pages on the active
lists for longer than 2.6.30 did. It's possible the fewer reclaim decisions
is delaying lumpy reclaim.
CONFIG_NUMA is not set in your config, so the zone_reclaim() changes
around 24cf72518c79cdcda486ed26074ff8151291cf65 can be discounted.
Commit ee993b135ec75a93bd5c45e636bb210d2975159b altered how lumpy
reclaim works but it should have been harmless. It does not cleanly
revert but it's easy to manually revert.
I didn't spot any other patches that might be potential problems in the
commits.
> I'm not saying that mm is the only issue here, but I'm convinced that there
> _is_ an mm change that has contributed in a major way to these issues,
> even if we've not yet been able to identify it.
>
> > - net_ratelimit())
> > + net_ratelimit()) {
> > IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free
> > buffers remaining.\n", priority == GFP_ATOMIC ? "GFP_ATOMIC" :
> > "GFP_KERNEL",
>
> Haven't you broken the test 'priority == GFP_ATOMIC' here by setting
> priority to GFP_ATOMIC|__GFP_NOWARN?
>
Yes, I did, but as you say that this error message is showing up and buffers
are all depleted, it's not even close to being the right fix. It'd only
be relevant if that error message was showing up with buffers remaining in
the queue.
Revert commit b70d94ee
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 557bdad..3a94e4b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -21,8 +21,7 @@ struct vm_area_struct;
#define __GFP_DMA ((__force gfp_t)0x01u)
#define __GFP_HIGHMEM ((__force gfp_t)0x02u)
#define __GFP_DMA32 ((__force gfp_t)0x04u)
-#define __GFP_MOVABLE ((__force gfp_t)0x08u) /* Page is movable */
-#define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
+
/*
* Action modifiers - doesn't change the zoning
*
@@ -52,6 +51,7 @@ struct vm_area_struct;
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
#define __GFP_THISNODE ((__force gfp_t)0x40000u)/* No fallback, no policies */
#define __GFP_RECLAIMABLE ((__force gfp_t)0x80000u) /* Page is reclaimable */
+#define __GFP_MOVABLE ((__force gfp_t)0x100000u) /* Page is movable */
#ifdef CONFIG_KMEMCHECK
#define __GFP_NOTRACK ((__force gfp_t)0x200000u) /* Don't track with kmemcheck */
@@ -128,105 +128,24 @@ static inline int allocflags_to_migratetype(gfp_t gfp_flags)
((gfp_flags & __GFP_RECLAIMABLE) != 0);
}
-#ifdef CONFIG_HIGHMEM
-#define OPT_ZONE_HIGHMEM ZONE_HIGHMEM
-#else
-#define OPT_ZONE_HIGHMEM ZONE_NORMAL
-#endif
-
+static inline enum zone_type gfp_zone(gfp_t flags)
+{
#ifdef CONFIG_ZONE_DMA
-#define OPT_ZONE_DMA ZONE_DMA
-#else
-#define OPT_ZONE_DMA ZONE_NORMAL
+ if (flags & __GFP_DMA)
+ return ZONE_DMA;
#endif
-
#ifdef CONFIG_ZONE_DMA32
-#define OPT_ZONE_DMA32 ZONE_DMA32
-#else
-#define OPT_ZONE_DMA32 ZONE_NORMAL
-#endif
-
-/*
- * GFP_ZONE_TABLE is a word size bitstring that is used for looking up the
- * zone to use given the lowest 4 bits of gfp_t. Entries are ZONE_SHIFT long
- * and there are 16 of them to cover all possible combinations of
- * __GFP_DMA, __GFP_DMA32, __GFP_MOVABLE and __GFP_HIGHMEM
- *
- * The zone fallback order is MOVABLE=>HIGHMEM=>NORMAL=>DMA32=>DMA.
- * But GFP_MOVABLE is not only a zone specifier but also an allocation
- * policy. Therefore __GFP_MOVABLE plus another zone selector is valid.
- * Only 1bit of the lowest 3 bit (DMA,DMA32,HIGHMEM) can be set to "1".
- *
- * bit result
- * =================
- * 0x0 => NORMAL
- * 0x1 => DMA or NORMAL
- * 0x2 => HIGHMEM or NORMAL
- * 0x3 => BAD (DMA+HIGHMEM)
- * 0x4 => DMA32 or DMA or NORMAL
- * 0x5 => BAD (DMA+DMA32)
- * 0x6 => BAD (HIGHMEM+DMA32)
- * 0x7 => BAD (HIGHMEM+DMA32+DMA)
- * 0x8 => NORMAL (MOVABLE+0)
- * 0x9 => DMA or NORMAL (MOVABLE+DMA)
- * 0xa => MOVABLE (Movable is valid only if HIGHMEM is set too)
- * 0xb => BAD (MOVABLE+HIGHMEM+DMA)
- * 0xc => DMA32 (MOVABLE+HIGHMEM+DMA32)
- * 0xd => BAD (MOVABLE+DMA32+DMA)
- * 0xe => BAD (MOVABLE+DMA32+HIGHMEM)
- * 0xf => BAD (MOVABLE+DMA32+HIGHMEM+DMA)
- *
- * ZONES_SHIFT must be <= 2 on 32 bit platforms.
- */
-
-#if 16 * ZONES_SHIFT > BITS_PER_LONG
-#error ZONES_SHIFT too large to create GFP_ZONE_TABLE integer
+ if (flags & __GFP_DMA32)
+ return ZONE_DMA32;
#endif
-
-#define GFP_ZONE_TABLE ( \
- (ZONE_NORMAL << 0 * ZONES_SHIFT) \
- | (OPT_ZONE_DMA << __GFP_DMA * ZONES_SHIFT) \
- | (OPT_ZONE_HIGHMEM << __GFP_HIGHMEM * ZONES_SHIFT) \
- | (OPT_ZONE_DMA32 << __GFP_DMA32 * ZONES_SHIFT) \
- | (ZONE_NORMAL << __GFP_MOVABLE * ZONES_SHIFT) \
- | (OPT_ZONE_DMA << (__GFP_MOVABLE | __GFP_DMA) * ZONES_SHIFT) \
- | (ZONE_MOVABLE << (__GFP_MOVABLE | __GFP_HIGHMEM) * ZONES_SHIFT)\
- | (OPT_ZONE_DMA32 << (__GFP_MOVABLE | __GFP_DMA32) * ZONES_SHIFT)\
-)
-
-/*
- * GFP_ZONE_BAD is a bitmap for all combination of __GFP_DMA, __GFP_DMA32
- * __GFP_HIGHMEM and __GFP_MOVABLE that are not permitted. One flag per
- * entry starting with bit 0. Bit is set if the combination is not
- * allowed.
- */
-#define GFP_ZONE_BAD ( \
- 1 << (__GFP_DMA | __GFP_HIGHMEM) \
- | 1 << (__GFP_DMA | __GFP_DMA32) \
- | 1 << (__GFP_DMA32 | __GFP_HIGHMEM) \
- | 1 << (__GFP_DMA | __GFP_DMA32 | __GFP_HIGHMEM) \
- | 1 << (__GFP_MOVABLE | __GFP_HIGHMEM | __GFP_DMA) \
- | 1 << (__GFP_MOVABLE | __GFP_DMA32 | __GFP_DMA) \
- | 1 << (__GFP_MOVABLE | __GFP_DMA32 | __GFP_HIGHMEM) \
- | 1 << (__GFP_MOVABLE | __GFP_DMA32 | __GFP_DMA | __GFP_HIGHMEM)\
-)
-
-static inline enum zone_type gfp_zone(gfp_t flags)
-{
- enum zone_type z;
- int bit = flags & GFP_ZONEMASK;
-
- z = (GFP_ZONE_TABLE >> (bit * ZONES_SHIFT)) &
- ((1 << ZONES_SHIFT) - 1);
-
- if (__builtin_constant_p(bit))
- MAYBE_BUILD_BUG_ON((GFP_ZONE_BAD >> bit) & 1);
- else {
-#ifdef CONFIG_DEBUG_VM
- BUG_ON((GFP_ZONE_BAD >> bit) & 1);
+ if ((flags & (__GFP_HIGHMEM | __GFP_MOVABLE)) ==
+ (__GFP_HIGHMEM | __GFP_MOVABLE))
+ return ZONE_MOVABLE;
+#ifdef CONFIG_HIGHMEM
+ if (flags & __GFP_HIGHMEM)
+ return ZONE_HIGHMEM;
#endif
- }
- return z;
+ return ZONE_NORMAL;
}
/*
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Wednesday 14 October 2009, Mel Gorman wrote:
> You highlighted the first akpm merge so lets look closer at that as I
> don't think there is anything more I can do with the wireless driver
> other than the suggestions made already.
Thanks a lot for that analysis Mel. I'll see if I can come up with
additional data based of the info you provide here.
Hi Mel,
On Wed, 2009-10-14 at 03:30 -0700, Mel Gorman wrote:
> From 5fb9f897117bf2701f9fdebe4d008dbe34358ab9 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <[email protected]>
> Date: Wed, 14 Oct 2009 11:19:57 +0100
> Subject: [PATCH] iwlwifi: Suppress warnings related to GFP_ATOMIC allocations that do not matter
>
> iwlwifi refills RX buffers in two ways - a direct method using GFP_ATOMIC
> and a tasklet method using GFP_KERNEL. There are a number of RX buffers and
> there are only serious issues when there are no RX buffers left. The driver
> explicitly warns when refills are failing and the buffers are low but it
> always warns when a GFP_ATOMIC allocation fails even when there is no
> packet loss as a result.
No, it does not always warn when a GFP_ATOMIC allocation fails. Please
check earlier in iwl_rx_allocate() we have:
if (rxq->free_count > RX_LOW_WATERMARK)
priority |= __GFP_NOWARN;
So it will suppress warnings as long as we have buffers available.
We do want to see warnings if memory is below watermark and allocation
fails - your patch prevents these warnings from appearing.
> This patch specifies __GFP_NOWARN for the direct refill method that uses
> GFP_ATOMIC. To help identify where allocation failures might be coming
> from, the stack is dumped when the RX queue is dangerously low.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> drivers/net/wireless/iwlwifi/iwl-rx.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
> old mode 100644
> new mode 100755
> diff --git a/drivers/net/wireless/iwlwifi/iwl-rx.c b/drivers/net/wireless/iwlwifi/iwl-rx.c
> index 8e1bb53..f91a108 100644
> --- a/drivers/net/wireless/iwlwifi/iwl-rx.c
> +++ b/drivers/net/wireless/iwlwifi/iwl-rx.c
> @@ -260,10 +260,12 @@ void iwl_rx_allocate(struct iwl_priv *priv, gfp_t priority)
> if (net_ratelimit())
> IWL_DEBUG_INFO(priv, "Failed to allocate SKB buffer.\n");
> if ((rxq->free_count <= RX_LOW_WATERMARK) &&
> - net_ratelimit())
> + net_ratelimit()) {
> IWL_CRIT(priv, "Failed to allocate SKB buffer with %s. Only %u free buffers remaining.\n",
> priority == GFP_ATOMIC ? "GFP_ATOMIC" : "GFP_KERNEL",
> rxq->free_count);
> + dump_stack();
> + }
> /* We don't reschedule replenish work here -- we will
> * call the restock method and if it still needs
> * more buffers it will schedule replenish */
> @@ -320,7 +322,7 @@ EXPORT_SYMBOL(iwl_rx_replenish);
>
> void iwl_rx_replenish_now(struct iwl_priv *priv)
> {
> - iwl_rx_allocate(priv, GFP_ATOMIC);
> + iwl_rx_allocate(priv, GFP_ATOMIC|__GFP_NOWARN);
>
> iwl_rx_queue_restock(priv);
> }
On Wed, 2009-10-14 at 06:10 -0700, Frans Pop wrote:
> On Wednesday 14 October 2009, Mel Gorman wrote:
> > The majority of the wireless reports have been in
> > this driver and I think we have the problem commit there. The only other
> > is a firmware loading problem in e100 after resume that fails to make an
> > atomic order-5 fail.
>
> Not exactly true. Bartlomiej's report was about ipw2200, so there are at
> least 3 different drivers involved, two wireless and one wired. Besides
> that one report is related to heavy swap, one to resume and one to driver
> reload.
Another report arrived today. Please see
http://thread.gmane.org/gmane.linux.kernel.wireless.general/40858 - it
is an order-5 allocation failure during driver reload.
Reinette
On Wed, Oct 14, 2009 at 09:28:00AM -0700, reinette chatre wrote:
> Hi Mel,
>
> On Wed, 2009-10-14 at 03:30 -0700, Mel Gorman wrote:
> > From 5fb9f897117bf2701f9fdebe4d008dbe34358ab9 Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <[email protected]>
> > Date: Wed, 14 Oct 2009 11:19:57 +0100
> > Subject: [PATCH] iwlwifi: Suppress warnings related to GFP_ATOMIC allocations that do not matter
> >
> > iwlwifi refills RX buffers in two ways - a direct method using GFP_ATOMIC
> > and a tasklet method using GFP_KERNEL. There are a number of RX buffers and
> > there are only serious issues when there are no RX buffers left. The driver
> > explicitly warns when refills are failing and the buffers are low but it
> > always warns when a GFP_ATOMIC allocation fails even when there is no
> > packet loss as a result.
>
>
> No, it does not always warn when a GFP_ATOMIC allocation fails. Please
> check earlier in iwl_rx_allocate() we have:
>
> if (rxq->free_count > RX_LOW_WATERMARK)
> priority |= __GFP_NOWARN;
>
> So it will suppress warnings as long as we have buffers available.
>
> We do want to see warnings if memory is below watermark and allocation
> fails - your patch prevents these warnings from appearing.
>
Yeah, the patch is balls and is not the way forward.
What is your take on GFP_ATOMIC-direct deleting the pool before the tasklet
can refill it with GFP_KERNEL? Should direct allocation be falling back to
calling with GFP_KERNEL when the pool has been depleted instead of failing?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Some initial results; all negative I'm afraid.
On Wednesday 14 October 2009, Mel Gorman wrote:
> This is what I found. The following were the possible commits that might
> be causing the problem.
> 56e49d2..f166777 -- reclaim
> ????????I would have considered this strong candidates except again, the
> ????????last good commit happened after this point. If other obvious
> ????????candidates don't crop up, it might be worth double checking
> ????????within this range, particularly commit 56e49d2 vmscan: evict
> ????????use-once pages first as it is targeted at streaming-IO workloads
> ????????which would include your music workload.
Reverted 56e49d2 on top of .31.1; no change.
> 5c87ead..e9bb35d -- inactive ratio changes
> ????????These patches should be harmless but just in case, please
> ????????compare the output of
> ????????# grep inactive_ratio /proc/zoneinfo
> ????????on 2.6.30 and 2.6.31 and make sure the ratios are the same.
The same for both (and for .32). DMA: 1; DMA32: 3
> ????????Commit b70d94e altered how zonelists were selected during
> ????????allocation. This was tested fairly heavily but if the testing
> ????????missed something, it would mean that some allocations are not
> ????????using the zones they should be.
Reverted on top of .31.1; no change.
> ????????Commit bc75d33 is totally harmless but it mentions
> ????????min_free_kbytes. I checked on my machine to make sure
> ????????min_free_kbytes was the same on both 2.6.30 and 2.6.31. Can you
> ????????check that this is true for your machine? If min_free_kbytes
> ????????decreased, it could explain GFP_ATOMIC failures.
Virtually identical. .30: 5704; .31/.32: 5711
> After this point, your analysis indicates that things are already broken
> but lets look at some of the candidates anyway. ?Out of curiousity,
> was CONFIG_UNEVICTABLE_LRU unset in your .config for 2.6.30? I could
> only find your 2.6.31 .config. If it was, it might be worth reverting
> 6837765963f1723e80ca97b1fae660f3a60d77df and unsetting it in 2.6.31 and
> seeing what happens.
CONFIG_UNEVICTABLE_LRU was set and during bisections I've always accepted
the default, which was "y".
> Commit ee993b135ec75a93bd5c45e636bb210d2975159b altered how lumpy
> reclaim works but it should have been harmless. It does not cleanly
> revert but it's easy to manually revert.
Reverted on top of .31.1; no change.
I'll do some more digging in the 'akpm' merge.
On Wed, 2009-10-14 at 09:50 -0700, Mel Gorman wrote:
> What is your take on GFP_ATOMIC-direct deleting the pool before the tasklet
> can refill it with GFP_KERNEL?
I am not sure I understand your question. We attempt to reclaim a
received buffer on every receive, and with a queue size of 256 + 64 we
assume to have a pretty big buffer to deal with cases when allocations
fail. So, technically, for us to get into this situation where we start
seeing these allocation failures there would have been more than 200
times in which GFP_ATOMIC allocations failed that we did _not_ see since
we only see those warnings when there are less than 8 free buffers
remaining. More on this below ...
> Should direct allocation be falling back to
> calling with GFP_KERNEL when the pool has been depleted instead of failing?
This is the intention of the current implementation. In the tasklet we
run iwl_rx_replenish_now(), which attempts the GFP_ATOMIC allocations
first by calling iwl_rx_allocate() with the GFP_ATOMIC flag. No
particular action is taken when this fails (apart from the error
message), but if the buffers are running low then iwl_rx_queue_restock()
(which is also called from iwl_rx_replenish_now()) will queue work that
will do the allocation with GFP_KERNEL.
We do queue the GFP_KERNEL allocations when there are only a few buffers
remaining in the queue (8 right now) ... maybe we can make this higher?
I am not sure if this will help in what you are trying to figure out
here, but would it help to play with the numbers here? That is, in
iwl_rx_queue_restock() we have:
if (rxq->free_count <= RX_LOW_WATERMARK)
queue_work(priv->workqueue, &priv->rx_replenish);
Would it help here to make that value higher? Maybe queue the GFP_KERNEL
allocation when there are, for example, 50 or 100 free buffers
remaining?
Reinette
On Wednesday 14 October 2009, reinette chatre wrote:
> We do queue the GFP_KERNEL allocations when there are only a few buffers
> remaining in the queue (8 right now) ...
Are you sure of this? I have zero messages in my logs about allocation
failures with GFP_KERNEL, but I do have plenty with "Only 0 free buffers
remaining" with GFP_ATOMIC.
Does that indicate a bug or could they fall under the ratelimit somehow?
Or do I misunderstand the logic?
On Wed, 2009-10-14 at 14:33 -0700, Frans Pop wrote:
> On Wednesday 14 October 2009, reinette chatre wrote:
> > We do queue the GFP_KERNEL allocations when there are only a few buffers
> > remaining in the queue (8 right now) ...
>
> Are you sure of this? I have zero messages in my logs about allocation
> failures with GFP_KERNEL, but I do have plenty with "Only 0 free buffers
> remaining" with GFP_ATOMIC.
That does make sense to me. We do not expect allocations with GFP_KERNEL
to fail. Considering how I understand how things work I am considering
the following scenario:
* start with system low on available memory
* now introduce incoming traffic (causing the RX code to run)
* upon receipt of frame we attempt an allocation (to reclaim the buffer)
with GFP_ATOMIC (state: num RX buffer free > watermark)
* this fails since memory is not available
* num RX buffer free reduces
* does _not_ queue replenishment of buffers with GFP_KERNEL
* repeat above until we hit the watermark (currently 8)
* upon receipt of frame we attempt an allocation (to reclaim the buffer)
with GFP_ATOMIC (state: num RX buffer free <= watermark)
* this fails (now user sees big warning)
* queue replenishment of buffers with GFP_KERNEL
Essentially what I suspect could happen is that
we do attempt to replenish the buffers with GFP_KERNEL after several
failures with GFP_ATOMIC, but at that point we have already run out
completely.
One way to test this theory is to queue the GFP_KERNEL allocation
earlier (when we still have a significant number of RX buffers
available), 8 may turn out to be too small.
> Does that indicate a bug or could they fall under the ratelimit somehow?
In your kernel log I do see that the driver's error messages related to
GFP_ATOMIC are rate limited (we see many more "order-2 allocation
failure" messages than the "Failed to allocate" messages). All of these
allocation failures are from the "replenish_now" code though, which is
GFP_ATOMIC. So even though we do not see the "Failed to allocate" errors
(which are rate limited) it seems that all allocation failures are from
that (the GFP_ATOMIC) code.
Reinette
On Wed, Oct 14, 2009 at 08:34:56PM +0200, Frans Pop wrote:
> Some initial results; all negative I'm afraid.
>
These are highly unlikely candidates. I say highly unlikely because they
are before the page allocator patches when your analysis indicated
things were ok.
Commit 70ac23c readahead: sequential mmap readahead
This affects readahead for mmap() and could have an impact on the
number of allocations made by the streaming IO. This might be
generating more bursty network traffic in 2.6.31 than 2.6.30 and
affecting the allocation apttern enough to cause problems
Commit 2fad6f5 readahead: enforce full readahead size on async mmap readahead
Another readahead change that may affect the rate of network
traffic being generated when streaming IO over the network
Commit 10be0b3 readahead: introduce context readahead algorithm
By using readahead in more situations, it again may be affecting
the burst rate of network traffic and the rate of GFP_ATOMIC arrivals
Commit 78dc583 vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC
Very low probability that this is a problem, but it affects
lumpy reclaim and so has to be considered. It's an awkward
revert but I think the most important part is just to revert the
condition that checks if congestion_wait() should be called or not
I relooked at the page allocator patches themselves just in case. Of the
patches in there, I came up with
Commit 11e33f6 page allocator: break up the allocator entry point into fast and slow paths
This is possibly the most disruptive patch in the set. It should
not have affected behaviour but the complexity of the patch is
quite high. I did spot an oddity whereby a process exiting making
a __GFP_NOFAIL allocation can ignore watermarks. It's unlikely
this is the problem but as the journal layer uses __GFP_NOFAIL,
you never know - it might be pushing things down low enough for
other watermark checks to fail. Patch is below. This is also the
patch that cause kswapd to wake up less. I sent a patch for that
problem but I still don't know if it reduced the number of
failures for you or not.
Commit f2260e6 page allocator: update NR_FREE_PAGES only as necessary
This patch affects the timing of when NR_FREE_PAGES is updated.
The reclaim algorithm makes decisions based on this NR_FREE_PAGES
value. Crucially, the value can determine if the anon list is force
scanned or not. The window during which this can make a difference
should be extremely small but maybe it's enough to make a difference.
Outside the range of commits suspected of causing problems was the
following. It's extremely low probability
Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write confusion
This patch alters the call to congestion_wait() in the page
allocator. Frankly, I don't get the change but it might worth
checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31
makes any difference
After a lot more eyeballing, the best next candidate within mm is the
following patch. Should be tested on it's own and in combination with
the wakeup-kswapd patch sent before.
====
>From 4e8b5217f51a00caee527e4e8d8e46fe9f82b482 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Thu, 15 Oct 2009 00:17:05 +0100
Subject: [PATCH] page allocator: Direct reclaim should always obey watermarks
ALLOC_NO_WATERMARKS should be cleared when trying to allocate from the
free-lists after a direct reclaim.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3694609..619933d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1920,7 +1920,8 @@ rebalance:
page = __alloc_pages_direct_reclaim(gfp_mask, order,
zonelist, high_zoneidx,
nodemask,
- alloc_flags, preferred_zone,
+ alloc_flags & ~ALLOC_NO_WATERMARKS,
+ preferred_zone,
migratetype, &did_some_progress);
if (page)
goto got_pg;
On Wednesday 14 October 2009, reinette chatre wrote:
> We do queue the GFP_KERNEL allocations when there are only a few buffers
> remaining in the queue (8 right now) ... maybe we can make this higher?
I've tried increasing it to 50. Here's the result for a single test:
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 25 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
__ratelimit: 1 callbacks suppressed
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
__ratelimit: 97 callbacks suppressed
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 44 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
This is with current mainline (v2.6.32-rc4-149-ga3ccf63).
The log file timestamps don't tell much as the logging gets delayed,
so they all end up at the same time. Maybe I should enable the kernel
timestamps so we can see how far apart these failures are.
On Wed, 2009-10-14 at 19:02 -0700, Frans Pop wrote:
> On Wednesday 14 October 2009, reinette chatre wrote:
> > We do queue the GFP_KERNEL allocations when there are only a few buffers
> > remaining in the queue (8 right now) ... maybe we can make this higher?
>
> I've tried increasing it to 50. Here's the result for a single test:
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 25 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 48 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> __ratelimit: 1 callbacks suppressed
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> __ratelimit: 97 callbacks suppressed
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 44 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
>
> This is with current mainline (v2.6.32-rc4-149-ga3ccf63).
> The log file timestamps don't tell much as the logging gets delayed,
> so they all end up at the same time. Maybe I should enable the kernel
> timestamps so we can see how far apart these failures are.
If you can get accurate timing it will be very useful. I am interested
to see how quickly it goes from "48 free buffers" to "0 free buffers".
Thank you
Reinette
On Thursday 15 October 2009, reinette chatre wrote:
> > The log file timestamps don't tell much as the logging gets delayed,
> > so they all end up at the same time. Maybe I should enable the kernel
> > timestamps so we can see how far apart these failures are.
>
> If you can get accurate timing it will be very useful. I am interested
> to see how quickly it goes from "48 free buffers" to "0 free buffers".
Attached the dmesg for three consecutive test runs (i.e. without
rebooting). Not that the 2nd one includes only "0 free buffers" messages,
even though the behavior (point where desktop freezes and music stops)
looked similar.
Not sure if you can tell all that much from the data.
N.B. You may want to clean this up in iwlwifi code:
iwl-dev.h:#include "iwl-fh.h"
iwl-dev.h:#define RX_LOW_WATERMARK 8
iwl-fh.h:#define RX_LOW_WATERMARK 8
I.e: RX_LOW_WATERMARK is defined in iwl-dev.h even though that includes
iwl-fh.h where it's also defined. The same may be true for other defines.
I think this gave me an incorrect result the first time I increased the
limit as I only changed one of the two files (iwl-dev.h IIRC).
Cheers,
FJP
On Thursday 15 October 2009, Mel Gorman wrote:
> After a lot more eyeballing, the best next candidate within mm is the
> following patch. Should be tested on it's own and in combination with
> the wakeup-kswapd patch sent before.
>
> From 4e8b5217f51a00caee527e4e8d8e46fe9f82b482 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <[email protected]>
> Date: Thu, 15 Oct 2009 00:17:05 +0100
> Subject: [PATCH] page allocator: Direct reclaim should always obey
> watermarks
>
> ALLOC_NO_WATERMARKS should be cleared when trying to allocate from the
> free-lists after a direct reclaim.
I've tested the two patches together and this seems like a definite
improvement. I still get SKB errors on the first test, but the desktop
freezes are a lot shorter and the total time needed to load the 3rd gitk
goes down from ~2:15 to ~1:15. The counter in gitk of the number of
loaded commits goes up visibly faster and with fewer halts.
This is on top of current mainline with the RX_LOW_WATERMARK in iwlagn
at it's current value (8).
Here are the allocation failures for 2 consecutive tests. Note that the
first test still shows quite a lot of failures, but the second test hardly
had any at all (I still had music skips though).
[ 232.845116] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 232.845116] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 232.873009] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 232.884545] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 240.121577] __ratelimit: 26 callbacks suppressed
[ 240.121634] __ratelimit: 6 callbacks suppressed
[ 240.124006] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 6 free buffers remaining.
[ 304.335767] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 304.335767] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 304.374729] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 309.446481] __ratelimit: 5 callbacks suppressed
[ 309.450197] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
[ 525.912934] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 5 free buffers remaining.
[ 525.953939] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 7 free buffers remaining.
[ 536.058171] __ratelimit: 1 callbacks suppressed
Thanks,
FJP
On Thu, Oct 15, 2009 at 10:15:09PM +0200, Frans Pop wrote:
> On Thursday 15 October 2009, Mel Gorman wrote:
> > After a lot more eyeballing, the best next candidate within mm is the
> > following patch. Should be tested on it's own and in combination with
> > the wakeup-kswapd patch sent before.
> >
> > From 4e8b5217f51a00caee527e4e8d8e46fe9f82b482 Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <[email protected]>
> > Date: Thu, 15 Oct 2009 00:17:05 +0100
> > Subject: [PATCH] page allocator: Direct reclaim should always obey
> > watermarks
> >
> > ALLOC_NO_WATERMARKS should be cleared when trying to allocate from the
> > free-lists after a direct reclaim.
>
> I've tested the two patches together and this seems like a definite
> improvement.
You probably don't need the mental image, but this made me do a happy
dance. Certainly helps my cold!
> I still get SKB errors on the first test, but the desktop
> freezes are a lot shorter and the total time needed to load the 3rd gitk
> goes down from ~2:15 to ~1:15. The counter in gitk of the number of
> loaded commits goes up visibly faster and with fewer halts.
>
This brings us close to the state of affairs before the akpm merge.
There might still be something missing in either the mm area or the wireless
driver but any improvement is better than none.
> This is on top of current mainline with the RX_LOW_WATERMARK in iwlagn
> at it's current value (8).
>
> Here are the allocation failures for 2 consecutive tests. Note that the
> first test still shows quite a lot of failures, but the second test hardly
> had any at all (I still had music skips though).
>
So, we are still dealing with three problems.
1. GFP_ATOMICS were introduced to the wireless driver in the
2.6.30..2.6.31 timeframe. It has been more or less identified as being
the tasklet off-loading and the pools being depleted too easily. This
still needs to be fixed.
2. There is also some firmware reloading problem of an unknown source
3. There was an mm regression that made GFP_ATOMIC failures much worse.
This is at least partially due to tasks exiting being able to go below the
watermarks and kswapd not being woken up when it should be. This could
be the source of the allocation failures on resume that have nothing to
do with the iwlagn wireless driver.
I am going to put together the pair of patches against mainline with a
recommendation they be also applied for 2.6.31.5. I'll keep looking to
see can I spot another possible candidate for GFP_ATOMIC being worse
than it was.
> [ 232.845116] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 232.845116] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 232.873009] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 232.884545] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 240.121577] __ratelimit: 26 callbacks suppressed
> [ 240.121634] __ratelimit: 6 callbacks suppressed
> [ 240.124006] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 6 free buffers remaining.
> [ 304.335767] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 304.335767] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 304.374729] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
> [ 309.446481] __ratelimit: 5 callbacks suppressed
> [ 309.450197] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 0 free buffers remaining.
>
> [ 525.912934] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 5 free buffers remaining.
> [ 525.953939] iwlagn 0000:10:00.0: Failed to allocate SKB buffer with GFP_ATOMIC. Only 7 free buffers remaining.
> [ 536.058171] __ratelimit: 1 callbacks suppressed
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Thu, 2009-10-15 at 12:41 -0700, Frans Pop wrote:
> On Thursday 15 October 2009, reinette chatre wrote:
> > > The log file timestamps don't tell much as the logging gets delayed,
> > > so they all end up at the same time. Maybe I should enable the kernel
> > > timestamps so we can see how far apart these failures are.
> >
> > If you can get accurate timing it will be very useful. I am interested
> > to see how quickly it goes from "48 free buffers" to "0 free buffers".
>
> Attached the dmesg for three consecutive test runs (i.e. without
> rebooting). Not that the 2nd one includes only "0 free buffers" messages,
> even though the behavior (point where desktop freezes and music stops)
> looked similar.
Thank you very much. I am studying it.
> Not sure if you can tell all that much from the data.
>
> N.B. You may want to clean this up in iwlwifi code:
> iwl-dev.h:#include "iwl-fh.h"
> iwl-dev.h:#define RX_LOW_WATERMARK 8
> iwl-fh.h:#define RX_LOW_WATERMARK 8
>
> I.e: RX_LOW_WATERMARK is defined in iwl-dev.h even though that includes
> iwl-fh.h where it's also defined. The same may be true for other defines.
Sorry about that. The patch below will fix that. I will send it
separately to wireless list.
>From 7cc8e6482b359eef5ce099457037a237d355b5b1 Mon Sep 17 00:00:00 2001
From: Reinette Chatre <[email protected]>
Date: Fri, 16 Oct 2009 10:11:10 -0700
Subject: [PATCH] iwlwifi: remove duplicate defines
RX_FREE_BUFFERS and RX_LOW_WATERMARK are currently defined in four places.
Based on how files are included we only need the definition in iwl-fh.h
Signed-off-by: Reinette Chatre <[email protected]>
Reported-by: Frans Pop <[email protected]>
---
drivers/net/wireless/iwlwifi/iwl-3945-hw.h | 6 ------
drivers/net/wireless/iwlwifi/iwl-3945.h | 6 ------
drivers/net/wireless/iwlwifi/iwl-dev.h | 6 ------
3 files changed, 0 insertions(+), 18 deletions(-)
diff --git a/drivers/net/wireless/iwlwifi/iwl-3945-hw.h b/drivers/net/wireless/iwlwifi/iwl-3945-hw.h
index ccdac69..6fd10d4 100644
--- a/drivers/net/wireless/iwlwifi/iwl-3945-hw.h
+++ b/drivers/net/wireless/iwlwifi/iwl-3945-hw.h
@@ -248,12 +248,6 @@ struct iwl3945_eeprom {
#define TFD_CTL_PAD_SET(n) (n << 28)
#define TFD_CTL_PAD_GET(ctl) (ctl >> 28)
-/*
- * RX related structures and functions
- */
-#define RX_FREE_BUFFERS 64
-#define RX_LOW_WATERMARK 8
-
/* Sizes and addresses for instruction and data memory (SRAM) in
* 3945's embedded processor. Driver access is via HBUS_TARG_MEM_* regs. */
#define IWL39_RTC_INST_LOWER_BOUND (0x000000)
diff --git a/drivers/net/wireless/iwlwifi/iwl-3945.h b/drivers/net/wireless/iwlwifi/iwl-3945.h
index f3907c1..84fa0d7 100644
--- a/drivers/net/wireless/iwlwifi/iwl-3945.h
+++ b/drivers/net/wireless/iwlwifi/iwl-3945.h
@@ -130,12 +130,6 @@ struct iwl3945_frame {
#define SN_TO_SEQ(ssn) (((ssn) << 4) & IEEE80211_SCTL_SEQ)
#define MAX_SN ((IEEE80211_SCTL_SEQ) >> 4)
-/*
- * RX related structures and functions
- */
-#define RX_FREE_BUFFERS 64
-#define RX_LOW_WATERMARK 8
-
#define SUP_RATE_11A_MAX_NUM_CHANNELS 8
#define SUP_RATE_11B_MAX_NUM_CHANNELS 4
#define SUP_RATE_11G_MAX_NUM_CHANNELS 12
diff --git a/drivers/net/wireless/iwlwifi/iwl-dev.h b/drivers/net/wireless/iwlwifi/iwl-dev.h
index 1378654..0fa0cf5 100644
--- a/drivers/net/wireless/iwlwifi/iwl-dev.h
+++ b/drivers/net/wireless/iwlwifi/iwl-dev.h
@@ -406,12 +406,6 @@ struct iwl_host_cmd {
u8 id;
};
-/*
- * RX related structures and functions
- */
-#define RX_FREE_BUFFERS 64
-#define RX_LOW_WATERMARK 8
-
#define SUP_RATE_11A_MAX_NUM_CHANNELS 8
#define SUP_RATE_11B_MAX_NUM_CHANNELS 4
#define SUP_RATE_11G_MAX_NUM_CHANNELS 12
--
1.5.6.3
Hi Frans,
On Thu, 2009-10-15 at 12:41 -0700, Frans Pop wrote:
> On Thursday 15 October 2009, reinette chatre wrote:
> > > The log file timestamps don't tell much as the logging gets delayed,
> > > so they all end up at the same time. Maybe I should enable the kernel
> > > timestamps so we can see how far apart these failures are.
> >
> > If you can get accurate timing it will be very useful. I am interested
> > to see how quickly it goes from "48 free buffers" to "0 free buffers".
>
> Attached the dmesg for three consecutive test runs (i.e. without
> rebooting). Not that the 2nd one includes only "0 free buffers" messages,
> even though the behavior (point where desktop freezes and music stops)
> looked similar.
>
> Not sure if you can tell all that much from the data.
>
Prompted by this thread we are in process of moving allocation to paged
skb. This will definitely reduce the allocation size (from order 2 to
order 1) and hopefully help with this problem also. Could you please try
with the attached two patches? They are based on 2.6.32-rc4.
Thank you very much
Reinette
Another long mail, sorry.
On Wednesday 14 October 2009, Frans Pop wrote:
> > There still has not been a mm-change identified that makes
> > fragmentation significantly worse.
>
> My bisection shows a very clear point, even if not an individual commit,
> in the 'akpm' merge where SKB errors suddenly become *much* more
> frequent and easy to trigger.
> I'm sorry to say this, but the fact that nothing has been identified yet
> is IMO the result of a lack of effort, not because there is no such
> change.
I was wrong. It turns out that I was creating the variations in the test
results around the akpm merge myself by tiny changes in the way I ran the
tests. It took another round of about 30 compilations and tests purely in
this range to show that, but those same tests also made me aware of other
patterns I should look at.
Until a few days ago I was concentrating on "do I see SKB allocation errors
or not". Since then I've also been looking more consciously at when they
happen, at disk access patterns and at desktop freeze patterns.
I think I did mention before that this whole issue is rather subtle :-/
So, my apologies for finguering the wrong area for so long, but it looked
solid given the info available at the time.
On Thursday 15 October 2009, Mel Gorman wrote:
> Outside the range of commits suspected of causing problems was the
> following. It's extremely low probability
>
> Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write confusion
> This patch alters the call to congestion_wait() in the page
> allocator. Frankly, I don't get the change but it might worth
> checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31
> makes any difference
This is the real culprit. Mel: thanks very much for looking beyond the
area I identified. Your overview of mm changes was exactly what I needed
and really helped a lot during my later tests.
This commit definitely causes most of the problems; confirmed by reverting
it on top of 2.6.31 (also requires reverting 373c0a7e, which is a later
build fix).
The rest of this mail gives details on my tests and how I reached the above
conclusion.
TEST BASELINE (2.6.30)
======================
I mentioned in an earlier mail that I run three instances of gitk for my
tests. Loading gitk seems to consist of 3 phases:
1) general initial scan of the repository (branches?)
2) reading commits: commit counter increases
3) reading references (including bisection good/bad points) and
uncommitted changes
Below times and comments per stage when the test is run with 2.6.30. As my
test starts after a clean boot, buffers are mostly empty.
1st instance: 'gitk v2.6.29..master' (preparation)
1) ~20 seconds; user interface is mostly blank
2) ~5 seconds to read 35.000 commits; user interface is updated and counter
increases steadily as they are read
3) ~10 seconds; "branch"/"follows"/"precedes" info and tags are filled
in; fairly heavy disk activity
2st instance: 'gitk master' (preparation)
1) 0 seconds (because data is already buffered)
2) ~25 seconds to read 167500 commits; counter increases steadily
3) 1-2 seconds (because data is already buffered)
3st instance: 'gitk master' (the actual test)
1) 0 seconds because data is already buffered
2) ~55 seconds due to swapping overhead; minor music skip around commit
110.000; counter slower after 90.000, some short halts, but generally
increases steadily; moderate disk activity
3) ~55-60 seconds; because buffers have been emptied data must by read
again, with swapping; very heavy disk activity; fairly long music
skip (15-20 seconds), but no SKB allocation errors
So, the loading of the 3rd instance takes 1.5 minutes longer than the
second because of the swapping. And phase 3) is most affected by it.
AFTER WIRELESS CHANGE
=====================
After commit 4752c93c30 ("iwlcore: Allow skb allocation from tasklet") I
start getting the SKB errors. They can be triggered reliably if the whole
test is repeated 1 or 2 times, but generally not the first time the test
is run.
Or so I thought for a long time.
It turns out that I will get SKB errors during the first run if I'm
"sloppy" in the test execution. For example if I wait too long before
switching from the last gitk instance to konsole where I have
a 'tail -f /var/log/kern.log' running.
Another factor is the state of the repository: do I have master checked
out, or an older branch, or am I in the middle of a bisection. This
influences how data is read from the disk and thus the test results.
A last factor may be the size of the kernel I'm using: my test/bisect
kernel is significantly smaller than my regular kernel.
If the test is run completely cleanly, I will not get SKB errors during the
first run. Also, this change does not affect the timings of the test at
all: the total load time of the 3rd instance is still ~1:55 and music
skips happen in roughly the same places. The pattern of disk activity also
remains unchanged.
If I do *not* run the test cleanly, any SKB errors during the first test
run will always be during phase 3), never during phase 2). This is what I
saw during tests in the 'akpm' range, and explains the inconsistent
results there.
After discovering this I've made a copy of the git repo so that I always
test using the exact same state and tightened my test procedure.
AFTER congestion_wait CHANGE
============================
If I test commit 9f2d8be, which is just before the congestion_wait()
change, I still get the same pattern as described above. But when I test
with 8aa7e84 ("Fix congestion_wait() sync/async vs read/write confusion"),
things change dramatically when the 3rd gitk instance is started.
During the 2nd phase I see the first SKB allocation errors with a music
skip between reading commits 95.000 and 110.000.
About commit 115.000 there is a very long pause during which the counter
does not increase, music stops and the desktop freezes completely. The
first 30 seconds of that freeze there is only very low disk activity (which
seems strange); the next 25 seconds there suddenly is very high disk
activity during which things gradually unfreeze and more SKB errors are
displayed. After that the commit counter runs up fairly steadily again.
Phase 2) ends at ~1:45. Phase 3) (with more SKB errors) ends at ~2:05.
So this change almost doubles the time needed for phase 2) and causes SKB
allocation errors to occur during that phase. Also, before this commit the
desktop freezes are much shorter and less severe. With this change the
desktop is completely unusable for almost a minute during phase 2), with
even the mouse pointer frozen solid.
Note that phase 3) becomes shorter, but that the total time needed to load
the 3rd instance increases by about 10-15 seconds.
Note: -rc2 and -rc3 had broken NFS, so I had to cherry-pick 3 NFS commits
from -rc4 on top of the commits I wanted to test.
WITH congestion_wait CHANGE REVERTED
====================================
I've done quite a few tests of 2.6.31 with 373c0a7e and 8aa7e847 reverted
to confirm that's really the culprit. I've done this for .31-rc3, .31-rc4,
.31-rc5, .31 and .31.1.
In all cases the huge freeze in phase 2) is gone and the general behavior
and timings are again as it was after the wireless change. During most
tests I did not get any SKB allocation errors during phase 2) or phase 3).
However with .31-rc5, .31 and .31.1 I have had some tests where I would see
a few SKB allocation errors during phase 3) (which is somewhat likely),
but also during phase 2). At this point I'm unsure whether this is just
noise, or maybe a minor influence from some change merged after .31-rc4.
Looking through the commits there are several mm/page allocation changes.
For now I suggest ignoring this though as the impact (if any) is very minor
and it is not reproducible reliably enough.
Next I'll retest Mel's patches and also test Reinette's patches.
Cheers,
FJP
(Adding Jens to CC.)
On Wednesday 14 October 2009, Frans Pop wrote:
> > > There still has not been a mm-change identified that makes
> > > fragmentation significantly worse.
On Mon, 2009-10-19 at 01:33 +0200, Frans Pop wrote:
> > My bisection shows a very clear point, even if not an individual commit,
> > in the 'akpm' merge where SKB errors suddenly become *much* more
> > frequent and easy to trigger.
> > I'm sorry to say this, but the fact that nothing has been identified yet
> > is IMO the result of a lack of effort, not because there is no such
> > change.
>
> I was wrong. It turns out that I was creating the variations in the test
> results around the akpm merge myself by tiny changes in the way I ran the
> tests. It took another round of about 30 compilations and tests purely in
> this range to show that, but those same tests also made me aware of other
> patterns I should look at.
>
> Until a few days ago I was concentrating on "do I see SKB allocation errors
> or not". Since then I've also been looking more consciously at when they
> happen, at disk access patterns and at desktop freeze patterns.
>
> I think I did mention before that this whole issue is rather subtle :-/
> So, my apologies for finguering the wrong area for so long, but it looked
> solid given the info available at the time.
>
> On Thursday 15 October 2009, Mel Gorman wrote:
> > Outside the range of commits suspected of causing problems was the
> > following. It's extremely low probability
> >
> > Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write confusion
> > This patch alters the call to congestion_wait() in the page
> > allocator. Frankly, I don't get the change but it might worth
> > checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31
> > makes any difference
>
> This is the real culprit. Mel: thanks very much for looking beyond the
> area I identified. Your overview of mm changes was exactly what I needed
> and really helped a lot during my later tests.
>
> This commit definitely causes most of the problems; confirmed by reverting
> it on top of 2.6.31 (also requires reverting 373c0a7e, which is a later
> build fix).
Mel/Jens, any ideas why commit 8aa7e84 makes us run out of high order
pages? Should we be using BLK_RW_SYNC in mm/page_alloc.c instead of
BLK_RW_ASYNC?
Pekka
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf72055..fa8380a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1727,7 +1727,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(BLK_RW_SYNC, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -1898,7 +1898,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(BLK_RW_SYNC, HZ/50);
goto rebalance;
}
On Monday 19 October 2009, Pekka Enberg wrote:
> On Wednesday 14 October 2009, Frans Pop wrote:
> > On Thursday 15 October 2009, Mel Gorman wrote:
> > > Outside the range of commits suspected of causing problems was the
> > > following. It's extremely low probability
> > >
> > > Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write
> > > confusion This patch alters the call to congestion_wait() in the
> > > page allocator. Frankly, I don't get the change but it might worth
> > > checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31 makes
> > > any difference
> >
> > This is the real culprit. Mel: thanks very much for looking beyond the
> > area I identified. Your overview of mm changes was exactly what I
> > needed and really helped a lot during my later tests.
> >
> > This commit definitely causes most of the problems; confirmed by
> > reverting it on top of 2.6.31 (also requires reverting 373c0a7e, which
> > is a later build fix).
>
> Mel/Jens, any ideas why commit 8aa7e84 makes us run out of high order
> pages? Should we be using BLK_RW_SYNC in mm/page_alloc.c instead of
> BLK_RW_ASYNC?
I'm starting to think that this commit may not be directly related to high
order allocation failures. The fact that I'm seeing SKB allocation
failures earlier because of this commit could be just a side effect.
It could be that instead the main impact of this commit is on encrypted
file system and/or encrypted swap (kcryptd).
Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
only reading from NFS that's unlikely).
Reason for thinking this is that reverting it makes no difference for Karol
[1]. It will be interesting to see if it does make a difference for Sven
Geggus [2].
/me wonders if we'll ever get to the bottom of this...
[1] http://lkml.org/lkml/2009/10/18/138
[2] http://lkml.org/lkml/2009/10/17/113
On Mon, Oct 19 2009, Pekka Enberg wrote:
> (Adding Jens to CC.)
>
> On Wednesday 14 October 2009, Frans Pop wrote:
> > > > There still has not been a mm-change identified that makes
> > > > fragmentation significantly worse.
>
> On Mon, 2009-10-19 at 01:33 +0200, Frans Pop wrote:
> > > My bisection shows a very clear point, even if not an individual commit,
> > > in the 'akpm' merge where SKB errors suddenly become *much* more
> > > frequent and easy to trigger.
> > > I'm sorry to say this, but the fact that nothing has been identified yet
> > > is IMO the result of a lack of effort, not because there is no such
> > > change.
> >
> > I was wrong. It turns out that I was creating the variations in the test
> > results around the akpm merge myself by tiny changes in the way I ran the
> > tests. It took another round of about 30 compilations and tests purely in
> > this range to show that, but those same tests also made me aware of other
> > patterns I should look at.
> >
> > Until a few days ago I was concentrating on "do I see SKB allocation errors
> > or not". Since then I've also been looking more consciously at when they
> > happen, at disk access patterns and at desktop freeze patterns.
> >
> > I think I did mention before that this whole issue is rather subtle :-/
> > So, my apologies for finguering the wrong area for so long, but it looked
> > solid given the info available at the time.
> >
> > On Thursday 15 October 2009, Mel Gorman wrote:
> > > Outside the range of commits suspected of causing problems was the
> > > following. It's extremely low probability
> > >
> > > Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write confusion
> > > This patch alters the call to congestion_wait() in the page
> > > allocator. Frankly, I don't get the change but it might worth
> > > checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31
> > > makes any difference
> >
> > This is the real culprit. Mel: thanks very much for looking beyond the
> > area I identified. Your overview of mm changes was exactly what I needed
> > and really helped a lot during my later tests.
> >
> > This commit definitely causes most of the problems; confirmed by reverting
> > it on top of 2.6.31 (also requires reverting 373c0a7e, which is a later
> > build fix).
>
> Mel/Jens, any ideas why commit 8aa7e84 makes us run out of high order
> pages? Should we be using BLK_RW_SYNC in mm/page_alloc.c instead of
> BLK_RW_ASYNC?
No, I think that is definitely broken since the page freeing should be
using async writes. If the commit in question is making the difference
and the below does indeed fix it, I think that's primarliy due to timing
issues and the general brokeness of the congestion bits. With the below
change, you essentially guarenteed to be waiting 20ms every time and
it's quite likely that that is enough to change the picture.
So I'd like elsewhere for the real problem, it's not likely to be caused
by the sync vs async bits themselves.
--
Jens Axboe
Today Frans Pop wrote:
>
> I'm starting to think that this commit may not be directly related to high
> order allocation failures. The fact that I'm seeing SKB allocation
> failures earlier because of this commit could be just a side effect.
> It could be that instead the main impact of this commit is on encrypted
> file system and/or encrypted swap (kcryptd).
>
> Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> only reading from NFS that's unlikely).
I have updated a fileserver to 2.6.31 today and I see page
allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
So I guess the problem must be quite generic:
Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
Oct 19 08:59:16 johan kernel: [30120.685647] __ratelimit: 13 callbacks suppressed [kern.warning]
Oct 19 08:59:16 johan kernel: [30120.685654] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 08:59:16 johan kernel: [30120.685660] Pid: 6071, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 08:59:16 johan kernel: [30120.685663] Call Trace: [kern.warning]
Oct 19 08:59:16 johan kernel: [30120.685666] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
Oct 19 09:36:31 johan kernel: [32355.708345] __ratelimit: 16 callbacks suppressed [kern.warning]
Oct 19 09:36:31 johan kernel: [32355.708352] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 09:36:31 johan kernel: [32355.708358] Pid: 6087, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 09:36:31 johan kernel: [32355.708361] Call Trace: [kern.warning]
Oct 19 09:36:31 johan kernel: [32355.708364] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
Oct 19 10:52:01 johan kernel: [36885.358312] __ratelimit: 31 callbacks suppressed [kern.warning]
Oct 19 10:52:01 johan kernel: [36885.358319] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 10:52:01 johan kernel: [36885.358325] Pid: 6057, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 10:52:01 johan kernel: [36885.358327] Call Trace: [kern.warning]
Oct 19 10:52:01 johan kernel: [36885.358331] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
Oct 19 11:12:01 johan kernel: [38085.163831] events/3: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 11:12:01 johan kernel: [38085.163840] Pid: 18, comm: events/3 Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 11:12:01 johan kernel: [38085.163843] Call Trace: [kern.warning]
Oct 19 11:12:01 johan kernel: [38085.163846] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Mon, 2009-10-19 at 11:49 +0200, Tobi Oetiker wrote:
> Today Frans Pop wrote:
>
> >
> > I'm starting to think that this commit may not be directly related to high
> > order allocation failures. The fact that I'm seeing SKB allocation
> > failures earlier because of this commit could be just a side effect.
> > It could be that instead the main impact of this commit is on encrypted
> > file system and/or encrypted swap (kcryptd).
> >
> > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > only reading from NFS that's unlikely).
>
> I have updated a fileserver to 2.6.31 today and I see page
> allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> So I guess the problem must be quite generic:
Yup, it almost certainly is. Does this patch help?
http://lkml.org/lkml/2009/10/16/89
Frans, did you ever get around retesting with just the above patch
applied?
Pekka
> Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
>
> Oct 19 08:59:16 johan kernel: [30120.685647] __ratelimit: 13 callbacks suppressed [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685654] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685660] Pid: 6071, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685663] Call Trace: [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685666] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 09:36:31 johan kernel: [32355.708345] __ratelimit: 16 callbacks suppressed [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708352] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708358] Pid: 6087, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708361] Call Trace: [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708364] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 10:52:01 johan kernel: [36885.358312] __ratelimit: 31 callbacks suppressed [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358319] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358325] Pid: 6057, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358327] Call Trace: [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358331] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 11:12:01 johan kernel: [38085.163831] events/3: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163840] Pid: 18, comm: events/3 Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163843] Call Trace: [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163846] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
>
>
On Mon, Oct 19, 2009 at 11:49:08AM +0200, Tobi Oetiker wrote:
> Today Frans Pop wrote:
>
> >
> > I'm starting to think that this commit may not be directly related to high
> > order allocation failures. The fact that I'm seeing SKB allocation
> > failures earlier because of this commit could be just a side effect.
> > It could be that instead the main impact of this commit is on encrypted
> > file system and/or encrypted swap (kcryptd).
> >
> > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > only reading from NFS that's unlikely).
>
> I have updated a fileserver to 2.6.31 today and I see page
> allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> So I guess the problem must be quite generic:
>
>
> Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
What's the rest of the stack trace? I'm wondering where a large number
of order-5 GFP_ATOMIC allocations are coming from. It seems different to
the e100 problem where there is one GFP_ATOMIC allocation while the
firmware is being loaded.
Thanks
>
> Oct 19 08:59:16 johan kernel: [30120.685647] __ratelimit: 13 callbacks suppressed [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685654] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685660] Pid: 6071, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685663] Call Trace: [kern.warning]
> Oct 19 08:59:16 johan kernel: [30120.685666] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 09:36:31 johan kernel: [32355.708345] __ratelimit: 16 callbacks suppressed [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708352] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708358] Pid: 6087, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708361] Call Trace: [kern.warning]
> Oct 19 09:36:31 johan kernel: [32355.708364] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 10:52:01 johan kernel: [36885.358312] __ratelimit: 31 callbacks suppressed [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358319] nfsd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358325] Pid: 6057, comm: nfsd Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358327] Call Trace: [kern.warning]
> Oct 19 10:52:01 johan kernel: [36885.358331] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
> Oct 19 11:12:01 johan kernel: [38085.163831] events/3: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163840] Pid: 18, comm: events/3 Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163843] Call Trace: [kern.warning]
> Oct 19 11:12:01 johan kernel: [38085.163846] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
>
>
>
> --
> Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
> http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Today Mel Gorman wrote:
> On Mon, Oct 19, 2009 at 11:49:08AM +0200, Tobi Oetiker wrote:
> > Today Frans Pop wrote:
> >
> > >
> > > I'm starting to think that this commit may not be directly related to high
> > > order allocation failures. The fact that I'm seeing SKB allocation
> > > failures earlier because of this commit could be just a side effect.
> > > It could be that instead the main impact of this commit is on encrypted
> > > file system and/or encrypted swap (kcryptd).
> > >
> > > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > > only reading from NFS that's unlikely).
> >
> > I have updated a fileserver to 2.6.31 today and I see page
> > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > So I guess the problem must be quite generic:
> >
> >
> > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> >
>
> What's the rest of the stack trace? I'm wondering where a large number
> of order-5 GFP_ATOMIC allocations are coming from. It seems different to
> the e100 problem where there is one GFP_ATOMIC allocation while the
> firmware is being loaded.
Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684157] [<ffffffff810da7e5>] __alloc_pages_nodemask+0x135/0x140 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684164] [<ffffffff815065b4>] ? _spin_unlock_bh+0x14/0x20 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684170] [<ffffffff8110b368>] kmalloc_large_node+0x68/0xc0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684175] [<ffffffff8110f15a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684181] [<ffffffff8140ffd2>] ? skb_copy+0x32/0xa0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684185] [<ffffffff8140d8b6>] __alloc_skb+0x76/0x180 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684205] [<ffffffff8140ffd2>] skb_copy+0x32/0xa0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684221] [<ffffffffa050f33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684227] [<ffffffff81416a6d>] dev_queue_xmit_nit+0x10d/0x170 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684231] [<ffffffff81416f79>] dev_hard_start_xmit+0x189/0x1c0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684236] [<ffffffff8142f071>] __qdisc_run+0x1a1/0x230 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684240] [<ffffffff81418a88>] dev_queue_xmit+0x238/0x310 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684246] [<ffffffff8144864b>] ip_finish_output+0x11b/0x2f0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684250] [<ffffffff814488a9>] ip_output+0x89/0xd0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684254] [<ffffffff814478c0>] ip_local_out+0x20/0x30 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684258] [<ffffffff814481ab>] ip_queue_xmit+0x22b/0x3f0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684264] [<ffffffff8145d5e5>] tcp_transmit_skb+0x345/0x4e0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684269] [<ffffffff8145eaf6>] tcp_write_xmit+0xb6/0x2e0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684273] [<ffffffff8145ed8b>] __tcp_push_pending_frames+0x2b/0xa0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684277] [<ffffffff8145b249>] tcp_rcv_established+0x459/0x6d0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684282] [<ffffffff814630bd>] tcp_v4_do_rcv+0x12d/0x140 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684285] [<ffffffff8146365e>] tcp_v4_rcv+0x58e/0x7c0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684289] [<ffffffff8144276d>] ip_local_deliver_finish+0x11d/0x2b0 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684293] [<ffffffff8144293b>] ip_local_deliver+0x3b/0x90 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684297] [<ffffffff81442ad6>] ip_rcv_finish+0x146/0x420 [kern.warning]
Oct 19 07:10:02 johan kernel: [23565.684301] [<ffffffff8144304b>] ip_rcv+0x29b/0x370 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684304] [<ffffffff81418f9a>] netif_receive_skb+0x38a/0x4d0 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684308] [<ffffffff81419268>] napi_skb_finish+0x48/0x60 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684311] [<ffffffff81419724>] napi_gro_receive+0x34/0x40 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684330] [<ffffffffa006b623>] tg3_rx+0x373/0x4b0 [tg3] [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684339] [<ffffffffa006cbf0>] tg3_poll_work+0x70/0xf0 [tg3] [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684347] [<ffffffffa006ccae>] tg3_poll+0x3e/0xe0 [tg3] [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684350] [<ffffffff814198d2>] net_rx_action+0x102/0x210 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684357] [<ffffffff81061d24>] __do_softirq+0xc4/0x1f0 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684362] [<ffffffff8101314c>] call_softirq+0x1c/0x30 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684365] [<ffffffff81014945>] do_softirq+0x55/0x90 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684369] [<ffffffff8106116b>] irq_exit+0x7b/0x90 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684372] [<ffffffff81013e93>] do_IRQ+0x73/0xe0 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684378] [<ffffffff81012993>] ret_from_intr+0x0/0x11 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684381] <EOI> [<ffffffff810318b6>] ? native_safe_halt+0x6/0x10 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684391] [<ffffffff81019cd8>] ? default_idle+0x48/0xe0 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684396] [<ffffffff8150929d>] ? __atomic_notifier_call_chain+0xd/0x10 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684400] [<ffffffff815092b1>] ? atomic_notifier_call_chain+0x11/0x20 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684404] [<ffffffff810107c8>] ? cpu_idle+0x98/0xe0 [kern.warning]
Oct 19 07:10:04 johan kernel: [23565.684410] [<ffffffff81500d95>] ? start_secondary+0x95/0xc0 [kern.warning]
if you need more, I can send you a whole bunch of them ...
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Mon, Oct 19, 2009 at 12:54:11PM +0300, Pekka Enberg wrote:
> On Mon, 2009-10-19 at 11:49 +0200, Tobi Oetiker wrote:
> > I have updated a fileserver to 2.6.31 today and I see page
> > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > So I guess the problem must be quite generic:
>
> Yup, it almost certainly is. Does this patch help?
>
> http://lkml.org/lkml/2009/10/16/89
This patch seems to help in some cases. Before applying this patch I
was able to trigger alloc failures on different machine by booting
kernel with "mem=256MB" and doing:
$ gitk on-full-tree &
# rmmod e100
... wait for few MBs in swap
# modprobe e100; ifup --force ethX
So here this patch helped -- with it, I was unable to trigger page
allocation failures (testing was short, tough). However, as I said
here[1], I applied both of Mel's patches (including this one) and that
didn't help my orginal issue (failures after suspend).
[1] http://lkml.org/lkml/2009/10/17/109
Thanks.
On Mon, Oct 19, 2009 at 01:33:29AM +0200, Frans Pop wrote:
> Another long mail, sorry.
>
> On Wednesday 14 October 2009, Frans Pop wrote:
> > > There still has not been a mm-change identified that makes
> > > fragmentation significantly worse.
> >
> > My bisection shows a very clear point, even if not an individual commit,
> > in the 'akpm' merge where SKB errors suddenly become *much* more
> > frequent and easy to trigger.
> > I'm sorry to say this, but the fact that nothing has been identified yet
> > is IMO the result of a lack of effort, not because there is no such
> > change.
>
> I was wrong. It turns out that I was creating the variations in the test
> results around the akpm merge myself by tiny changes in the way I ran the
> tests. It took another round of about 30 compilations and tests purely in
> this range to show that, but those same tests also made me aware of other
> patterns I should look at.
>
Once again, thanks for persisting with this for so long. That many tests
and searching is a miserable undertaking.
> Until a few days ago I was concentrating on "do I see SKB allocation errors
> or not". Since then I've also been looking more consciously at when they
> happen, at disk access patterns and at desktop freeze patterns.
>
> I think I did mention before that this whole issue is rather subtle :-/
Indeed
> So, my apologies for finguering the wrong area for so long, but it looked
> solid given the info available at the time.
>
> On Thursday 15 October 2009, Mel Gorman wrote:
> > Outside the range of commits suspected of causing problems was the
> > following. It's extremely low probability
> >
> > Commit 8aa7e84 Fix congestion_wait() sync/async vs read/write confusion
> > This patch alters the call to congestion_wait() in the page
> > allocator. Frankly, I don't get the change but it might worth
> > checking if replacing BLK_RW_ASYNC with WRITE on top of 2.6.31
> > makes any difference
>
> This is the real culprit. Mel: thanks very much for looking beyond the
> area I identified. Your overview of mm changes was exactly what I needed
> and really helped a lot during my later tests.
>
I'm surprised this made such a big difference which is why I described
it as "extremely low probability". It implies that the real problem isn't
fragmentation per-se but the timing of when pages get consumed.
Maybe what has really changed is how long direct reclaimers wait before trying
to allocate again. After the commit, if direct reclaimers are waiting longer
between direct reclaim attempts, it might mean that the GFP_KERNEL reclaimers
of high-order pages are doing less work before and hurting parallel GFP_ATOMIC
users. Jens, does this sound plausible?
> This commit definitely causes most of the problems; confirmed by reverting
> it on top of 2.6.31 (also requires reverting 373c0a7e, which is a later
> build fix).
>
> The rest of this mail gives details on my tests and how I reached the above
> conclusion.
>
> TEST BASELINE (2.6.30)
> ======================
> I mentioned in an earlier mail that I run three instances of gitk for my
> tests. Loading gitk seems to consist of 3 phases:
> 1) general initial scan of the repository (branches?)
> 2) reading commits: commit counter increases
> 3) reading references (including bisection good/bad points) and
> uncommitted changes
>
> Below times and comments per stage when the test is run with 2.6.30. As my
> test starts after a clean boot, buffers are mostly empty.
>
> 1st instance: 'gitk v2.6.29..master' (preparation)
> 1) ~20 seconds; user interface is mostly blank
> 2) ~5 seconds to read 35.000 commits; user interface is updated and counter
> increases steadily as they are read
> 3) ~10 seconds; "branch"/"follows"/"precedes" info and tags are filled
> in; fairly heavy disk activity
>
> 2st instance: 'gitk master' (preparation)
> 1) 0 seconds (because data is already buffered)
> 2) ~25 seconds to read 167500 commits; counter increases steadily
> 3) 1-2 seconds (because data is already buffered)
>
> 3st instance: 'gitk master' (the actual test)
> 1) 0 seconds because data is already buffered
> 2) ~55 seconds due to swapping overhead; minor music skip around commit
> 110.000; counter slower after 90.000, some short halts, but generally
> increases steadily; moderate disk activity
> 3) ~55-60 seconds; because buffers have been emptied data must by read
> again, with swapping; very heavy disk activity; fairly long music
> skip (15-20 seconds), but no SKB allocation errors
>
> So, the loading of the 3rd instance takes 1.5 minutes longer than the
> second because of the swapping. And phase 3) is most affected by it.
>
> AFTER WIRELESS CHANGE
> =====================
> After commit 4752c93c30 ("iwlcore: Allow skb allocation from tasklet") I
> start getting the SKB errors. They can be triggered reliably if the whole
> test is repeated 1 or 2 times, but generally not the first time the test
> is run.
It's up to the wireless driver maintainer what to do here, but it seems
like that patch needs to be reverted and thought about some more before
trying again.
>
> Or so I thought for a long time.
> It turns out that I will get SKB errors during the first run if I'm
> "sloppy" in the test execution. For example if I wait too long before
> switching from the last gitk instance to konsole where I have
> a 'tail -f /var/log/kern.log' running.
So the timing is critical of when the high-order atomic allocations
start kicking in.
> Another factor is the state of the repository: do I have master checked
> out, or an older branch, or am I in the middle of a bisection. This
> influences how data is read from the disk and thus the test results.
> A last factor may be the size of the kernel I'm using: my test/bisect
> kernel is significantly smaller than my regular kernel.
>
> If the test is run completely cleanly, I will not get SKB errors during the
> first run. Also, this change does not affect the timings of the test at
> all: the total load time of the 3rd instance is still ~1:55 and music
> skips happen in roughly the same places. The pattern of disk activity also
> remains unchanged.
>
> If I do *not* run the test cleanly, any SKB errors during the first test
> run will always be during phase 3), never during phase 2). This is what I
> saw during tests in the 'akpm' range, and explains the inconsistent
> results there.
>
> After discovering this I've made a copy of the git repo so that I always
> test using the exact same state and tightened my test procedure.
>
> AFTER congestion_wait CHANGE
> ============================
> If I test commit 9f2d8be, which is just before the congestion_wait()
> change, I still get the same pattern as described above. But when I test
> with 8aa7e84 ("Fix congestion_wait() sync/async vs read/write confusion"),
> things change dramatically when the 3rd gitk instance is started.
>
So, assuming this is a timing problem, this commit affects the timing of
when pages are consumed by processes doing direct reclaim.
> During the 2nd phase I see the first SKB allocation errors with a music
> skip between reading commits 95.000 and 110.000.
> About commit 115.000 there is a very long pause during which the counter
> does not increase, music stops and the desktop freezes completely. The
> first 30 seconds of that freeze there is only very low disk activity (which
> seems strange);
I'm just going to have to depend on Jens here. Jens, the congestion_wait() is
on BLK_RW_ASYNC after the commit. Reclaim usually writes pages asynchronously
but lumpy reclaim actually waits of pages to write out synchronously so
it's not always async.
Either way, reclaim is usually worried about writing pages but it would appear
after this change that a lot of read activity can also stall a process in
direct reclaim. What might be happening in Frans's particular case is that the
tasklet that allocates high-order pages for the RX buffers is getting stalled
by congestion caused by other processes doing reads from the filesystem.
While it makes sense from a congestion point of view to halt the IO, the
reclaim operations from direct reclaimers is getting delayed for long enough
to cause problems for GFP_ATOMIC.
Does this sound plausible to you? If so, what's the best way of
addressing this? Changing congestion_wait back to WRITE (assuming that
works for Frans)? Changing it to SYNC (again, assuming it actually
works) or a revert?
> the next 25 seconds there suddenly is very high disk
> activity during which things gradually unfreeze and more SKB errors are
> displayed. After that the commit counter runs up fairly steadily again.
>
> Phase 2) ends at ~1:45. Phase 3) (with more SKB errors) ends at ~2:05.
>
> So this change almost doubles the time needed for phase 2) and causes SKB
> allocation errors to occur during that phase. Also, before this commit the
> desktop freezes are much shorter and less severe. With this change the
> desktop is completely unusable for almost a minute during phase 2), with
> even the mouse pointer frozen solid.
> Note that phase 3) becomes shorter, but that the total time needed to load
> the 3rd instance increases by about 10-15 seconds.
>
> Note: -rc2 and -rc3 had broken NFS, so I had to cherry-pick 3 NFS commits
> from -rc4 on top of the commits I wanted to test.
>
> WITH congestion_wait CHANGE REVERTED
> ====================================
> I've done quite a few tests of 2.6.31 with 373c0a7e and 8aa7e847 reverted
> to confirm that's really the culprit. I've done this for .31-rc3, .31-rc4,
> .31-rc5, .31 and .31.1.
>
> In all cases the huge freeze in phase 2) is gone and the general behavior
> and timings are again as it was after the wireless change. During most
> tests I did not get any SKB allocation errors during phase 2) or phase 3).
>
> However with .31-rc5, .31 and .31.1 I have had some tests where I would see
> a few SKB allocation errors during phase 3) (which is somewhat likely),
> but also during phase 2). At this point I'm unsure whether this is just
> noise, or maybe a minor influence from some change merged after .31-rc4.
> Looking through the commits there are several mm/page allocation changes.
>
It could still be kswapd not being woken up often enough after direct
reclaimers. I took a look through the commits but none of the mm or
allocator changes struck me as likely candidates for making
fragmentation worse or altering the timing.
> For now I suggest ignoring this though as the impact (if any) is very minor
> and it is not reproducible reliably enough.
>
> Next I'll retest Mel's patches and also test Reinette's patches.
>
Of the two patches, only the kswapd one should have any significance. As
David pointed out, the second patch is essentially a no-op as it should
not have been possible to enter direct reclaim with ALLOC_NO_WATERMARKS
set.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Mon, Oct 19, 2009 at 04:01:45PM +0200, Karol Lewandowski wrote:
> On Mon, Oct 19, 2009 at 12:54:11PM +0300, Pekka Enberg wrote:
> > On Mon, 2009-10-19 at 11:49 +0200, Tobi Oetiker wrote:
> > > I have updated a fileserver to 2.6.31 today and I see page
> > > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > > So I guess the problem must be quite generic:
> >
> > Yup, it almost certainly is. Does this patch help?
> >
> > http://lkml.org/lkml/2009/10/16/89
>
> This patch seems to help in some cases. Before applying this patch I
> was able to trigger alloc failures on different machine by booting
> kernel with "mem=256MB" and doing:
>
> $ gitk on-full-tree &
> # rmmod e100
> ... wait for few MBs in swap
> # modprobe e100; ifup --force ethX
>
> So here this patch helped -- with it, I was unable to trigger page
> allocation failures (testing was short, tough). However, as I said
> here[1], I applied both of Mel's patches (including this one) and that
> didn't help my orginal issue (failures after suspend).
>
> [1] http://lkml.org/lkml/2009/10/17/109
>
Can you test with my kswapd patch applied and commits 373c0a7e,8aa7e847
reverted please?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Mon, Oct 19, 2009 at 03:40:05PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > On Mon, Oct 19, 2009 at 11:49:08AM +0200, Tobi Oetiker wrote:
> > > Today Frans Pop wrote:
> > >
> > > >
> > > > I'm starting to think that this commit may not be directly related to high
> > > > order allocation failures. The fact that I'm seeing SKB allocation
> > > > failures earlier because of this commit could be just a side effect.
> > > > It could be that instead the main impact of this commit is on encrypted
> > > > file system and/or encrypted swap (kcryptd).
> > > >
> > > > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > > > only reading from NFS that's unlikely).
> > >
> > > I have updated a fileserver to 2.6.31 today and I see page
> > > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > > So I guess the problem must be quite generic:
> > >
> > >
> > > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> > >
> >
> > What's the rest of the stack trace? I'm wondering where a large number
> > of order-5 GFP_ATOMIC allocations are coming from. It seems different to
> > the e100 problem where there is one GFP_ATOMIC allocation while the
> > firmware is being loaded.
>
> Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684157] [<ffffffff810da7e5>] __alloc_pages_nodemask+0x135/0x140 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684164] [<ffffffff815065b4>] ? _spin_unlock_bh+0x14/0x20 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684170] [<ffffffff8110b368>] kmalloc_large_node+0x68/0xc0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684175] [<ffffffff8110f15a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684181] [<ffffffff8140ffd2>] ? skb_copy+0x32/0xa0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684185] [<ffffffff8140d8b6>] __alloc_skb+0x76/0x180 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684205] [<ffffffff8140ffd2>] skb_copy+0x32/0xa0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684221] [<ffffffffa050f33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
Is the MTU set very high between the host and virtualised machine?
Can you test please with the patch at http://lkml.org/lkml/2009/10/16/89
applied and with commits 373c0a7e and 8aa7e847 reverted please?
> Oct 19 07:10:02 johan kernel: [23565.684231] [<ffffffff81416f79>] dev_hard_start_xmit+0x189/0x1c0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684236] [<ffffffff8142f071>] __qdisc_run+0x1a1/0x230 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684240] [<ffffffff81418a88>] dev_queue_xmit+0x238/0x310 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684246] [<ffffffff8144864b>] ip_finish_output+0x11b/0x2f0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684250] [<ffffffff814488a9>] ip_output+0x89/0xd0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684254] [<ffffffff814478c0>] ip_local_out+0x20/0x30 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684258] [<ffffffff814481ab>] ip_queue_xmit+0x22b/0x3f0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684264] [<ffffffff8145d5e5>] tcp_transmit_skb+0x345/0x4e0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684269] [<ffffffff8145eaf6>] tcp_write_xmit+0xb6/0x2e0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684273] [<ffffffff8145ed8b>] __tcp_push_pending_frames+0x2b/0xa0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684277] [<ffffffff8145b249>] tcp_rcv_established+0x459/0x6d0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684282] [<ffffffff814630bd>] tcp_v4_do_rcv+0x12d/0x140 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684285] [<ffffffff8146365e>] tcp_v4_rcv+0x58e/0x7c0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684289] [<ffffffff8144276d>] ip_local_deliver_finish+0x11d/0x2b0 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684293] [<ffffffff8144293b>] ip_local_deliver+0x3b/0x90 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684297] [<ffffffff81442ad6>] ip_rcv_finish+0x146/0x420 [kern.warning]
> Oct 19 07:10:02 johan kernel: [23565.684301] [<ffffffff8144304b>] ip_rcv+0x29b/0x370 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684304] [<ffffffff81418f9a>] netif_receive_skb+0x38a/0x4d0 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684308] [<ffffffff81419268>] napi_skb_finish+0x48/0x60 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684311] [<ffffffff81419724>] napi_gro_receive+0x34/0x40 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684330] [<ffffffffa006b623>] tg3_rx+0x373/0x4b0 [tg3] [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684339] [<ffffffffa006cbf0>] tg3_poll_work+0x70/0xf0 [tg3] [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684347] [<ffffffffa006ccae>] tg3_poll+0x3e/0xe0 [tg3] [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684350] [<ffffffff814198d2>] net_rx_action+0x102/0x210 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684357] [<ffffffff81061d24>] __do_softirq+0xc4/0x1f0 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684362] [<ffffffff8101314c>] call_softirq+0x1c/0x30 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684365] [<ffffffff81014945>] do_softirq+0x55/0x90 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684369] [<ffffffff8106116b>] irq_exit+0x7b/0x90 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684372] [<ffffffff81013e93>] do_IRQ+0x73/0xe0 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684378] [<ffffffff81012993>] ret_from_intr+0x0/0x11 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684381] <EOI> [<ffffffff810318b6>] ? native_safe_halt+0x6/0x10 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684391] [<ffffffff81019cd8>] ? default_idle+0x48/0xe0 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684396] [<ffffffff8150929d>] ? __atomic_notifier_call_chain+0xd/0x10 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684400] [<ffffffff815092b1>] ? atomic_notifier_call_chain+0x11/0x20 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684404] [<ffffffff810107c8>] ? cpu_idle+0x98/0xe0 [kern.warning]
> Oct 19 07:10:04 johan kernel: [23565.684410] [<ffffffff81500d95>] ? start_secondary+0x95/0xc0 [kern.warning]
>
> if you need more, I can send you a whole bunch of them ...
>
I'm assuming they are all more or less the same.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Today Mel Gorman wrote:
> On Mon, Oct 19, 2009 at 03:40:05PM +0200, Tobias Oetiker wrote:
> > Hi Mel,
> >
> > Today Mel Gorman wrote:
> >
> > > On Mon, Oct 19, 2009 at 11:49:08AM +0200, Tobi Oetiker wrote:
> > > > Today Frans Pop wrote:
> > > >
> > > > >
> > > > > I'm starting to think that this commit may not be directly related to high
> > > > > order allocation failures. The fact that I'm seeing SKB allocation
> > > > > failures earlier because of this commit could be just a side effect.
> > > > > It could be that instead the main impact of this commit is on encrypted
> > > > > file system and/or encrypted swap (kcryptd).
> > > > >
> > > > > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > > > > only reading from NFS that's unlikely).
> > > >
> > > > I have updated a fileserver to 2.6.31 today and I see page
> > > > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > > > So I guess the problem must be quite generic:
> > > >
> > > >
> > > > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > > > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > > > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > > > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> > > >
> > >
> > > What's the rest of the stack trace? I'm wondering where a large number
> > > of order-5 GFP_ATOMIC allocations are coming from. It seems different to
> > > the e100 problem where there is one GFP_ATOMIC allocation while the
> > > firmware is being loaded.
> >
> > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684157] [<ffffffff810da7e5>] __alloc_pages_nodemask+0x135/0x140 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684164] [<ffffffff815065b4>] ? _spin_unlock_bh+0x14/0x20 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684170] [<ffffffff8110b368>] kmalloc_large_node+0x68/0xc0 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684175] [<ffffffff8110f15a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684181] [<ffffffff8140ffd2>] ? skb_copy+0x32/0xa0 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684185] [<ffffffff8140d8b6>] __alloc_skb+0x76/0x180 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684205] [<ffffffff8140ffd2>] skb_copy+0x32/0xa0 [kern.warning]
> > Oct 19 07:10:02 johan kernel: [23565.684221] [<ffffffffa050f33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
>
> Is the MTU set very high between the host and virtualised machine?
>
> Can you test please with the patch at http://lkml.org/lkml/2009/10/16/89
> applied and with commits 373c0a7e and 8aa7e847 reverted please?
if you can send me a consolidated patch which does apply to
2.6.31.4 I will be glad to try ...
your patch in http://lkml.org/lkml/2009/10/16/89 seems not to be
for 2.6.31 ... I assume it would be but then again I I don't realy
understand the code so this is just pattern matching ...
--- a/mm/page_alloc.c 2009-10-05 19:12:06.000000000 +0200
+++ b/mm/page_alloc.c 2009-10-19 14:52:15.000000000 +0200
@@ -1763,6 +1763,7 @@
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
+restart:
wake_all_kswapd(order, zonelist, high_zoneidx);
/*
@@ -1772,7 +1773,6 @@
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);
-restart:
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Mon, Oct 19, 2009 at 04:16:36PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > On Mon, Oct 19, 2009 at 03:40:05PM +0200, Tobias Oetiker wrote:
> > > Hi Mel,
> > >
> > > Today Mel Gorman wrote:
> > >
> > > > On Mon, Oct 19, 2009 at 11:49:08AM +0200, Tobi Oetiker wrote:
> > > > > Today Frans Pop wrote:
> > > > >
> > > > > >
> > > > > > I'm starting to think that this commit may not be directly related to high
> > > > > > order allocation failures. The fact that I'm seeing SKB allocation
> > > > > > failures earlier because of this commit could be just a side effect.
> > > > > > It could be that instead the main impact of this commit is on encrypted
> > > > > > file system and/or encrypted swap (kcryptd).
> > > > > >
> > > > > > Besides mm the commit also touches dm-crypt (and nfs/write.c, but as I'm
> > > > > > only reading from NFS that's unlikely).
> > > > >
> > > > > I have updated a fileserver to 2.6.31 today and I see page
> > > > > allocation failures from several parts of the system ... mostly nfs though ... (it is a nfs server).
> > > > > So I guess the problem must be quite generic:
> > > > >
> > > > >
> > > > > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > > > > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > > > > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > > > > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> > > > >
> > > >
> > > > What's the rest of the stack trace? I'm wondering where a large number
> > > > of order-5 GFP_ATOMIC allocations are coming from. It seems different to
> > > > the e100 problem where there is one GFP_ATOMIC allocation while the
> > > > firmware is being loaded.
> > >
> > > Oct 19 07:10:02 johan kernel: [23565.684110] swapper: page allocation failure. order:5, mode:0x4020 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684118] Pid: 0, comm: swapper Not tainted 2.6.31-02063104-generic #02063104 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684121] Call Trace: [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684124] <IRQ> [<ffffffff810da5a2>] __alloc_pages_slowpath+0x3b2/0x4c0 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684157] [<ffffffff810da7e5>] __alloc_pages_nodemask+0x135/0x140 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684164] [<ffffffff815065b4>] ? _spin_unlock_bh+0x14/0x20 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684170] [<ffffffff8110b368>] kmalloc_large_node+0x68/0xc0 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684175] [<ffffffff8110f15a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684181] [<ffffffff8140ffd2>] ? skb_copy+0x32/0xa0 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684185] [<ffffffff8140d8b6>] __alloc_skb+0x76/0x180 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684205] [<ffffffff8140ffd2>] skb_copy+0x32/0xa0 [kern.warning]
> > > Oct 19 07:10:02 johan kernel: [23565.684221] [<ffffffffa050f33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> >
> > Is the MTU set very high between the host and virtualised machine?
> >
> > Can you test please with the patch at http://lkml.org/lkml/2009/10/16/89
> > applied and with commits 373c0a7e and 8aa7e847 reverted please?
>
> if you can send me a consolidated patch which does apply to
> 2.6.31.4 I will be glad to try ...
>
Sure
==== CUT HERE ====
>From 6c0215af3b7c39ef7b8083ea38ca3ad93cd3f51f Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Mon, 19 Oct 2009 15:40:43 +0100
Subject: [PATCH] Kick off kswapd after direct reclaim and revert congestion changes
The following patch is http://lkml.org/lkml/2009/10/16/89 on top of
2.6.31.4 as well as patches 373c0a7e and 8aa7e847 reverted.
---
arch/x86/lib/usercopy_32.c | 2 +-
drivers/block/pktcdvd.c | 10 ++++------
drivers/md/dm-crypt.c | 2 +-
fs/fat/file.c | 2 +-
fs/fuse/dev.c | 8 ++++----
fs/nfs/write.c | 8 +++-----
fs/reiserfs/journal.c | 2 +-
fs/xfs/linux-2.6/kmem.c | 4 ++--
fs/xfs/linux-2.6/xfs_buf.c | 2 +-
include/linux/backing-dev.h | 11 +++--------
include/linux/blkdev.h | 13 +++++++++----
mm/backing-dev.c | 7 ++++---
mm/memcontrol.c | 2 +-
mm/page-writeback.c | 8 ++++----
mm/page_alloc.c | 15 ++++++++-------
mm/vmscan.c | 8 ++++----
16 files changed, 51 insertions(+), 53 deletions(-)
diff --git a/arch/x86/lib/usercopy_32.c b/arch/x86/lib/usercopy_32.c
index 1f118d4..7c8ca91 100644
--- a/arch/x86/lib/usercopy_32.c
+++ b/arch/x86/lib/usercopy_32.c
@@ -751,7 +751,7 @@ survive:
if (retval == -ENOMEM && is_global_init(current)) {
up_read(¤t->mm->mmap_sem);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto survive;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 99a506f..83650e0 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -1372,10 +1372,8 @@ try_next_bio:
wakeup = (pd->write_congestion_on > 0
&& pd->bio_queue_size <= pd->write_congestion_off);
spin_unlock(&pd->lock);
- if (wakeup) {
- clear_bdi_congested(&pd->disk->queue->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ if (wakeup)
+ clear_bdi_congested(&pd->disk->queue->backing_dev_info, WRITE);
pkt->sleep_time = max(PACKET_WAIT_TIME, 1);
pkt_set_state(pkt, PACKET_WAITING_STATE);
@@ -2594,10 +2592,10 @@ static int pkt_make_request(struct request_queue *q, struct bio *bio)
spin_lock(&pd->lock);
if (pd->write_congestion_on > 0
&& pd->bio_queue_size >= pd->write_congestion_on) {
- set_bdi_congested(&q->backing_dev_info, BLK_RW_ASYNC);
+ set_bdi_congested(&q->backing_dev_info, WRITE);
do {
spin_unlock(&pd->lock);
- congestion_wait(BLK_RW_ASYNC, HZ);
+ congestion_wait(WRITE, HZ);
spin_lock(&pd->lock);
} while(pd->bio_queue_size > pd->write_congestion_off);
}
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index ed10381..c72a8dd 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -776,7 +776,7 @@ static void kcryptd_crypt_write_convert(struct dm_crypt_io *io)
* But don't wait if split was due to the io size restriction
*/
if (unlikely(out_of_pages))
- congestion_wait(BLK_RW_ASYNC, HZ/100);
+ congestion_wait(WRITE, HZ/100);
/*
* With async crypto it is unsafe to share the crypto context
diff --git a/fs/fat/file.c b/fs/fat/file.c
index f042b96..b28ea64 100644
--- a/fs/fat/file.c
+++ b/fs/fat/file.c
@@ -134,7 +134,7 @@ static int fat_file_release(struct inode *inode, struct file *filp)
if ((filp->f_mode & FMODE_WRITE) &&
MSDOS_SB(inode->i_sb)->options.flush) {
fat_flush_inodes(inode->i_sb, inode, NULL);
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
return 0;
}
diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 6484eb7..f58ecbc 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -286,8 +286,8 @@ __releases(&fc->lock)
}
if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
fc->connected && fc->bdi_initialized) {
- clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ clear_bdi_congested(&fc->bdi, READ);
+ clear_bdi_congested(&fc->bdi, WRITE);
}
fc->num_background--;
fc->active_background--;
@@ -414,8 +414,8 @@ static void fuse_request_send_nowait_locked(struct fuse_conn *fc,
fc->blocked = 1;
if (fc->num_background == FUSE_CONGESTION_THRESHOLD &&
fc->bdi_initialized) {
- set_bdi_congested(&fc->bdi, BLK_RW_SYNC);
- set_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
+ set_bdi_congested(&fc->bdi, READ);
+ set_bdi_congested(&fc->bdi, WRITE);
}
list_add_tail(&req->list, &fc->bg_queue);
flush_bg_queue(fc);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index a34fae2..5693fcd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -200,10 +200,8 @@ static int nfs_set_page_writeback(struct page *page)
struct nfs_server *nfss = NFS_SERVER(inode);
if (atomic_long_inc_return(&nfss->writeback) >
- NFS_CONGESTION_ON_THRESH) {
- set_bdi_congested(&nfss->backing_dev_info,
- BLK_RW_ASYNC);
- }
+ NFS_CONGESTION_ON_THRESH)
+ set_bdi_congested(&nfss->backing_dev_info, WRITE);
}
return ret;
}
@@ -215,7 +213,7 @@ static void nfs_end_page_writeback(struct page *page)
end_page_writeback(page);
if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
- clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+ clear_bdi_congested(&nfss->backing_dev_info, WRITE);
}
/*
diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
index 9062220..77f5bb7 100644
--- a/fs/reiserfs/journal.c
+++ b/fs/reiserfs/journal.c
@@ -997,7 +997,7 @@ static int reiserfs_async_progress_wait(struct super_block *s)
DEFINE_WAIT(wait);
struct reiserfs_journal *j = SB_JOURNAL(s);
if (atomic_read(&j->j_async_throttle))
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
return 0;
}
diff --git a/fs/xfs/linux-2.6/kmem.c b/fs/xfs/linux-2.6/kmem.c
index 2d3f90a..1cd3b55 100644
--- a/fs/xfs/linux-2.6/kmem.c
+++ b/fs/xfs/linux-2.6/kmem.c
@@ -53,7 +53,7 @@ kmem_alloc(size_t size, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
@@ -130,7 +130,7 @@ kmem_zone_alloc(kmem_zone_t *zone, unsigned int __nocast flags)
printk(KERN_ERR "XFS: possible memory allocation "
"deadlock in %s (mode:0x%x)\n",
__func__, lflags);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (1);
}
diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 965df12..178c20c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -412,7 +412,7 @@ _xfs_buf_lookup_pages(
XFS_STATS_INC(xb_page_retries);
xfsbufd_wakeup(0, gfp_mask);
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto retry;
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..0ec2c59 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -229,14 +229,9 @@ static inline int bdi_rw_congested(struct backing_dev_info *bdi)
(1 << BDI_async_congested));
}
-enum {
- BLK_RW_ASYNC = 0,
- BLK_RW_SYNC = 1,
-};
-
-void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
-void set_bdi_congested(struct backing_dev_info *bdi, int sync);
-long congestion_wait(int sync, long timeout);
+void clear_bdi_congested(struct backing_dev_info *bdi, int rw);
+void set_bdi_congested(struct backing_dev_info *bdi, int rw);
+long congestion_wait(int rw, long timeout);
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 69103e0..998c8e0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -70,6 +70,11 @@ enum rq_cmd_type_bits {
REQ_TYPE_ATA_PC,
};
+enum {
+ BLK_RW_ASYNC = 0,
+ BLK_RW_SYNC = 1,
+};
+
/*
* For request of type REQ_TYPE_LINUX_BLOCK, rq->cmd[0] is the opcode being
* sent down (similar to how REQ_TYPE_BLOCK_PC means that ->cmd[] holds a
@@ -775,18 +780,18 @@ extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t,
* congested queues, and wake up anyone who was waiting for requests to be
* put back.
*/
-static inline void blk_clear_queue_congested(struct request_queue *q, int sync)
+static inline void blk_clear_queue_congested(struct request_queue *q, int rw)
{
- clear_bdi_congested(&q->backing_dev_info, sync);
+ clear_bdi_congested(&q->backing_dev_info, rw);
}
/*
* A queue has just entered congestion. Flag that in the queue's VM-visible
* state flags and increment the global gounter of congested queues.
*/
-static inline void blk_set_queue_congested(struct request_queue *q, int sync)
+static inline void blk_set_queue_congested(struct request_queue *q, int rw)
{
- set_bdi_congested(&q->backing_dev_info, sync);
+ set_bdi_congested(&q->backing_dev_info, rw);
}
extern void blk_start_queue(struct request_queue *q);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..493b468 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -283,6 +283,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
enum bdi_state bit;
@@ -307,18 +308,18 @@ EXPORT_SYMBOL(set_bdi_congested);
/**
* congestion_wait - wait for a backing_dev to become uncongested
- * @sync: SYNC or ASYNC IO
+ * @rw: READ or WRITE
* @timeout: timeout in jiffies
*
* Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit
* write congestion. If no backing_devs are congested then just wait for the
* next write to be completed.
*/
-long congestion_wait(int sync, long timeout)
+long congestion_wait(int rw, long timeout)
{
long ret;
DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
+ wait_queue_head_t *wqh = &congestion_wqh[rw];
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fd4529d..834509f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1990,7 +1990,7 @@ try_to_free:
if (!progress) {
nr_retries--;
/* maybe some writeback is necessary */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
}
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..7687879 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -575,7 +575,7 @@ static void balance_dirty_pages(struct address_space *mapping)
if (pages_written >= write_chunk)
break; /* We've done our duty */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -669,7 +669,7 @@ void throttle_vm_writeout(gfp_t gfp_mask)
if (global_page_state(NR_UNSTABLE_NFS) +
global_page_state(NR_WRITEBACK) <= dirty_thresh)
break;
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The caller might hold locks which can prevent IO completion
@@ -715,7 +715,7 @@ static void background_writeout(unsigned long _min_pages)
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
else
break;
}
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
else
break; /* All the old data is written */
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0b3c6cb..489a187 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1673,7 +1673,7 @@ __alloc_pages_high_priority(gfp_t gfp_mask, unsigned int order,
preferred_zone, migratetype);
if (!page && gfp_mask & __GFP_NOFAIL)
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
} while (!page && (gfp_mask & __GFP_NOFAIL));
return page;
@@ -1763,16 +1763,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
if (NUMA_BUILD && (gfp_mask & GFP_THISNODE) == GFP_THISNODE)
goto nopage;
- wake_all_kswapd(order, zonelist, high_zoneidx);
-
/*
- * OK, we're below the kswapd watermark and have kicked background
- * reclaim. Now things get more complex, so set up alloc_flags according
- * to how we want to proceed.
+ * OK, we're below the kswapd watermark and now things get more
+ * complex, so set up alloc_flags according to how we want to
+ * proceed.
*/
alloc_flags = gfp_to_alloc_flags(gfp_mask);
restart:
+ /* Kick background reclaim */
+ wake_all_kswapd(order, zonelist, high_zoneidx);
+
/* This is the last chance, in general, before the goto nopage. */
page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -1844,7 +1845,7 @@ rebalance:
pages_reclaimed += did_some_progress;
if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
/* Wait for some write requests to complete then retry */
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+ congestion_wait(WRITE, HZ/50);
goto rebalance;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 94e86dd..9219beb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1109,7 +1109,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
lumpy_reclaim) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* The attempt at page out may have made some
@@ -1726,7 +1726,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
/* Take a nap, wait for some writeback to complete */
if (sc->nr_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
}
/* top priority shrink_zones still had more to do? don't OOM, then */
if (!sc->all_unreclaimable && scanning_global_lru(sc))
@@ -1965,7 +1965,7 @@ loop_again:
* another pass across the zones.
*/
if (total_scanned && priority < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(WRITE, HZ/10);
/*
* We do this so kswapd doesn't build up large priorities for
@@ -2238,7 +2238,7 @@ unsigned long shrink_all_memory(unsigned long nr_pages)
goto out;
if (sc.nr_scanned && prio < DEF_PRIORITY - 2)
- congestion_wait(BLK_RW_ASYNC, HZ / 10);
+ congestion_wait(WRITE, HZ / 10);
}
}
On Mon, Oct 19, 2009 at 03:01:52PM +0100, Mel Gorman wrote:
>
> > During the 2nd phase I see the first SKB allocation errors with a music
> > skip between reading commits 95.000 and 110.000.
> > About commit 115.000 there is a very long pause during which the counter
> > does not increase, music stops and the desktop freezes completely. The
> > first 30 seconds of that freeze there is only very low disk activity (which
> > seems strange);
>
> I'm just going to have to depend on Jens here. Jens, the congestion_wait() is
> on BLK_RW_ASYNC after the commit. Reclaim usually writes pages asynchronously
> but lumpy reclaim actually waits of pages to write out synchronously so
> it's not always async.
Waiting doesn't make it synchronous from the elevator point of view ;)
If you're using WB_SYNC_NONE, it's a async write. WB_SYNC_ALL makes it
a sync write. I only see WB_SYNC_NONE in vmscan.c, so we should be
using the async congestion wait. (the exception is xfs which always
does async writes).
But I'm honestly not 100% sure. Looking back through the emails, the
test case is doing IO on top of a whole lot of things on top of
dm-crypt? I just tried to figure out if dm-crypt is turning the async
IO into sync IOs, but didn't quite make sense of it.
Could you also please include which filesystems were being abused during
the test and how? Reading through the emails, I think you've got:
gitk being run 3 times on some FS (NFS?)
streaming reads on NFS
swap on dm-crypt
If other filesystems are being used, please correct me. Also please
include if they are on crypto or straight block device.
>
> Either way, reclaim is usually worried about writing pages but it would appear
> after this change that a lot of read activity can also stall a process in
> direct reclaim. What might be happening in Frans's particular case is that the
> tasklet that allocates high-order pages for the RX buffers is getting stalled
> by congestion caused by other processes doing reads from the filesystem.
> While it makes sense from a congestion point of view to halt the IO, the
> reclaim operations from direct reclaimers is getting delayed for long enough
> to cause problems for GFP_ATOMIC.
The congestion_wait code either waits for congestion to clear or for
a given timeout. The part that isn't clear is if before the patch
we waited a very short time (congestion cleared quickly) or a very long
time (we hit the timeout or congestion cleared slowly).
The easiest way to tell is to just replace the congestion_wait() calls
in direct reclaim with schedule_timeout_interruptible(10), test, then
schedule_timeout_interruptible(HZ/20), then test again.
>
> Does this sound plausible to you? If so, what's the best way of
> addressing this? Changing congestion_wait back to WRITE (assuming that
> works for Frans)? Changing it to SYNC (again, assuming it actually
> works) or a revert?
I don't think changing it to SYNC is a good plan unless we're actually
doing sync io. It would be better to just wait on one of the pages that
you've sent down (or its hashed waitqueue since the page can go away).
-chris
On Tue, Oct 20, 2009 at 01:18:15AM +0900, Chris Mason wrote:
> Waiting doesn't make it synchronous from the elevator point of view ;)
> If you're using WB_SYNC_NONE, it's a async write. WB_SYNC_ALL makes it
> a sync write. I only see WB_SYNC_NONE in vmscan.c, so we should be
> using the async congestion wait. (the exception is xfs which always
> does async writes).
That's only because those people who did the global sweep did not bother
to convert it or even tell the list about it. I have a patch in my
QA queue to change it..
On Mon, Oct 19, 2009 at 03:06:19PM +0100, Mel Gorman wrote:
> Can you test with my kswapd patch applied and commits 373c0a7e,8aa7e847
> reverted please?
It seems that your patch and Frans' reverts together *do* make
difference.
With these patches I haven't been able to trigger failures so far
(in about 6 attempts). I'll continue testing and let you know if
anything changes.
If nothing changes this looks like fix for my problem.
Thanks. Thanks a lot!
Hi Mel,
Today Mel Gorman wrote:
> >
> > if you can send me a consolidated patch which does apply to
> > 2.6.31.4 I will be glad to try ...
> >
>
> Sure
>
> ==== CUT HERE ====
>
> From 6c0215af3b7c39ef7b8083ea38ca3ad93cd3f51f Mon Sep 17 00:00:00 2001
> From: Mel Gorman <[email protected]>
> Date: Mon, 19 Oct 2009 15:40:43 +0100
> Subject: [PATCH] Kick off kswapd after direct reclaim and revert congestion changes
>
> The following patch is http://lkml.org/lkml/2009/10/16/89 on top of
> 2.6.31.4 as well as patches 373c0a7e and 8aa7e847 reverted.
it seems to help ... the server has been running for 3 hours now
without incident, but then again it is not as active as during the
day, ... will report tomorrow.
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
Hi Mel,
Today Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > >
> > > if you can send me a consolidated patch which does apply to
> > > 2.6.31.4 I will be glad to try ...
> > >
> >
> > Sure
> >
> > ==== CUT HERE ====
> >
> > From 6c0215af3b7c39ef7b8083ea38ca3ad93cd3f51f Mon Sep 17 00:00:00 2001
> > From: Mel Gorman <[email protected]>
> > Date: Mon, 19 Oct 2009 15:40:43 +0100
> > Subject: [PATCH] Kick off kswapd after direct reclaim and revert congestion changes
> >
> > The following patch is http://lkml.org/lkml/2009/10/16/89 on top of
> > 2.6.31.4 as well as patches 373c0a7e and 8aa7e847 reverted.
>
> it seems to help ... the server has been running for 3 hours now
> without incident, but then again it is not as active as during the
> day, ... will report tomorrow.
while I was writing, the system found that the patch does not realy
help:
Oct 19 22:09:52 johan kernel: [11157.121506] smtpd: page allocation failure. order:5, mode:0x4020 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121514] Pid: 19324, comm: smtpd Tainted: G D 2.6.31.4-oep #1 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121518] Call Trace: [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121521] <IRQ> [<ffffffff810cb599>] __alloc_pages_nodemask+0x549/0x650 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121563] [<ffffffffa02bde3b>] ? __nf_ct_refresh_acct+0xab/0x110 [nf_conntrack] [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121572] [<ffffffffa02a8337>] ? ipt_do_table+0x2f7/0x610 [ip_tables] [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121580] [<ffffffff810fac18>] kmalloc_large_node+0x68/0xc0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121585] [<ffffffff810fe90a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121592] [<ffffffff813ebd42>] ? skb_copy+0x32/0xa0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121596] [<ffffffff813e9606>] __alloc_skb+0x76/0x180 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121632] [<ffffffff8140a2c1>] __qdisc_run+0x1a1/0x230 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121637] [<ffffffff813f41e0>] dev_queue_xmit+0x2b0/0x3a0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121642] [<ffffffff8142349b>] ip_finish_output+0x11b/0x2f0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121646] [<ffffffff814236f9>] ip_output+0x89/0xd0 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121650] [<ffffffff81422710>] ip_local_out+0x20/0x30 [kern.warning]
Oct 19 22:09:52 johan kernel: [11157.121654] [<ffffffff81422ffb>] ip_queue_xmit+0x22b/0x3f0 [kern.warning]
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Mon, Oct 19, 2009 at 01:01:15PM -0400, Christoph Hellwig wrote:
> On Tue, Oct 20, 2009 at 01:18:15AM +0900, Chris Mason wrote:
> > Waiting doesn't make it synchronous from the elevator point of view ;)
> > If you're using WB_SYNC_NONE, it's a async write. WB_SYNC_ALL makes it
> > a sync write. I only see WB_SYNC_NONE in vmscan.c, so we should be
> > using the async congestion wait. (the exception is xfs which always
> > does async writes).
>
> That's only because those people who did the global sweep did not bother
> to convert it or even tell the list about it. I have a patch in my
> QA queue to change it..
Yes, we just didn't realize XFS was missed. Sorry. I wasn't trying to
blame xfs for being behind, just mentioning that we've got about 10
different variables here and I'm having a hard time figuring out which
ones to push on.
-chris
On Mon, Oct 19, 2009 at 07:09:47PM +0200, Karol Lewandowski wrote:
> On Mon, Oct 19, 2009 at 03:06:19PM +0100, Mel Gorman wrote:
> > Can you test with my kswapd patch applied and commits 373c0a7e,8aa7e847
> > reverted please?
>
> It seems that your patch and Frans' reverts together *do* make
> difference.
>
> With these patches I haven't been able to trigger failures so far
> (in about 6 attempts). I'll continue testing and let you know if
> anything changes.
Damn it.
I'm sorry to inform you that yes, I still get failures (less often,
but still).
Thanks.
e100: Intel(R) PRO/100 Network Driver, 3.5.24-k2-NAPI
e100: Copyright(c) 1999-2006 Intel Corporation
e100 0000:00:03.0: PCI INT A -> Link[LNKC] -> GSI 9 (level, low) -> IRQ 9
e100 0000:00:03.0: PME# disabled
e100: eth0: e100_probe: addr 0xe8120000, irq 9, MAC addr 00:10:a4:89:e8:84
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 5151, comm: ifconfig Not tainted 2.6.31+frans2+mel-00002-g90702f9-dirty #2
Call Trace:
[<c015c4e1>] ? __alloc_pages_nodemask+0x423/0x468
[<c0104de7>] ? dma_generic_alloc_coherent+0x4a/0xab
[<c0104d9d>] ? dma_generic_alloc_coherent+0x0/0xab
[<d1614b6f>] ? e100_alloc_cbs+0xc7/0x174 [e100]
[<d1615bfe>] ? e100_up+0x1b/0xf5 [e100]
[<d1615cef>] ? e100_open+0x17/0x41 [e100]
[<c02f871f>] ? dev_open+0x8f/0xc5
[<c02f7ed9>] ? dev_change_flags+0xa2/0x155
[<c032daa6>] ? devinet_ioctl+0x22a/0x51c
[<c02ebabe>] ? sock_ioctl+0x0/0x1e4
[<c02ebc7e>] ? sock_ioctl+0x1c0/0x1e4
[<c02ebabe>] ? sock_ioctl+0x0/0x1e4
[<c017f23a>] ? vfs_ioctl+0x16/0x4a
[<c017fb01>] ? do_vfs_ioctl+0x48a/0x4c1
[<c0168137>] ? handle_mm_fault+0x1e0/0x42c
[<c0348c6b>] ? do_page_fault+0x2ce/0x2e4
[<c017fb64>] ? sys_ioctl+0x2c/0x42
[<c0102748>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 35
Active_anon:14778 active_file:10836 inactive_anon:22033
inactive_file:11854 unevictable:0 dirty:6 writeback:0 unstable:0
free:1031 slab:2083 mapped:6193 pagetables:417 bounce:0
DMA free:1096kB min:124kB low:152kB high:184kB active_anon:528kB inactive_anon:3440kB active_file:1076kB inactive_file:5580kB unevictable:0kB present:15868kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:3028kB min:1908kB low:2384kB high:2860kB active_anon:58584kB inactive_anon:84692kB active_file:42268kB inactive_file:41836kB unevictable:0kB present:243776kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 46*4kB 0*8kB 5*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1096kB
Normal: 135*4kB 213*8kB 21*16kB 4*32kB 5*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3028kB
25927 total pagecache pages
3010 pages in swap cache
Swap cache stats: add 205613, delete 202603, find 63665/79800
Free swap = 485236kB
Total swap = 514040kB
65520 pages RAM
1663 pages reserved
14633 pages shared
52919 pages non-shared
ifconfig: page allocation failure. order:5, mode:0x8020
Pid: 5151, comm: ifconfig Not tainted 2.6.31+frans2+mel-00002-g90702f9-dirty #2
Call Trace:
[<c015c4e1>] ? __alloc_pages_nodemask+0x423/0x468
[<c0104de7>] ? dma_generic_alloc_coherent+0x4a/0xab
[<c0104d9d>] ? dma_generic_alloc_coherent+0x0/0xab
[<d1614b6f>] ? e100_alloc_cbs+0xc7/0x174 [e100]
[<d1615bfe>] ? e100_up+0x1b/0xf5 [e100]
[<d1615cef>] ? e100_open+0x17/0x41 [e100]
[<c02f871f>] ? dev_open+0x8f/0xc5
[<c02f7ed9>] ? dev_change_flags+0xa2/0x155
[<c032daa6>] ? devinet_ioctl+0x22a/0x51c
[<c02ebabe>] ? sock_ioctl+0x0/0x1e4
[<c02ebc7e>] ? sock_ioctl+0x1c0/0x1e4
[<c02ebabe>] ? sock_ioctl+0x0/0x1e4
[<c017f23a>] ? vfs_ioctl+0x16/0x4a
[<c017fb01>] ? do_vfs_ioctl+0x48a/0x4c1
[<c0175fd1>] ? vfs_write+0xf4/0x105
[<c017fb64>] ? sys_ioctl+0x2c/0x42
[<c0102748>] ? sysenter_do_call+0x12/0x26
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 90, btch: 15 usd: 67
Active_anon:14760 active_file:10798 inactive_anon:22052
inactive_file:11862 unevictable:0 dirty:6 writeback:30 unstable:0
free:1031 slab:2083 mapped:6187 pagetables:417 bounce:0
DMA free:1096kB min:124kB low:152kB high:184kB active_anon:528kB inactive_anon:3440kB active_file:1076kB inactive_file:5580kB unevictable:0kB present:15868kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 238 238
Normal free:3028kB min:1908kB low:2384kB high:2860kB active_anon:58512kB inactive_anon:84768kB active_file:42116kB inactive_file:41868kB unevictable:0kB present:243776kB pages_scanned:100 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 46*4kB 0*8kB 5*16kB 0*32kB 1*64kB 0*128kB 1*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 1096kB
Normal: 135*4kB 213*8kB 21*16kB 4*32kB 5*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3028kB
25924 total pagecache pages
3037 pages in swap cache
Swap cache stats: add 205644, delete 202607, find 63666/79802
Free swap = 485116kB
Total swap = 514040kB
65520 pages RAM
1663 pages reserved
14638 pages shared
52896 pages non-shared
e100 0000:00:03.0: firmware: requesting e100/d101s_ucode.bin
ADDRCONF(NETDEV_UP): eth0: link is not ready
e100: eth0 NIC Link is Up 100 Mbps Full Duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present
On Tue, Oct 20, 2009 at 01:18:15AM +0900, Chris Mason wrote:
> On Mon, Oct 19, 2009 at 03:01:52PM +0100, Mel Gorman wrote:
> >
> > > During the 2nd phase I see the first SKB allocation errors with a music
> > > skip between reading commits 95.000 and 110.000.
> > > About commit 115.000 there is a very long pause during which the counter
> > > does not increase, music stops and the desktop freezes completely. The
> > > first 30 seconds of that freeze there is only very low disk activity (which
> > > seems strange);
> >
> > I'm just going to have to depend on Jens here. Jens, the congestion_wait() is
> > on BLK_RW_ASYNC after the commit. Reclaim usually writes pages asynchronously
> > but lumpy reclaim actually waits of pages to write out synchronously so
> > it's not always async.
>
> Waiting doesn't make it synchronous from the elevator point of view ;)
> If you're using WB_SYNC_NONE, it's a async write. WB_SYNC_ALL makes it
> a sync write. I only see WB_SYNC_NONE in vmscan.c, so we should be
> using the async congestion wait. (the exception is xfs which always
> does async writes).
>
Right, reclaim always queues the pages for async IO but for lumpy reclaim,
it calls wait_on_page_writeback() but as you say, from an elevator point of
view, it's still async.
> But I'm honestly not 100% sure. Looking back through the emails, the
> test case is doing IO on top of a whole lot of things on top of
> dm-crypt? I just tried to figure out if dm-crypt is turning the async
> IO into sync IOs, but didn't quite make sense of it.
>
I'm not overly sure either.
> Could you also please include which filesystems were being abused during
> the test and how? Reading through the emails, I think you've got:
>
> gitk being run 3 times on some FS (NFS?)
> streaming reads on NFS
> swap on dm-crypt
>
> If other filesystems are being used, please correct me. Also please
> include if they are on crypto or straight block device.
>
I've attached a patch below that should allow us to cheat. When it's applied,
it outputs who called congestion_wait(), how long the timeout was and how
long it waited for. By comparing before and after sleep times, we should
be able to see which of the callers has significantly changed and if
it's something easily addressable.
> > Either way, reclaim is usually worried about writing pages but it would appear
> > after this change that a lot of read activity can also stall a process in
> > direct reclaim. What might be happening in Frans's particular case is that the
> > tasklet that allocates high-order pages for the RX buffers is getting stalled
> > by congestion caused by other processes doing reads from the filesystem.
> > While it makes sense from a congestion point of view to halt the IO, the
> > reclaim operations from direct reclaimers is getting delayed for long enough
> > to cause problems for GFP_ATOMIC.
>
> The congestion_wait code either waits for congestion to clear or for
> a given timeout. The part that isn't clear is if before the patch
> we waited a very short time (congestion cleared quickly) or a very long
> time (we hit the timeout or congestion cleared slowly).
>
Using the instrumentation patch, I found with a very basic test that we
are waiting for short periods of time more often with the patch applied
1 congestion_wait rw=1 delay 6 timeout 25 :: before commit
7 kswapd congestion_wait rw=1 delay 0 timeout 25 :: before commit
32 kswapd congestion_wait sync=0 delay 0 timeout 25 :: after commit
61 kswapd congestion_wait rw=1 delay 1 timeout 25 :: before commit
133 kswapd congestion_wait sync=0 delay 1 timeout 25 :: after commit
16 kswapd congestion_wait rw=1 delay 2 timeout 25 :: before commit
70 kswapd congestion_wait sync=0 delay 2 timeout 25 :: after commit
1 try_to_free_pages congestion_wait sync=0 delay 2 timeout 25 :: after commit
17 kswapd congestion_wait rw=1 delay 3 timeout 25 :: before commit
28 kswapd congestion_wait sync=0 delay 3 timeout 25 :: after commit
1 try_to_free_pages congestion_wait sync=0 delay 3 timeout 25 :: after commit
23 kswapd congestion_wait rw=1 delay 4 timeout 25 :: before commit
16 kswapd congestion_wait sync=0 delay 4 timeout 25 :: after commit
5 try_to_free_pages congestion_wait sync=0 delay 4 timeout 25 :: after commit
20 kswapd congestion_wait rw=1 delay 5 timeout 25 :: before commit
18 kswapd congestion_wait sync=0 delay 5 timeout 25 :: after commit
3 try_to_free_pages congestion_wait sync=0 delay 5 timeout 25 :: after commit
21 kswapd congestion_wait rw=1 delay 6 timeout 25 :: before commit
8 kswapd congestion_wait sync=0 delay 6 timeout 25 :: after commit
2 try_to_free_pages congestion_wait sync=0 delay 6 timeout 25 :: after commit
13 kswapd congestion_wait rw=1 delay 7 timeout 25 :: before commit
12 kswapd congestion_wait sync=0 delay 7 timeout 25 :: after commit
2 try_to_free_pages congestion_wait sync=0 delay 7 timeout 25 :: after commit
8 kswapd congestion_wait rw=1 delay 8 timeout 25 :: before commit
7 kswapd congestion_wait sync=0 delay 8 timeout 25 :: after commit
9 kswapd congestion_wait rw=1 delay 9 timeout 25 :: before commit
5 kswapd congestion_wait sync=0 delay 9 timeout 25 :: after commit
2 try_to_free_pages congestion_wait sync=0 delay 9 timeout 25 :: after commit
4 kswapd congestion_wait rw=1 delay 10 timeout 25 :: before commit
5 kswapd congestion_wait sync=0 delay 10 timeout 25 :: after commit
1 try_to_free_pages congestion_wait sync=0 delay 10 timeout 25 :: after commit
[... remaining output snipped ...]
The before and after commit are really 2.6.31 and 2.6.31-patch-reverted.
The first column is how many times we delayed for that length of time.
To generate the output, I just took the console log from both kernels with
a basic test, put the congestion_wait lines into two separate files and
cat congestion-*-sorted | sort -n -k5 | uniq -c
to give a count of how many times we delayed for a particular caller.
> The easiest way to tell is to just replace the congestion_wait() calls
> in direct reclaim with schedule_timeout_interruptible(10), test, then
> schedule_timeout_interruptible(HZ/20), then test again.
>
Reclaim can also call congestion_wait() and maybe the problem isn't
within the page allocator at all but that it's indirectly affected by
timing.
> >
> > Does this sound plausible to you? If so, what's the best way of
> > addressing this? Changing congestion_wait back to WRITE (assuming that
> > works for Frans)? Changing it to SYNC (again, assuming it actually
> > works) or a revert?
>
> I don't think changing it to SYNC is a good plan unless we're actually
> doing sync io. It would be better to just wait on one of the pages that
> you've sent down (or its hashed waitqueue since the page can go away).
>
Frans, is there any chance you could apply the following patch and get
the console logs for a vanilla kernel and with the congestion patches
reverted? I'm hoping it'll be able to tell us which of the callers has
significantly changed in timing. If there is one caller that has
significantly changed, it might be enough to address just that caller.
=====
>From 757999066dc41f2e053d59589c673052fc7c1a65 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Tue, 20 Oct 2009 11:01:57 +0100
Subject: [PATCH] Instrument congestion_wait
This patch instruments how long congestion_wait() really waited for a
given caller.
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3d3accb..fc945e0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -10,6 +10,7 @@
#include <linux/module.h>
#include <linux/writeback.h>
#include <linux/device.h>
+#include <linux/kallsyms.h>
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -729,6 +730,11 @@ EXPORT_SYMBOL(set_bdi_congested);
*/
long congestion_wait(int sync, long timeout)
{
+ unsigned long jiffies_start = jiffies;
+ char *module;
+ char buf[128];
+ const char *symbol;
+ unsigned long offset, symbolsize;
long ret;
DEFINE_WAIT(wait);
wait_queue_head_t *wqh = &congestion_wqh[sync];
@@ -736,6 +742,13 @@ long congestion_wait(int sync, long timeout)
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = io_schedule_timeout(timeout);
finish_wait(wqh, &wait);
+
+ symbol = kallsyms_lookup(_RET_IP_, &symbolsize, &offset, &module, buf),
+ printk(KERN_INFO "%-20s congestion_wait sync=%d delay %lu timeout %ld\n",
+ symbol,
+ sync,
+ jiffies - jiffies_start,
+ timeout);
return ret;
}
EXPORT_SYMBOL(congestion_wait);
On Mon, Oct 19, 2009 at 10:17:06PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Tobias Oetiker wrote:
>
> > Hi Mel,
> >
> > Today Mel Gorman wrote:
> >
> > > >
> > > > if you can send me a consolidated patch which does apply to
> > > > 2.6.31.4 I will be glad to try ...
> > > >
> > >
> > > Sure
> > >
> > > ==== CUT HERE ====
> > >
> > > From 6c0215af3b7c39ef7b8083ea38ca3ad93cd3f51f Mon Sep 17 00:00:00 2001
> > > From: Mel Gorman <[email protected]>
> > > Date: Mon, 19 Oct 2009 15:40:43 +0100
> > > Subject: [PATCH] Kick off kswapd after direct reclaim and revert congestion changes
> > >
> > > The following patch is http://lkml.org/lkml/2009/10/16/89 on top of
> > > 2.6.31.4 as well as patches 373c0a7e and 8aa7e847 reverted.
> >
> > it seems to help ... the server has been running for 3 hours now
> > without incident, but then again it is not as active as during the
> > day, ... will report tomorrow.
>
> while I was writing, the system found that the patch does not realy
> help:
>
> Oct 19 22:09:52 johan kernel: [11157.121506] smtpd: page allocation failure. order:5, mode:0x4020 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121514] Pid: 19324, comm: smtpd Tainted: G D 2.6.31.4-oep #1 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121518] Call Trace: [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121521] <IRQ> [<ffffffff810cb599>] __alloc_pages_nodemask+0x549/0x650 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121563] [<ffffffffa02bde3b>] ? __nf_ct_refresh_acct+0xab/0x110 [nf_conntrack] [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121572] [<ffffffffa02a8337>] ? ipt_do_table+0x2f7/0x610 [ip_tables] [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121580] [<ffffffff810fac18>] kmalloc_large_node+0x68/0xc0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121585] [<ffffffff810fe90a>] __kmalloc_node_track_caller+0x11a/0x180 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121592] [<ffffffff813ebd42>] ? skb_copy+0x32/0xa0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121596] [<ffffffff813e9606>] __alloc_skb+0x76/0x180 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
Are the number of failures at least reduced or are they occuring at the
same rate? Also, what was the last kernel that worked for you with this
configuration?
Thanks
> Oct 19 22:09:52 johan kernel: [11157.121632] [<ffffffff8140a2c1>] __qdisc_run+0x1a1/0x230 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121637] [<ffffffff813f41e0>] dev_queue_xmit+0x2b0/0x3a0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121642] [<ffffffff8142349b>] ip_finish_output+0x11b/0x2f0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121646] [<ffffffff814236f9>] ip_output+0x89/0xd0 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121650] [<ffffffff81422710>] ip_local_out+0x20/0x30 [kern.warning]
> Oct 19 22:09:52 johan kernel: [11157.121654] [<ffffffff81422ffb>] ip_queue_xmit+0x22b/0x3f0 [kern.warning]
>
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Today Mel Gorman wrote:
> On Mon, Oct 19, 2009 at 10:17:06PM +0200, Tobias Oetiker wrote:
> > Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
>
> Are the number of failures at least reduced or are they occuring at the
> same rate?
not that it would have any statistical significance, but I had 5
failure (clusters) yesterday morning and 5 this morning ...
the failures often show up in groups I saved one on
http://tobi.oetiker.ch/cluster-2009-10-20-08-31.txt
> Also, what was the last kernel that worked for you with this
> configuration?
that would be 2.6.24 ... I have not upgraded in quite some time.
But since the io performance of 2.6.31 is about double in my tests
I thought it would be a good thing todo ...
cheers
tobi
> Thanks
>
> > Oct 19 22:09:52 johan kernel: [11157.121632] [<ffffffff8140a2c1>] __qdisc_run+0x1a1/0x230 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121637] [<ffffffff813f41e0>] dev_queue_xmit+0x2b0/0x3a0 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121642] [<ffffffff8142349b>] ip_finish_output+0x11b/0x2f0 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121646] [<ffffffff814236f9>] ip_output+0x89/0xd0 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121650] [<ffffffff81422710>] ip_local_out+0x20/0x30 [kern.warning]
> > Oct 19 22:09:52 johan kernel: [11157.121654] [<ffffffff81422ffb>] ip_queue_xmit+0x22b/0x3f0 [kern.warning]
> >
>
>
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Tue, Oct 20, 2009 at 01:44:50PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > On Mon, Oct 19, 2009 at 10:17:06PM +0200, Tobias Oetiker wrote:
>
> > > Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
> > > Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> > > Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
> >
> > Are the number of failures at least reduced or are they occuring at the
> > same rate?
>
> not that it would have any statistical significance, but I had 5
> failure (clusters) yesterday morning and 5 this morning ...
>
Before the patches were applied, how many failures were you seeing in
the morning?
> the failures often show up in groups I saved one on
> http://tobi.oetiker.ch/cluster-2009-10-20-08-31.txt
>
> > Also, what was the last kernel that worked for you with this
> > configuration?
>
> that would be 2.6.24 ... I have not upgraded in quite some time.
> But since the io performance of 2.6.31 is about double in my tests
> I thought it would be a good thing todo ...
>
That significant a different in performance may explain differences in timing
as well. i.e. the allocator is being put under more pressure now than it
was previously as more processes make forward progress.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Today Mel Gorman wrote:
> On Tue, Oct 20, 2009 at 01:44:50PM +0200, Tobias Oetiker wrote:
> > Hi Mel,
> >
> > Today Mel Gorman wrote:
> >
> > > On Mon, Oct 19, 2009 at 10:17:06PM +0200, Tobias Oetiker wrote:
> >
> > > > Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
> > > > Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> > > > Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
> > >
> > > Are the number of failures at least reduced or are they occuring at the
> > > same rate?
> >
> > not that it would have any statistical significance, but I had 5
> > failure (clusters) yesterday morning and 5 this morning ...
> >
>
> Before the patches were applied, how many failures were you seeing in
> the morning?
5 as well ... before an after ...
> > the failures often show up in groups I saved one on
> > http://tobi.oetiker.ch/cluster-2009-10-20-08-31.txt
> >
> > > Also, what was the last kernel that worked for you with this
> > > configuration?
> >
> > that would be 2.6.24 ... I have not upgraded in quite some time.
> > But since the io performance of 2.6.31 is about double in my tests
> > I thought it would be a good thing todo ...
> >
>
> That significant a different in performance may explain differences in timing
> as well. i.e. the allocator is being put under more pressure now than it
> was previously as more processes make forward progress.
you are saing that the problem might be even older ?
we do have 8GB ram and 16 GB swap, so it should not fail to allocate all that
often
top - 14:58:34 up 19:54, 6 users, load average: 2.09, 1.94, 1.97
Tasks: 451 total, 1 running, 449 sleeping, 0 stopped, 1 zombie
Cpu(s): 3.5%us, 15.5%sy, 2.0%ni, 72.2%id, 6.5%wa, 0.1%hi, 0.3%si, 0.0%st
Mem: 8198504k total, 7599132k used, 599372k free, 1212636k buffers
Swap: 16777208k total, 83568k used, 16693640k free, 610136k cached
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Tue, Oct 20, 2009 at 02:58:53PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > On Tue, Oct 20, 2009 at 01:44:50PM +0200, Tobias Oetiker wrote:
> > > Hi Mel,
> > >
> > > Today Mel Gorman wrote:
> > >
> > > > On Mon, Oct 19, 2009 at 10:17:06PM +0200, Tobias Oetiker wrote:
> > >
> > > > > Oct 19 22:09:52 johan kernel: [11157.121600] [<ffffffff813ebd42>] skb_copy+0x32/0xa0 [kern.warning]
> > > > > Oct 19 22:09:52 johan kernel: [11157.121615] [<ffffffffa07dd33c>] vboxNetFltLinuxPacketHandler+0x5c/0xd0 [vboxnetflt] [kern.warning]
> > > > > Oct 19 22:09:52 johan kernel: [11157.121620] [<ffffffff813f2512>] dev_hard_start_xmit+0x142/0x320 [kern.warning]
> > > >
> > > > Are the number of failures at least reduced or are they occuring at the
> > > > same rate?
> > >
> > > not that it would have any statistical significance, but I had 5
> > > failure (clusters) yesterday morning and 5 this morning ...
> > >
> >
> > Before the patches were applied, how many failures were you seeing in
> > the morning?
>
> 5 as well ... before an after ...
>
> > > the failures often show up in groups I saved one on
> > > http://tobi.oetiker.ch/cluster-2009-10-20-08-31.txt
> > >
> > > > Also, what was the last kernel that worked for you with this
> > > > configuration?
> > >
> > > that would be 2.6.24 ... I have not upgraded in quite some time.
> > > But since the io performance of 2.6.31 is about double in my tests
> > > I thought it would be a good thing todo ...
> > >
> >
> > That significant a different in performance may explain differences in timing
> > as well. i.e. the allocator is being put under more pressure now than it
> > was previously as more processes make forward progress.
>
> you are saing that the problem might be even older ?
>
> we do have 8GB ram and 16 GB swap, so it should not fail to allocate all that
> often
>
> top - 14:58:34 up 19:54, 6 users, load average: 2.09, 1.94, 1.97
> Tasks: 451 total, 1 running, 449 sleeping, 0 stopped, 1 zombie
> Cpu(s): 3.5%us, 15.5%sy, 2.0%ni, 72.2%id, 6.5%wa, 0.1%hi, 0.3%si, 0.0%st
> Mem: 8198504k total, 7599132k used, 599372k free, 1212636k buffers
> Swap: 16777208k total, 83568k used, 16693640k free, 610136k cached
>
High-order atomic allocations of the type you are trying at that frequency
were always a very long shot. The most likely outcome is that something
has changed that means a burst of allocations trigger an allocation failure
where as before processes would delay long enough for the system not to notice.
1. Have MTU settings changed?
2. As order-5 allocations are required to succeed, I'm surprised in a
sense that there are only 5 failures because it implies the machine is
actually recovering and continueing on as normal. Can you think of what
happens in the morning that causes a burst of allocations to occur?
3. Other than the failures, have you noticed any other problems with the
machine or does it continue along happily?
4. Does the following patch help by any chance?
Thanks
==== CUT HERE ====
vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
When a high-order allocation fails, kswapd is kicked so that it reclaims
at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
allocations. Something has changed in recent kernels that affect the timing
where high-order GFP_ATOMIC allocations are now failing with more frequency,
particularly under pressure. This patch forces kswapd to notice sooner that
high-order allocations are occuring by checking when watermarks are hit early
and by having kswapd restart quickly when the reclaim order is increased.
Not-signed-off-by-because-this-is-a-hatchet-job: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 14 ++++++++++++--
mm/vmscan.c | 9 +++++++++
2 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2fd7b20..fdbf8c9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1907,6 +1906,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
zonelist, high_zoneidx, nodemask,
preferred_zone, migratetype);
+ /*
+ * If after a high-order allocation we are now below watermarks,
+ * pre-emptively kick kswapd rather than having the next allocation
+ * fail and have to wake up kswapd, potentially failing GFP_ATOMIC
+ * allocations or entering direct reclaim
+ */
+ if (unlikely(order) && page && !zone_watermark_ok(preferred_zone, order,
+ preferred_zone->watermark[ALLOC_WMARK_LOW],
+ zone_idx(preferred_zone), ALLOC_WMARK_LOW))
+ wake_all_kswapd(order, zonelist, high_zoneidx);
+
return page;
}
EXPORT_SYMBOL(__alloc_pages_nodemask);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9219beb..0e66a6b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1925,6 +1925,15 @@ loop_again:
priority != DEF_PRIORITY)
continue;
+ /*
+ * Exit quickly to restart if it has been indicated
+ * that higher orders are required
+ */
+ if (pgdat->kswapd_max_order > order) {
+ all_zones_ok = 1;
+ goto out;
+ }
+
if (!zone_watermark_ok(zone, order,
high_wmark_pages(zone), end_zone, 0))
all_zones_ok = 0;
Hi Mel,
Today Mel Gorman wrote:
> On Tue, Oct 20, 2009 at 02:58:53PM +0200, Tobias Oetiker wrote:
> > you are saing that the problem might be even older ?
> >
> > we do have 8GB ram and 16 GB swap, so it should not fail to allocate all that
> > often
> >
> > top - 14:58:34 up 19:54, 6 users, load average: 2.09, 1.94, 1.97
> > Tasks: 451 total, 1 running, 449 sleeping, 0 stopped, 1 zombie
> > Cpu(s): 3.5%us, 15.5%sy, 2.0%ni, 72.2%id, 6.5%wa, 0.1%hi, 0.3%si, 0.0%st
> > Mem: 8198504k total, 7599132k used, 599372k free, 1212636k buffers
> > Swap: 16777208k total, 83568k used, 16693640k free, 610136k cached
> >
>
> High-order atomic allocations of the type you are trying at that frequency
> were always a very long shot. The most likely outcome is that something
> has changed that means a burst of allocations trigger an allocation failure
> where as before processes would delay long enough for the system not to notice.
>
> 1. Have MTU settings changed?
no not to my knowledge
> 2. As order-5 allocations are required to succeed, I'm surprised in a
> sense that there are only 5 failures because it implies the machine is
> actually recovering and continueing on as normal. Can you think of what
> happens in the morning that causes a burst of allocations to occur?
the burts occur all day while the machine is in use ... its just
that I was writing this at noon so only the morning had passed. So
I compared things to the day before ...
> 3. Other than the failures, have you noticed any other problems with the
> machine or does it continue along happily?
The machine seems to be fine.
> 4. Does the following patch help by any chance?
should I try this on vanilla 2.6.31.4 or ontop of your previous
patch?
we are running virtualbox 3.0.8 on this machine, virtualbox is using
the physical network interface in bridge mode access the network.
Could this have something todo with the problem ?
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Tue, Oct 20, 2009 at 03:50:12PM +0200, Tobias Oetiker wrote:
> Hi Mel,
>
> Today Mel Gorman wrote:
>
> > On Tue, Oct 20, 2009 at 02:58:53PM +0200, Tobias Oetiker wrote:
> > > you are saing that the problem might be even older ?
> > >
> > > we do have 8GB ram and 16 GB swap, so it should not fail to allocate all that
> > > often
> > >
> > > top - 14:58:34 up 19:54, 6 users, load average: 2.09, 1.94, 1.97
> > > Tasks: 451 total, 1 running, 449 sleeping, 0 stopped, 1 zombie
> > > Cpu(s): 3.5%us, 15.5%sy, 2.0%ni, 72.2%id, 6.5%wa, 0.1%hi, 0.3%si, 0.0%st
> > > Mem: 8198504k total, 7599132k used, 599372k free, 1212636k buffers
> > > Swap: 16777208k total, 83568k used, 16693640k free, 610136k cached
> > >
> >
> > High-order atomic allocations of the type you are trying at that frequency
> > were always a very long shot. The most likely outcome is that something
> > has changed that means a burst of allocations trigger an allocation failure
> > where as before processes would delay long enough for the system not to notice.
> >
> > 1. Have MTU settings changed?
>
> no not to my knowledge
>
> > 2. As order-5 allocations are required to succeed, I'm surprised in a
> > sense that there are only 5 failures because it implies the machine is
> > actually recovering and continueing on as normal. Can you think of what
> > happens in the morning that causes a burst of allocations to occur?
>
> the burts occur all day while the machine is in use ... its just
> that I was writing this at noon so only the morning had passed. So
> I compared things to the day before ...
>
Over the course of a day, how many would you see? By and large, it seems
that the problem yourself and Frans are similar except his is a lot more
severe.
> > 3. Other than the failures, have you noticed any other problems with the
> > machine or does it continue along happily?
>
> The machine seems to be fine.
>
> > 4. Does the following patch help by any chance?
>
> should I try this on vanilla 2.6.31.4 or ontop of your previous
> patch?
>
Try on top of vanilla 2.6.31.4 first plase and if failures still occur,
then on top of the previous patch.
> we are running virtualbox 3.0.8 on this machine, virtualbox is using
> the physical network interface in bridge mode access the network.
> Could this have something todo with the problem ?
>
I do not know for sure. I'm assuming the configuration is the same on
both kernels so it's unlikely to be the issue.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Today Mel Gorman wrote:
>
> Over the course of a day, how many would you see? By and large, it seems
> that the problem yourself and Frans are similar except his is a lot more
> severe.
yesterday it was 19 for 24 hours, today it is 9 for 16 hours (day
is not done yet).
> Try on top of vanilla 2.6.31.4 first plase and if failures still occur,
> then on top of the previous patch.
ok
> > we are running virtualbox 3.0.8 on this machine, virtualbox is using
> > the physical network interface in bridge mode access the network.
> > Could this have something todo with the problem ?
> >
>
> I do not know for sure. I'm assuming the configuration is the same on
> both kernels so it's unlikely to be the issue.
just to be on the sure side I created a tickt with the virtualbox
people ... http://www.virtualbox.org/ticket/5260
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Thu, Oct 01, 2009 at 09:56:04PM +0200, Rafael J. Wysocki wrote:
> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=14265
> Subject : ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
> Submitter : Karol Lewandowski <[email protected]>
> Date : 2009-09-15 12:05 (17 days old)
> References : http://marc.info/?l=linux-kernel&m=125301636509517&w=4
Guys, could anyone check if patch below helps? I think I've finally
found culprit of all allocation failures (but I might be wrong
too... ;-)
Thanks.
commit d6849591e042bceb66f1b4513a1df6740d2ad762
Author: Karol Lewandowski <[email protected]>
Date: Wed Oct 21 21:01:20 2009 +0200
SLUB: Don't drop __GFP_NOFAIL completely from allocate_slab()
Commit ba52270d18fb17ce2cf176b35419dab1e43fe4a3 unconditionally
cleared __GFP_NOFAIL flag on all allocations.
Preserve this flag on second attempt to allocate page (with possibly
decreased order).
This should help with bugs #14265, #14141 and similar.
Signed-off-by: Karol Lewandowski <[email protected]>
diff --git a/mm/slub.c b/mm/slub.c
index b627675..ac5db65 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1084,7 +1084,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
struct kmem_cache_order_objects oo = s->oo;
- gfp_t alloc_gfp;
+ gfp_t alloc_gfp, nofail;
flags |= s->allocflags;
@@ -1092,6 +1092,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
* Let the initial higher-order allocation fail under memory pressure
* so we fall-back to the minimum order allocation.
*/
+ nofail = flags & __GFP_NOFAIL;
alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
page = alloc_slab_page(alloc_gfp, node, oo);
@@ -1100,8 +1101,10 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
/*
* Allocation may have failed due to fragmentation.
* Try a lower order alloc if possible
+ *
+ * Preserve __GFP_NOFAIL flag if previous allocation failed.
*/
- page = alloc_slab_page(flags, node, oo);
+ page = alloc_slab_page(flags | nofail, node, oo);
if (!page)
return NULL;
On Wed, 21 Oct 2009, Karol Lewandowski wrote:
> commit d6849591e042bceb66f1b4513a1df6740d2ad762
> Author: Karol Lewandowski <[email protected]>
> Date: Wed Oct 21 21:01:20 2009 +0200
>
> SLUB: Don't drop __GFP_NOFAIL completely from allocate_slab()
>
> Commit ba52270d18fb17ce2cf176b35419dab1e43fe4a3 unconditionally
> cleared __GFP_NOFAIL flag on all allocations.
>
No, it clears __GFP_NOFAIL from the first allocation of oo_order(s->oo).
If that fails (and it's easy to fail, it has __GFP_NORETRY), another
allocation is attempted with oo_order(s->min), for which __GFP_NOFAIL
would be preserved if that's the slab cache's allocflags.
> Preserve this flag on second attempt to allocate page (with possibly
> decreased order).
>
> This should help with bugs #14265, #14141 and similar.
>
> Signed-off-by: Karol Lewandowski <[email protected]>
>
> diff --git a/mm/slub.c b/mm/slub.c
> index b627675..ac5db65 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1084,7 +1084,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> {
> struct page *page;
> struct kmem_cache_order_objects oo = s->oo;
> - gfp_t alloc_gfp;
> + gfp_t alloc_gfp, nofail;
>
> flags |= s->allocflags;
>
> @@ -1092,6 +1092,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> * Let the initial higher-order allocation fail under memory pressure
> * so we fall-back to the minimum order allocation.
> */
> + nofail = flags & __GFP_NOFAIL;
> alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
>
> page = alloc_slab_page(alloc_gfp, node, oo);
> @@ -1100,8 +1101,10 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> /*
> * Allocation may have failed due to fragmentation.
> * Try a lower order alloc if possible
> + *
> + * Preserve __GFP_NOFAIL flag if previous allocation failed.
> */
> - page = alloc_slab_page(flags, node, oo);
> + page = alloc_slab_page(flags | nofail, node, oo);
> if (!page)
> return NULL;
>
>
This does nothing. You may have missed that the lower order allocation is
passing 'flags' (which is a union of the gfp flags passed to
allocate_slab() based on the allocation context and the cache's
allocflags), and not alloc_gfp where __GFP_NOFAIL is masked.
Nack.
Note: slub isn't going to be a culprit in order 5 allocation failures
since they have kmalloc passthrough to the page allocator.
On Wed, Oct 21, 2009 at 02:06:41PM -0700, David Rientjes wrote:
> On Wed, 21 Oct 2009, Karol Lewandowski wrote:
>
> > commit d6849591e042bceb66f1b4513a1df6740d2ad762
> > Author: Karol Lewandowski <[email protected]>
> > Date: Wed Oct 21 21:01:20 2009 +0200
> >
> > SLUB: Don't drop __GFP_NOFAIL completely from allocate_slab()
> >
> > Commit ba52270d18fb17ce2cf176b35419dab1e43fe4a3 unconditionally
> > cleared __GFP_NOFAIL flag on all allocations.
> >
>
> No, it clears __GFP_NOFAIL from the first allocation of oo_order(s->oo).
> If that fails (and it's easy to fail, it has __GFP_NORETRY), another
> allocation is attempted with oo_order(s->min), for which __GFP_NOFAIL
> would be preserved if that's the slab cache's allocflags.
Right, patch is junk.
However, I haven't been able to trigger failures since I've switched
to SLAB allocator. That patch seemed related (and wrong), but it
wasn't.
> > */
> > - page = alloc_slab_page(flags, node, oo);
> > + page = alloc_slab_page(flags | nofail, node, oo);
> > if (!page)
> > return NULL;
> >
> >
>
> This does nothing. You may have missed that the lower order allocation is
> passing 'flags' (which is a union of the gfp flags passed to
> allocate_slab() based on the allocation context and the cache's
> allocflags), and not alloc_gfp where __GFP_NOFAIL is masked.
Right, I missed that.
> Nack.
>
> Note: slub isn't going to be a culprit in order 5 allocation failures
> since they have kmalloc passthrough to the page allocator.
However, it might change fragmentation somewhat I guess. This might
make problem more/less visible.
On Wed, Oct 21, 2009 at 11:20:34PM +0200, Karol Lewandowski wrote:
> On Wed, Oct 21, 2009 at 02:06:41PM -0700, David Rientjes wrote:
> > On Wed, 21 Oct 2009, Karol Lewandowski wrote:
> >
> > > commit d6849591e042bceb66f1b4513a1df6740d2ad762
> > > Author: Karol Lewandowski <[email protected]>
> > > Date: Wed Oct 21 21:01:20 2009 +0200
> > >
> > > SLUB: Don't drop __GFP_NOFAIL completely from allocate_slab()
> > >
> > > Commit ba52270d18fb17ce2cf176b35419dab1e43fe4a3 unconditionally
> > > cleared __GFP_NOFAIL flag on all allocations.
> > >
> >
> > No, it clears __GFP_NOFAIL from the first allocation of oo_order(s->oo).
> > If that fails (and it's easy to fail, it has __GFP_NORETRY), another
> > allocation is attempted with oo_order(s->min), for which __GFP_NOFAIL
> > would be preserved if that's the slab cache's allocflags.
>
> Right, patch is junk.
>
> However, I haven't been able to trigger failures since I've switched
> to SLAB allocator. That patch seemed related (and wrong), but it
> wasn't.
>
Interesting. Pekka, I looked for SLUB commits in the 2.6.30..2.6.31
range for patches that might affect what order of pages SLUB allocates
but didn't spot anything obvious. Can you think of any changes that
might have altered how SLUB uses memory?
> > > */
> > > - page = alloc_slab_page(flags, node, oo);
> > > + page = alloc_slab_page(flags | nofail, node, oo);
> > > if (!page)
> > > return NULL;
> > >
> > >
> >
> > This does nothing. You may have missed that the lower order allocation is
> > passing 'flags' (which is a union of the gfp flags passed to
> > allocate_slab() based on the allocation context and the cache's
> > allocflags), and not alloc_gfp where __GFP_NOFAIL is masked.
>
> Right, I missed that.
>
> > Nack.
> >
> > Note: slub isn't going to be a culprit in order 5 allocation failures
> > since they have kmalloc passthrough to the page allocator.
>
> However, it might change fragmentation somewhat I guess. This might
> make problem more/less visible.
>
Did you have CONFIG_KMEMCHECK set by any chance?
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
Hi Mel,
Tuesday Mel Gorman wrote:
> 4. Does the following patch help by any chance?
>
> Thanks
>
> ==== CUT HERE ====
> vmscan: Force kswapd to take notice faster when high-order watermarks are being hit
>
> When a high-order allocation fails, kswapd is kicked so that it reclaims
> at a higher-order to avoid direct reclaimers stall and to help GFP_ATOMIC
> allocations. Something has changed in recent kernels that affect the timing
> where high-order GFP_ATOMIC allocations are now failing with more frequency,
> particularly under pressure. This patch forces kswapd to notice sooner that
> high-order allocations are occuring by checking when watermarks are hit early
> and by having kswapd restart quickly when the reclaim order is increased.
>
> Not-signed-off-by-because-this-is-a-hatchet-job: Mel Gorman <[email protected]>
> ---
it does seem to help ... I have been running it from 6am to 12am on
our server now and have not yet seen any issues ...
will shout if I do ...
cheers
tobi
--
Tobi Oetiker, OETIKER+PARTNER AG, Aarweg 15 CH-4600 Olten, Switzerland
http://it.oetiker.ch [email protected] ++41 62 775 9902 / sb: -9900
On Thu, Oct 22, 2009 at 11:20:14AM +0100, Mel Gorman wrote:
> On Wed, Oct 21, 2009 at 11:20:34PM +0200, Karol Lewandowski wrote:
> > > Note: slub isn't going to be a culprit in order 5 allocation failures
> > > since they have kmalloc passthrough to the page allocator.
> >
> > However, it might change fragmentation somewhat I guess. This might
> > make problem more/less visible.
> >
>
> Did you have CONFIG_KMEMCHECK set by any chance?
No, kmemcheck (and kmemleak) was always disabled.
It's likely that's possible to trigger allocation failures with slab,
I just haven't been successful at it. Lack of good testcase is really
problem here -- even if I can't trigger failures I can never be sure
that these wont appear in some strange moment.
BTW I'll test your patches (from another thread) shortly.
Thanks.
Sorry for the delayed reply.
On Monday 19 October 2009, Chris Mason wrote:
> On Mon, Oct 19, 2009 at 03:01:52PM +0100, Mel Gorman wrote:
> > > During the 2nd phase I see the first SKB allocation errors with a
> > > music skip between reading commits 95.000 and 110.000.
> > > About commit 115.000 there is a very long pause during which the
> > > counter does not increase, music stops and the desktop freezes
> > > completely. The first 30 seconds of that freeze there is only very
> > > low disk activity (which seems strange);
> >
> > I'm just going to have to depend on Jens here. Jens, the
> > congestion_wait() is on BLK_RW_ASYNC after the commit. Reclaim usually
> > writes pages asynchronously but lumpy reclaim actually waits of pages
> > to write out synchronously so it's not always async.
>
> Waiting doesn't make it synchronous from the elevator point of view ;)
> If you're using WB_SYNC_NONE, it's a async write. ?WB_SYNC_ALL makes it
> a sync write. ?I only see WB_SYNC_NONE in vmscan.c, so we should be
> using the async congestion wait. ?(the exception is xfs which always
> does async writes).
>
> But I'm honestly not 100% sure. ?Looking back through the emails, the
> test case is doing IO on top of a whole lot of things on top of
> dm-crypt? ?I just tried to figure out if dm-crypt is turning the async
> IO into sync IOs, but didn't quite make sense of it.
>
> Could you also please include which filesystems were being abused during
> the test and how? ?Reading through the emails, I think you've got:
>
> gitk being run 3 times on some FS (NFS?)
gitk is run on an ext3 logical volume in a volume group that's on a LUKS
encrypted partition of the local hard disk.
So it's: SATA harddisk -> dm-crypt (dmsetup) -> LVM (lvm2) -> ext3
> streaming reads on NFS
Correct. My music share is a remote (nfs4) read-only mounted ext3
partition.
> swap on dm-crypt
Correct. Swap is another logical volume in the same volume group as
mentioned above.
So kcrypt gets to (de)encrypt both the gitk data *and* any swapping caused
by that [1].
> If other filesystems are being used, please correct me. ?Also please
> include if they are on crypto or straight block device.
All my file systems are ext3. Nothing newfangled or exotic ;-)
There are some bind mounts involved, but I expect that's transparent.
Cheers,
FJP
[1] I've plans to move some of my data outside the encrypted volume, but
currently everything except /boot is in the encrypted VG.
On Tuesday 20 October 2009, Mel Gorman wrote:
> I've attached a patch below that should allow us to cheat. When it's
> applied, it outputs who called congestion_wait(), how long the timeout
> was and how long it waited for. By comparing before and after sleep
> times, we should be able to see which of the callers has significantly
> changed and if it's something easily addressable.
The results from this look fairly interesting (although I may be a bad
judge as I don't really know what I'm looking at ;-).
I've tested with two kernels:
1) 2.6.31.1: 1 test run
2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
The 1st kernel had the expected "freeze" while reading commits in gitk;
reading commits with the 2nd kernel was more fluent.
I did 2 runs with the 2nd kernel as the first run had a fairly long music
skip and more SKB errors than expected. The second run was fairly normal
with no music skips at all even though it had a few SKB errors.
Data for the tests:
1st kernel 2nd kernel 1 2nd kernel 2
end reading commits 1:15 1:00 0:55
"freeze" yes no no
branch data shown 1:55 1:15 1:10
system quiet 2:25 1:50 1:45
# SKB allocation errors 10 53 5
Note that the test is substantially faster with the 2nd kernel and that the
SKB errors don't really affect the duration of the test.
Attached a tarball with the kernel logs, both the full logs and a stripped
version with only the lines generated during the actual test.
Something like this will extract the debug data from the logs:
$ grep "delay " <file> | sed "s/^.*\] //"
Also attached a ODF spreadsheet with a summary of the data for all 3 tests.
I've dropped the congestion_wait and sync/rw= columns as they were always
the same (rw=1 for 1st kernel and sync=0 for 2nd kernel).
I've added a column "weighed delay" and totals for that column and the
count column.
My layman's observations are:
- without the revert 'background_writeout' is called a lot less frequently,
but when it's called it gets long delays
- without the revert you have 'wb_kupdate', which is relatively expensive
- with the revert 'shrink_list' is relatively expensive, although not
really in absolute terms
You people may want to look at exactly what happens directly around the SKB
allocation errors. I've only looked at totals.
Cheers,
FJP
Sorry for the delay in replying.
On Saturday 17 October 2009, reinette chatre wrote:
> Prompted by this thread we are in process of moving allocation to paged
> skb. This will definitely reduce the allocation size (from order 2 to
> order 1) and hopefully help with this problem also. Could you please try
> with the attached two patches? They are based on 2.6.32-rc4.
Looks very good! With these patches I no longer get any SKB allocation
errors, even during the heaviest freezes while gitk is loading. I do still
get (long) music skips during the freezes, but that's not unexpected.
AFAICT the wireless connection is stable.
Tested on top of current mainline git: v2.6.32-rc5-81-g964fe08.
Please add, if you feel it's appropriate, my:
Reported-and-tested-by: Frans Pop <[email protected]>
Cheers,
FJP
On Mon, Oct 26, 2009 at 10:06:09PM +0100, Frans Pop wrote:
> On Tuesday 20 October 2009, Mel Gorman wrote:
> > I've attached a patch below that should allow us to cheat. When it's
> > applied, it outputs who called congestion_wait(), how long the timeout
> > was and how long it waited for. By comparing before and after sleep
> > times, we should be able to see which of the callers has significantly
> > changed and if it's something easily addressable.
>
> The results from this look fairly interesting (although I may be a bad
> judge as I don't really know what I'm looking at ;-).
>
> I've tested with two kernels:
> 1) 2.6.31.1: 1 test run
> 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
>
> The 1st kernel had the expected "freeze" while reading commits in gitk;
> reading commits with the 2nd kernel was more fluent.
> I did 2 runs with the 2nd kernel as the first run had a fairly long music
> skip and more SKB errors than expected. The second run was fairly normal
> with no music skips at all even though it had a few SKB errors.
>
> Data for the tests:
> 1st kernel 2nd kernel 1 2nd kernel 2
> end reading commits 1:15 1:00 0:55
> "freeze" yes no no
> branch data shown 1:55 1:15 1:10
> system quiet 2:25 1:50 1:45
> # SKB allocation errors 10 53 5
>
> Note that the test is substantially faster with the 2nd kernel and that the
> SKB errors don't really affect the duration of the test.
>
Ok. I think that despite expectations, the writeback changes have
changed the timing significantly enough to be worth examining closer.
>
> - without the revert 'background_writeout' is called a lot less frequently,
> but when it's called it gets long delays
> - without the revert you have 'wb_kupdate', which is relatively expensive
> - with the revert 'shrink_list' is relatively expensive, although not
> really in absolute terms
>
Lets look at the callers that waited in congestion_wait() for at least
25 jiffies.
2.6.31.1-async-sync-congestion-wait i.e. vanilla kernel
generated with: cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
24 background_writeout congestion_wait sync=0 delay 25 timeout 25
203 kswapd congestion_wait sync=0 delay 25 timeout 25
5 shrink_list congestion_wait sync=0 delay 25 timeout 25
155 try_to_free_pages congestion_wait sync=0 delay 25 timeout 25
145 wb_kupdate congestion_wait sync=0 delay 25 timeout 25
2 kswapd congestion_wait sync=0 delay 26 timeout 25
8 wb_kupdate congestion_wait sync=0 delay 26 timeout 25
1 try_to_free_pages congestion_wait sync=0 delay 54 timeout 25
2.6.31.1-write-congestion-wait i.e. kernel with patch reverted
generated with: cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
2 background_writeout congestion_wait rw=1 delay 25 timeout 25
188 kswapd congestion_wait rw=1 delay 25 timeout 25
14 shrink_list congestion_wait rw=1 delay 25 timeout 25
181 try_to_free_pages congestion_wait rw=1 delay 25 timeout 25
5 kswapd congestion_wait rw=1 delay 26 timeout 25
10 try_to_free_pages congestion_wait rw=1 delay 26 timeout 25
3 try_to_free_pages congestion_wait rw=1 delay 27 timeout 25
1 kswapd congestion_wait rw=1 delay 29 timeout 25
1 __alloc_pages_nodemask congestion_wait rw=1 delay 30 timeout 5
1 try_to_free_pages congestion_wait rw=1 delay 31 timeout 25
1 try_to_free_pages congestion_wait rw=1 delay 35 timeout 25
1 kswapd congestion_wait rw=1 delay 51 timeout 25
1 try_to_free_pages congestion_wait rw=1 delay 56 timeout 25
So, wb_kupdate and background_writeout are the big movers in terms of waiting,
not the direct reclaimers which is what we were expecting. Of those big
movers, wb_kupdate is the most interested because compare the following
$ cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
[ no output ]
$ $ cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
1 wb_kupdate congestion_wait sync=0 delay 15 timeout 25
1 wb_kupdate congestion_wait sync=0 delay 23 timeout 25
145 wb_kupdate congestion_wait sync=0 delay 25 timeout 25
8 wb_kupdate congestion_wait sync=0 delay 26 timeout 25
The vanilla kernel is not waiting in wb_kupdate at all.
Jens, before the congestion_wait() changes, wb_kupdate was waiting on
congestion and afterwards it's not. Furthermore, look at the number of pages
that are queued for writeback in the two page allocation failure reports.
without-revert: writeback:65653
with-revert: writeback:21713
So, after the move to async/sync, a lot more pages are getting queued
for writeback - more than three times the number of pages are queued for
writeback with the vanilla kernel. This amount of congestion might be why
direct reclaimers and kswapd's timings have changed so much.
Chris Mason hinted at this but I didn't quite "get it" at the time but is it
possible that writeback_inodes() is converting what is expected to be async
IO into sync IO? One way of checking this is if Frans could test the patch
below that makes wb_kupdate wait on sync instead of async.
If this makes a difference, I think the three main areas of trouble we
are now seeing are
1. page allocator regressions - mostly fixed hopefully
2. page writeback change in timing - theory yet to be confirmed
3. drivers using more atomics - iwlagn specific, being dealt with
Of course, the big problem is if the changes are due to major timing
differences in page writeback, then mainline is a totally different
shape of problem as pdflush has been replaced there.
====
Have wb_kupdate wait on sync IO congestion instead of async
wb_kupdate is expected to only have queued up pages for async IO.
However, something screwy is happening because it never appears to go to
sleep. Frans, can you test with this patch instead of the revert please?
Preferably, keep the verbose-congestion_wait patch applied so we can
still get an idea who is going to sleep and for how long when calling
congestion_wait. thanks
Not-signed-off-hacket-job: Mel Gorman <[email protected]>
---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..cb646dd 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
writeback_inodes(&wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion || wbc.more_io)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ congestion_wait(BLK_RW_SYNC, HZ/10);
else
break; /* All the old data is written */
}
2009/10/27 Mel Gorman <[email protected]>:
> On Mon, Oct 26, 2009 at 10:06:09PM +0100, Frans Pop wrote:
>> On Tuesday 20 October 2009, Mel Gorman wrote:
>> > I've attached a patch below that should allow us to cheat. When it's
>> > applied, it outputs who called congestion_wait(), how long the timeout
>> > was and how long it waited for. By comparing before and after sleep
>> > times, we should be able to see which of the callers has significantly
>> > changed and if it's something easily addressable.
>>
>> The results from this look fairly interesting (although I may be a bad
>> judge as I don't really know what I'm looking at ;-).
>>
>> I've tested with two kernels:
>> 1) 2.6.31.1: 1 test run
>> 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
>>
>> The 1st kernel had the expected "freeze" while reading commits in gitk;
>> reading commits with the 2nd kernel was more fluent.
>> I did 2 runs with the 2nd kernel as the first run had a fairly long music
>> skip and more SKB errors than expected. The second run was fairly normal
>> with no music skips at all even though it had a few SKB errors.
>>
>> Data for the tests:
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1st kernel ? ? ?2nd kernel 1 ? ?2nd kernel 2
>> end reading commits ? ? ? ? ? 1:15 ? ? ? ? ? ?1:00 ? ? ? ? ? ?0:55
>> ? "freeze" ? ? ? ? ? ? ? ? ? ?yes ? ? ? ? ? ? no ? ? ? ? ? ? ?no
>> branch data shown ? ? ? ? ? ? 1:55 ? ? ? ? ? ?1:15 ? ? ? ? ? ?1:10
>> system quiet ? ? ? ? ? ? ? ? ?2:25 ? ? ? ? ? ?1:50 ? ? ? ? ? ?1:45
>> # SKB allocation errors ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ?53 ? ? ? ? ? ? ?5
>>
>> Note that the test is substantially faster with the 2nd kernel and that the
>> SKB errors don't really affect the duration of the test.
>>
>
> Ok. I think that despite expectations, the writeback changes have
> changed the timing significantly enough to be worth examining closer.
>
>>
>> - without the revert 'background_writeout' is called a lot less frequently,
>> ? but when it's called it gets long delays
>> - without the revert you have 'wb_kupdate', which is relatively expensive
>> - with the revert 'shrink_list' is relatively expensive, although not
>> ? really in absolute terms
>>
>
> Lets look at the callers that waited in congestion_wait() for at least
> 25 jiffies.
>
> 2.6.31.1-async-sync-congestion-wait i.e. vanilla kernel
> generated with: cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> ? ? 24 ?background_writeout ?congestion_wait sync=0 delay 25 timeout 25
> ? ?203 ?kswapd ? ? ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> ? ? ?5 ?shrink_list ? ? ? ? ?congestion_wait sync=0 delay 25 timeout 25
> ? ?155 ?try_to_free_pages ? ?congestion_wait sync=0 delay 25 timeout 25
> ? ?145 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> ? ? ?2 ?kswapd ? ? ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
> ? ? ?8 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
> ? ? ?1 ?try_to_free_pages ? ?congestion_wait sync=0 delay 54 timeout 25
>
> 2.6.31.1-write-congestion-wait i.e. kernel with patch reverted
> generated with: cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> ? ? ?2 ?background_writeout ?congestion_wait rw=1 delay 25 timeout 25
> ? ?188 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 25 timeout 25
> ? ? 14 ?shrink_list ? ? ? ? ?congestion_wait rw=1 delay 25 timeout 25
> ? ?181 ?try_to_free_pages ? ?congestion_wait rw=1 delay 25 timeout 25
> ? ? ?5 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 26 timeout 25
> ? ? 10 ?try_to_free_pages ? ?congestion_wait rw=1 delay 26 timeout 25
> ? ? ?3 ?try_to_free_pages ? ?congestion_wait rw=1 delay 27 timeout 25
> ? ? ?1 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 29 timeout 25
> ? ? ?1 ?__alloc_pages_nodemask congestion_wait rw=1 delay 30 timeout 5
> ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 31 timeout 25
> ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 35 timeout 25
> ? ? ?1 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 51 timeout 25
> ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 56 timeout 25
>
> So, wb_kupdate and background_writeout are the big movers in terms of waiting,
> not the direct reclaimers which is what we were expecting. Of those big
> movers, wb_kupdate is the most interested because compare the following
>
> $ cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> [ no output ]
> $ $ cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> ? ? ?1 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 15 timeout 25
> ? ? ?1 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 23 timeout 25
> ? ?145 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> ? ? ?8 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
>
> The vanilla kernel is not waiting in wb_kupdate at all.
>
> Jens, before the congestion_wait() changes, wb_kupdate was waiting on
> congestion and afterwards it's not. Furthermore, look at the number of pages
> that are queued for writeback in the two page allocation failure reports.
>
> without-revert: writeback:65653
> with-revert: ? ?writeback:21713
>
> So, after the move to async/sync, a lot more pages are getting queued
> for writeback - more than three times the number of pages are queued for
> writeback with the vanilla kernel. This amount of congestion might be why
> direct reclaimers and kswapd's timings have changed so much.
>
> Chris Mason hinted at this but I didn't quite "get it" at the time but is it
> possible that writeback_inodes() is converting what is expected to be async
> IO into sync IO? One way of checking this is if Frans could test the patch
> below that makes wb_kupdate wait on sync instead of async.
>
> If this makes a difference, I think the three main areas of trouble we
> are now seeing are
>
> ? ? ? ?1. page allocator regressions - mostly fixed hopefully
> ? ? ? ?2. page writeback change in timing - theory yet to be confirmed
> ? ? ? ?3. drivers using more atomics - iwlagn specific, being dealt with
>
> Of course, the big problem is if the changes are due to major timing
> differences in page writeback, then mainline is a totally different
> shape of problem as pdflush has been replaced there.
>
> ====
> Have wb_kupdate wait on sync IO congestion instead of async
>
> wb_kupdate is expected to only have queued up pages for async IO.
> However, something screwy is happening because it never appears to go to
> sleep. Frans, can you test with this patch instead of the revert please?
> Preferably, keep the verbose-congestion_wait patch applied so we can
> still get an idea who is going to sleep and for how long when calling
> congestion_wait. thanks
>
> Not-signed-off-hacket-job: Mel Gorman <[email protected]>
> ---
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 81627eb..cb646dd 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
> ? ? ? ? ? ? ? ?writeback_inodes(&wbc);
> ? ? ? ? ? ? ? ?if (wbc.nr_to_write > 0) {
> ? ? ? ? ? ? ? ? ? ? ? ?if (wbc.encountered_congestion || wbc.more_io)
> - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_SYNC, HZ/10);
> ? ? ? ? ? ? ? ? ? ? ? ?else
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break; ?/* All the old data is written */
> ? ? ? ? ? ? ? ?}
Hmm, This doesn't looks correct to me.
BLK_RW_ASYNC mean async write.
BLK_RW_SYNC mean read and sync-write.
wb_kupdate use WB_SYNC_NONE. it's async write.
On Wed, Oct 28, 2009 at 12:16:30AM +0900, KOSAKI Motohiro wrote:
> 2009/10/27 Mel Gorman <[email protected]>:
> > On Mon, Oct 26, 2009 at 10:06:09PM +0100, Frans Pop wrote:
> >> On Tuesday 20 October 2009, Mel Gorman wrote:
> >> > I've attached a patch below that should allow us to cheat. When it's
> >> > applied, it outputs who called congestion_wait(), how long the timeout
> >> > was and how long it waited for. By comparing before and after sleep
> >> > times, we should be able to see which of the callers has significantly
> >> > changed and if it's something easily addressable.
> >>
> >> The results from this look fairly interesting (although I may be a bad
> >> judge as I don't really know what I'm looking at ;-).
> >>
> >> I've tested with two kernels:
> >> 1) 2.6.31.1: 1 test run
> >> 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
> >>
> >> The 1st kernel had the expected "freeze" while reading commits in gitk;
> >> reading commits with the 2nd kernel was more fluent.
> >> I did 2 runs with the 2nd kernel as the first run had a fairly long music
> >> skip and more SKB errors than expected. The second run was fairly normal
> >> with no music skips at all even though it had a few SKB errors.
> >>
> >> Data for the tests:
> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1st kernel ? ? ?2nd kernel 1 ? ?2nd kernel 2
> >> end reading commits ? ? ? ? ? 1:15 ? ? ? ? ? ?1:00 ? ? ? ? ? ?0:55
> >> ? "freeze" ? ? ? ? ? ? ? ? ? ?yes ? ? ? ? ? ? no ? ? ? ? ? ? ?no
> >> branch data shown ? ? ? ? ? ? 1:55 ? ? ? ? ? ?1:15 ? ? ? ? ? ?1:10
> >> system quiet ? ? ? ? ? ? ? ? ?2:25 ? ? ? ? ? ?1:50 ? ? ? ? ? ?1:45
> >> # SKB allocation errors ? ? ? ? ? ? ? 10 ? ? ? ? ? ? ?53 ? ? ? ? ? ? ?5
> >>
> >> Note that the test is substantially faster with the 2nd kernel and that the
> >> SKB errors don't really affect the duration of the test.
> >>
> >
> > Ok. I think that despite expectations, the writeback changes have
> > changed the timing significantly enough to be worth examining closer.
> >
> >>
> >> - without the revert 'background_writeout' is called a lot less frequently,
> >> ? but when it's called it gets long delays
> >> - without the revert you have 'wb_kupdate', which is relatively expensive
> >> - with the revert 'shrink_list' is relatively expensive, although not
> >> ? really in absolute terms
> >>
> >
> > Lets look at the callers that waited in congestion_wait() for at least
> > 25 jiffies.
> >
> > 2.6.31.1-async-sync-congestion-wait i.e. vanilla kernel
> > generated with: cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> > ? ? 24 ?background_writeout ?congestion_wait sync=0 delay 25 timeout 25
> > ? ?203 ?kswapd ? ? ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> > ? ? ?5 ?shrink_list ? ? ? ? ?congestion_wait sync=0 delay 25 timeout 25
> > ? ?155 ?try_to_free_pages ? ?congestion_wait sync=0 delay 25 timeout 25
> > ? ?145 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> > ? ? ?2 ?kswapd ? ? ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
> > ? ? ?8 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
> > ? ? ?1 ?try_to_free_pages ? ?congestion_wait sync=0 delay 54 timeout 25
> >
> > 2.6.31.1-write-congestion-wait i.e. kernel with patch reverted
> > generated with: cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> > ? ? ?2 ?background_writeout ?congestion_wait rw=1 delay 25 timeout 25
> > ? ?188 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 25 timeout 25
> > ? ? 14 ?shrink_list ? ? ? ? ?congestion_wait rw=1 delay 25 timeout 25
> > ? ?181 ?try_to_free_pages ? ?congestion_wait rw=1 delay 25 timeout 25
> > ? ? ?5 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 26 timeout 25
> > ? ? 10 ?try_to_free_pages ? ?congestion_wait rw=1 delay 26 timeout 25
> > ? ? ?3 ?try_to_free_pages ? ?congestion_wait rw=1 delay 27 timeout 25
> > ? ? ?1 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 29 timeout 25
> > ? ? ?1 ?__alloc_pages_nodemask congestion_wait rw=1 delay 30 timeout 5
> > ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 31 timeout 25
> > ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 35 timeout 25
> > ? ? ?1 ?kswapd ? ? ? ? ? ? ? congestion_wait rw=1 delay 51 timeout 25
> > ? ? ?1 ?try_to_free_pages ? ?congestion_wait rw=1 delay 56 timeout 25
> >
> > So, wb_kupdate and background_writeout are the big movers in terms of waiting,
> > not the direct reclaimers which is what we were expecting. Of those big
> > movers, wb_kupdate is the most interested because compare the following
> >
> > $ cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> > [ no output ]
> > $ $ cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> > ? ? ?1 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 15 timeout 25
> > ? ? ?1 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 23 timeout 25
> > ? ?145 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 25 timeout 25
> > ? ? ?8 ?wb_kupdate ? ? ? ? ? congestion_wait sync=0 delay 26 timeout 25
> >
> > The vanilla kernel is not waiting in wb_kupdate at all.
> >
> > Jens, before the congestion_wait() changes, wb_kupdate was waiting on
> > congestion and afterwards it's not. Furthermore, look at the number of pages
> > that are queued for writeback in the two page allocation failure reports.
> >
> > without-revert: writeback:65653
> > with-revert: ? ?writeback:21713
> >
> > So, after the move to async/sync, a lot more pages are getting queued
> > for writeback - more than three times the number of pages are queued for
> > writeback with the vanilla kernel. This amount of congestion might be why
> > direct reclaimers and kswapd's timings have changed so much.
> >
> > Chris Mason hinted at this but I didn't quite "get it" at the time but is it
> > possible that writeback_inodes() is converting what is expected to be async
> > IO into sync IO? One way of checking this is if Frans could test the patch
> > below that makes wb_kupdate wait on sync instead of async.
> >
> > If this makes a difference, I think the three main areas of trouble we
> > are now seeing are
> >
> > ? ? ? ?1. page allocator regressions - mostly fixed hopefully
> > ? ? ? ?2. page writeback change in timing - theory yet to be confirmed
> > ? ? ? ?3. drivers using more atomics - iwlagn specific, being dealt with
> >
> > Of course, the big problem is if the changes are due to major timing
> > differences in page writeback, then mainline is a totally different
> > shape of problem as pdflush has been replaced there.
> >
> > ====
> > Have wb_kupdate wait on sync IO congestion instead of async
> >
> > wb_kupdate is expected to only have queued up pages for async IO.
> > However, something screwy is happening because it never appears to go to
> > sleep. Frans, can you test with this patch instead of the revert please?
> > Preferably, keep the verbose-congestion_wait patch applied so we can
> > still get an idea who is going to sleep and for how long when calling
> > congestion_wait. thanks
> >
> > Not-signed-off-hacket-job: Mel Gorman <[email protected]>
> > ---
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 81627eb..cb646dd 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -787,7 +787,7 @@ static void wb_kupdate(unsigned long arg)
> > ? ? ? ? ? ? ? ?writeback_inodes(&wbc);
> > ? ? ? ? ? ? ? ?if (wbc.nr_to_write > 0) {
> > ? ? ? ? ? ? ? ? ? ? ? ?if (wbc.encountered_congestion || wbc.more_io)
> > - ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? congestion_wait(BLK_RW_SYNC, HZ/10);
> > ? ? ? ? ? ? ? ? ? ? ? ?else
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?break; ?/* All the old data is written */
> > ? ? ? ? ? ? ? ?}
>
> Hmm, This doesn't looks correct to me.
>
> BLK_RW_ASYNC mean async write.
> BLK_RW_SYNC mean read and sync-write.
>
> wb_kupdate use WB_SYNC_NONE. it's async write.
>
I don't think it's correct either which is why I described it as
"something screwy is happening because it never appears to go to sleep".
This is despite there being a whole lot of pages queued for writeback
according to the page allocation failure reports.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Oct 27, 2009 at 02:54:35PM +0000, Mel Gorman wrote:
> On Mon, Oct 26, 2009 at 10:06:09PM +0100, Frans Pop wrote:
> > On Tuesday 20 October 2009, Mel Gorman wrote:
> > > I've attached a patch below that should allow us to cheat. When it's
> > > applied, it outputs who called congestion_wait(), how long the timeout
> > > was and how long it waited for. By comparing before and after sleep
> > > times, we should be able to see which of the callers has significantly
> > > changed and if it's something easily addressable.
> >
> > The results from this look fairly interesting (although I may be a bad
> > judge as I don't really know what I'm looking at ;-).
> >
> > I've tested with two kernels:
> > 1) 2.6.31.1: 1 test run
> > 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
> >
> > The 1st kernel had the expected "freeze" while reading commits in gitk;
> > reading commits with the 2nd kernel was more fluent.
> > I did 2 runs with the 2nd kernel as the first run had a fairly long music
> > skip and more SKB errors than expected. The second run was fairly normal
> > with no music skips at all even though it had a few SKB errors.
> >
> > Data for the tests:
> > 1st kernel 2nd kernel 1 2nd kernel 2
> > end reading commits 1:15 1:00 0:55
> > "freeze" yes no no
> > branch data shown 1:55 1:15 1:10
> > system quiet 2:25 1:50 1:45
> > # SKB allocation errors 10 53 5
> >
> > Note that the test is substantially faster with the 2nd kernel and that the
> > SKB errors don't really affect the duration of the test.
> >
>
> Ok. I think that despite expectations, the writeback changes have
> changed the timing significantly enough to be worth examining closer.
>
> >
> > - without the revert 'background_writeout' is called a lot less frequently,
> > but when it's called it gets long delays
> > - without the revert you have 'wb_kupdate', which is relatively expensive
> > - with the revert 'shrink_list' is relatively expensive, although not
> > really in absolute terms
> >
>
> Lets look at the callers that waited in congestion_wait() for at least
> 25 jiffies.
>
> 2.6.31.1-async-sync-congestion-wait i.e. vanilla kernel
> generated with: cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> 24 background_writeout congestion_wait sync=0 delay 25 timeout 25
> 203 kswapd congestion_wait sync=0 delay 25 timeout 25
> 5 shrink_list congestion_wait sync=0 delay 25 timeout 25
> 155 try_to_free_pages congestion_wait sync=0 delay 25 timeout 25
> 145 wb_kupdate congestion_wait sync=0 delay 25 timeout 25
> 2 kswapd congestion_wait sync=0 delay 26 timeout 25
> 8 wb_kupdate congestion_wait sync=0 delay 26 timeout 25
> 1 try_to_free_pages congestion_wait sync=0 delay 54 timeout 25
>
> 2.6.31.1-write-congestion-wait i.e. kernel with patch reverted
> generated with: cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c
> 2 background_writeout congestion_wait rw=1 delay 25 timeout 25
> 188 kswapd congestion_wait rw=1 delay 25 timeout 25
> 14 shrink_list congestion_wait rw=1 delay 25 timeout 25
> 181 try_to_free_pages congestion_wait rw=1 delay 25 timeout 25
> 5 kswapd congestion_wait rw=1 delay 26 timeout 25
> 10 try_to_free_pages congestion_wait rw=1 delay 26 timeout 25
> 3 try_to_free_pages congestion_wait rw=1 delay 27 timeout 25
> 1 kswapd congestion_wait rw=1 delay 29 timeout 25
> 1 __alloc_pages_nodemask congestion_wait rw=1 delay 30 timeout 5
> 1 try_to_free_pages congestion_wait rw=1 delay 31 timeout 25
> 1 try_to_free_pages congestion_wait rw=1 delay 35 timeout 25
> 1 kswapd congestion_wait rw=1 delay 51 timeout 25
> 1 try_to_free_pages congestion_wait rw=1 delay 56 timeout 25
>
> So, wb_kupdate and background_writeout are the big movers in terms of waiting,
> not the direct reclaimers which is what we were expecting. Of those big
> movers, wb_kupdate is the most interested because compare the following
>
Bah, this part is right, but I got the next section the wrong way
around. I should have renamed the damn things instead of remember what
was 1 and what was 2.
1 == vanilla
2 == with-revert
> $ cat kern.log_2.1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> [ no output ]
> $ $ cat kern.log_1_test | awk -F ] '{print $2}' | sort -k 5 -n | uniq -c | grep wb_kup
> 1 wb_kupdate congestion_wait sync=0 delay 15 timeout 25
> 1 wb_kupdate congestion_wait sync=0 delay 23 timeout 25
> 145 wb_kupdate congestion_wait sync=0 delay 25 timeout 25
> 8 wb_kupdate congestion_wait sync=0 delay 26 timeout 25
>
> The vanilla kernel is not waiting in wb_kupdate at all.
>
The vanilla kernel *is* waiting. The reverted kernel is not. If my patch
makes any difference, it's not for the right reasons.
> Jens, before the congestion_wait() changes, wb_kupdate was waiting on
> congestion and afterwards it's not. Furthermore, look at the number of pages
> that are queued for writeback in the two page allocation failure reports.
>
> without-revert: writeback:65653
> with-revert: writeback:21713
>
and got it back right again.
kernel 1 == vanilla kernel == without-revert writeback:65653
kernel 2 == revert kernel == with-revert writeback:21713
> So, after the move to async/sync, a lot more pages are getting queued
> for writeback - more than three times the number of pages are queued for
> writeback with the vanilla kernel. This amount of congestion might be why
> direct reclaimers and kswapd's timings have changed so much.
>
Or more accurately, the vanilla kernel has queued up a lot more pages for
IO than when the patch is reverted. I'm not seeing yet why this is.
> Chris Mason hinted at this but I didn't quite "get it" at the time but is it
> possible that writeback_inodes() is converting what is expected to be async
> IO into sync IO? One way of checking this is if Frans could test the patch
> below that makes wb_kupdate wait on sync instead of async.
>
This reasoning is rubbish. If the patch makes any difference, it's because
it changes timing. It's probably more important to figure out if a) if the
different number of pages for writeback is relevant and if so b) why has
it changed.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
On Tue, Oct 27, 2009 at 03:52:24PM +0000, Mel Gorman wrote:
>
> > So, after the move to async/sync, a lot more pages are getting queued
> > for writeback - more than three times the number of pages are queued for
> > writeback with the vanilla kernel. This amount of congestion might be why
> > direct reclaimers and kswapd's timings have changed so much.
> >
>
> Or more accurately, the vanilla kernel has queued up a lot more pages for
> IO than when the patch is reverted. I'm not seeing yet why this is.
[ sympathies over confusion about congestion...lots of variables here ]
If wb_kupdate has been able to queue more writes it is because the
congestion logic isn't stopping it. We have congestion_wait(), but
before calling that in the writeback paths it says: are you congested?
and then backs off if the answer is yes.
Ideally, direct reclaim will never do writeback. We want it to be able
to find clean pages that kupdate and friends have already processed.
Waiting for congestion is a funny thing, it only tells us the device has
managed to finish some IO or that a timeout has passed. Neither event has
any relation to figuring out if the IO for reclaimable pages has
finished.
One option is to have the VM remember the hashed waitqueue for one of
the pages it direct reclaims and then wait on it.
-chris
Hi Frans,
On Tue, 2009-10-27 at 04:10 -0700, Frans Pop wrote:
> Sorry for the delay in replying.
>
> On Saturday 17 October 2009, reinette chatre wrote:
> > Prompted by this thread we are in process of moving allocation to paged
> > skb. This will definitely reduce the allocation size (from order 2 to
> > order 1) and hopefully help with this problem also. Could you please try
> > with the attached two patches? They are based on 2.6.32-rc4.
>
> Looks very good! With these patches I no longer get any SKB allocation
> errors, even during the heaviest freezes while gitk is loading. I do still
> get (long) music skips during the freezes, but that's not unexpected.
> AFAICT the wireless connection is stable.
>
> Tested on top of current mainline git: v2.6.32-rc5-81-g964fe08.
>
> Please add, if you feel it's appropriate, my:
> Reported-and-tested-by: Frans Pop <[email protected]>
Thank you very much for testing these patches so thoroughly. They are
both on their way upstream already so I am not able to add your
signature at this time. Since these are pretty big changes these patches
will be in 2.6.33.
Reinette
On Tuesday 27 October 2009, Chris Mason wrote:
> On Tue, Oct 27, 2009 at 03:52:24PM +0000, Mel Gorman wrote:
> > > So, after the move to async/sync, a lot more pages are getting
> > > queued for writeback - more than three times the number of pages are
> > > queued for writeback with the vanilla kernel. This amount of
> > > congestion might be why direct reclaimers and kswapd's timings have
> > > changed so much.
> >
> > Or more accurately, the vanilla kernel has queued up a lot more pages
> > for IO than when the patch is reverted. I'm not seeing yet why this
> > is.
>
> [ sympathies over confusion about congestion...lots of variables here ]
>
> If wb_kupdate has been able to queue more writes it is because the
> congestion logic isn't stopping it. We have congestion_wait(), but
> before calling that in the writeback paths it says: are you congested?
> and then backs off if the answer is yes.
>
> Ideally, direct reclaim will never do writeback. We want it to be able
> to find clean pages that kupdate and friends have already processed.
>
> Waiting for congestion is a funny thing, it only tells us the device has
> managed to finish some IO or that a timeout has passed. Neither event
> has any relation to figuring out if the IO for reclaimable pages has
> finished.
>
> One option is to have the VM remember the hashed waitqueue for one of
> the pages it direct reclaims and then wait on it.
What people should be aware of is the behavior of the system I see at this
point. I've already mentioned this in other mails, but it's probably good
to repeat it here.
While gitk is reading commits with vanilla .31 and .32 kernels there is at
some point a fairly long period (10-20 seconds) where I see:
- a completely frozen desktop, including frozen mouse cursor
- really very little disk activity (HD led flashes very briefly less than
once per second)
- reading commits stops completely during this period
- no music.
After that there is a period (another 5-15 seconds) with a huge amount of
disk activity during which the system gradually becomes responsive again
and in gitk the count of commits that have been read starts increasing
again (without a jump in the counter which confirms that no commits were
read during the freeze).
I cannot really tell what the system is doing during those freezes. Because
of the frozen desktop I cannot for example see CPU usage. I suspect that,
as there is hardly any disk activity, the system must be reorganizing RAM
or something. But it seems quite bad that that gets "bunched up" instead
of happening more gradually.
With the congestion_wait() change reverted I never see these freezes, only
much more normal minor latencies (< 2 seconds; mostly < 0.5 seconds),
which is probably unavoidable during heavy swapping.
Hth,
FJP
On Monday 26 October 2009, Frans Pop wrote:
> On Tuesday 20 October 2009, Mel Gorman wrote:
> > I've attached a patch below that should allow us to cheat. When it's
> > applied, it outputs who called congestion_wait(), how long the timeout
> > was and how long it waited for. By comparing before and after sleep
> > times, we should be able to see which of the callers has significantly
> > changed and if it's something easily addressable.
>
> The results from this look fairly interesting (although I may be a bad
> judge as I don't really know what I'm looking at ;-).
>
> I've tested with two kernels:
> 1) 2.6.31.1: 1 test run
> 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
I've taken another look at the data from this debug patch, resulting in
these graphs: http://people.debian.org/~fjp/tmp/kernel/congestion.pdf
I think the graph may show the reason for the congestion_wait() regression.
Horizontal axis shows time, vertical axis shows number of logged
congestion_wait calls per type.
The top chart is without the revert, the bottom one after the revert.
Note how before the revert the graph shows distinct steps: first you get
almost exclusively kwapd, followed by almost exclusively alloc_pages and
try_to_free. I suspect the periods where kswapd is almost horizontal
correspond to the freezes.
With the revert the lines for the different functions are almost straight
and everything happens much better interspersed.
Cheers,
FJP
On Thursday 05 November 2009, Frans Pop wrote:
> On Monday 26 October 2009, Frans Pop wrote:
> > On Tuesday 20 October 2009, Mel Gorman wrote:
> > > I've attached a patch below that should allow us to cheat. When it's
> > > applied, it outputs who called congestion_wait(), how long the
> > > timeout was and how long it waited for. By comparing before and
> > > after sleep times, we should be able to see which of the callers has
> > > significantly changed and if it's something easily addressable.
> >
> > The results from this look fairly interesting (although I may be a bad
> > judge as I don't really know what I'm looking at ;-).
> >
> > I've tested with two kernels:
> > 1) 2.6.31.1: 1 test run
> > 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
>
> I've taken another look at the data from this debug patch, resulting in
> these graphs: http://people.debian.org/~fjp/tmp/kernel/congestion.pdf
>
> I think the graph may show the reason for the congestion_wait()
> regression. Horizontal axis shows time, vertical axis shows number of
> logged congestion_wait calls per type.
I'm sorry. My initial version had a skewed time axis (showed occurrences
instead of actual time). I've now uploaded a corrected version:
http://people.debian.org/~fjp/tmp/kernel/congestion.pdf
I've also uploaded a second version that shows cumulative delay per type,
which probably gives a better insight:
http://people.debian.org/~fjp/tmp/kernel/congestion2.pdf
For both the top chart is without the revert, the bottom one after the
revert.
On Fri, Nov 06, 2009 at 10:51:37AM +0100, Frans Pop wrote:
> On Thursday 05 November 2009, Frans Pop wrote:
> > On Monday 26 October 2009, Frans Pop wrote:
> > > On Tuesday 20 October 2009, Mel Gorman wrote:
> > > > I've attached a patch below that should allow us to cheat. When it's
> > > > applied, it outputs who called congestion_wait(), how long the
> > > > timeout was and how long it waited for. By comparing before and
> > > > after sleep times, we should be able to see which of the callers has
> > > > significantly changed and if it's something easily addressable.
> > >
> > > The results from this look fairly interesting (although I may be a bad
> > > judge as I don't really know what I'm looking at ;-).
> > >
> > > I've tested with two kernels:
> > > 1) 2.6.31.1: 1 test run
> > > 2) 2.6.31.1 + congestion_wait() reverts: 2 test runs
> >
> > I've taken another look at the data from this debug patch, resulting in
> > these graphs: http://people.debian.org/~fjp/tmp/kernel/congestion.pdf
> >
> > I think the graph may show the reason for the congestion_wait()
> > regression. Horizontal axis shows time, vertical axis shows number of
> > logged congestion_wait calls per type.
>
> I'm sorry. My initial version had a skewed time axis (showed occurrences
> instead of actual time). I've now uploaded a corrected version:
> http://people.debian.org/~fjp/tmp/kernel/congestion.pdf
>
> I've also uploaded a second version that shows cumulative delay per type,
> which probably gives a better insight:
> http://people.debian.org/~fjp/tmp/kernel/congestion2.pdf
>
> For both the top chart is without the revert, the bottom one after the
> revert.
>
I'm looking into this at the moment. There are some definite
differences not only in the length congestion_wait() is waiting but in
what the callers are doing. I've more or less reproduced your results
locally and am slowly plodding through each caller to see what has
changed of significance. No patches yet though.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab