2006-08-06 10:08:16

by Andrew Morton

[permalink] [raw]
Subject: 2.6.18-rc3-mm2


ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

- 2.6.18-rc3-mm1 gets mysterious udev timeouts during boot and crashes in
NFS. This kernel reverts the patches which were causing that.



Changes since 2.6.18-rc3-mm1:


+revert-x86_64-mm-i386-remove-lock-section.patch

Revert patch which caues udev timeouts.

-knfsd-make-rpc-threads-pools-numa-aware-fix.patch

Folded into knfsd-make-rpc-threads-pools-numa-aware.patch

+revert-knfsd-make-rpc-threads-pools-numa-aware.patch

Revert patch which causes nfs crashes.



All 1136 patches:

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/patch-list



2006-08-06 11:09:27

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi,

On 06/08/06, Andrew Morton <[email protected]> wrote:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>

I get this error during the build.

kernel/built-in.o: In function `bacct_add_tsk':
/usr/src/linux-mm/kernel/tsacct.c:39: undefined reference to `__divdi3'
make[1]: *** [.tmp_vmlinux1] Error 1
make: *** [_all] Error 2

I'll try with CONFIG_TASKSTATS disabled.

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-06 13:33:56

by Mattia Dongili

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sun, Aug 06, 2006 at 03:08:09AM -0700, Andrew Morton wrote:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

there's something more, I had a load of the following while playing with
UML, full dmesg and config are
http://oioio.altervista.org/linux/config-2.6.18-rc3-mm2-1
http://oioio.altervista.org/linux/dmesg-2.6.18-rc3-mm2-1

[ 781.988000] ------------[ cut here ]------------
[ 781.988000] kernel BUG at mm/vmscan.c:383!
[ 781.988000] invalid opcode: 0000 [#1]
[ 781.988000] 4K_STACKS PREEMPT
[ 781.988000] last sysfs file: /devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load
[ 781.988000] Modules linked in: ipv6 nfsd exportfs lockd sunrpc ipt_MASQUERADE iptable_nat ip_nat xt_tcpudp xt_state ip_conntrack iptable_filter ip_tables x_tables jfs aes dm_crypt dm_mod rtc sony_acpi tun psmouse sonypi speedstep_ich speedstep_lib freq_table cpufreq_conservative cpufreq_ondemand cpufreq_powersave sd_mod usb_storage scsi_mod usbhid pcmcia snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_timer intel_agp agpgart i2c_i801 uhci_hcd usbcore evdev e100 mii yenta_socket rsrc_nonstatic pcmcia_core snd soundcore snd_page_alloc pcspkr
[ 781.988000] CPU: 0
[ 781.988000] EIP: 0060:[<c014c4d8>] Not tainted VLI
[ 781.988000] EFLAGS: 00210203 (2.6.18-rc3-mm2-1 #1)
[ 781.988000] EIP is at remove_mapping+0xe8/0x120
[ 781.988000] eax: c0374120 ebx: c11e2a80 ecx: c0374120 edx: 000000d0
[ 781.988000] esi: c0374120 edi: cfea0f78 ebp: cfea0e04 esp: cfea0df8
[ 781.988000] ds: 007b es: 007b ss: 0068
[ 781.988000] Process kswapd0 (pid: 134, ti=cfea0000 task=cfe9e030 task.ti=cfea0000)
[ 781.988000] Stack: c11e2a80 c11e2a80 c0374120 cfea0f14 c014cbab c0374120 c11e2a80 cfea0f78
[ 781.988000] c0373d60 c0373e2c 00000020 00000020 00000000 00000020 00000000 00000000
[ 781.988000] c0374120 00000001 00000000 c101a860 c0373c20 00000000 00000001 c0463168
[ 781.988000] Call Trace:
[ 781.988000] [<c014cbab>] shrink_inactive_list+0x69b/0x920
[ 781.988000] [<c014cec2>] shrink_zone+0x92/0xe0
[ 781.988000] [<c014d1f1>] kswapd+0x2e1/0x430
[ 781.988000] [<c012ee26>] kthread+0xe6/0xf0
[ 781.988000] [<c0101005>] kernel_thread_helper+0x5/0x10
[ 781.988000] DWARF2 unwinder stuck at kernel_thread_helper+0x5/0x10
[ 781.988000] Leftover inexact backtrace:
[ 781.988000] [<c0103a06>] show_stack_log_lvl+0xb6/0x100
[ 781.988000] [<c0103c2f>] show_registers+0x1df/0x290
[ 781.988000] [<c01041aa>] die+0x13a/0x310
[ 781.988000] [<c01047dd>] do_trap+0x9d/0x100
[ 781.988000] [<c0104c41>] do_invalid_op+0xa1/0xb0
[ 781.988000] [<c031a4a9>] error_code+0x39/0x40
[ 781.988000] [<c014cbab>] shrink_inactive_list+0x69b/0x920
[ 781.988000] [<c014cec2>] shrink_zone+0x92/0xe0
[ 781.988000] [<c014d1f1>] kswapd+0x2e1/0x430
[ 781.988000] [<c012ee26>] kthread+0xe6/0xf0
[ 781.988000] [<c0101005>] kernel_thread_helper+0x5/0x10
[ 781.988000] Code: 89 e0 25 00 f0 ff ff ff 48 14 8b 40 08 31 d2 a8 08 74 bc e8 6b be 1c 00 31 d2 eb b3 8d b4 26 00 00 00 00 8b 53 0c e9 51 ff ff ff <0f> 0b 7f 01 4e 66 33 c0 e9 2c ff ff ff 0f 0b 7e 01 4e 66 33 c0
[ 781.988000] EIP: [<c014c4d8>] remove_mapping+0xe8/0x120 SS:ESP 0068:cfea0df8
[ 781.988000] <0>------------[ cut here ]------------
[ 782.292000] kernel BUG at mm/vmscan.c:383!
...
[ 782.292000] <0>------------[ cut here ]------------
[ 782.564000] kernel BUG at mm/vmscan.c:383!
...
[ 809.588000] ------------[ cut here ]------------
[ 809.588000] kernel BUG at mm/vmscan.c:383!
...
[ 809.588000] <0>------------[ cut here ]------------
[ 811.748000] kernel BUG at mm/vmscan.c:383!
...
[ 811.748000] <0>------------[ cut here ]------------
[ 814.128000] kernel BUG at mm/vmscan.c:383!
...
[ 814.128000] <0>------------[ cut here ]------------
[ 815.272000] kernel BUG at mm/vmscan.c:383!
...
[ 815.272000] <0>------------[ cut here ]------------
[ 816.116000] kernel BUG at mm/vmscan.c:383!
...
[ 816.856000] <0>------------[ cut here ]------------
[ 817.120000] kernel BUG at mm/vmscan.c:383!

--
mattia
:wq!

2006-08-06 14:11:32

by Reuben Farrelly

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2



On 6/08/2006 10:08 p.m., Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> - 2.6.18-rc3-mm1 gets mysterious udev timeouts during boot and crashes in
> NFS. This kernel reverts the patches which were causing that.
>
>
>
> Changes since 2.6.18-rc3-mm1:
>
>
> +revert-x86_64-mm-i386-remove-lock-section.patch
>
> Revert patch which caues udev timeouts.
>
> -knfsd-make-rpc-threads-pools-numa-aware-fix.patch
>
> Folded into knfsd-make-rpc-threads-pools-numa-aware.patch
>
> +revert-knfsd-make-rpc-threads-pools-numa-aware.patch
>
> Revert patch which causes nfs crashes.

Seems to work well.

The only outstanding issue I have is with the "Generic ATA support" option which
I believe should be detecting and driving my ATA DVD-RW. However it is giving
this still on boot - it has never worked:

ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part
ata1: SATA max UDMA/133 cmd 0xFFFFC2000000E100 ctl 0x0 bmdma 0x0 irq 314
ata2: SATA max UDMA/133 cmd 0xFFFFC2000000E180 ctl 0x0 bmdma 0x0 irq 314
ata3: SATA max UDMA/133 cmd 0xFFFFC2000000E200 ctl 0x0 bmdma 0x0 irq 314
ata4: SATA max UDMA/133 cmd 0xFFFFC2000000E280 ctl 0x0 bmdma 0x0 irq 314
scsi0 : ahci
ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata1.00: ATA-7, max UDMA/133, 586072368 sectors: LBA48 NCQ (depth 31/32)
ata1.00: ata1: dev 0 multi count 16
ata1.00: configured for UDMA/133
scsi1 : ahci
ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata2.00: ATA-6, max UDMA/133, 156301488 sectors: LBA48 NCQ (depth 31/32)
ata2.00: ata2: dev 0 multi count 16
ata2.00: configured for UDMA/133
scsi2 : ahci
ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata3.00: ATA-7, max UDMA/133, 586072368 sectors: LBA48 NCQ (depth 31/32)
ata3.00: ata3: dev 0 multi count 16
ata3.00: configured for UDMA/133
scsi3 : ahci
ata4: SATA link down (SStatus 0 SControl 300)
Vendor: ATA Model: ST3300622AS Rev: 3.AA
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: ST380817AS Rev: 3.42
Type: Direct-Access ANSI SCSI revision: 05
Vendor: ATA Model: ST3300622AS Rev: 3.AA
Type: Direct-Access ANSI SCSI revision: 05
ata_piix 0000:00:1f.1: version 2.00ac6
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 18
PCI: Setting latency timer of device 0000:00:1f.1 to 64
ata5: PATA max UDMA/133 cmd 0x1F0 ctl 0x3F6 bmdma 0x30B0 irq 14
scsi4 : ata_piix
ata5.00: ATAPI, max UDMA/66
ata5.00: configured for UDMA/66
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: (BMDMA stat 0x24)
ata5.00: tag 0 cmd 0xa0 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata5: soft resetting port
ata5.00: configured for UDMA/66
ata5: EH complete
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: (BMDMA stat 0x24)
ata5.00: tag 0 cmd 0xa0 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata5: soft resetting port
ata5.00: configured for UDMA/66
ata5: EH complete
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: (BMDMA stat 0x24)
ata5.00: tag 0 cmd 0xa0 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata5: soft resetting port
ata5.00: configured for UDMA/66
ata5: EH complete
ata5.00: limiting speed to UDMA/44
ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata5.00: (BMDMA stat 0x24)
ata5.00: tag 0 cmd 0xa0 Emask 0x4 stat 0x40 err 0x0 (timeout)
ata5: soft resetting port
ata5.00: configured for UDMA/44
ata5: EH complete
SCSI device sda: 586072368 512-byte hdwr sectors (300069 MB)

And no DVD-RW :-(

I posted some information about it to LKML on 10/07/06

ATAPI CD-ROM, with removable media
Model Number: PIONEER DVD-RW DVR-111D
Serial Number: FADC005671WL
Firmware Revision: 1.23
+ more

but had no feedback.

Should I continue to ask/report it or should I just disable it for now and try
again in a few months to see if it works?

Reuben

2006-08-06 14:56:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 [BUG at mm/vmscan.c:383!]

On Sun, 6 Aug 2006, Mattia Dongili wrote:
> [ 781.988000] kernel BUG at mm/vmscan.c:383!
> [ 781.988000] EIP is at remove_mapping+0xe8/0x120

You are so right: the minor fix below is needed.

> [ 781.988000] DWARF2 unwinder stuck at kernel_thread_helper+0x5/0x10

Sorry, someone else will have to help with all that nuisance.


remove_mapping() must check against page_mapping(page):
&swapper_space is implicit, never actually stored in page->mapping.

Signed-off-by: Hugh Dickins <[email protected]>

--- 2.6.18-rc3-mm2/mm/vmscan.c 2006-08-06 12:25:40.000000000 +0100
+++ linux/mm/vmscan.c 2006-08-06 15:40:34.000000000 +0100
@@ -380,7 +380,7 @@ static pageout_t pageout(struct page *pa
int remove_mapping(struct address_space *mapping, struct page *page)
{
BUG_ON(!PageLocked(page));
- BUG_ON(mapping != page->mapping);
+ BUG_ON(mapping != page_mapping(page));

write_lock_irq(&mapping->tree_lock);

Subject: Re: 2.6.18-rc3-mm2

Hi,

I have found dependency error while compiling 2.6.18-rc3-mm2 kernel into
another directory...


estibi@amilo /home/place/linux-2.6.18-rc3-mm2> make V=1
O=../linux-2.6.18-rc3-mm2_amilo_obj menuconfig

make -C /home/place/linux-2.6.18-rc3-mm2_amilo_obj \
KBUILD_SRC=/home/place/linux-2.6.18-rc3-mm2 \
KBUILD_EXTMOD="" -f /home/place/linux-2.6.18-rc3-mm2/Makefile menuconfig
make -f /home/place/linux-2.6.18-rc3-mm2/scripts/Makefile.build
obj=scripts/basic
/bin/sh /home/place/linux-2.6.18-rc3-mm2/scripts/mkmakefile \
/home/place/linux-2.6.18-rc3-mm2
/home/place/linux-2.6.18-rc3-mm2_amilo_obj 2 6
GEN /home/place/linux-2.6.18-rc3-mm2_amilo_obj/Makefile
mkdir -p include/linux include/config
make -f /home/place/linux-2.6.18-rc3-mm2/scripts/Makefile.build
obj=scripts/kconfig menuconfig
gcc -Wp,-MD,scripts/kconfig/lxdialog/.checklist.o.d -Iscripts/kconfig
-Wall -Wstrict-prototypes -O2 -fomit-frame-pointer
-DCURSES_LOC="<ncurses.h>" -DLOCALE -c -o
scripts/kconfig/lxdialog/checklist.o
/home/place/linux-2.6.18-rc3-mm2/scripts/kconfig/lxdialog/checklist.c
/home/place/linux-2.6.18-rc3-mm2/scripts/kconfig/lxdialog/checklist.c:325:
fatal error: opening dependency file
scripts/kconfig/lxdialog/.checklist.o.d: Nie ma takiego pliku ani katalogu
compilation terminated.
make[2]: *** [scripts/kconfig/lxdialog/checklist.o] Bd 1
make[1]: *** [menuconfig] Bd 2
make: *** [menuconfig] Bd 2



Best Regards!

Piotr Jasiukajtis

2006-08-06 17:03:09

by Mattia Dongili

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 [BUG at mm/vmscan.c:383!]

On Sun, Aug 06, 2006 at 03:55:43PM +0100, Hugh Dickins wrote:
> On Sun, 6 Aug 2006, Mattia Dongili wrote:
> > [ 781.988000] kernel BUG at mm/vmscan.c:383!
> > [ 781.988000] EIP is at remove_mapping+0xe8/0x120
>
> You are so right: the minor fix below is needed.

Thanks now it runs ok (since ~30 minutes now).
Hot-fix? :)

--
mattia
:wq!

2006-08-06 19:09:08

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sun, 6 Aug 2006 17:48:52 +0200
"Fabio Comolli" <[email protected]> wrote:

> This kernel does not detect my HP laptop Alps touchpad. Also keyboard
> seems to be detected but does not work, with the only exception of the
> power button (I can use it to perform a clean shutdown).
>
> 2.6.18-rc1-mm1 works perfectly.

hum.

-tycho kernel: ata1.00: configured for UDMA/33
+tycho kernel: ata1.00: configured for UDMA/100

That looks nice.

-tycho kernel: ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x18C8 irq 15
-tycho kernel: scsi1 : ata_piix
-tycho kernel: ata2: port disabled. ignoring.
-tycho kernel: ATA: abnormal status 0xFF on port 0x177

So does that.

-tycho kernel: input: PS/2 Mouse as /class/input/input1
-tycho kernel: input: AlpsPS/2 ALPS GlidePoint as /class/input/input2

That's not so good.


Dmitry, do you have anything in there which might have caused that?

Perhaps hdaps-handle-errors-from-input_register_device.patch is triggering
for some reason. Fabio, it'd be useful if you could add this, see if it
triggers:


--- a/drivers/input/input.c~input_register_device-debug
+++ a/drivers/input/input.c
@@ -1007,6 +1007,10 @@ int input_register_device(struct input_d
fail3: sysfs_remove_group(&dev->cdev.kobj, &input_dev_id_attr_group);
fail2: sysfs_remove_group(&dev->cdev.kobj, &input_dev_attr_group);
fail1: class_device_del(&dev->cdev);
+ if (error) {
+ printk(KERN_ERR "%s failed: %d\n", __FUNCTION__, error);
+ dump_stack();
+ }
return error;
}
EXPORT_SYMBOL(input_register_device);
_

2006-08-06 22:43:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sunday 06 August 2006 12:08, Andrew Morton wrote:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.

Unfortunately I have no indication what can be wrong, no oopses, no error
messages in dmesg, nothing.

Right now I'm doing a binary search for the offending patch.

Greetings,
Rafael

2006-08-06 22:54:59

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Mon, 7 Aug 2006 00:42:10 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
>
> Unfortunately I have no indication what can be wrong, no oopses, no error
> messages in dmesg, nothing.
>
> Right now I'm doing a binary search for the offending patch.
>

Thanks. I'd zoom in on
hdaps-handle-errors-from-input_register_device.patch and git-input.patch.

2006-08-07 02:07:40

by Grant Coady

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sun, 6 Aug 2006 03:08:09 -0700, Andrew Morton <[email protected]> wrote:

>
>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

Okay here, done some fdisk partition manipulation and didn't lose
any filesystems or any other nasties. ;) Dual boot 'doze, so
stuffing around with NTFS (ro) as well as NFS (rw).

Some odd looking IRQ reassignments (Via chipset), I've put up
-rc3 -> -rc3-mm2 dmesg diff, as well as dmesg and config on
<http://bugsplatter.mine.nu/test/linux-2.6/sempro/> if anyone
curious.

Grant.

2006-08-07 02:18:10

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sunday 06 August 2006 15:09, Andrew Morton wrote:
> -tycho kernel: input: PS/2 Mouse as /class/input/input1
> -tycho kernel: input: AlpsPS/2 ALPS GlidePoint as /class/input/input2
>
> That's not so good.
>
>
> Dmitry, do you have anything in there which might have caused that?
>
> Perhaps hdaps-handle-errors-from-input_register_device.patch is triggering
> for some reason.

Hmm, I'd be more concerned with i8042-get-rid-of-polling-timer patch...
Anyway, can I have dmesg from boot with i8042.debug=1, please? Make sure
you have big log biffer.

--
Dmitry

2006-08-07 02:19:00

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sunday 06 August 2006 18:42, Rafael J. Wysocki wrote:
> On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> >
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
>
> Unfortunately I have no indication what can be wrong, no oopses, no error
> messages in dmesg, nothing.
>
> Right now I'm doing a binary search for the offending patch.
>

Can I please have dmesg with i8042.debug=1?

--
Dmitry

2006-08-07 02:20:56

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sunday 06 August 2006 22:18, Dmitry Torokhov wrote:
> On Sunday 06 August 2006 18:42, Rafael J. Wysocki wrote:
> > On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> > My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> > if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
> >
> > Unfortunately I have no indication what can be wrong, no oopses, no error
> > messages in dmesg, nothing.
> >
> > Right now I'm doing a binary search for the offending patch.
> >
>
> Can I please have dmesg with i8042.debug=1?
>

Btw, does 2.6.18-rc4 work?

--
Dmitry

2006-08-07 09:16:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Monday 07 August 2006 00:54, Andrew Morton wrote:
> On Mon, 7 Aug 2006 00:42:10 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> > >
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> > My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> > if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
> >
> > Unfortunately I have no indication what can be wrong, no oopses, no error
> > messages in dmesg, nothing.
> >
> > Right now I'm doing a binary search for the offending patch.
> >
>
> Thanks. I'd zoom in on
> hdaps-handle-errors-from-input_register_device.patch and git-input.patch.

None of these, but close: remove-polling-timer-from-i8042-v2.patch breaks
things here. [FYI, the box is booted with "noapic", because the IRQ sharing
doesn't work otherwise due to a BIOS issue, so it may be related.]

Attached is the dmesg output with i8042.debug=1 for Dmitry. It's from
2.6.18-rc3 with -mm2 partially applied (up to and including
logips2pp-fix-mx300-button-layout.patch). I'll apply the rest tonight, after
I find the patch that broke suspend for me.

BTW, I couldn't test -rc4, because I don't use git and there's no standalone
version so far. I hope it will be available?

[Now, I have an emergency to handle, so I won't be reachable before tonight,
I think.]

Greetings,
Rafael


Attachments:
(No filename) (1.42 kB)
dmesg.log.gz (22.81 kB)
Download all attachments

2006-08-07 09:28:15

by Jiri Slaby

[permalink] [raw]
Subject: swsusp regression [Was: 2.6.18-rc3-mm2]

Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

I tried it and guess what :)... swsusp doesn't work :@.

This time I was able to dump process states with sysrq-t:
http://www.fi.muni.cz/~xslaby/sklad/ide2.gif

My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
suspending device 2.0

diff of dmesgs:
--- rc2 2006-08-07 11:13:34.000000000 +0200
+++ rc3 2006-08-07 11:13:39.000000000 +0200
@@ -1,4 +1,4 @@
-Linux version 2.6.18-rc2-mm1 (ku@bellona) (gcc version 4.1.1 20060721 (Red Hat
4.1.1-13)) #155 SMP Tue Aug 1 01:17:45 CEST 2006
+Linux version 2.6.18-rc3-mm2 (ku@bellona) (gcc version 4.1.1 20060802 (Red Hat
4.1.1-14)) #157 SMP Sun Aug 6 19:38:53 CEST 2006
BIOS-provided physical RAM map:
sanitize start
sanitize end
@@ -49,7 +49,7 @@
Enabling APIC mode: Flat. Using 1 I/O APICs
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 50000000 (gap: 40000000:bec00000)
-Detected 2736.278 MHz processor.
+Detected 2736.289 MHz processor.
Built 1 zonelists. Total pages: 262128
Kernel command line: ro root=/dev/hda2 reboot=w vga=1 2
mapped APIC to ffffd000 (fee00000)
@@ -57,14 +57,22 @@
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
-CPU 0 irqstacks, hard=c0505000 soft=c0502000
+CPU 0 irqstacks, hard=c0509000 soft=c0506000
PID hash table entries: 4096 (order: 12, 16384 bytes)
Console: colour VGA+ 80x50
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
-Memory: 1034488k/1048512k available (2514k kernel code, 13456k reserved, 1349k
data, 200k init, 131008k highmem)
+Memory: 1034472k/1048512k available (2522k kernel code, 13472k reserved, 1353k
data, 204k init, 131008k highmem)
+virtual kernel memory layout:
+ fixmap : 0xfff90000 - 0xfffff000 ( 444 kB)
+ pkmap : 0xff800000 - 0xffc00000 (4096 kB)
+ vmalloc : 0xf8800000 - 0xff7fe000 ( 111 MB)
+ lowmem : 0xc0000000 - 0xf8000000 ( 896 MB)
+ .init : 0xc04ce000 - 0xc0501000 ( 204 kB)
+ .data : 0xc03768d2 - 0xc04c8ff8 (1353 kB)
+ .text : 0xc0100000 - 0xc03768d2 (2522 kB)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
-Calibrating delay using timer specific routine.. 5476.47 BogoMIPS (lpj=10952942)
+Calibrating delay using timer specific routine.. 5476.48 BogoMIPS (lpj=10952969)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
@@ -82,9 +90,9 @@
CPU0: Intel(R) Pentium(R) 4 CPU 2.60GHz stepping 09
SMP alternatives: switching to SMP code
Booting processor 1/1 eip 3000
-CPU 1 irqstacks, hard=c0506000 soft=c0503000
+CPU 1 irqstacks, hard=c050a000 soft=c0507000
Initializing CPU#1
-Calibrating delay using timer specific routine.. 5472.77 BogoMIPS (lpj=10945546)
+Calibrating delay using timer specific routine.. 5472.79 BogoMIPS (lpj=10945581)
CPU: After generic identify, caps: bfebfbff 00000000 00000000 00000000
00004400 00000000 00000000
CPU: After vendor identify, caps: bfebfbff 00000000 00000000 00000000 00004400
00000000 00000000
CPU: Trace cache: 12K uops, L1 D cache: 8K
@@ -96,15 +104,15 @@
CPU1: Intel P4/Xeon Extended MCE MSRs (12) available
CPU1: Thermal monitoring enabled
CPU1: Intel(R) Pentium(R) 4 CPU 2.60GHz stepping 09
-Total of 2 processors activated (10949.24 BogoMIPS).
+Total of 2 processors activated (10949.27 BogoMIPS).
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=-1 pin2=-1
checking TSC synchronization across 2 CPUs: passed.
Brought up 2 CPUs
-migration_cost=111
+migration_cost=1
NET: Registered protocol family 16
ACPI: bus type pci registered
-PCI: PCI BIOS revision 2.10 entry at 0xfb670, last bus=2
+PCI: Using configuration type 1
Setting up standard PCI resources
ACPI: Interpreter enabled
ACPI: Using IOAPIC for interrupt routing
@@ -189,7 +197,7 @@
ACPI: PCI Interrupt 0000:02:01.0[A] -> GSI 21 (level, low) -> IRQ 19
HPT370: chipset revision 3
HPT370: no clock data saved by BIOS
-HPT370: DPLL base: 48 MHz, f_CNT: 146, assuming 33 MHz PCI
+HPT370: DPLL base: 48 MHz, f_CNT: 148, assuming 33 MHz PCI
HPT370: using 33 MHz PCI clock
HPT370: 100% native mode on irq 19
ide2: BM-DMA at 0x9000-0x9007, BIOS settings: hde:DMA, hdf:pio
@@ -243,7 +251,7 @@
usb usb1: new device found, idVendor=0000, idProduct=0000
usb usb1: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb1: Product: EHCI Host Controller
-usb usb1: Manufacturer: Linux 2.6.18-rc2-mm1 ehci_hcd
+usb usb1: Manufacturer: Linux 2.6.18-rc3-mm2 ehci_hcd
usb usb1: SerialNumber: 0000:00:1d.7
usb usb1: configuration #1 chosen from 1 choice
hub 1-0:1.0: USB hub found
@@ -257,7 +265,7 @@
usb usb2: new device found, idVendor=0000, idProduct=0000
usb usb2: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb2: Product: UHCI Host Controller
-usb usb2: Manufacturer: Linux 2.6.18-rc2-mm1 uhci_hcd
+usb usb2: Manufacturer: Linux 2.6.18-rc3-mm2 uhci_hcd
usb usb2: SerialNumber: 0000:00:1d.0
usb usb2: configuration #1 chosen from 1 choice
hub 2-0:1.0: USB hub found
@@ -270,7 +278,7 @@
usb usb3: new device found, idVendor=0000, idProduct=0000
usb usb3: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb3: Product: UHCI Host Controller
-usb usb3: Manufacturer: Linux 2.6.18-rc2-mm1 uhci_hcd
+usb usb3: Manufacturer: Linux 2.6.18-rc3-mm2 uhci_hcd
usb usb3: SerialNumber: 0000:00:1d.1
usb usb3: configuration #1 chosen from 1 choice
hub 3-0:1.0: USB hub found
@@ -283,7 +291,7 @@
usb usb4: new device found, idVendor=0000, idProduct=0000
usb usb4: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb4: Product: UHCI Host Controller
-usb usb4: Manufacturer: Linux 2.6.18-rc2-mm1 uhci_hcd
+usb usb4: Manufacturer: Linux 2.6.18-rc3-mm2 uhci_hcd
usb usb4: SerialNumber: 0000:00:1d.2
usb usb4: configuration #1 chosen from 1 choice
hub 4-0:1.0: USB hub found
@@ -296,7 +304,7 @@
usb usb5: new device found, idVendor=0000, idProduct=0000
usb usb5: new device strings: Mfr=3, Product=2, SerialNumber=1
usb usb5: Product: UHCI Host Controller
-usb usb5: Manufacturer: Linux 2.6.18-rc2-mm1 uhci_hcd
+usb usb5: Manufacturer: Linux 2.6.18-rc3-mm2 uhci_hcd
usb usb5: SerialNumber: 0000:00:1d.3
usb usb5: configuration #1 chosen from 1 choice
hub 5-0:1.0: USB hub found
@@ -312,8 +320,8 @@
input: Wacom Graphire2 4x5 as /class/input/input0
usbcore: registered new interface driver wacom
/l/latest/xxx/drivers/usb/input/wacom.c: v1.45:USB Wacom Graphire and Wacom
Intuos tablet driver
-serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
+serio: i8042 AUX port at 0x60,0x64 irq 12
mice: PS/2 mouse device common for all mice
it87: Found IT8712F chip at 0x290, revision 5
md: raid0 personality registered for level 0
@@ -325,6 +333,7 @@
No soundcards found.
oprofile: using NMI interrupt.
ip_conntrack version 2.4 (8191 buckets, 65528 max) - 208 bytes per conntrack
+input: AT Translated Set 2 keyboard as /class/input/input1
ip_tables: (C) 2000-2006 Netfilter Core Team
TCP bic registered
NET: Registered protocol family 1
@@ -336,13 +345,15 @@
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
+EXT3-fs: INFO: recovery required on readonly filesystem.
+EXT3-fs: write access will be enabled during recovery.
kjournald starting. Commit interval 5 seconds
+EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
VFS: Mounted root (ext3 filesystem) readonly.
-Freeing unused kernel memory: 200k freed
-input: AT Translated Set 2 keyboard as /class/input/input1
+Freeing unused kernel memory: 204k freed
ieee1394: Initialized config rom entry `ip1394'
-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
Uniform CD-ROM driver Revision: 3.20
hdd: ATAPI 40X DVD-ROM CD-R/RW drive, 2048kB Cache, UDMA(33)
ACPI: PCI Interrupt 0000:02:05.0[A] -> GSI 21 (level, low) -> IRQ 19
@@ -387,3 +398,5 @@
EXT3 FS on md0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 506036k swap on /dev/hda3. Priority:-1 extents:1 across:506036k
+JBD: barrier-based sync failed on hda2 - disabling barriers
+JBD: barrier-based sync failed on md0 - disabling barriers

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-07 09:53:08

by Balbir Singh

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Michal Piotrowski wrote:
> Hi,
>
> On 06/08/06, Andrew Morton <[email protected]> wrote:
>>
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>>
>>
>
> I get this error during the build.
>
> kernel/built-in.o: In function `bacct_add_tsk':
> /usr/src/linux-mm/kernel/tsacct.c:39: undefined reference to `__divdi3'
> make[1]: *** [.tmp_vmlinux1] Error 1
> make: *** [_all] Error 2
>
> I'll try with CONFIG_TASKSTATS disabled.
>
> Regards,
> Michal
>

Sounds likes we are trying to do a 64 bit division since timespec_to_ns()
returns a 64 bit value.

Here's a compile tested patch to fix the problem

Signed-off-by: Balbir Singh <[email protected]>
---

kernel/tsacct.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletion(-)

diff -puN kernel/tsacct.c~tsacct-build-fix kernel/tsacct.c
--- linux-2.6.18-rc3/kernel/tsacct.c~tsacct-build-fix 2006-08-07
14:20:58.000000000 +0530
+++ linux-2.6.18-rc3-balbir/kernel/tsacct.c 2006-08-07 14:51:44.000000000 +0530
@@ -36,7 +36,8 @@ void bacct_add_tsk(struct taskstats *sta
do_posix_clock_monotonic_gettime(&uptime);
ts = timespec_sub(uptime, current->group_leader->start_time);
/* rebase elapsed time to usec */
- stats->ac_etime = (timespec_to_ns(&ts))/NSEC_PER_USEC;
+ stats->ac_etime = (ts.tv_sec * USEC_PER_SEC) +
+ (ts.tv_nsec / NSEC_PER_USEC);
stats->ac_btime = xtime.tv_sec - ts.tv_sec;
if (thread_group_leader(tsk)) {
stats->ac_exitcode = tsk->exit_code;
_



--
Regards,
Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-08-07 12:16:07

by Michal Piotrowski

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi,

On 07/08/06, Balbir Singh <[email protected]> wrote:
> Michal Piotrowski wrote:
> > Hi,
> >
> > On 06/08/06, Andrew Morton <[email protected]> wrote:
> >>
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >>
> >>
> >
> > I get this error during the build.
> >
> > kernel/built-in.o: In function `bacct_add_tsk':
> > /usr/src/linux-mm/kernel/tsacct.c:39: undefined reference to `__divdi3'
> > make[1]: *** [.tmp_vmlinux1] Error 1
> > make: *** [_all] Error 2
> >
> > I'll try with CONFIG_TASKSTATS disabled.
> >
> > Regards,
> > Michal
> >
>
> Sounds likes we are trying to do a 64 bit division since timespec_to_ns()
> returns a 64 bit value.
>
> Here's a compile tested patch to fix the problem
>

It doesn't apply
cat patches/tsacct1.patch | patch -p1 --dry-run
patching file kernel/tsacct.c
Hunk #1 FAILED at 36.
1 out of 1 hunk FAILED -- saving rejects to file kernel/tsacct.c.rej

Andrew's csa-basic-accounting-over-taskstats-fix.patch fix compilation problem.

> --
> Regards,
> Balbir Singh,
> Linux Technology Center,
> IBM Software Labs
>

Regards,
Michal

--
Michal K. K. Piotrowski
LTG - Linux Testers Group
(http://www.stardust.webpages.pl/ltg/wiki/)

2006-08-07 13:42:28

by Andy Whitcroft

[permalink] [raw]
Subject: x86_64 command line truncated

It seems that the command line on x86_64 is being truncated during boot:

Bootdata ok (command line is root=/dev/sda1 ro profile=2 console=tty0
console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1154470592
profile=2)
[...]
Kernel command line: root=/dev/sda1 ro profile=2 console=tty0
console=ttyS0,57600 autobench_args: root=/dev/sda1 ABAT:1154470592 profile=2
[...]
elm3b6:~# cat /proc/cmdline
root=/dev/sda1

This seems to be occuring around the parse_args area.

Will try and track it down.

-apw

2006-08-07 14:06:04

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64 command line truncated

Andy Whitcroft <[email protected]> writes:

> It seems that the command line on x86_64 is being truncated during boot:

in mm right?
> Will try and track it down.

Don't bother, it is likely "early-param" (the patch from
hell). I'll investigate.

-Andi

2006-08-07 14:05:18

by Balbir Singh

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Michal Piotrowski wrote:
> Hi,
>
> On 07/08/06, Balbir Singh <[email protected]> wrote:
>> Michal Piotrowski wrote:
>> > Hi,
>> >
>> > On 06/08/06, Andrew Morton <[email protected]> wrote:
>> >>
>> >>
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>>
>> >>
>> >>
>> >
>> > I get this error during the build.
>> >
>> > kernel/built-in.o: In function `bacct_add_tsk':
>> > /usr/src/linux-mm/kernel/tsacct.c:39: undefined reference to `__divdi3'
>> > make[1]: *** [.tmp_vmlinux1] Error 1
>> > make: *** [_all] Error 2
>> >
>> > I'll try with CONFIG_TASKSTATS disabled.
>> >
>> > Regards,
>> > Michal
>> >
>>
>> Sounds likes we are trying to do a 64 bit division since timespec_to_ns()
>> returns a 64 bit value.
>>
>> Here's a compile tested patch to fix the problem
>>
>
> It doesn't apply
> cat patches/tsacct1.patch | patch -p1 --dry-run
> patching file kernel/tsacct.c
> Hunk #1 FAILED at 36.
> 1 out of 1 hunk FAILED -- saving rejects to file kernel/tsacct.c.rej
>
> Andrew's csa-basic-accounting-over-taskstats-fix.patch fix compilation
> problem.
>

Yeah, thats it! I did not see the fix in mm-commits.

Thanks for pointing to the fix.

--

Balbir Singh,
Linux Technology Center,
IBM Software Labs

2006-08-07 14:37:36

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64 command line truncated II

Andi Kleen <[email protected]> writes:

> Andy Whitcroft <[email protected]> writes:
>
> > It seems that the command line on x86_64 is being truncated during boot:
>
> in mm right?
> > Will try and track it down.
>
> Don't bother, it is likely "early-param" (the patch from
> hell). I'll investigate.

Following up myself ...

Are you sure it's a regression? 2.6.17 does the same
and we always had that 255 character limit (I tried
to increase it once, but it broke some old lilo setups)

i386 should be the same btw.

-Andi

2006-08-07 14:40:31

by Andy Whitcroft

[permalink] [raw]
Subject: Re: x86_64 command line truncated

Andi Kleen wrote:
> Andy Whitcroft <[email protected]> writes:
>
>> It seems that the command line on x86_64 is being truncated during boot:
>
> in mm right?
>> Will try and track it down.
>
> Don't bother, it is likely "early-param" (the patch from
> hell). I'll investigate.
>
> -Andi

Well I've narroed it down to the following patch from Andrew:

x86_64-mm-early-param.patch

Basically, that leads setup_arch to return saved_command_line as _the_
command_line. We then run parse_args() against it which assumes it may
irrevocabaly change command_line. Previous to this patch
saved_command_line and command_line were separate and this was not an issue.

It feels like we should be following the model in the newly added
parse_early_parms() and taking a local copy of the command_line here.

-apw

2006-08-07 14:44:08

by Andy Whitcroft

[permalink] [raw]
Subject: Re: x86_64 command line truncated II

Andi Kleen wrote:
> Andi Kleen <[email protected]> writes:
>
>> Andy Whitcroft <[email protected]> writes:
>>
>>> It seems that the command line on x86_64 is being truncated during boot:
>> in mm right?
>>> Will try and track it down.
>> Don't bother, it is likely "early-param" (the patch from
>> hell). I'll investigate.
>
> Following up myself ...
>
> Are you sure it's a regression? 2.6.17 does the same
> and we always had that 255 character limit (I tried
> to increase it once, but it broke some old lilo setups)
>
> i386 should be the same btw.

Its not being truncated at 255 characters, its being truncated at the
first space. This is coming out of parse_args, which dumps '\0's into
the command_line as it rips it apart. We now only have one copy of the
command line (in x86_64) instead of two, so we now expose this trashed
copy in /proc/cmdline.

-apw

2006-08-07 14:46:59

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64 command line truncated II

On Monday 07 August 2006 16:42, Andy Whitcroft wrote:
> Andi Kleen wrote:
> > Andi Kleen <[email protected]> writes:
> >
> >> Andy Whitcroft <[email protected]> writes:
> >>
> >>> It seems that the command line on x86_64 is being truncated during boot:
> >> in mm right?
> >>> Will try and track it down.
> >> Don't bother, it is likely "early-param" (the patch from
> >> hell). I'll investigate.
> >
> > Following up myself ...
> >
> > Are you sure it's a regression? 2.6.17 does the same
> > and we always had that 255 character limit (I tried
> > to increase it once, but it broke some old lilo setups)
> >
> > i386 should be the same btw.
>
> Its not being truncated at 255 characters, its being truncated at the
> first space. This is coming out of parse_args, which dumps '\0's into
> the command_line as it rips it apart. We now only have one copy of the
> command line (in x86_64) instead of two, so we now expose this trashed
> copy in /proc/cmdline.

I don't see this in my version; so it's likely fixed already. I did quite
a lot of changes on this patch already.

Please test

ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/early-param

-Andi

2006-08-07 15:06:31

by Andy Whitcroft

[permalink] [raw]
Subject: Re: x86_64 command line truncated II

Andi Kleen wrote:
> On Monday 07 August 2006 16:42, Andy Whitcroft wrote:
>> Andi Kleen wrote:
>>> Andi Kleen <[email protected]> writes:
>>>
>>>> Andy Whitcroft <[email protected]> writes:
>>>>
>>>>> It seems that the command line on x86_64 is being truncated during boot:
>>>> in mm right?
>>>>> Will try and track it down.
>>>> Don't bother, it is likely "early-param" (the patch from
>>>> hell). I'll investigate.
>>> Following up myself ...
>>>
>>> Are you sure it's a regression? 2.6.17 does the same
>>> and we always had that 255 character limit (I tried
>>> to increase it once, but it broke some old lilo setups)
>>>
>>> i386 should be the same btw.
>> Its not being truncated at 255 characters, its being truncated at the
>> first space. This is coming out of parse_args, which dumps '\0's into
>> the command_line as it rips it apart. We now only have one copy of the
>> command line (in x86_64) instead of two, so we now expose this trashed
>> copy in /proc/cmdline.
>
> I don't see this in my version; so it's likely fixed already. I did quite
> a lot of changes on this patch already.
>
> Please test
>
> ftp://ftp.firstfloor.org/pub/ak/x86_64/quilt/patches/early-param

Easier said than done as the original version is unwilling to revert.
Looking at the replacement patch it has the same fix I have been testing
to restore the original dual buffer semantic. So I think it would fix
the problem we're seeing here. I'll follow up to this email with the
incremental patch I tested with 2.6.18-rc2-mm2.

-apw

2006-08-07 15:15:29

by Andrew Morton

[permalink] [raw]
Subject: Re: x86_64 command line truncated

On Mon, 07 Aug 2006 15:38:49 +0100
Andy Whitcroft <[email protected]> wrote:

> Andi Kleen wrote:
> > Andy Whitcroft <[email protected]> writes:
> >
> >> It seems that the command line on x86_64 is being truncated during boot:
> >
> > in mm right?
> >> Will try and track it down.
> >
> > Don't bother, it is likely "early-param" (the patch from
> > hell). I'll investigate.
> >
> > -Andi
>
> Well I've narroed it down to the following patch from Andrew:
>
> x86_64-mm-early-param.patch

Not me. My only contribution to that patch was to scrog the changelog ;)
I'll be fixing that sometime.

I think that patch doesn't have a future, although Andi hasn't yet dropped it.

> Basically, that leads setup_arch to return saved_command_line as _the_
> command_line. We then run parse_args() against it which assumes it may
> irrevocabaly change command_line. Previous to this patch
> saved_command_line and command_line were separate and this was not an issue.
>
> It feels like we should be following the model in the newly added
> parse_early_parms() and taking a local copy of the command_line here.
>

2006-08-07 15:49:41

by Adrian Bunk

[permalink] [raw]
Subject: [-mm patch] make arch/i386/kernel/acpi/boot.c:acpi_force static

acpi_force can become static.

Signed-off-by: Adrian Bunk <[email protected]>

--- linux-2.6.18-rc3-mm2-full/arch/i386/kernel/acpi/boot.c.old 2006-08-07 15:56:19.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/arch/i386/kernel/acpi/boot.c 2006-08-07 15:56:28.000000000 +0200
@@ -37,7 +37,7 @@
#include <asm/io.h>
#include <asm/mpspec.h>

-int __initdata acpi_force = 0;
+static int __initdata acpi_force = 0;

#ifdef CONFIG_ACPI
int acpi_disabled = 0;

2006-08-07 15:49:45

by Adrian Bunk

[permalink] [raw]
Subject: [-mm patch] make arch/i386/kernel/apic.c:enable_local_apic static

enable_local_apic can now become static.

Signed-off-by: Adrian Bunk <[email protected]>

---

arch/i386/kernel/apic.c | 13 ++++++++++++-
include/asm-i386/apic.h | 12 ------------
2 files changed, 12 insertions(+), 13 deletions(-)

--- linux-2.6.18-rc3-mm2-full/include/asm-i386/apic.h.old 2006-08-07 16:10:45.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/include/asm-i386/apic.h 2006-08-07 16:12:37.000000000 +0200
@@ -16,20 +16,8 @@
#define APIC_VERBOSE 1
#define APIC_DEBUG 2

-extern int enable_local_apic;
extern int apic_verbosity;

-static inline void lapic_disable(void)
-{
- enable_local_apic = -1;
- clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
-}
-
-static inline void lapic_enable(void)
-{
- enable_local_apic = 1;
-}
-
/*
* Define the default level of output to be very little
* This can be turned up by using apic=verbose for more
--- linux-2.6.18-rc3-mm2-full/arch/i386/kernel/apic.c.old 2006-08-07 16:11:08.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/arch/i386/kernel/apic.c 2006-08-07 16:12:57.000000000 +0200
@@ -52,7 +52,18 @@
/*
* Knob to control our willingness to enable the local APIC.
*/
-int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */
+static int enable_local_apic __initdata = 0; /* -1=force-disable, +1=force-enable */
+
+static inline void lapic_disable(void)
+{
+ enable_local_apic = -1;
+ clear_bit(X86_FEATURE_APIC, boot_cpu_data.x86_capability);
+}
+
+static inline void lapic_enable(void)
+{
+ enable_local_apic = 1;
+}

/*
* Debug level

2006-08-07 15:50:12

by Adrian Bunk

[permalink] [raw]
Subject: [-mm patch] drivers/crypto/geode-aes.c: cleanups

This patch contains the following cleanups:
- make needlessly global code static
- use C99 struct initializers

Signed-off-by: Adrian Bunk <[email protected]>

---

The {cia,geode_aes}_{setkey,encrypt,decryt} prototype confusion both
sparse and gcc are giveng warnings about should also be fixed.

drivers/crypto/geode-aes.c | 12 ++++++------
drivers/crypto/geode-aes.h | 2 --
2 files changed, 6 insertions(+), 8 deletions(-)

--- linux-2.6.18-rc3-mm2-full/drivers/crypto/geode-aes.h.old 2006-08-07 16:23:25.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/drivers/crypto/geode-aes.h 2006-08-07 16:23:51.000000000 +0200
@@ -37,6 +37,4 @@
u8 iv[AES_IV_LENGTH];
};

-unsigned int geode_aes_crypt(struct geode_aes_op *);
-
#endif
--- linux-2.6.18-rc3-mm2-full/drivers/crypto/geode-aes.c.old 2006-08-07 16:24:03.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/drivers/crypto/geode-aes.c 2006-08-07 16:50:41.000000000 +0200
@@ -114,7 +114,7 @@
AWRITE((status & 0xFF) | AES_INTRA_PENDING, AES_INTR_REG);
}

-unsigned int
+static unsigned int
geode_aes_crypt(struct geode_aes_op *op)
{
u32 flags = 0;
@@ -361,7 +361,7 @@
return ret;
}

-struct pci_device_id geode_aes_tbl[] = {
+static struct pci_device_id geode_aes_tbl[] = {
{ PCI_VENDOR_ID_AMD, PCI_DEVICE_ID_AMD_LX_AES, PCI_ANY_ID, PCI_ANY_ID} ,
{ 0, }
};
@@ -369,10 +369,10 @@
MODULE_DEVICE_TABLE(pci, geode_aes_tbl);

static struct pci_driver geode_aes_driver = {
- name: "Geode LX AES",
- id_table: geode_aes_tbl,
- probe: geode_aes_probe,
- remove: __devexit_p(geode_aes_remove)
+ .name = "Geode LX AES",
+ .id_table = geode_aes_tbl,
+ .probe = geode_aes_probe,
+ .remove = __devexit_p(geode_aes_remove)
};

static int __devinit

2006-08-07 15:50:08

by Adrian Bunk

[permalink] [raw]
Subject: [-mm patch] net/: make code static

This patch makes needlessly global code static.

Signed-off-by: Adrian Bunk <[email protected]>

---

BTW:
It doesn't seem to be intended that the new
ipv4/fib_rules.c:fib4_rules_cleanup() is completely unused?

include/net/ip6_fib.h | 4 ----
net/ipv4/cipso_ipv4.c | 2 +-
net/ipv4/fib_rules.c | 4 ++--
net/ipv6/fib6_rules.c | 4 ++--
net/ipv6/ip6_fib.c | 6 +++---
net/ipv6/route.c | 6 +++---
net/netlabel/netlabel_domainhash.c | 4 ++--
7 files changed, 13 insertions(+), 17 deletions(-)

--- linux-2.6.18-rc3-mm2-full/net/ipv4/cipso_ipv4.c.old 2006-08-07 16:39:05.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/ipv4/cipso_ipv4.c 2006-08-07 16:39:15.000000000 +0200
@@ -60,7 +60,7 @@
* if in practice there are a lot of different DOIs this list should
* probably be turned into a hash table or something similar so we
* can do quick lookups. */
-DEFINE_SPINLOCK(cipso_v4_doi_list_lock);
+static DEFINE_SPINLOCK(cipso_v4_doi_list_lock);
static struct list_head cipso_v4_doi_list = LIST_HEAD_INIT(cipso_v4_doi_list);

/* Label mapping cache */
--- linux-2.6.18-rc3-mm2-full/net/ipv4/fib_rules.c.old 2006-08-07 16:39:33.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/ipv4/fib_rules.c 2006-08-07 16:39:51.000000000 +0200
@@ -101,8 +101,8 @@
return err;
}

-int fib4_rule_action(struct fib_rule *rule, struct flowi *flp, int flags,
- struct fib_lookup_arg *arg)
+static int fib4_rule_action(struct fib_rule *rule, struct flowi *flp,
+ int flags, struct fib_lookup_arg *arg)
{
int err = -EAGAIN;
struct fib_table *tbl;
--- linux-2.6.18-rc3-mm2-full/net/ipv6/fib6_rules.c.old 2006-08-07 16:41:07.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/ipv6/fib6_rules.c 2006-08-07 16:41:16.000000000 +0200
@@ -66,8 +66,8 @@
return (struct dst_entry *) arg.result;
}

-int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
- int flags, struct fib_lookup_arg *arg)
+static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
+ int flags, struct fib_lookup_arg *arg)
{
struct rt6_info *rt = NULL;
struct fib6_table *table;
--- linux-2.6.18-rc3-mm2-full/include/net/ip6_fib.h.old 2006-08-07 16:41:36.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/include/net/ip6_fib.h 2006-08-07 16:41:43.000000000 +0200
@@ -192,10 +192,6 @@
struct in6_addr *daddr, int dst_len,
struct in6_addr *saddr, int src_len);

-extern void fib6_clean_tree(struct fib6_node *root,
- int (*func)(struct rt6_info *, void *arg),
- int prune, void *arg);
-
extern void fib6_clean_all(int (*func)(struct rt6_info *, void *arg),
int prune, void *arg);

--- linux-2.6.18-rc3-mm2-full/net/ipv6/ip6_fib.c.old 2006-08-07 16:41:51.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/ipv6/ip6_fib.c 2006-08-07 16:42:05.000000000 +0200
@@ -1169,9 +1169,9 @@
* ignoring pure split nodes) will be scanned.
*/

-void fib6_clean_tree(struct fib6_node *root,
- int (*func)(struct rt6_info *, void *arg),
- int prune, void *arg)
+static void fib6_clean_tree(struct fib6_node *root,
+ int (*func)(struct rt6_info *, void *arg),
+ int prune, void *arg)
{
struct fib6_cleaner_t c;

--- linux-2.6.18-rc3-mm2-full/net/ipv6/route.c.old 2006-08-07 16:42:24.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/ipv6/route.c 2006-08-07 16:43:05.000000000 +0200
@@ -613,8 +613,8 @@
return rt;
}

-struct rt6_info *ip6_pol_route_input(struct fib6_table *table, struct flowi *fl,
- int flags)
+static struct rt6_info *ip6_pol_route_input(struct fib6_table *table,
+ struct flowi *fl, int flags)
{
struct fib6_node *fn;
struct rt6_info *rt, *nrt;
@@ -872,7 +872,7 @@
}

static struct dst_entry *ndisc_dst_gc_list;
-DEFINE_SPINLOCK(ndisc_lock);
+static DEFINE_SPINLOCK(ndisc_lock);

struct dst_entry *ndisc_dst_alloc(struct net_device *dev,
struct neighbour *neigh,
--- linux-2.6.18-rc3-mm2-full/net/netlabel/netlabel_domainhash.c.old 2006-08-07 16:43:27.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/net/netlabel/netlabel_domainhash.c 2006-08-07 16:43:53.000000000 +0200
@@ -50,11 +50,11 @@
/* Domain hash table */
/* XXX - updates should be so rare that having one spinlock for the entire
* hash table should be okay */
-DEFINE_SPINLOCK(netlbl_domhsh_lock);
+static DEFINE_SPINLOCK(netlbl_domhsh_lock);
static struct netlbl_domhsh_tbl *netlbl_domhsh = NULL;

/* Default domain mapping */
-DEFINE_SPINLOCK(netlbl_domhsh_def_lock);
+static DEFINE_SPINLOCK(netlbl_domhsh_def_lock);
static struct netlbl_dom_map *netlbl_domhsh_def = NULL;

/*

2006-08-07 15:58:10

by Andi Kleen

[permalink] [raw]
Subject: Re: x86_64 command line truncated

Andrew Morton <[email protected]> writes:

> On Mon, 07 Aug 2006 15:38:49 +0100
> Andy Whitcroft <[email protected]> wrote:
>
> > Andi Kleen wrote:
> > > Andy Whitcroft <[email protected]> writes:
> > >
> > >> It seems that the command line on x86_64 is being truncated during boot:
> > >
> > > in mm right?
> > >> Will try and track it down.
> > >
> > > Don't bother, it is likely "early-param" (the patch from
> > > hell). I'll investigate.
> > >
> > > -Andi
> >
> > Well I've narroed it down to the following patch from Andrew:
> >
> > x86_64-mm-early-param.patch
>
> Not me. My only contribution to that patch was to scrog the changelog ;)
> I'll be fixing that sometime.
>
> I think that patch doesn't have a future, although Andi hasn't yet dropped it.

I fixed all known bugs (but hasn't reached your tree it) and right now
it looks good to not be a drop.

Of course more testing will tell.

-Andi

2006-08-07 16:07:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [-mm patch] make arch/i386/kernel/acpi/boot.c:acpi_force static

On Monday 07 August 2006 17:49, Adrian Bunk wrote:
> acpi_force can become static.

Both patches added thanks

-Andi

2006-08-07 16:23:37

by Jason Lunz

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

In gmane.linux.kernel, you wrote:
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> I tried it and guess what :)... swsusp doesn't work :@.
>
> This time I was able to dump process states with sysrq-t:
> http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
>
> My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
> suspending device 2.0

Does it go away if you revert this?
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch

That should only affect resume, not suspend, but it does mess around
with ide power management. Is this maybe happening on the *second*
suspend?

> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)

This looks suspicious. -mm does have several ide-fix-hpt3xx patches.

Jason

2006-08-07 18:47:04

by Fabio Comolli

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi.

On 8/7/06, Dmitry Torokhov <[email protected]> wrote:
> On Sunday 06 August 2006 15:09, Andrew Morton wrote:
> > -tycho kernel: input: PS/2 Mouse as /class/input/input1
> > -tycho kernel: input: AlpsPS/2 ALPS GlidePoint as /class/input/input2
> >
> > That's not so good.
> >
> >
> > Dmitry, do you have anything in there which might have caused that?
> >
> > Perhaps hdaps-handle-errors-from-input_register_device.patch is triggering
> > for some reason.
>
> Hmm, I'd be more concerned with i8042-get-rid-of-polling-timer patch...

Bingo! Reverting remove-polling-timer-from-i8042-v2.patch did the
trick. Now I'm running 2.6.18-rc3-mm2 + hot-fixes :-)

Still interested in dmesg with i8042.debug=1 ?

Ciao.
Fabio


> Anyway, can I have dmesg from boot with i8042.debug=1, please? Make sure
> you have big log biffer.
>
> --
> Dmitry
>

2006-08-07 19:01:00

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On 8/7/06, Fabio Comolli <[email protected]> wrote:
> Hi.
>
> On 8/7/06, Dmitry Torokhov <[email protected]> wrote:
> > On Sunday 06 August 2006 15:09, Andrew Morton wrote:
> > > -tycho kernel: input: PS/2 Mouse as /class/input/input1
> > > -tycho kernel: input: AlpsPS/2 ALPS GlidePoint as /class/input/input2
> > >
> > > That's not so good.
> > >
> > >
> > > Dmitry, do you have anything in there which might have caused that?
> > >
> > > Perhaps hdaps-handle-errors-from-input_register_device.patch is triggering
> > > for some reason.
> >
> > Hmm, I'd be more concerned with i8042-get-rid-of-polling-timer patch...
>
> Bingo! Reverting remove-polling-timer-from-i8042-v2.patch did the
> trick. Now I'm running 2.6.18-rc3-mm2 + hot-fixes :-)
>
> Still interested in dmesg with i8042.debug=1 ?
>

Yes, _with_ the i8042 polling patch applied. Do you have PNP support enabled?

--
Dmitry

2006-08-07 19:39:15

by Mattia Dongili

[permalink] [raw]
Subject: resume from S3 regression [Was: 2.6.18-rc3-mm2]

Hello,

after resume from ram (tested in single user), I can type commands for a
few seconds (time is variable), the processes get stuck in io_schedule.
Poorman's screenshots are here:
http://oioio.altervista.org/linux/dsc03448.jpg
http://oioio.altervista.org/linux/dsc03449.jpg

.config:
http://oioio.altervista.org/linux/config-2.6.18-rc3-mm2-1

Anything useful I could add?
--
mattia
:wq!

2006-08-07 20:02:30

by Andrew Morton

[permalink] [raw]
Subject: Re: resume from S3 regression [Was: 2.6.18-rc3-mm2]

On Mon, 7 Aug 2006 21:38:36 +0200
Mattia Dongili <[email protected]> wrote:

> after resume from ram (tested in single user), I can type commands for a
> few seconds (time is variable), the processes get stuck in io_schedule.
> Poorman's screenshots are here:
> http://oioio.altervista.org/linux/dsc03448.jpg
> http://oioio.altervista.org/linux/dsc03449.jpg

That probably measn that the device or device driver has got itself into a
sick state and IO completions aren't occurring.

Which storage device (and which device driver) is being used here?

2006-08-07 20:35:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Monday 07 August 2006 11:15, Rafael J. Wysocki wrote:
> On Monday 07 August 2006 00:54, Andrew Morton wrote:
> > On Mon, 7 Aug 2006 00:42:10 +0200
> > "Rafael J. Wysocki" <[email protected]> wrote:
> >
> > > On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> > > >
> > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > >
> > > My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> > > if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
> > >
> > > Unfortunately I have no indication what can be wrong, no oopses, no error
> > > messages in dmesg, nothing.
> > >
> > > Right now I'm doing a binary search for the offending patch.
> > >
> >
> > Thanks. I'd zoom in on
> > hdaps-handle-errors-from-input_register_device.patch and git-input.patch.
>
> None of these, but close: remove-polling-timer-from-i8042-v2.patch breaks
> things here. [FYI, the box is booted with "noapic", because the IRQ sharing
> doesn't work otherwise due to a BIOS issue, so it may be related.]
>
> Attached is the dmesg output with i8042.debug=1 for Dmitry. It's from
> 2.6.18-rc3 with -mm2 partially applied (up to and including
> logips2pp-fix-mx300-button-layout.patch). I'll apply the rest tonight, after
> I find the patch that broke suspend for me.

Unfortunately this one is git-block.patch. I have no idea which part of it
may break the suspend.

It hangs during suspend, right after the memory has been shrunk, when devices
should be suspended. After pressing SysRq-P it shows it's spinning in the
idle thread and then hangs hard.

Greetings,
Rafael

2006-08-07 20:48:54

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Monday 07 August 2006 18:23, Jason Lunz wrote:
> In gmane.linux.kernel, you wrote:
> >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> > I tried it and guess what :)... swsusp doesn't work :@.
> >
> > This time I was able to dump process states with sysrq-t:
> > http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> >
> > My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
> > suspending device 2.0
>
> Does it go away if you revert this?
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
>
> That should only affect resume, not suspend, but it does mess around
> with ide power management. Is this maybe happening on the *second*
> suspend?
>
> > -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>
> This looks suspicious. -mm does have several ide-fix-hpt3xx patches.

I found that git-block.patch broke the suspend for me. Still have no idea
what's up with it.

Rafael

2006-08-07 20:55:48

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Mon, 7 Aug 2006 22:34:12 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> On Monday 07 August 2006 11:15, Rafael J. Wysocki wrote:
> > On Monday 07 August 2006 00:54, Andrew Morton wrote:
> > > On Mon, 7 Aug 2006 00:42:10 +0200
> > > "Rafael J. Wysocki" <[email protected]> wrote:
> > >
> > > > On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> > > > >
> > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > >
> > > > My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> > > > if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
> > > >
> > > > Unfortunately I have no indication what can be wrong, no oopses, no error
> > > > messages in dmesg, nothing.
> > > >
> > > > Right now I'm doing a binary search for the offending patch.
> > > >
> > >
> > > Thanks. I'd zoom in on
> > > hdaps-handle-errors-from-input_register_device.patch and git-input.patch.
> >
> > None of these, but close: remove-polling-timer-from-i8042-v2.patch breaks
> > things here. [FYI, the box is booted with "noapic", because the IRQ sharing
> > doesn't work otherwise due to a BIOS issue, so it may be related.]
> >
> > Attached is the dmesg output with i8042.debug=1 for Dmitry. It's from
> > 2.6.18-rc3 with -mm2 partially applied (up to and including
> > logips2pp-fix-mx300-button-layout.patch). I'll apply the rest tonight, after
> > I find the patch that broke suspend for me.
>
> Unfortunately this one is git-block.patch. I have no idea which part of it
> may break the suspend.

ow, that tree is pretty huge at present.

> It hangs during suspend, right after the memory has been shrunk, when devices
> should be suspended. After pressing SysRq-P it shows it's spinning in the
> idle thread and then hangs hard.

OK, thanks for doing that. I'll drop git-block until we can get it sorted.

2006-08-07 20:57:47

by Mattia Dongili

[permalink] [raw]
Subject: Re: resume from S3 regression [Was: 2.6.18-rc3-mm2]

On Mon, Aug 07, 2006 at 01:02:08PM -0700, Andrew Morton wrote:
> On Mon, 7 Aug 2006 21:38:36 +0200
> Mattia Dongili <[email protected]> wrote:
>
> > after resume from ram (tested in single user), I can type commands for a
> > few seconds (time is variable), the processes get stuck in io_schedule.
> > Poorman's screenshots are here:
> > http://oioio.altervista.org/linux/dsc03448.jpg
> > http://oioio.altervista.org/linux/dsc03449.jpg
>
> That probably measn that the device or device driver has got itself into a
> sick state and IO completions aren't occurring.

BTW: I tried to reverse ide-reprogram-disk-pio-timings-on-resume.patch
with no luck.

> Which storage device (and which device driver) is being used here?

A dmesg is available here (apart from the already resolved BUGs the boot
process is meaningful):
http://oioio.altervista.org/linux/dmesg-2.6.18-rc3-mm2-1
[ 3.168000] ICH3M: chipset revision 1
[ 3.168000] ICH3M: not 100% native mode: will probe irqs later
[ 3.168000] ide0: BM-DMA at 0x1860-0x1867, BIOS settings: hda:DMA, hdb:pio
[ 3.168000] ide1: BM-DMA at 0x1868-0x186f, BIOS settings: hdc:pio, hdd:pio
[ 3.168000] Probing IDE interface ide0...
[ 3.460000] hda: FUJITSU MHV2080AH, ATA DISK drive
[ 4.132000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
[ 4.136000] Probing IDE interface ide1...
[ 4.704000] Probing IDE interface ide1...
[ 5.272000] hda: max request size: 128KiB
[ 5.344000] hda: 156301488 sectors (80026 MB) w/8192KiB Cache, CHS=65535/16/63, UDMA(100)
[ 5.348000] hda: cache flushes supported
[ 5.352000] hda: hda1 hda2 hda3 hda4 < hda5 hda6 >

lspci reports:
00:1f.1 IDE interface: Intel Corporation 82801CAM IDE U100 (rev 01) (prog-if 8a [Master SecP PriP])
Subsystem: Sony Corporation VAIO PCG-GR214EP/GR214MP/GR215MP/GR314MP/GR315MP
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
Status: Cap- 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR-
Latency: 0
Interrupt: pin A routed to IRQ 255
Region 0: I/O ports at <ignored>
Region 1: I/O ports at <ignored>
Region 2: I/O ports at <ignored>
Region 3: I/O ports at <ignored>
Region 4: I/O ports at 1860 [size=16]
Region 5: Memory at d0000000 (32-bit, non-prefetchable) [size=1K]

--
mattia
:wq!

2006-08-07 21:04:19

by Adrian Bunk

[permalink] [raw]
Subject: [RFC: -mm patch] bcm43xx_main.c: remove 3 functions

This patch removes three no longer used functions (that are even
generating gcc warnings).

This patch doesn't look right, but it is the result of
58e5528ee464d38040b9489e10033c9387a10d56 in git-netdev...

Signed-off-by: Adrian Bunk <[email protected]>

---

drivers/net/wireless/bcm43xx/bcm43xx_main.c | 33 --------------------
1 file changed, 33 deletions(-)

--- linux-2.6.18-rc3-mm2-full/drivers/net/wireless/bcm43xx/bcm43xx_main.c.old 2006-08-07 18:21:31.000000000 +0200
+++ linux-2.6.18-rc3-mm2-full/drivers/net/wireless/bcm43xx/bcm43xx_main.c 2006-08-07 18:23:36.000000000 +0200
@@ -3194,39 +3194,6 @@
bcm43xx_clear_keys(bcm);
}

-static int bcm43xx_rng_read(struct hwrng *rng, u32 *data)
-{
- struct bcm43xx_private *bcm = (struct bcm43xx_private *)rng->priv;
- unsigned long flags;
-
- spin_lock_irqsave(&(bcm)->irq_lock, flags);
- *data = bcm43xx_read16(bcm, BCM43xx_MMIO_RNG);
- spin_unlock_irqrestore(&(bcm)->irq_lock, flags);
-
- return (sizeof(u16));
-}
-
-static void bcm43xx_rng_exit(struct bcm43xx_private *bcm)
-{
- hwrng_unregister(&bcm->rng);
-}
-
-static int bcm43xx_rng_init(struct bcm43xx_private *bcm)
-{
- int err;
-
- snprintf(bcm->rng_name, ARRAY_SIZE(bcm->rng_name),
- "%s_%s", KBUILD_MODNAME, bcm->net_dev->name);
- bcm->rng.name = bcm->rng_name;
- bcm->rng.data_read = bcm43xx_rng_read;
- bcm->rng.priv = (unsigned long)bcm;
- err = hwrng_register(&bcm->rng);
- if (err)
- printk(KERN_ERR PFX "RNG init failed (%d)\n", err);
-
- return err;
-}
-
static int bcm43xx_shutdown_all_wireless_cores(struct bcm43xx_private *bcm)
{
int ret = 0;

2006-08-07 21:09:15

by Jiri Slaby

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

Jason Lunz wrote:
> In gmane.linux.kernel, you wrote:
>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>> I tried it and guess what :)... swsusp doesn't work :@.
>>
>> This time I was able to dump process states with sysrq-t:
>> http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
>>
>> My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
>> suspending device 2.0
>
> Does it go away if you revert this?
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch

No change.

> That should only affect resume, not suspend, but it does mess around
> with ide power management. Is this maybe happening on the *second*
> suspend?

Nope, the first one.

>> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
>> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>
> This looks suspicious. -mm does have several ide-fix-hpt3xx patches.

But hdc is not on the hpt3xx controller.

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-07 22:10:06

by Mattia Dongili

[permalink] [raw]
Subject: Re: resume from S3 regression [Was: 2.6.18-rc3-mm2]

On Mon, Aug 07, 2006 at 10:57:08PM +0200, Mattia Dongili wrote:
> On Mon, Aug 07, 2006 at 01:02:08PM -0700, Andrew Morton wrote:
> > On Mon, 7 Aug 2006 21:38:36 +0200
> > Mattia Dongili <[email protected]> wrote:
> >
> > > after resume from ram (tested in single user), I can type commands for a
> > > few seconds (time is variable), the processes get stuck in io_schedule.
> > > Poorman's screenshots are here:
> > > http://oioio.altervista.org/linux/dsc03448.jpg
> > > http://oioio.altervista.org/linux/dsc03449.jpg
> >
> > That probably measn that the device or device driver has got itself into a
> > sick state and IO completions aren't occurring.
>
> BTW: I tried to reverse ide-reprogram-disk-pio-timings-on-resume.patch
> with no luck.

reverting git-block.patch (plus a couple more to make the thing build)
let me resume correctly (2 cycles already).

Suggestion taken from the "swsusp regression" sub-thread.

--
mattia
:wq!

2006-08-08 04:51:51

by David Miller

[permalink] [raw]
Subject: Re: [-mm patch] net/: make code static

From: Adrian Bunk <[email protected]>
Date: Mon, 7 Aug 2006 17:49:47 +0200

> This patch makes needlessly global code static.
>
> Signed-off-by: Adrian Bunk <[email protected]>

Looks reasonable, applied.

> It doesn't seem to be intended that the new
> ipv4/fib_rules.c:fib4_rules_cleanup() is completely unused?

I'll kill it off.

IPv4 can't be built as a module and therefore there is no
relevant exit or module load error path for ipv4 for which
this function should be called.

Thanks.

2006-08-08 05:20:14

by Jens Axboe

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Mon, Aug 07 2006, Andrew Morton wrote:
> On Mon, 7 Aug 2006 22:34:12 +0200
> "Rafael J. Wysocki" <[email protected]> wrote:
>
> > On Monday 07 August 2006 11:15, Rafael J. Wysocki wrote:
> > > On Monday 07 August 2006 00:54, Andrew Morton wrote:
> > > > On Mon, 7 Aug 2006 00:42:10 +0200
> > > > "Rafael J. Wysocki" <[email protected]> wrote:
> > > >
> > > > > On Sunday 06 August 2006 12:08, Andrew Morton wrote:
> > > > > >
> > > > > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > >
> > > > > My box's (Asus L5D, x86_64) keyboard doesn't work on this kernel at all, even
> > > > > if I boot with init=/bin/bash. On the 2.6.18-rc2-mm1 it worked.
> > > > >
> > > > > Unfortunately I have no indication what can be wrong, no oopses, no error
> > > > > messages in dmesg, nothing.
> > > > >
> > > > > Right now I'm doing a binary search for the offending patch.
> > > > >
> > > >
> > > > Thanks. I'd zoom in on
> > > > hdaps-handle-errors-from-input_register_device.patch and git-input.patch.
> > >
> > > None of these, but close: remove-polling-timer-from-i8042-v2.patch breaks
> > > things here. [FYI, the box is booted with "noapic", because the IRQ sharing
> > > doesn't work otherwise due to a BIOS issue, so it may be related.]
> > >
> > > Attached is the dmesg output with i8042.debug=1 for Dmitry. It's from
> > > 2.6.18-rc3 with -mm2 partially applied (up to and including
> > > logips2pp-fix-mx300-button-layout.patch). I'll apply the rest tonight, after
> > > I find the patch that broke suspend for me.
> >
> > Unfortunately this one is git-block.patch. I have no idea which part of it
> > may break the suspend.
>
> ow, that tree is pretty huge at present.
>
> > It hangs during suspend, right after the memory has been shrunk, when devices
> > should be suspended. After pressing SysRq-P it shows it's spinning in the
> > idle thread and then hangs hard.
>
> OK, thanks for doing that. I'll drop git-block until we can get it sorted.

I think I know what it is, hang on.

--
Jens Axboe

2006-08-08 08:40:09

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Mon, Aug 07 2006, Rafael J. Wysocki wrote:
> On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > In gmane.linux.kernel, you wrote:
> > >> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > >
> > > I tried it and guess what :)... swsusp doesn't work :@.
> > >
> > > This time I was able to dump process states with sysrq-t:
> > > http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > >
> > > My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
> > > suspending device 2.0
> >
> > Does it go away if you revert this?
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> >
> > That should only affect resume, not suspend, but it does mess around
> > with ide power management. Is this maybe happening on the *second*
> > suspend?
> >
> > > -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> >
> > This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
>
> I found that git-block.patch broke the suspend for me. Still have no idea
> what's up with it.

Can you apply this on top of -mm and see if that fixes it?

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index d2339e9..db647a9 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -390,7 +390,7 @@ void ide_end_drive_cmd (ide_drive_t *dri
args[5] = hwif->INB(IDE_HCYL_REG);
args[6] = hwif->INB(IDE_SELECT_REG);
}
- } else if (rq->cmd_type & REQ_TYPE_ATA_TASKFILE) {
+ } else if (rq->cmd_type == REQ_TYPE_ATA_TASKFILE) {
ide_task_t *args = (ide_task_t *) rq->special;
if (rq->errors == 0)
rq->errors = !OK_STAT(stat,READY_STAT,BAD_STAT);

--
Jens Axboe

2006-08-08 09:48:29

by Jiri Slaby

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

Jens Axboe wrote:
> On Mon, Aug 07 2006, Rafael J. Wysocki wrote:
>> On Monday 07 August 2006 18:23, Jason Lunz wrote:
>>> In gmane.linux.kernel, you wrote:
>>>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>>>> I tried it and guess what :)... swsusp doesn't work :@.
>>>>
>>>> This time I was able to dump process states with sysrq-t:
>>>> http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
>>>>
>>>> My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
>>>> suspending device 2.0
>>> Does it go away if you revert this?
>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
>>>
>>> That should only affect resume, not suspend, but it does mess around
>>> with ide power management. Is this maybe happening on the *second*
>>> suspend?
>>>
>>>> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
>>>> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>>> This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
>> I found that git-block.patch broke the suspend for me. Still have no idea
>> what's up with it.
>
> Can you apply this on top of -mm and see if that fixes it?

It doesn't solve the problem for me.

> diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
> index d2339e9..db647a9 100644
> --- a/drivers/ide/ide-io.c
> +++ b/drivers/ide/ide-io.c
> @@ -390,7 +390,7 @@ void ide_end_drive_cmd (ide_drive_t *dri
> args[5] = hwif->INB(IDE_HCYL_REG);
> args[6] = hwif->INB(IDE_SELECT_REG);
> }
> - } else if (rq->cmd_type & REQ_TYPE_ATA_TASKFILE) {
> + } else if (rq->cmd_type == REQ_TYPE_ATA_TASKFILE) {
> ide_task_t *args = (ide_task_t *) rq->special;
> if (rq->errors == 0)
> rq->errors = !OK_STAT(stat,READY_STAT,BAD_STAT);
>

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-08 10:07:31

by Jiri Slaby

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

Rafael J. Wysocki wrote:
> On Monday 07 August 2006 18:23, Jason Lunz wrote:
>> In gmane.linux.kernel, you wrote:
>>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>>> I tried it and guess what :)... swsusp doesn't work :@.
>>>
>>> This time I was able to dump process states with sysrq-t:
>>> http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
>>>
>>> My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel prints is
>>> suspending device 2.0
>> Does it go away if you revert this?
>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
>>
>> That should only affect resume, not suspend, but it does mess around
>> with ide power management. Is this maybe happening on the *second*
>> suspend?
>>
>>> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
>>> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>> This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
>
> I found that git-block.patch broke the suspend for me. Still have no idea
> what's up with it.

I suspect elevator changes. The wait_for_completion is not woken in ide-io by
ll_rw_blk. But I don't understand block layer too much. Where the
blk_end_sync_rq should be called from (why is not called at all)?

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-08 10:41:56

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Jiri Slaby wrote:
> Jens Axboe wrote:
> >On Mon, Aug 07 2006, Rafael J. Wysocki wrote:
> >>On Monday 07 August 2006 18:23, Jason Lunz wrote:
> >>>In gmane.linux.kernel, you wrote:
> >>>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >>>>I tried it and guess what :)... swsusp doesn't work :@.
> >>>>
> >>>>This time I was able to dump process states with sysrq-t:
> >>>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> >>>>
> >>>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> >>>>prints is suspending device 2.0
> >>>Does it go away if you revert this?
> >>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> >>>
> >>>That should only affect resume, not suspend, but it does mess around
> >>>with ide power management. Is this maybe happening on the *second*
> >>>suspend?
> >>>
> >>>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> >>>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> >>>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> >>I found that git-block.patch broke the suspend for me. Still have no idea
> >>what's up with it.
> >
> >Can you apply this on top of -mm and see if that fixes it?
>
> It doesn't solve the problem for me.

Ok, thanks for testing, I'll try and reproduce it here.

--
Jens Axboe

2006-08-08 10:44:57

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Jiri Slaby wrote:
> Rafael J. Wysocki wrote:
> >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> >>In gmane.linux.kernel, you wrote:
> >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >>>I tried it and guess what :)... swsusp doesn't work :@.
> >>>
> >>>This time I was able to dump process states with sysrq-t:
> >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> >>>
> >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> >>>prints is suspending device 2.0
> >>Does it go away if you revert this?
> >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> >>
> >>That should only affect resume, not suspend, but it does mess around
> >>with ide power management. Is this maybe happening on the *second*
> >>suspend?
> >>
> >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> >
> >I found that git-block.patch broke the suspend for me. Still have no idea
> >what's up with it.
>
> I suspect elevator changes. The wait_for_completion is not woken in
> ide-io by ll_rw_blk. But I don't understand block layer too much.

The ide changes are far more likely, it's probably missing a completion.

> Where the blk_end_sync_rq should be called from (why is not called at
> all)?

It's called from ->end_io() in end_that_request_last().

--
Jens Axboe

2006-08-08 11:00:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> On Tue, Aug 08 2006, Jiri Slaby wrote:
> > Rafael J. Wysocki wrote:
> > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > >>In gmane.linux.kernel, you wrote:
> > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > >>>
> > >>>This time I was able to dump process states with sysrq-t:
> > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > >>>
> > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > >>>prints is suspending device 2.0
> > >>Does it go away if you revert this?
> > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > >>
> > >>That should only affect resume, not suspend, but it does mess around
> > >>with ide power management. Is this maybe happening on the *second*
> > >>suspend?
> > >>
> > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > >
> > >I found that git-block.patch broke the suspend for me. Still have no idea
> > >what's up with it.
> >
> > I suspect elevator changes. The wait_for_completion is not woken in
> > ide-io by ll_rw_blk. But I don't understand block layer too much.
>
> The ide changes are far more likely, it's probably missing a completion.

Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
(Remove ->waiting member from struct request) is wrong, because
generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
is "true", because action == ide_wait). Previously &wait was stored in
rq->waiting and it didn't overwrite the PM data.

Haven't tested yet, though.

Greetings,
Rafael

2006-08-08 11:03:44

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > Rafael J. Wysocki wrote:
> > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > >>In gmane.linux.kernel, you wrote:
> > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > >>>
> > > >>>This time I was able to dump process states with sysrq-t:
> > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > >>>
> > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > >>>prints is suspending device 2.0
> > > >>Does it go away if you revert this?
> > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > >>
> > > >>That should only affect resume, not suspend, but it does mess around
> > > >>with ide power management. Is this maybe happening on the *second*
> > > >>suspend?
> > > >>
> > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > >
> > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > >what's up with it.
> > >
> > > I suspect elevator changes. The wait_for_completion is not woken in
> > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> >
> > The ide changes are far more likely, it's probably missing a completion.
>
> Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> (Remove ->waiting member from struct request) is wrong, because
> generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> is "true", because action == ide_wait). Previously &wait was stored in
> rq->waiting and it didn't overwrite the PM data.

Indeed, that looks broken now. That must be what is screwing it up. With
the former patch applied, did cdrom detection still look funny to you?

I'll concoct a fix for that breakage.

--
Jens Axboe

2006-08-08 11:06:21

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Jens Axboe wrote:
> On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > > Rafael J. Wysocki wrote:
> > > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > > >>In gmane.linux.kernel, you wrote:
> > > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > > >>>
> > > > >>>This time I was able to dump process states with sysrq-t:
> > > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > > >>>
> > > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > > >>>prints is suspending device 2.0
> > > > >>Does it go away if you revert this?
> > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > > >>
> > > > >>That should only affect resume, not suspend, but it does mess around
> > > > >>with ide power management. Is this maybe happening on the *second*
> > > > >>suspend?
> > > > >>
> > > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > > >
> > > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > > >what's up with it.
> > > >
> > > > I suspect elevator changes. The wait_for_completion is not woken in
> > > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> > >
> > > The ide changes are far more likely, it's probably missing a completion.
> >
> > Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> > (Remove ->waiting member from struct request) is wrong, because
> > generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> > to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> > is "true", because action == ide_wait). Previously &wait was stored in
> > rq->waiting and it didn't overwrite the PM data.
>
> Indeed, that looks broken now. That must be what is screwing it up. With
> the former patch applied, did cdrom detection still look funny to you?
>
> I'll concoct a fix for that breakage.

Something like this.

diff --git a/drivers/ide/ide-io.c b/drivers/ide/ide-io.c
index db647a9..38479a2 100644
--- a/drivers/ide/ide-io.c
+++ b/drivers/ide/ide-io.c
@@ -141,7 +141,7 @@ enum {

static void ide_complete_power_step(ide_drive_t *drive, struct request *rq, u8 stat, u8 error)
{
- struct request_pm_state *pm = rq->end_io_data;
+ struct request_pm_state *pm = rq->data;

if (drive->media != ide_disk)
return;
@@ -164,7 +164,7 @@ static void ide_complete_power_step(ide_

static ide_startstop_t ide_start_power_step(ide_drive_t *drive, struct request *rq)
{
- struct request_pm_state *pm = rq->end_io_data;
+ struct request_pm_state *pm = rq->data;
ide_task_t *args = rq->special;

memset(args, 0, sizeof(*args));
@@ -421,7 +421,7 @@ void ide_end_drive_cmd (ide_drive_t *dri
}
}
} else if (blk_pm_request(rq)) {
- struct request_pm_state *pm = rq->end_io_data;
+ struct request_pm_state *pm = rq->data;
#ifdef DEBUG_PM
printk("%s: complete_power_step(step: %d, stat: %x, err: %x)\n",
drive->name, rq->pm->pm_step, stat, err);
@@ -933,7 +933,7 @@ #endif

static void ide_check_pm_state(ide_drive_t *drive, struct request *rq)
{
- struct request_pm_state *pm = rq->end_io_data;
+ struct request_pm_state *pm = rq->data;

if (blk_pm_suspend_request(rq) &&
pm->pm_step == ide_pm_state_start_suspend)
@@ -1018,7 +1018,7 @@ #endif
rq->cmd_type == REQ_TYPE_ATA_TASKFILE)
return execute_drive_cmd(drive, rq);
else if (blk_pm_request(rq)) {
- struct request_pm_state *pm = rq->end_io_data;
+ struct request_pm_state *pm = rq->data;
#ifdef DEBUG_PM
printk("%s: start_power_step(step: %d)\n",
drive->name, rq->pm->pm_step);
diff --git a/drivers/ide/ide.c b/drivers/ide/ide.c
index d7b4499..0fd1e1c 100644
--- a/drivers/ide/ide.c
+++ b/drivers/ide/ide.c
@@ -1219,7 +1219,7 @@ static int generic_ide_suspend(struct de
memset(&args, 0, sizeof(args));
rq.cmd_type = REQ_TYPE_PM_SUSPEND;
rq.special = &args;
- rq.end_io_data = &rqpm;
+ rq.data = &rqpm;
rqpm.pm_step = ide_pm_state_start_suspend;
rqpm.pm_state = state.event;

@@ -1238,7 +1238,7 @@ static int generic_ide_resume(struct dev
memset(&args, 0, sizeof(args));
rq.cmd_type = REQ_TYPE_PM_RESUME;
rq.special = &args;
- rq.end_io_data = &rqpm;
+ rq.data = &rqpm;
rqpm.pm_step = ide_pm_state_start_resume;
rqpm.pm_state = PM_EVENT_ON;


--
Jens Axboe

2006-08-08 11:16:57

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
> On Tue, Aug 08 2006, Jens Axboe wrote:
> > On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > > On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > > > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > > > Rafael J. Wysocki wrote:
> > > > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > > > >>In gmane.linux.kernel, you wrote:
> > > > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > > > >>>
> > > > > >>>This time I was able to dump process states with sysrq-t:
> > > > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > > > >>>
> > > > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > > > >>>prints is suspending device 2.0
> > > > > >>Does it go away if you revert this?
> > > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > > > >>
> > > > > >>That should only affect resume, not suspend, but it does mess around
> > > > > >>with ide power management. Is this maybe happening on the *second*
> > > > > >>suspend?
> > > > > >>
> > > > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > > > >
> > > > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > > > >what's up with it.
> > > > >
> > > > > I suspect elevator changes. The wait_for_completion is not woken in
> > > > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> > > >
> > > > The ide changes are far more likely, it's probably missing a completion.
> > >
> > > Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> > > (Remove ->waiting member from struct request) is wrong, because
> > > generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> > > to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> > > is "true", because action == ide_wait). Previously &wait was stored in
> > > rq->waiting and it didn't overwrite the PM data.
> >
> > Indeed, that looks broken now. That must be what is screwing it up. With
> > the former patch applied, did cdrom detection still look funny to you?

Hm, I'm not sure what you mean ...

> >
> > I'll concoct a fix for that breakage.
>
> Something like this.

Looks good, I'll give it a try.

Rafael

2006-08-08 11:18:19

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
> > On Tue, Aug 08 2006, Jens Axboe wrote:
> > > On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > > > On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > > > > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > > > > Rafael J. Wysocki wrote:
> > > > > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > > > > >>In gmane.linux.kernel, you wrote:
> > > > > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > > > > >>>
> > > > > > >>>This time I was able to dump process states with sysrq-t:
> > > > > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > > > > >>>
> > > > > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > > > > >>>prints is suspending device 2.0
> > > > > > >>Does it go away if you revert this?
> > > > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > > > > >>
> > > > > > >>That should only affect resume, not suspend, but it does mess around
> > > > > > >>with ide power management. Is this maybe happening on the *second*
> > > > > > >>suspend?
> > > > > > >>
> > > > > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > > > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > > > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > > > > >
> > > > > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > > > > >what's up with it.
> > > > > >
> > > > > > I suspect elevator changes. The wait_for_completion is not woken in
> > > > > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> > > > >
> > > > > The ide changes are far more likely, it's probably missing a completion.
> > > >
> > > > Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> > > > (Remove ->waiting member from struct request) is wrong, because
> > > > generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> > > > to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> > > > is "true", because action == ide_wait). Previously &wait was stored in
> > > > rq->waiting and it didn't overwrite the PM data.
> > >
> > > Indeed, that looks broken now. That must be what is screwing it up. With
> > > the former patch applied, did cdrom detection still look funny to you?
>
> Hm, I'm not sure what you mean ...

-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)

But perhaps that wasn't you?

> > > I'll concoct a fix for that breakage.
> >
> > Something like this.
>
> Looks good, I'll give it a try.

Thanks!

--
Jens Axboe

2006-08-08 13:51:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tuesday 08 August 2006 13:19, Jens Axboe wrote:
> On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
> > > On Tue, Aug 08 2006, Jens Axboe wrote:
> > > > On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > > > > On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > > > > > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > > > > > Rafael J. Wysocki wrote:
> > > > > > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > > > > > >>In gmane.linux.kernel, you wrote:
> > > > > > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > > > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > > > > > >>>
> > > > > > > >>>This time I was able to dump process states with sysrq-t:
> > > > > > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > > > > > >>>
> > > > > > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > > > > > >>>prints is suspending device 2.0
> > > > > > > >>Does it go away if you revert this?
> > > > > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > > > > > >>
> > > > > > > >>That should only affect resume, not suspend, but it does mess around
> > > > > > > >>with ide power management. Is this maybe happening on the *second*
> > > > > > > >>suspend?
> > > > > > > >>
> > > > > > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > > > > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > > > > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > > > > > >
> > > > > > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > > > > > >what's up with it.
> > > > > > >
> > > > > > > I suspect elevator changes. The wait_for_completion is not woken in
> > > > > > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> > > > > >
> > > > > > The ide changes are far more likely, it's probably missing a completion.
> > > > >
> > > > > Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> > > > > (Remove ->waiting member from struct request) is wrong, because
> > > > > generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> > > > > to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> > > > > is "true", because action == ide_wait). Previously &wait was stored in
> > > > > rq->waiting and it didn't overwrite the PM data.
> > > >
> > > > Indeed, that looks broken now. That must be what is screwing it up. With
> > > > the former patch applied, did cdrom detection still look funny to you?
> >
> > Hm, I'm not sure what you mean ...
>
> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)

Ah, that.

> But perhaps that wasn't you?

No, that wasn't me. :-)

> > > > I'll concoct a fix for that breakage.
> > >
> > > Something like this.
> >
> > Looks good, I'll give it a try.
>
> Thanks!

It fixes this particular issue for me, but your first patch (appended) is also
needed to prevent the box from hanging later during the resume (when it
tries to save the image).

Thanks,
Rafael


--
drivers/ide/ide-io.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.18-rc3-mm2/drivers/ide/ide-io.c
===================================================================
--- linux-2.6.18-rc3-mm2.orig/drivers/ide/ide-io.c
+++ linux-2.6.18-rc3-mm2/drivers/ide/ide-io.c
@@ -402,7 +402,7 @@ void ide_end_drive_cmd (ide_drive_t *dri
args[5] = hwif->INB(IDE_HCYL_REG);
args[6] = hwif->INB(IDE_SELECT_REG);
}
- } else if (rq->cmd_type & REQ_TYPE_ATA_TASKFILE) {
+ } else if (rq->cmd_type == REQ_TYPE_ATA_TASKFILE) {
ide_task_t *args = (ide_task_t *) rq->special;
if (rq->errors == 0)
rq->errors = !OK_STAT(stat,READY_STAT,BAD_STAT);

2006-08-08 14:04:53

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> On Tuesday 08 August 2006 13:19, Jens Axboe wrote:
> > On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > > On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
> > > > On Tue, Aug 08 2006, Jens Axboe wrote:
> > > > > On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> > > > > > On Tuesday 08 August 2006 12:43, Jens Axboe wrote:
> > > > > > > On Tue, Aug 08 2006, Jiri Slaby wrote:
> > > > > > > > Rafael J. Wysocki wrote:
> > > > > > > > >On Monday 07 August 2006 18:23, Jason Lunz wrote:
> > > > > > > > >>In gmane.linux.kernel, you wrote:
> > > > > > > > >>>>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> > > > > > > > >>>I tried it and guess what :)... swsusp doesn't work :@.
> > > > > > > > >>>
> > > > > > > > >>>This time I was able to dump process states with sysrq-t:
> > > > > > > > >>>http://www.fi.muni.cz/~xslaby/sklad/ide2.gif
> > > > > > > > >>>
> > > > > > > > >>>My guess is ide2/2.0 dies (hpt370 driver), since last thing kernel
> > > > > > > > >>>prints is suspending device 2.0
> > > > > > > > >>Does it go away if you revert this?
> > > > > > > > >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/broken-out/ide-reprogram-disk-pio-timings-on-resume.patch
> > > > > > > > >>
> > > > > > > > >>That should only affect resume, not suspend, but it does mess around
> > > > > > > > >>with ide power management. Is this maybe happening on the *second*
> > > > > > > > >>suspend?
> > > > > > > > >>
> > > > > > > > >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > > > > > > > >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> > > > > > > > >>This looks suspicious. -mm does have several ide-fix-hpt3xx patches.
> > > > > > > > >
> > > > > > > > >I found that git-block.patch broke the suspend for me. Still have no idea
> > > > > > > > >what's up with it.
> > > > > > > >
> > > > > > > > I suspect elevator changes. The wait_for_completion is not woken in
> > > > > > > > ide-io by ll_rw_blk. But I don't understand block layer too much.
> > > > > > >
> > > > > > > The ide changes are far more likely, it's probably missing a completion.
> > > > > >
> > > > > > Actually I think the commit f74bf2e6b415588e562fdcfdd454d587eb33cd46
> > > > > > (Remove ->waiting member from struct request) is wrong, because
> > > > > > generic_ide_suspend() uses the end_of_io member of rq to pass the PM data
> > > > > > to ide_do_drive_cmd() where the pointer gets overwritten by &wait (must_wait
> > > > > > is "true", because action == ide_wait). Previously &wait was stored in
> > > > > > rq->waiting and it didn't overwrite the PM data.
> > > > >
> > > > > Indeed, that looks broken now. That must be what is screwing it up. With
> > > > > the former patch applied, did cdrom detection still look funny to you?
> > >
> > > Hm, I'm not sure what you mean ...
> >
> > -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> > +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>
> Ah, that.
>
> > But perhaps that wasn't you?
>
> No, that wasn't me. :-)
>
> > > > > I'll concoct a fix for that breakage.
> > > >
> > > > Something like this.
> > >
> > > Looks good, I'll give it a try.
> >
> > Thanks!
>
> It fixes this particular issue for me, but your first patch (appended)
> is also needed to prevent the box from hanging later during the resume
> (when it tries to save the image).

Yes certainly, that's a separate bug, sorry if I didn't make that clear.
Both fixes are in the block repo now, so next -mm should work fine
again.

--
Jens Axboe

2006-08-08 14:40:44

by Rafael J. Wysocki

[permalink] [raw]
Subject: 2.6.18-rc3-mm2: reiserfs problem?

Hi,

I get something like the appended on every attempt to unmount the reiserfs
filesystem mounted on /tmp. The other reiserfs filesystems don't have such
problems and this one didn't have them too with 2.6.18-rc2-mm1.


BUG: Dentry ffff810037c573e8{i=3,n=.reiserfs_priv} still in use (1) [unmount of reiserfs hdc7]
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/dcache.c:611
invalid opcode: 0000 [1] PREEMPT
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 0
Modules linked in: ide_cd cdrom xt_pkttype ipt_LOG xt_limit usbserial asus_acpi thermal processor fan button battery ac snd_pcm_oss snd_mix
er_oss snd_seq snd_seq_device af_packet bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yen
ta_socket rsrc_nonstatic pcmcia_core usbhid ff_memless ip6t_REJECT xt_tcpudp ipt_REJECT xt_state snd_intel8x0 snd_ac97_codec snd_ac97_bus s
nd_pcm snd_timer snd iptable_mangle soundcore iptable_nat ip_nat iptable_filter snd_page_alloc ip6table_mangle ehci_hcd ip_conntrack i2c_nf
orce2 i2c_core ip_tables ohci_hcd ip6table_filter ip6_tables x_tables ipv6 parport_pc lp parport dm_mod
Pid: 9478, comm: umount Not tainted 2.6.18-rc3-mm2 #7
RIP: 0010:[<ffffffff802a6eb7>] [<ffffffff802a6eb7>] shrink_dcache_for_umount_subtree+0x1d7/0x2b0
RSP: 0018:ffff810059291da8 EFLAGS: 00010296
RAX: 0000000000000062 RBX: ffff810037c573e8 RCX: 0000000000000003
RDX: 0000000000000008 RSI: ffff810037c627d8 RDI: 0000000000000001
RBP: ffff810059291dc8 R08: 0000000000000002 R09: ffffffff8022de59
R10: 0000000000000000 R11: 0000000000000001 R12: ffff810037c573e8
R13: ffff81005ddc4800 R14: ffff81005f539250 R15: ffff81005f0a8688
FS: 00002afc38e00b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002ac8a49a1d40 CR3: 000000005f546000 CR4: 00000000000006e0
Process umount (pid: 9478, threadinfo ffff810059290000, task ffff810037c62080)
Stack: ffff81005f0a8b10 ffff81005f0a8688 ffffffff80577a20 ffff810059291ea8
ffff810059291de8 ffffffff802a6fc4 ffff81005f0a8688 ffffffff80577a20
ffff810059291e18 ffffffff80293bb4 ffff81005f539250 ffff81005e09d140
Call Trace:
[<ffffffff802a6fc4>] shrink_dcache_for_umount+0x34/0x70
[<ffffffff80293bb4>] generic_shutdown_super+0x24/0x110
[<ffffffff80293cd0>] kill_block_super+0x30/0x50
[<ffffffff80293f81>] deactivate_super+0x81/0xa0
[<ffffffff802ac008>] mntput_no_expire+0x58/0xa0
[<ffffffff8029b83d>] path_release_on_umount+0x1d/0x30
[<ffffffff802ad3f4>] sys_umount+0x274/0x290
[<ffffffff80209d0e>] system_call+0x7e/0x83
DWARF2 unwinder stuck at system_call+0x7e/0x83
Leftover inexact backtrace:


Code: 0f 0b 68 41 33 4a 80 c2 63 02 49 8b 5c 24 68 49 39 dc 75 05
RIP [<ffffffff802a6eb7>] shrink_dcache_for_umount_subtree+0x1d7/0x2b0
RSP <ffff810059291da8>

2006-08-08 14:42:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Monday 07 August 2006 21:00, Dmitry Torokhov wrote:
> On 8/7/06, Fabio Comolli <[email protected]> wrote:
]--snip--[
> >
> > Still interested in dmesg with i8042.debug=1 ?
> >
>
> Yes, _with_ the i8042 polling patch applied.

I've got one for you (attached).

Greetings,
Rafael


Attachments:
(No filename) (288.00 B)
dmesg.log.gz (9.14 kB)
Download all attachments

2006-08-08 15:12:15

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2: reiserfs problem?

On Tue, 8 Aug 2006 16:39:38 +0200
"Rafael J. Wysocki" <[email protected]> wrote:

> Hi,
>
> I get something like the appended on every attempt to unmount the reiserfs
> filesystem mounted on /tmp. The other reiserfs filesystems don't have such
> problems and this one didn't have them too with 2.6.18-rc2-mm1.
>
>
> BUG: Dentry ffff810037c573e8{i=3,n=.reiserfs_priv} still in use (1) [unmount of reiserfs hdc7]
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/dcache.c:611
> invalid opcode: 0000 [1] PREEMPT
> last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
> CPU 0
> Modules linked in: ide_cd cdrom xt_pkttype ipt_LOG xt_limit usbserial asus_acpi thermal processor fan button battery ac snd_pcm_oss snd_mix
> er_oss snd_seq snd_seq_device af_packet bcm43xx ieee80211softmac ieee80211 ieee80211_crypt pcmcia firmware_class ohci1394 ieee1394 skge yen
> ta_socket rsrc_nonstatic pcmcia_core usbhid ff_memless ip6t_REJECT xt_tcpudp ipt_REJECT xt_state snd_intel8x0 snd_ac97_codec snd_ac97_bus s
> nd_pcm snd_timer snd iptable_mangle soundcore iptable_nat ip_nat iptable_filter snd_page_alloc ip6table_mangle ehci_hcd ip_conntrack i2c_nf
> orce2 i2c_core ip_tables ohci_hcd ip6table_filter ip6_tables x_tables ipv6 parport_pc lp parport dm_mod
> Pid: 9478, comm: umount Not tainted 2.6.18-rc3-mm2 #7
> RIP: 0010:[<ffffffff802a6eb7>] [<ffffffff802a6eb7>] shrink_dcache_for_umount_subtree+0x1d7/0x2b0

Thanks, Rafael.
vfs-destroy-the-dentries-contributed-by-a-superblock-on-unmounting.patch
added that BUG_ON().

> RSP: 0018:ffff810059291da8 EFLAGS: 00010296
> RAX: 0000000000000062 RBX: ffff810037c573e8 RCX: 0000000000000003
> RDX: 0000000000000008 RSI: ffff810037c627d8 RDI: 0000000000000001
> RBP: ffff810059291dc8 R08: 0000000000000002 R09: ffffffff8022de59
> R10: 0000000000000000 R11: 0000000000000001 R12: ffff810037c573e8
> R13: ffff81005ddc4800 R14: ffff81005f539250 R15: ffff81005f0a8688
> FS: 00002afc38e00b00(0000) GS:ffffffff808c2000(0000) knlGS:00000000558b4d00
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002ac8a49a1d40 CR3: 000000005f546000 CR4: 00000000000006e0
> Process umount (pid: 9478, threadinfo ffff810059290000, task ffff810037c62080)
> Stack: ffff81005f0a8b10 ffff81005f0a8688 ffffffff80577a20 ffff810059291ea8
> ffff810059291de8 ffffffff802a6fc4 ffff81005f0a8688 ffffffff80577a20
> ffff810059291e18 ffffffff80293bb4 ffff81005f539250 ffff81005e09d140
> Call Trace:
> [<ffffffff802a6fc4>] shrink_dcache_for_umount+0x34/0x70
> [<ffffffff80293bb4>] generic_shutdown_super+0x24/0x110
> [<ffffffff80293cd0>] kill_block_super+0x30/0x50
> [<ffffffff80293f81>] deactivate_super+0x81/0xa0
> [<ffffffff802ac008>] mntput_no_expire+0x58/0xa0
> [<ffffffff8029b83d>] path_release_on_umount+0x1d/0x30
> [<ffffffff802ad3f4>] sys_umount+0x274/0x290
> [<ffffffff80209d0e>] system_call+0x7e/0x83
> DWARF2 unwinder stuck at system_call+0x7e/0x83
> Leftover inexact backtrace:
>
>
> Code: 0f 0b 68 41 33 4a 80 c2 63 02 49 8b 5c 24 68 49 39 dc 75 05
> RIP [<ffffffff802a6eb7>] shrink_dcache_for_umount_subtree+0x1d7/0x2b0
> RSP <ffff810059291da8>

2006-08-08 16:40:34

by Jiri Slaby

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

Jens Axboe wrote:
> On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
>> On Tuesday 08 August 2006 13:19, Jens Axboe wrote:
>>> On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
>>>> On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
>>>>> On Tue, Aug 08 2006, Jens Axboe wrote:
>>>>>>> Indeed, that looks broken now. That must be what is screwing it up. With
>>>>>> the former patch applied, did cdrom detection still look funny to you?
>>>> Hm, I'm not sure what you mean ...
>>> -hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
>>> +hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
>> Ah, that.
>>
>>> But perhaps that wasn't you?
>> No, that wasn't me. :-)

It was me and it's OK.

>>>>>> I'll concoct a fix for that breakage.
>>>>> Something like this.
>>>> Looks good, I'll give it a try.
>>> Thanks!
>> It fixes this particular issue for me, but your first patch (appended)
>> is also needed to prevent the box from hanging later during the resume
>> (when it tries to save the image).
>
> Yes certainly, that's a separate bug, sorry if I didn't make that clear.
> Both fixes are in the block repo now, so next -mm should work fine
> again.

And even this is OK.

I'm just curious, what
@@ -387,3 +398,5 @@
EXT3 FS on md0, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 506036k swap on /dev/hda3. Priority:-1 extents:1 across:506036k
+JBD: barrier-based sync failed on hda2 - disabling barriers
+JBD: barrier-based sync failed on md0 - disabling barriers

means. Another bug?

thanks,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-08 17:42:39

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On 8/8/06, Rafael J. Wysocki <[email protected]> wrote:
> On Monday 07 August 2006 21:00, Dmitry Torokhov wrote:
> > On 8/7/06, Fabio Comolli <[email protected]> wrote:
> ]--snip--[
> > >
> > > Still interested in dmesg with i8042.debug=1 ?
> > >
> >
> > Yes, _with_ the i8042 polling patch applied.
>
> I've got one for you (attached).
>

Thnk you, I think I see what the problem is. Rafael, could you please
try booting with i8042.nomux and tell me if mouse starts working.

Fabio, do you have a multiplexing controller as well?

--
Dmitry

2006-08-08 17:52:34

by Jens Axboe

[permalink] [raw]
Subject: Re: swsusp regression [Was: 2.6.18-rc3-mm2]

On Tue, Aug 08 2006, Jiri Slaby wrote:
> Jens Axboe wrote:
> >On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> >>On Tuesday 08 August 2006 13:19, Jens Axboe wrote:
> >>>On Tue, Aug 08 2006, Rafael J. Wysocki wrote:
> >>>>On Tuesday 08 August 2006 13:07, Jens Axboe wrote:
> >>>>>On Tue, Aug 08 2006, Jens Axboe wrote:
> >>>>>>>Indeed, that looks broken now. That must be what is screwing it up.
> >>>>>>>With
> >>>>>>the former patch applied, did cdrom detection still look funny to you?
> >>>>Hm, I'm not sure what you mean ...
> >>>-hdc: ATAPI 63X DVD-ROM DVD-R CD-R/RW drive, 2048kB Cache, UDMA(33)
> >>>+hdc: ATAPI CD-ROM drive, 0kB Cache, UDMA(33)
> >>Ah, that.
> >>
> >>>But perhaps that wasn't you?
> >>No, that wasn't me. :-)
>
> It was me and it's OK.
>
> >>>>>>I'll concoct a fix for that breakage.
> >>>>>Something like this.
> >>>>Looks good, I'll give it a try.
> >>>Thanks!
> >>It fixes this particular issue for me, but your first patch (appended)
> >>is also needed to prevent the box from hanging later during the resume
> >>(when it tries to save the image).
> >
> >Yes certainly, that's a separate bug, sorry if I didn't make that clear.
> >Both fixes are in the block repo now, so next -mm should work fine
> >again.
>
> And even this is OK.

Good.

> I'm just curious, what
> @@ -387,3 +398,5 @@
> EXT3 FS on md0, internal journal
> EXT3-fs: mounted filesystem with ordered data mode.
> Adding 506036k swap on /dev/hda3. Priority:-1 extents:1 across:506036k
> +JBD: barrier-based sync failed on hda2 - disabling barriers
> +JBD: barrier-based sync failed on md0 - disabling barriers

I think that -mm also added barriers on by default for ext3, so I don't
think it's anything to worry about.

--
Jens Axboe

2006-08-08 18:14:12

by Fabio Comolli

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi.

On 8/7/06, Dmitry Torokhov <[email protected]> wrote:
> On 8/7/06, Fabio Comolli <[email protected]> wrote:
> > Hi.
> >
> > On 8/7/06, Dmitry Torokhov <[email protected]> wrote:
> > > On Sunday 06 August 2006 15:09, Andrew Morton wrote:
> > > > -tycho kernel: input: PS/2 Mouse as /class/input/input1
> > > > -tycho kernel: input: AlpsPS/2 ALPS GlidePoint as /class/input/input2
> > > >
> > > > That's not so good.
> > > >
> > > >
> > > > Dmitry, do you have anything in there which might have caused that?
> > > >
> > > > Perhaps hdaps-handle-errors-from-input_register_device.patch is triggering
> > > > for some reason.
> > >
> > > Hmm, I'd be more concerned with i8042-get-rid-of-polling-timer patch...
> >
> > Bingo! Reverting remove-polling-timer-from-i8042-v2.patch did the
> > trick. Now I'm running 2.6.18-rc3-mm2 + hot-fixes :-)
> >
> > Still interested in dmesg with i8042.debug=1 ?
> >
>
> Yes, _with_ the i8042 polling patch applied. Do you have PNP support enabled?
>
> --
> Dmitry
>

Please find the compressed log attached. And no, I don't have PNP
support enabled.
Hope this helps.

Fabio


Attachments:
(No filename) (1.10 kB)
rc3-mm2.i8042_debug.gz (7.82 kB)
Download all attachments

2006-08-08 18:16:59

by Fabio Comolli

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi Dmitry.

On 8/8/06, Dmitry Torokhov <[email protected]> wrote:

> Fabio, do you have a multiplexing controller as well?

Well, I don't even know what this means :-(
How do I know?

However, it's a HP laptop, model name Pavillion DV4378EA.

>
> --
> Dmitry
>

Fabio

2006-08-08 18:24:36

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On 8/8/06, Fabio Comolli <[email protected]> wrote:
> Hi Dmitry.
>
> On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
>
> > Fabio, do you have a multiplexing controller as well?
>
> Well, I don't even know what this means :-(
> How do I know?
>
> However, it's a HP laptop, model name Pavillion DV4378EA.
>

Yep, you do have it:

> i8042.c: Detected active multiplexing controller, rev 1.1.

Could you please try booting with i8042.nomux and tell me if it works?

Thanks!

--
Dmitry

2006-08-08 18:33:59

by Michael Büsch

[permalink] [raw]
Subject: Re: [RFC: -mm patch] bcm43xx_main.c: remove 3 functions

On Monday 07 August 2006 23:04, Adrian Bunk wrote:
> This patch removes three no longer used functions (that are even
> generating gcc warnings).
>
> This patch doesn't look right, but it is the result of
> 58e5528ee464d38040b9489e10033c9387a10d56 in git-netdev...

Hm, can't find that commit in a tree.
I looked at linus', netdev-2.6.

But one thing is for sure. This patch is _wrong_. ;)

> Signed-off-by: Adrian Bunk <[email protected]>

NACK.

> drivers/net/wireless/bcm43xx/bcm43xx_main.c | 33 --------------------
> 1 file changed, 33 deletions(-)
>
> --- linux-2.6.18-rc3-mm2-full/drivers/net/wireless/bcm43xx/bcm43xx_main.c.old 2006-08-07 18:21:31.000000000 +0200
> +++ linux-2.6.18-rc3-mm2-full/drivers/net/wireless/bcm43xx/bcm43xx_main.c 2006-08-07 18:23:36.000000000 +0200
> @@ -3194,39 +3194,6 @@
> bcm43xx_clear_keys(bcm);
> }
>
> -static int bcm43xx_rng_read(struct hwrng *rng, u32 *data)
> -{
> - struct bcm43xx_private *bcm = (struct bcm43xx_private *)rng->priv;
> - unsigned long flags;
> -
> - spin_lock_irqsave(&(bcm)->irq_lock, flags);
> - *data = bcm43xx_read16(bcm, BCM43xx_MMIO_RNG);
> - spin_unlock_irqrestore(&(bcm)->irq_lock, flags);
> -
> - return (sizeof(u16));
> -}
> -
> -static void bcm43xx_rng_exit(struct bcm43xx_private *bcm)
> -{
> - hwrng_unregister(&bcm->rng);
> -}
> -
> -static int bcm43xx_rng_init(struct bcm43xx_private *bcm)
> -{
> - int err;
> -
> - snprintf(bcm->rng_name, ARRAY_SIZE(bcm->rng_name),
> - "%s_%s", KBUILD_MODNAME, bcm->net_dev->name);
> - bcm->rng.name = bcm->rng_name;
> - bcm->rng.data_read = bcm43xx_rng_read;
> - bcm->rng.priv = (unsigned long)bcm;
> - err = hwrng_register(&bcm->rng);
> - if (err)
> - printk(KERN_ERR PFX "RNG init failed (%d)\n", err);
> -
> - return err;
> -}
> -
> static int bcm43xx_shutdown_all_wireless_cores(struct bcm43xx_private *bcm)
> {

--
Greetings Michael.

2006-08-08 18:36:21

by Fabio Comolli

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi.

On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> On 8/8/06, Fabio Comolli <[email protected]> wrote:
> > Hi Dmitry.
> >
> > On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> >
> > > Fabio, do you have a multiplexing controller as well?
> >
> > Well, I don't even know what this means :-(
> > How do I know?
> >
> > However, it's a HP laptop, model name Pavillion DV4378EA.
> >
>
> Yep, you do have it:
>
> > i8042.c: Detected active multiplexing controller, rev 1.1.
>
> Could you please try booting with i8042.nomux and tell me if it works?
>

Yup, it works.

> Thanks!
>
> --
> Dmitry
>

Ciao.
Fabio

2006-08-08 19:42:36

by Adrian Bunk

[permalink] [raw]
Subject: Re: [RFC: -mm patch] bcm43xx_main.c: remove 3 functions

On Tue, Aug 08, 2006 at 08:32:37PM +0200, Michael Buesch wrote:
> On Monday 07 August 2006 23:04, Adrian Bunk wrote:
> > This patch removes three no longer used functions (that are even
> > generating gcc warnings).
> >
> > This patch doesn't look right, but it is the result of
> > 58e5528ee464d38040b9489e10033c9387a10d56 in git-netdev...
>
> Hm, can't find that commit in a tree.
> I looked at linus', netdev-2.6.

It's in netdev-2.6.git#ALL that gets included in -mm.

> But one thing is for sure. This patch is _wrong_. ;)
>...

And it seems to be your fault. ;-)


commit 58e5528ee464d38040b9489e10033c9387a10d56
Author: Michael Buesch <[email protected]>
Date: Sat Jul 8 22:02:18 2006 +0200

[PATCH] bcm43xx: init routine rewrite

Rewrite of the bcm43xx initialization routines.
This fixes several issues:
* up-down-up-down-up... stale data issue
(May fix some DHCP issues)
* Fix the init vs IRQ handler race (and remove the workaround)
* Fix init for cards with multiple cores (APHY)
As softmac has no internal PHY handling (unlike dscape),
this adds the file "phymode" to sysfs.
The active PHY can be selected by writing either a, b or g
to this file. Current PHY can be determined by reading from it.
* Fix the controller restart code.
Controller restart can now also be triggered through
echo 1 > /debug/bcm43xx/ethX/restart

Signed-off-by: Michael Buesch <[email protected]>
Signed-off-by: John W. Linville <[email protected]>


> Greetings Michael.

cu
Adrian

--

"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed

2006-08-08 20:33:19

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Tuesday 08 August 2006 19:42, Dmitry Torokhov wrote:
> On 8/8/06, Rafael J. Wysocki <[email protected]> wrote:
> > On Monday 07 August 2006 21:00, Dmitry Torokhov wrote:
> > > On 8/7/06, Fabio Comolli <[email protected]> wrote:
> > ]--snip--[
> > > >
> > > > Still interested in dmesg with i8042.debug=1 ?
> > > >
> > >
> > > Yes, _with_ the i8042 polling patch applied.
> >
> > I've got one for you (attached).
> >
>
> Thnk you, I think I see what the problem is. Rafael, could you please
> try booting with i8042.nomux and tell me if mouse starts working.

It's a touchpad, but I guess that doesn't make a difference?

Rafael

2006-08-08 22:14:12

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC: -mm patch] bcm43xx_main.c: remove 3 functions

Michael Buesch wrote:
> On Monday 07 August 2006 23:04, Adrian Bunk wrote:
>> This patch removes three no longer used functions (that are even
>> generating gcc warnings).
>>
>> This patch doesn't look right, but it is the result of
>> 58e5528ee464d38040b9489e10033c9387a10d56 in git-netdev...
>
> Hm, can't find that commit in a tree.
> I looked at linus', netdev-2.6.

It's clearly in netdev-2.6.git#upstream:

commit 58e5528ee464d38040b9489e10033c9387a10d56
Author: Michael Buesch <[email protected]>
Date: Sat Jul 8 22:02:18 2006 +0200

[PATCH] bcm43xx: init routine rewrite

Rewrite of the bcm43xx initialization routines.
This fixes several issues:
* up-down-up-down-up... stale data issue
(May fix some DHCP issues)
* Fix the init vs IRQ handler race (and remove the workaround)
* Fix init for cards with multiple cores (APHY)
As softmac has no internal PHY handling (unlike dscape),
this adds the file "phymode" to sysfs.
The active PHY can be selected by writing either a, b or g
to this file. Current PHY can be determined by reading from it.
* Fix the controller restart code.
Controller restart can now also be triggered through
echo 1 > /debug/bcm43xx/ethX/restart

Signed-off-by: Michael Buesch <[email protected]>
Signed-off-by: John W. Linville <[email protected]>

2006-08-09 03:47:25

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Tuesday 08 August 2006 14:36, Fabio Comolli wrote:
> Hi.
>
> On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> > On 8/8/06, Fabio Comolli <[email protected]> wrote:
> > > Hi Dmitry.
> > >
> > > On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> > >
> > > > Fabio, do you have a multiplexing controller as well?
> > >
> > > Well, I don't even know what this means :-(
> > > How do I know?
> > >
> > > However, it's a HP laptop, model name Pavillion DV4378EA.
> > >
> >
> > Yep, you do have it:
> >
> > > i8042.c: Detected active multiplexing controller, rev 1.1.
> >
> > Could you please try booting with i8042.nomux and tell me if it works?
> >
>
> Yup, it works.
>

Fabio, Rafael,

Could you please try applying the patch below on top of -rc3-mm2 and
see if it works without needing i8042.nomux?

Thank you!

--
Dmitry

Signed-off-by: Dmitry Torokhov <[email protected]>
---

drivers/input/serio/i8042.c | 2 +-
1 files changed, 1 insertion(+), 1 deletion(-)

Index: work/drivers/input/serio/i8042.c
===================================================================
--- work.orig/drivers/input/serio/i8042.c
+++ work/drivers/input/serio/i8042.c
@@ -435,7 +435,7 @@ static int i8042_enable_mux_ports(void)
i8042_command(&param, I8042_CMD_AUX_ENABLE);
}

- return 0;
+ return i8042_enable_aux_port();
}

/*

2006-08-09 04:47:41

by Michael Büsch

[permalink] [raw]
Subject: Re: [RFC: -mm patch] bcm43xx_main.c: remove 3 functions

On Tuesday 08 August 2006 21:42, you wrote:
> And it seems to be your fault. ;-)

Uh, oh. I'm trapped.

> commit 58e5528ee464d38040b9489e10033c9387a10d56
> Author: Michael Buesch <[email protected]>
> Date: Sat Jul 8 22:02:18 2006 +0200
>
> [PATCH] bcm43xx: init routine rewrite

Ah, I guessed it.
This was caused by some merge-race ;)
Will send a fix for this, soon.

--
Greetings Michael.

2006-08-09 07:12:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Wednesday 09 August 2006 05:47, Dmitry Torokhov wrote:
> On Tuesday 08 August 2006 14:36, Fabio Comolli wrote:
> > Hi.
> >
> > On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> > > On 8/8/06, Fabio Comolli <[email protected]> wrote:
> > > > Hi Dmitry.
> > > >
> > > > On 8/8/06, Dmitry Torokhov <[email protected]> wrote:
> > > >
> > > > > Fabio, do you have a multiplexing controller as well?
> > > >
> > > > Well, I don't even know what this means :-(
> > > > How do I know?
> > > >
> > > > However, it's a HP laptop, model name Pavillion DV4378EA.
> > > >
> > >
> > > Yep, you do have it:
> > >
> > > > i8042.c: Detected active multiplexing controller, rev 1.1.
> > >
> > > Could you please try booting with i8042.nomux and tell me if it works?
> > >
> >
> > Yup, it works.
> >
>
> Fabio, Rafael,
>
> Could you please try applying the patch below on top of -rc3-mm2 and
> see if it works without needing i8042.nomux?

Yes, it does.

Thanks,
Rafael

2006-08-09 19:06:41

by Valdis Klētnieks

[permalink] [raw]
Subject: 2.6.18-rc3-mm2 - ext3 locking issue?

On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:

> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

Yum managed to get wedged: 'echo t > /proc/sysrq-trigger' says:

[ 4514.840000] yum D D5C32AA0 0 4747 4430 (NOTLB)
[ 4514.840000] d5c3dda4 d5c3dd78 00000007 d5c32aa0 bd3ddd00 00000338 00000000 d5c32bc0
[ 4514.840000] c1601628 d5c3dd9c 64600300 0000001f d5c3ddd8 d5c3ddd8 c1601628 d5c3ddac
[ 4514.840000] c034fef8 d5c3ddb4 c0136e8e d5c3ddcc c0350026 c0136e58 d5c3ddd8 00000000
[ 4514.840000] Call Trace:
[ 4514.840000] [<c034fef8>] io_schedule+0x25/0x44
[ 4514.840000] [<c0136e8e>] sync_page+0x36/0x3a
[ 4514.840000] [<c0350026>] __wait_on_bit_lock+0x30/0x58
[ 4514.840000] [<c0136e44>] __lock_page+0x51/0x59
[ 4514.840000] [<c013f099>] truncate_inode_pages_range+0x1de/0x230
[ 4514.840000] [<c013f0f7>] truncate_inode_pages+0xc/0x11
[ 4514.840000] [<c018ea12>] ext3_delete_inode+0x16/0xbd
[ 4514.840000] [<c016798f>] generic_delete_inode+0xb6/0x130
[ 4514.840000] [<c0167a1b>] generic_drop_inode+0x12/0x166
[ 4514.840000] [<c01673f1>] iput+0x67/0x6a
[ 4514.840000] [<c0165662>] dentry_iput+0x97/0xcc
[ 4514.840000] [<c016613d>] dput+0x183/0x19c
[ 4514.840000] [<c015f64f>] sys_renameat+0x17a/0x1d3
[ 4514.840000] [<c015f6ba>] sys_rename+0x12/0x14
[ 4514.840000] [<c0102849>] sysenter_past_esp+0x56/0x79

A careful check of the dmesg doesn't reveal anything particularly helpful,
like an oops or other relevant kernel message.


Attachments:
(No filename) (226.00 B)

2006-08-09 19:47:22

by Fabio Comolli

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

Hi Dmitry.

On 8/9/06, Dmitry Torokhov <[email protected]> wrote:
> Could you please try applying the patch below on top of -rc3-mm2 and
> see if it works without needing i8042.nomux?
>

Yes, it works for me too. However, Andrew put a revert patch for
remove-polling-timer-from-i8042-v2.patch in his hot-fixes directory.
So, which one should be considered the correct fix?

> Thank you!
>
> --
> Dmitry

Ciao.
Fabio



>
> Signed-off-by: Dmitry Torokhov <[email protected]>
> ---
>
> drivers/input/serio/i8042.c | 2 +-
> 1 files changed, 1 insertion(+), 1 deletion(-)
>
> Index: work/drivers/input/serio/i8042.c
> ===================================================================
> --- work.orig/drivers/input/serio/i8042.c
> +++ work/drivers/input/serio/i8042.c
> @@ -435,7 +435,7 @@ static int i8042_enable_mux_ports(void)
> i8042_command(&param, I8042_CMD_AUX_ENABLE);
> }
>
> - return 0;
> + return i8042_enable_aux_port();
> }
>
> /*
>

2006-08-09 20:01:54

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Wed, 09 Aug 2006 15:06:35 -0400
[email protected] wrote:

> On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
>
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> Yum managed to get wedged: 'echo t > /proc/sysrq-trigger' says:
>
> [ 4514.840000] yum D D5C32AA0 0 4747 4430 (NOTLB)
> [ 4514.840000] d5c3dda4 d5c3dd78 00000007 d5c32aa0 bd3ddd00 00000338 00000000 d5c32bc0
> [ 4514.840000] c1601628 d5c3dd9c 64600300 0000001f d5c3ddd8 d5c3ddd8 c1601628 d5c3ddac
> [ 4514.840000] c034fef8 d5c3ddb4 c0136e8e d5c3ddcc c0350026 c0136e58 d5c3ddd8 00000000
> [ 4514.840000] Call Trace:
> [ 4514.840000] [<c034fef8>] io_schedule+0x25/0x44
> [ 4514.840000] [<c0136e8e>] sync_page+0x36/0x3a
> [ 4514.840000] [<c0350026>] __wait_on_bit_lock+0x30/0x58
> [ 4514.840000] [<c0136e44>] __lock_page+0x51/0x59
> [ 4514.840000] [<c013f099>] truncate_inode_pages_range+0x1de/0x230
> [ 4514.840000] [<c013f0f7>] truncate_inode_pages+0xc/0x11
> [ 4514.840000] [<c018ea12>] ext3_delete_inode+0x16/0xbd
> [ 4514.840000] [<c016798f>] generic_delete_inode+0xb6/0x130
> [ 4514.840000] [<c0167a1b>] generic_drop_inode+0x12/0x166
> [ 4514.840000] [<c01673f1>] iput+0x67/0x6a
> [ 4514.840000] [<c0165662>] dentry_iput+0x97/0xcc
> [ 4514.840000] [<c016613d>] dput+0x183/0x19c
> [ 4514.840000] [<c015f64f>] sys_renameat+0x17a/0x1d3
> [ 4514.840000] [<c015f6ba>] sys_rename+0x12/0x14
> [ 4514.840000] [<c0102849>] sysenter_past_esp+0x56/0x79
>
> A careful check of the dmesg doesn't reveal anything particularly helpful,
> like an oops or other relevant kernel message.

Usually this means that there's an IO request in flight and it got lost
somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
lost interrupt (hardware bug, PCI setup bug, etc).

Which device driver and which IO sched are you using?

2006-08-09 20:13:41

by Dmitry Torokhov

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On 8/9/06, Fabio Comolli <[email protected]> wrote:
> Hi Dmitry.
>
> On 8/9/06, Dmitry Torokhov <[email protected]> wrote:
> > Could you please try applying the patch below on top of -rc3-mm2 and
> > see if it works without needing i8042.nomux?
> >
>
> Yes, it works for me too.

Thank you for testing.

> However, Andrew put a revert patch for
> remove-polling-timer-from-i8042-v2.patch in his hot-fixes directory.
> So, which one should be considered the correct fix?

I'd rather have him replace reverting patch with this one. Removing
polling timer is needed for tickless operation.

--
Dmitry

2006-08-09 20:43:29

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Wed, 09 Aug 2006 13:01:51 PDT, Andrew Morton said:

> > Yum managed to get wedged: 'echo t > /proc/sysrq-trigger' says:
> >
> > [ 4514.840000] yum D D5C32AA0 0 4747 4430
(NOTLB)
> > [ 4514.840000] d5c3dda4 d5c3dd78 00000007 d5c32aa0 bd3ddd00 00000338
00000000 d5c32bc0
> > [ 4514.840000] c1601628 d5c3dd9c 64600300 0000001f d5c3ddd8 d5c3ddd8
c1601628 d5c3ddac
> > [ 4514.840000] c034fef8 d5c3ddb4 c0136e8e d5c3ddcc c0350026 c0136e58
d5c3ddd8 00000000
> > [ 4514.840000] Call Trace:
> > [ 4514.840000] [<c034fef8>] io_schedule+0x25/0x44
> > [ 4514.840000] [<c0136e8e>] sync_page+0x36/0x3a
> > [ 4514.840000] [<c0350026>] __wait_on_bit_lock+0x30/0x58
> > [ 4514.840000] [<c0136e44>] __lock_page+0x51/0x59
> > [ 4514.840000] [<c013f099>] truncate_inode_pages_range+0x1de/0x230
> > [ 4514.840000] [<c013f0f7>] truncate_inode_pages+0xc/0x11
> > [ 4514.840000] [<c018ea12>] ext3_delete_inode+0x16/0xbd
> > [ 4514.840000] [<c016798f>] generic_delete_inode+0xb6/0x130
> > [ 4514.840000] [<c0167a1b>] generic_drop_inode+0x12/0x166
> > [ 4514.840000] [<c01673f1>] iput+0x67/0x6a
> > [ 4514.840000] [<c0165662>] dentry_iput+0x97/0xcc
> > [ 4514.840000] [<c016613d>] dput+0x183/0x19c
> > [ 4514.840000] [<c015f64f>] sys_renameat+0x17a/0x1d3
> > [ 4514.840000] [<c015f6ba>] sys_rename+0x12/0x14
> > [ 4514.840000] [<c0102849>] sysenter_past_esp+0x56/0x79
> >
> > A careful check of the dmesg doesn't reveal anything particularly helpful,
> > like an oops or other relevant kernel message.
>
> Usually this means that there's an IO request in flight and it got lost
> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
> lost interrupt (hardware bug, PCI setup bug, etc).
>
> Which device driver and which IO sched are you using?

Aug 9 13:33:13 turing-police kernel: [ 11.297507] libata version 2.00 loaded.
Aug 9 13:33:14 turing-police kernel: [ 11.297763] ata_piix 0000:00:1f.1: version 2.00ac6
Aug 9 13:33:14 turing-police kernel: [ 11.297780] PCI: Enabling device 0000:00:1f.1 (0005 -> 0007)
Aug 9 13:33:14 turing-police kernel: [ 11.299245] ACPI: PCI Interrupt Link [LNKA] enabled at IRQ 11
Aug 9 13:33:14 turing-police kernel: [ 11.299786] ACPI: PCI Interrupt 0000:00:1f.1[A] -> Link [LNKA] -> GSI 11 (level, low) -> IRQ 11
Aug 9 13:33:14 turing-police kernel: [ 11.300638] PCI: Setting latency timer of device 0000:00:1f.1 to 64
Aug 9 13:33:15 turing-police kernel: [ 11.300720] ata1: PATA max UDMA/100 cmd 0x1F0 ctl 0x3F6 bmdma 0xBFA0 irq 14
Aug 9 13:33:15 turing-police kernel: [ 11.301381] scsi0 : ata_piix
...

Disk was running with 'cfq' scheduler. I checked the dmesg, and only odd thing
was this:

Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0

Wierd though - floppy and ATA are on different IRQs according to /proc/interrupts:

CPU0
0: 11122651 XT-PIC-level timer
1: 12532 XT-PIC-level i8042
2: 0 XT-PIC-level cascade
5: 190651 XT-PIC-level Intel 82801CA-ICH3
6: 5 XT-PIC-level floppy
8: 1 XT-PIC-level rtc
9: 1 XT-PIC-level acpi
11: 238728 XT-PIC-level uhci_hcd:usb1, uhci_hcd:usb2, uhci_hcd:usb3, yenta, yenta, yenta, ohci1394, pcmcia2.0, eth3
12: 114 XT-PIC-level i8042
14: 172656 XT-PIC-level libata
15: 0 XT-PIC-level libata
NMI: 0
LOC: 0
ERR: 1
MIS: 0

(For the record, the laptop doesn't even *have* a floppy drive installed at the
moment)


Attachments:
(No filename) (226.00 B)

2006-08-10 03:33:11

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:

> > Usually this means that there's an IO request in flight and it got lost
> > somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
> > lost interrupt (hardware bug, PCI setup bug, etc).

> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0

Red herring. yum just wedged again, this time with no reference to floppy drive.
Same traceback. Anybody have anything to suggest before I start playing
hunt-the-wumpus with a -mm bisection?



Attachments:
(No filename) (226.00 B)

2006-08-10 09:03:26

by Laurent Riffard

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

[this is a resend, as the original message may be too big to reach the list...]

Le 06.08.2006 12:08, Andrew Morton a ?crit :
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/26.18-rc3-mm2/

Hello,

On my system, a cron runs every day to check the integrity of
installed RPMS, it runs "rpm -v" on each package, which computes
MD5 hash for each installed file and compares this result, the file
size and modification time with values stored in RPM database.

This is the workload. Since 2.6.18-rc3-mm2, this processus eats
all the memory and triggers OOM.

On my system, "free -t" output normally looks like this ("cached" value
is about half of RAM):
# free -t
total used free shared buffers cached
Mem: 515032 508512 6520 0 22992 256032
-/+ buffers/cache: 229488 285544
Swap: 1116428 324 1116104
Total: 1631460 508836 1122624

After the rpm database check, "free -t" says:
total used free shared buffers cached
Mem: 515032 507124 7908 0 8132 398296
-/+ buffers/cache: 100696 414336
Swap: 1116428 34896 1081532
Total: 1631460 542020 1089440

And the value of "cached" won't decrease.


This evening, this process trigger OOM-killer. Here is its first report:

syslogd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
[show_trace+13/16] show_trace+0xd/0x10
[<c0104c18>] show_trace+0xd/0x10
[dump_stack+25/29] dump_stack+0x19/0x1d
[<c0104c34>] dump_stack+0x19/0x1d
[out_of_memory+93/422] out_of_memory+0x5d/0x1a6
[<c013be03>] out_of_memory+0x5d/0x1a6
[__alloc_pages+505/633] __alloc_pages+0x1f9/0x279
[<c013d25f>] __alloc_pages+0x1f9/0x279
[__do_page_cache_readahead+165/495] __do_page_cache_readahead+0xa5/0x1ef
[<c013e71b>] __do_page_cache_readahead+0xa5/0x1ef
[do_page_cache_readahead+66/80] do_page_cache_readahead+0x42/0x50
[<c013ec64>] do_page_cache_readahead+0x42/0x50
[filemap_nopage+412/882] filemap_nopage+0x19c/0x372
[<c013afbe>] filemap_nopage+0x19c/0x372
[__handle_mm_fault+540/1772] __handle_mm_fault+0x21c/0x6ec
[<c014435d>] __handle_mm_fault+0x21c/0x6ec
[do_page_fault+397/1158] do_page_fault+0x18d/0x486
[<c0111e1f>] do_page_fault+0x18d/0x486
[error_code+57/64] error_code+0x39/0x40
[<c0293079>] error_code+0x39/0x40
Mem-info:
DMA per-cpu:
cpu 0 hot: high 0, batch 1 used:0
cpu 0 cold: high 0, batch 1 used:0
Normal per-cpu:
cpu 0 hot: high 186, batch 31 used:63
cpu 0 cold: high 62, batch 15 used:61
Active:1621 inactive:97987 dirty:0 writeback:33 unstable:0 free:1215 slab:23388 mapped:3 pagetables:446
DMA free:2068kB min:88kB low:108kB high:132kB active:0kB inactive:7432kB present:16384kB pages_scanned:11284 all_unreclaimable? yes
lowmem_reserve[]: 0 495
Normal free:2792kB min:2804kB low:3504kB high:4204kB active:6484kB inactive:384516kB present:507824kB pages_scanned:670357
all_unreclaimable? yes
lowmem_reserve[]: 0 0
DMA: 1*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 1*2048kB 0*4096kB = 2068kB
Normal: 0*4kB 1*8kB 6*16kB 2*32kB 1*64kB 0*128kB 0*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 2792kB
Swap cache: add 109576, delete 109542, find 12933/22258, race 0+8
Free swap = 936452kB
Total swap = 1116428kB
Free swap: 936452kB
131052 pages of RAM
0 pages of HIGHMEM
2358 reserved pages
2668 pages shared
34 pages swap cached
0 pages dirty
33 pages writeback
3 pages mapped
23388 pages slab
446 pages pagetables
Out of Memory: Kill process 23392 (seamonkey-bin) score 48523 and children.
Out of memory: Killed process 23392 (seamonkey-bin).


I gather some data before the rpm database check and near the end of it:
- /proc/slabinfo
- /proc/slab_allocators
- /proc/meminfo
- free -t

Please look in http://laurent.riffard.free.fr/2.6.18-rc3-mm2. You'll
find dmesg and .config too.

For information:

/proc/sys/vm/block_dump:0
/proc/sys/vm/dirty_background_ratio:10
/proc/sys/vm/dirty_expire_centisecs:3000
/proc/sys/vm/dirty_ratio:40
/proc/sys/vm/dirty_writeback_centisecs:500
/proc/sys/vm/drop_caches:0
/proc/sys/vm/laptop_mode:0
/proc/sys/vm/legacy_va_layout:0
/proc/sys/vm/lowmem_reserve_ratio:256
/proc/sys/vm/max_map_count:65536
/proc/sys/vm/min_free_kbytes:2896
/proc/sys/vm/nr_pdflush_threads:2
/proc/sys/vm/overcommit_memory:0
/proc/sys/vm/overcommit_ratio:50
/proc/sys/vm/page-cluster:3
/proc/sys/vm/panic_on_oom:0
/proc/sys/vm/percpu_pagelist_fraction:0
/proc/sys/vm/readahead_hit_rate:1
/proc/sys/vm/readahead_ratio:50
/proc/sys/vm/swappiness:60
/proc/sys/vm/swap_prefetch:1
/proc/sys/vm/swap_token_timeout:300
/proc/sys/vm/vdso_enabled:1
/proc/sys/vm/vfs_cache_pressure:100

# cat /proc/mounts
rootfs / rootfs rw 0 0
/dev /dev tmpfs rw 0 0
/dev/vglinux1/lvroot / ext3 rw,data=ordered 0 0
/proc /proc proc rw 0 0
/sys /sys sysfs rw 0 0
none /dev/pts devpts rw 0 0
none /dev/shm tmpfs rw 0 0
none /proc/bus/usb usbfs rw 0 0
/dev/hda2 /boot ext2 rw 0 0
/dev/vglinux1/lvhome /home reiserfs rw 0 0
/dev/vglinux1/lvusr /usr reiserfs ro 0 0
/dev/vglinux1/lvvar /var ext3 rw,data=ordered 0 0
none /proc/sys/fs/binfmt_misc binfmt_misc rw 0 0
automount(pid1949) /vol autofs rw,fd=4,pgrp=1949,timeout=5,minproto=2,maxproto=4,indirect 0 0

~~
laurent


2006-08-10 09:20:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

On Thu, 10 Aug 2006 11:04:36 +0200
Laurent Riffard <[email protected]> wrote:

> Le 06.08.2006 12:08, Andrew Morton a ?crit :
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/26.18-rc3-mm2/
>
> Hello,
>
> On my system, a cron runs every day to check the integrity of
> installed RPMS, it runs "rpm -v" on each package, which computes
> MD5 hash for each installed file and compares this result, the file
> size and modification time with values stored in RPM database.
>
> This is the workload. Since 2.6.18-rc3-mm2, this processus eats
> all the memory and triggers OOM.
>
> On my system, "free -t" output normally looks like this ("cached" value
> is about half of RAM):
> # free -t
> total used free shared buffers cached
> Mem: 515032 508512 6520 0 22992 256032
> -/+ buffers/cache: 229488 285544
> Swap: 1116428 324 1116104
> Total: 1631460 508836 1122624
>
> After the rpm database check, "free -t" says:
> total used free shared buffers cached
> Mem: 515032 507124 7908 0 8132 398296
> -/+ buffers/cache: 100696 414336
> Swap: 1116428 34896 1081532
> Total: 1631460 542020 1089440
>
> And the value of "cached" won't decrease.
>

Yes, I was just trying to reproduce this. No luck so far. Will try your
.config tomorrow.

It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
perhaps that got broken.

Also, are you able to determine whether the problem is specific to `rpm
-V'? Are you able to make the leak trigger using other filesystem
workloads?

If it's specific to `rpm -V' then perhaps direct-io is somehow causing
pagecache leakage. That would be a bit odd.



btw, it's not necessary to go all the way to oom to work out if the
pagecache leak is happening. After booting, do

echo 3 > /proc/sys/vm/drop_pagecache

and record the `Cached' figure in /proc/meminfo. After running some test,
run `echo 3 > /proc/sys/vm/drop_pagecache' again and check
/proc/meminfo:Cached. If it dodn't do gown to a similarly low figure,
we're leaking pagecache.

btw2: please use /proc/meminfo output rather than free(1). Because free(1)
shows less info, and it does mysterious mangling of the info which it does
read in ways which confuse me.

Thanks.

2006-08-10 11:39:25

by Jiri Slaby

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

[email protected] wrote:
> On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:
>
>>> Usually this means that there's an IO request in flight and it got lost
>>> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
>>> lost interrupt (hardware bug, PCI setup bug, etc).
>
>> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0
>
> Red herring. yum just wedged again, this time with no reference to floppy drive.
> Same traceback. Anybody have anything to suggest before I start playing
> hunt-the-wumpus with a -mm bisection?

Hmm, I have the accurately same problem...
yum + CFQ + BLK_DEV_PIIX + nothing odd in dmesg

[ 3438.574864] yum D 00000000 0 21659 3838
(NOTLB)
[ 3438.575098] e5c09d24 00000001 c180f5a8 00000000 e5c09ce0 c01683e8
fe37c0bc 000002c4
[ 3438.575388] 00001000 00000001 c18fbbd0 0023001f 00000007 f26cc560
c1913560 fe4166d5
[ 3438.575713] 000002c4 0009a619 00000001 f26cc66c c180ec40 c04ff140
e5c09d14 c01fad44
[ 3438.576039] Call Trace:
[ 3438.576113] [<c0373d3b>] io_schedule+0x26/0x30
[ 3438.576187] [<c014653c>] sync_page+0x39/0x45
[ 3438.576260] [<c0374401>] __wait_on_bit_lock+0x41/0x64
[ 3438.576333] [<c01464ef>] __lock_page+0x57/0x5f
[ 3438.576405] [<c014f5f2>] truncate_inode_pages_range+0x1b6/0x304
[ 3438.576480] [<c014f76f>] truncate_inode_pages+0x2f/0x40
[ 3438.576553] [<c01a7bc4>] ext3_delete_inode+0x29/0xf7
[ 3438.576627] [<c017f26b>] generic_delete_inode+0x65/0xe7
[ 3438.576701] [<c017f3aa>] generic_drop_inode+0xbd/0x173
[ 3438.576774] [<c017ed25>] iput+0x6b/0x7b
[ 3438.576846] [<c017cc57>] dentry_iput+0x68/0xb3
[ 3438.576919] [<c017d99e>] dput+0x4f/0x19f
[ 3438.576990] [<c0176164>] sys_renameat+0x1e0/0x212
[ 3438.577063] [<c01761be>] sys_rename+0x28/0x2a
[ 3438.577135] [<c01030fb>] syscall_call+0x7/0xb

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-10 12:13:43

by Frederik Deweerdt

[permalink] [raw]
Subject: [patch] Use rwsems instead of custom locking scheme in net/socket.c and net/dccp/ccid.c

On Sun, Aug 06, 2006 at 03:08:09AM -0700, Andrew Morton wrote:
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
Hi Andrew,

This patch aims at removing two implementations (spotted by Masatake YAMATO) of
pseudo-rwlocks using a spinlock_t and an atomic_t. One in net/socket.c
and another in net/bluetooth/af_bluetooth.c. I think that both could be
converted to rwsems, saving some lines of code.

Regards,
Frederik


Signed-off-by: Frederik Deweerdt <[email protected]>

net/dccp/ccid.c | 63 ++++++++++++------------------------------------------------
net/socket.c | 58 +++++++------------------------------------------------
2 files changed, 21 insertions(+), 100 deletions(-)

diff --git a/net/dccp/ccid.c b/net/dccp/ccid.c
--- a/net/dccp/ccid.c
+++ b/net/dccp/ccid.c
@@ -12,48 +12,11 @@
*/

#include "ccid.h"
+#include <linux/rwsem.h>

static struct ccid_operations *ccids[CCID_MAX];
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-static atomic_t ccids_lockct = ATOMIC_INIT(0);
-static DEFINE_SPINLOCK(ccids_lock);
+static DECLARE_RWSEM(ccids_sem);

-/*
- * The strategy is: modifications ccids vector are short, do not sleep and
- * veeery rare, but read access should be free of any exclusive locks.
- */
-static void ccids_write_lock(void)
-{
- spin_lock(&ccids_lock);
- while (atomic_read(&ccids_lockct) != 0) {
- spin_unlock(&ccids_lock);
- yield();
- spin_lock(&ccids_lock);
- }
-}
-
-static inline void ccids_write_unlock(void)
-{
- spin_unlock(&ccids_lock);
-}
-
-static inline void ccids_read_lock(void)
-{
- atomic_inc(&ccids_lockct);
- spin_unlock_wait(&ccids_lock);
-}
-
-static inline void ccids_read_unlock(void)
-{
- atomic_dec(&ccids_lockct);
-}
-
-#else
-#define ccids_write_lock() do { } while(0)
-#define ccids_write_unlock() do { } while(0)
-#define ccids_read_lock() do { } while(0)
-#define ccids_read_unlock() do { } while(0)
-#endif

static kmem_cache_t *ccid_kmem_cache_create(int obj_size, const char *fmt,...)
{
@@ -103,13 +66,13 @@ int ccid_register(struct ccid_operations
if (ccid_ops->ccid_hc_tx_slab == NULL)
goto out_free_rx_slab;

- ccids_write_lock();
+ down_write(&ccids_sem);
err = -EEXIST;
if (ccids[ccid_ops->ccid_id] == NULL) {
ccids[ccid_ops->ccid_id] = ccid_ops;
err = 0;
}
- ccids_write_unlock();
+ up_write(&ccids_sem);
if (err != 0)
goto out_free_tx_slab;

@@ -131,9 +94,9 @@ EXPORT_SYMBOL_GPL(ccid_register);

int ccid_unregister(struct ccid_operations *ccid_ops)
{
- ccids_write_lock();
+ down_write(&ccids_sem);
ccids[ccid_ops->ccid_id] = NULL;
- ccids_write_unlock();
+ up_write(&ccids_sem);

ccid_kmem_cache_destroy(ccid_ops->ccid_hc_tx_slab);
ccid_ops->ccid_hc_tx_slab = NULL;
@@ -152,15 +115,15 @@ struct ccid *ccid_new(unsigned char id,
struct ccid_operations *ccid_ops;
struct ccid *ccid = NULL;

- ccids_read_lock();
+ down_read(&ccids_sem);
#ifdef CONFIG_KMOD
if (ccids[id] == NULL) {
/* We only try to load if in process context */
- ccids_read_unlock();
+ up_read(&ccids_sem);
if (gfp & GFP_ATOMIC)
goto out;
request_module("net-dccp-ccid-%d", id);
- ccids_read_lock();
+ down_read(&ccids_sem);
}
#endif
ccid_ops = ccids[id];
@@ -170,7 +133,7 @@ #endif
if (!try_module_get(ccid_ops->ccid_owner))
goto out_unlock;

- ccids_read_unlock();
+ up_read(&ccids_sem);

ccid = kmem_cache_alloc(rx ? ccid_ops->ccid_hc_rx_slab :
ccid_ops->ccid_hc_tx_slab, gfp);
@@ -191,7 +154,7 @@ #endif
out:
return ccid;
out_unlock:
- ccids_read_unlock();
+ up_read(&ccids_sem);
goto out;
out_free_ccid:
kmem_cache_free(rx ? ccid_ops->ccid_hc_rx_slab :
@@ -235,10 +198,10 @@ static void ccid_delete(struct ccid *cci
ccid_ops->ccid_hc_tx_exit(sk);
kmem_cache_free(ccid_ops->ccid_hc_tx_slab, ccid);
}
- ccids_read_lock();
+ down_read(&ccids_sem);
if (ccids[ccid_ops->ccid_id] != NULL)
module_put(ccid_ops->ccid_owner);
- ccids_read_unlock();
+ up_read(&ccids_sem);
}

void ccid_hc_rx_delete(struct ccid *ccid, struct sock *sk)
diff --git a/net/socket.c b/net/socket.c
index 53cb85b..bc52aeb 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -85,6 +85,7 @@ #include <linux/compat.h>
#include <linux/kmod.h>
#include <linux/audit.h>
#include <linux/wireless.h>
+#include <linux/rwsem.h>

#include <asm/uaccess.h>
#include <asm/unistd.h>
@@ -143,50 +144,7 @@ #endif

static struct net_proto_family *net_families[NPROTO];

-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-static atomic_t net_family_lockct = ATOMIC_INIT(0);
-static DEFINE_SPINLOCK(net_family_lock);
-
-/* The strategy is: modifications net_family vector are short, do not
- sleep and veeery rare, but read access should be free of any exclusive
- locks.
- */
-
-static void net_family_write_lock(void)
-{
- spin_lock(&net_family_lock);
- while (atomic_read(&net_family_lockct) != 0) {
- spin_unlock(&net_family_lock);
-
- yield();
-
- spin_lock(&net_family_lock);
- }
-}
-
-static __inline__ void net_family_write_unlock(void)
-{
- spin_unlock(&net_family_lock);
-}
-
-static __inline__ void net_family_read_lock(void)
-{
- atomic_inc(&net_family_lockct);
- spin_unlock_wait(&net_family_lock);
-}
-
-static __inline__ void net_family_read_unlock(void)
-{
- atomic_dec(&net_family_lockct);
-}
-
-#else
-#define net_family_write_lock() do { } while(0)
-#define net_family_write_unlock() do { } while(0)
-#define net_family_read_lock() do { } while(0)
-#define net_family_read_unlock() do { } while(0)
-#endif
-
+static DECLARE_RWSEM(net_family_sem);

/*
* Statistics counters of the socket lists
@@ -1132,7 +1090,7 @@ #if defined(CONFIG_KMOD)
}
#endif

- net_family_read_lock();
+ down_read(&net_family_sem);
if (net_families[family] == NULL) {
err = -EAFNOSUPPORT;
goto out;
@@ -1185,7 +1143,7 @@ #endif
goto out_release;

out:
- net_family_read_unlock();
+ up_read(&net_family_sem);
return err;
out_module_put:
module_put(net_families[family]->owner);
@@ -2034,13 +1992,13 @@ int sock_register(struct net_proto_famil
printk(KERN_CRIT "protocol %d >= NPROTO(%d)\n", ops->family, NPROTO);
return -ENOBUFS;
}
- net_family_write_lock();
+ down_write(&net_family_sem);
err = -EEXIST;
if (net_families[ops->family] == NULL) {
net_families[ops->family]=ops;
err = 0;
}
- net_family_write_unlock();
+ up_write(&net_family_sem);
printk(KERN_INFO "NET: Registered protocol family %d\n",
ops->family);
return err;
@@ -2057,9 +2015,9 @@ int sock_unregister(int family)
if (family < 0 || family >= NPROTO)
return -1;

- net_family_write_lock();
+ down_write(&net_family_sem);
net_families[family]=NULL;
- net_family_write_unlock();
+ up_write(&net_family_sem);
printk(KERN_INFO "NET: Unregistered protocol family %d\n",
family);
return 0;

2006-08-10 12:57:21

by David Miller

[permalink] [raw]
Subject: Re: [patch] Use rwsems instead of custom locking scheme in net/socket.c and net/dccp/ccid.c

From: Frederik Deweerdt <[email protected]>
Date: Thu, 10 Aug 2006 14:13:36 +0200

> This patch aims at removing two implementations (spotted by Masatake YAMATO) of
> pseudo-rwlocks using a spinlock_t and an atomic_t. One in net/socket.c
> and another in net/bluetooth/af_bluetooth.c. I think that both could be
> converted to rwsems, saving some lines of code.

The net/socket.c one has been converted to RCU by Stephen
Hemminger already.

If the bluetooth case is in an important code path it should
use RCU as well.

2006-08-10 13:19:20

by Frederik Deweerdt

[permalink] [raw]
Subject: Re: [patch] Use rwsems instead of custom locking scheme in net/socket.c and net/dccp/ccid.c

On Thu, Aug 10, 2006 at 05:57:11AM -0700, David Miller wrote:
> From: Frederik Deweerdt <[email protected]>
> Date: Thu, 10 Aug 2006 14:13:36 +0200
>
> > This patch aims at removing two implementations (spotted by Masatake YAMATO) of
> > pseudo-rwlocks using a spinlock_t and an atomic_t. One in net/socket.c
> > and another in net/bluetooth/af_bluetooth.c. I think that both could be
> > converted to rwsems, saving some lines of code.
>
> The net/socket.c one has been converted to RCU by Stephen
> Hemminger already.
>
> If the bluetooth case is in an important code path it should
> use RCU as well.
Sorry, I made a mistake there: net/bluetooth/af_bluetooth.c should read
net/dccp/ccid.c. Does your comment regarding af_bluetooth.c applies to
ccid.c as well?
Also, is there a place where I can find Stephen Hemminger's work?
- Note, this is pure curiosity, it can wait a kernel release or two :) -

Thanks,
Frederik

2006-08-10 13:44:13

by Reuben Farrelly

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 [oops: shrink_dcache_for_umount_subtree ?]



On 6/08/2006 10:08 p.m., Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> - 2.6.18-rc3-mm1 gets mysterious udev timeouts during boot and crashes in
> NFS. This kernel reverts the patches which were causing that.

Just hit this one upon shutdown (no traces logged before then):

INIT: Sending processes the TERM signal
INITStopping clamd: [FAILED]
Starting killall: Stopping clamd: [FAILED]
[ OK ]
Sending all processes the TERM signal...
Sending all processes the KILL signal...
Saving random seed:
Syncing hardware clock to system time
Turning off swap:
Unmounting file systems: umount2: Device or resource busy
umount: /var/www/html: device is busy
umount2: Device or resource busy
umount: /var/www/html: device is busy
BUG: Dentry ffff81003d0f34f0{i=3,n=.reiserfs_priv} still in use (1) [unmount of
reiserfs sdc8]
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/dcache.c:611
invalid opcode: 0000 [1] SMP
last sysfs file:
/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/bInterfaceProtocol
CPU 0
Modules linked in: ipv6 ip_gre binfmt_misc i2c_i801 iTCO_wdt serio_raw
Pid: 22715, comm: umount Not tainted 2.6.18-rc3-mm2 #1
RIP: 0010:[<ffffffff802ce943>] [<ffffffff802ce943>]
shrink_dcache_for_umount_subtree+0x1a3/0x2a7
RSP: 0018:ffff81002ec6fd98 EFLAGS: 00010292
RAX: 0000000000000062 RBX: ffff81003d0f34f0 RCX: 0000000000000003
RDX: 0000000000000008 RSI: ffff810035224740 RDI: ffff810035224040
RBP: ffff81002ec6fdb8 R08: 0000000000000001 R09: 0000000000000001
R10: ffffffff80216800 R11: 0000000000000000 R12: ffff81003d0f34f0
R13: ffff8100025b2ce8 R14: ffff81002f936d30 R15: 0000000000000000
FS: 00002b532ecdd4b0(0000) GS:ffffffff808b5000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b532ecd0000 CR3: 000000003273e000 CR4: 00000000000006e0
Process umount (pid: 22715, threadinfo ffff81002ec6e000, task ffff810035224040)
Stack: ffff81003d29c980 ffff81003d29c588 ffffffff80595640 ffff81002ec6fea8
ffff81002ec6fdd8 ffffffff802ceea9 ffffffff805955e0 ffff81003d29c588
ffff81002ec6fe08 ffffffff802c6944 ffff81002f936d30 ffff81003e99e2c0
Call Trace:
[<ffffffff802ceea9>] shrink_dcache_for_umount+0x37/0x6e
[<ffffffff802c6944>] generic_shutdown_super+0x24/0x151
[<ffffffff802c6a97>] kill_block_super+0x26/0x3b
[<ffffffff802c6b65>] deactivate_super+0x4c/0x67
[<ffffffff8022d061>] mntput_no_expire+0x58/0x92
[<ffffffff80232562>] path_release_on_umount+0x1d/0x2b
[<ffffffff802d1182>] sys_umount+0x252/0x29b
[<ffffffff8025f45e>] system_call+0x7e/0x83
DWARF2 unwinder stuck at system_call+0x7e/0x83
Leftover inexact backtrace:


Code: 0f 0b 68 c9 47 4c 80 c2 63 02 4c 8b 63 50 49 39 dc 75 05 45
RIP [<ffffffff802ce943>] shrink_dcache_for_umount_subtree+0x1a3/0x2a7
RSP <ffff81002ec6fd98>
/etc/rc6.d/S01reboot: line 14: 22715 Segmentation fault "$@"

/var/www/html: c
/var: mcm
Unmounting file systems (retry):<3>BUG: Dentry
ffff81003ef61e80{i=3,n=.reiserfs_priv} still in use (1) [unmount of reiserfs sda8]
umount2: Devic----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at fs/dcache.c:611
invalid opcode: 0000 [2] SMP
last sysfs file:
/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/bInterfaceProtocol
CPU 1
Modules linked in: ipv6 ip_gre binfmt_misc i2c_i801 iTCO_wdt serio_raw
Pid: 22722, comm: umount Not tainted 2.6.18-rc3-mm2 #1
RIP: 0010:[<ffffffff802ce943>] e or resource bu [<ffffffff802ce943>]
shrink_dcache_for_umount_subtree+0x1a3/0x2a7
RSP: 0018:ffff810027e1dd98 EFLAGS: 00010292
RAX: 0000000000000062 RBX: ffff81003ef61e80 RCX: 0000000000000000
sy
umount: /varRDX: ffff810015f99140 RSI: 0000000000000001 RDI: 0000000000000000
RBP: ffff810027e1ddb8 R08: 0000000000000002 R09: 0000000000000001
R10: ffffffff80216800 R11: 0000000000000001 R12: ffff81003ef61e80
/www/html: devicR13: ffff8100131f3648 R14: ffff81002f936e18 R15: 0000000000000000
FS: 00002b52520af4b0(0000) GS:ffff81003f6eb430(0000) knlGS:0000000000000000
e is busy
umounCS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff58a02ea0 CR3: 000000002ec49000 CR4: 00000000000006e0
t2: Device or reProcess umount (pid: 22722, threadinfo ffff810027e1c000, task
ffff810015f99140)
Stack: ffff81003d29d198 ffff81003d29cda0 ffffffff80595640 ffff810027e1dea8
ffff810027e1ddd8 ffffffff802ceea9 ffffffff805955e0 ffff81003d29cda0
ffff810027e1de08 ffffffff802c6944 ffff81002f936e18 ffff81003ebaa938
Call Trace:
source busy
umo [<ffffffff802ceea9>] shrink_dcache_for_umount+0x37/0x6e
unt: /var/www/ht [<ffffffff802c6944>] generic_shutdown_super+0x24/0x151
ml: device is bu [<ffffffff802c6a97>] kill_block_super+0x26/0x3b
sy
[<ffffffff802c6b65>] deactivate_super+0x4c/0x67
[<ffffffff8022d061>] mntput_no_expire+0x58/0x92
[<ffffffff80232562>] path_release_on_umount+0x1d/0x2b
[<ffffffff802d1182>] sys_umount+0x252/0x29b
[<ffffffff8025f45e>] system_call+0x7e/0x83
DWARF2 unwinder stuck at system_call+0x7e/0x83
Leftover inexact backtrace:


Code: 0f 0b 68 c9 47 4c 80 c2 63 02 4c 8b 63 50 49 39 dc 75 05 45
RIP [<ffffffff802ce943>] shrink_dcache_for_umount_subtree+0x1a3/0x2a7
RSP <ffff810027e1dd98>
/etc/rc6.d/S01reboot: line 14: 22722 Segmentation fault "$@"

/var/www/html: c
/var: mcm

Yes, there are bits of the shutdown mixed in which doesn't really help readability.

The reason I shut the box down was due to yum hanging and becoming a 'D' process
which was unkillable.

What is strange is that /var/www/html should not be busy as there are no mounts
underneath it. It's just a standard ext3 partition.

[root@tornado ~]# mount
/dev/md0 on / type ext3 (rw)
none on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)
/dev/sda1 on /boot type ext3 (rw)
/dev/md1 on /home type ext3 (rw)
/dev/md2 on /var type ext3 (rw)
/dev/md3 on /var/www/html type ext3 (rw)
/dev/md4 on /var/www/cgi-bin type ext3 (rw)
/dev/md5 on /store type ext3 (rw)
/dev/sda8 on /var/spool/squid-1 type reiserfs (rw,noatime,notail)
/dev/sdc8 on /var/spool/squid-2 type reiserfs (rw,noatime,notail)
/dev/sda9 on /tmp type ext3 (rw)
/dev/shm on /var/spool/amavisd/tmp type tmpfs (rw,size=25m,mode=700,uid=101,gid=511)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
[root@tornado ~]#

Looks identical to
http://www.uwsg.iu.edu/hypermail/linux/kernel/0606.3/2802.html which hasn't
appeared since then. I remember it was reproduceable at the time, but
disappeared for a while and just came back before..

Reuben



2006-08-10 15:27:54

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Thu, 10 Aug 2006 13:39:11 +0159
Jiri Slaby <[email protected]> wrote:

> [email protected] wrote:
> > On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:
> >
> >>> Usually this means that there's an IO request in flight and it got lost
> >>> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
> >>> lost interrupt (hardware bug, PCI setup bug, etc).
> >
> >> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0
> >
> > Red herring. yum just wedged again, this time with no reference to floppy drive.
> > Same traceback. Anybody have anything to suggest before I start playing
> > hunt-the-wumpus with a -mm bisection?
>
> Hmm, I have the accurately same problem...
> yum + CFQ + BLK_DEV_PIIX + nothing odd in dmesg
>
> [ 3438.574864] yum D 00000000 0 21659 3838
> (NOTLB)
> [ 3438.575098] e5c09d24 00000001 c180f5a8 00000000 e5c09ce0 c01683e8
> fe37c0bc 000002c4
> [ 3438.575388] 00001000 00000001 c18fbbd0 0023001f 00000007 f26cc560
> c1913560 fe4166d5
> [ 3438.575713] 000002c4 0009a619 00000001 f26cc66c c180ec40 c04ff140
> e5c09d14 c01fad44
> [ 3438.576039] Call Trace:
> [ 3438.576113] [<c0373d3b>] io_schedule+0x26/0x30
> [ 3438.576187] [<c014653c>] sync_page+0x39/0x45
> [ 3438.576260] [<c0374401>] __wait_on_bit_lock+0x41/0x64
> [ 3438.576333] [<c01464ef>] __lock_page+0x57/0x5f
> [ 3438.576405] [<c014f5f2>] truncate_inode_pages_range+0x1b6/0x304
> [ 3438.576480] [<c014f76f>] truncate_inode_pages+0x2f/0x40
> [ 3438.576553] [<c01a7bc4>] ext3_delete_inode+0x29/0xf7
> [ 3438.576627] [<c017f26b>] generic_delete_inode+0x65/0xe7
> [ 3438.576701] [<c017f3aa>] generic_drop_inode+0xbd/0x173
> [ 3438.576774] [<c017ed25>] iput+0x6b/0x7b
> [ 3438.576846] [<c017cc57>] dentry_iput+0x68/0xb3
> [ 3438.576919] [<c017d99e>] dput+0x4f/0x19f
> [ 3438.576990] [<c0176164>] sys_renameat+0x1e0/0x212
> [ 3438.577063] [<c01761be>] sys_rename+0x28/0x2a
> [ 3438.577135] [<c01030fb>] syscall_call+0x7/0xb
>

Is yum the only process which was stuck in D state?

If so, I'd still be expecting a device driver/iosched bug.

If not, it's probably a vfs/fs deadlock.

2006-08-10 15:38:16

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 [oops: shrink_dcache_for_umount_subtree ?]

On Fri, 11 Aug 2006 01:43:53 +1200
Reuben Farrelly <[email protected]> wrote:

>
>
> On 6/08/2006 10:08 p.m., Andrew Morton wrote:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> > - 2.6.18-rc3-mm1 gets mysterious udev timeouts during boot and crashes in
> > NFS. This kernel reverts the patches which were causing that.
>
> Just hit this one upon shutdown (no traces logged before then):
>
> INIT: Sending processes the TERM signal
> INITStopping clamd: [FAILED]
> Starting killall: Stopping clamd: [FAILED]
> [ OK ]
> Sending all processes the TERM signal...
> Sending all processes the KILL signal...
> Saving random seed:
> Syncing hardware clock to system time
> Turning off swap:
> Unmounting file systems: umount2: Device or resource busy
> umount: /var/www/html: device is busy
> umount2: Device or resource busy
> umount: /var/www/html: device is busy
> BUG: Dentry ffff81003d0f34f0{i=3,n=.reiserfs_priv} still in use (1) [unmount of
> reiserfs sdc8]
> ----------- [cut here ] --------- [please bite here ] ---------
> Kernel BUG at fs/dcache.c:611
> invalid opcode: 0000 [1] SMP
> last sysfs file:
> /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.2/2-1.2:1.0/bInterfaceProtocol
> CPU 0
> Modules linked in: ipv6 ip_gre binfmt_misc i2c_i801 iTCO_wdt serio_raw
> Pid: 22715, comm: umount Not tainted 2.6.18-rc3-mm2 #1
> RIP: 0010:[<ffffffff802ce943>] [<ffffffff802ce943>]
> shrink_dcache_for_umount_subtree+0x1a3/0x2a7
> RSP: 0018:ffff81002ec6fd98 EFLAGS: 00010292
> RAX: 0000000000000062 RBX: ffff81003d0f34f0 RCX: 0000000000000003
> RDX: 0000000000000008 RSI: ffff810035224740 RDI: ffff810035224040
> RBP: ffff81002ec6fdb8 R08: 0000000000000001 R09: 0000000000000001
> R10: ffffffff80216800 R11: 0000000000000000 R12: ffff81003d0f34f0
> R13: ffff8100025b2ce8 R14: ffff81002f936d30 R15: 0000000000000000
> FS: 00002b532ecdd4b0(0000) GS:ffffffff808b5000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00002b532ecd0000 CR3: 000000003273e000 CR4: 00000000000006e0
> Process umount (pid: 22715, threadinfo ffff81002ec6e000, task ffff810035224040)
> Stack: ffff81003d29c980 ffff81003d29c588 ffffffff80595640 ffff81002ec6fea8
> ffff81002ec6fdd8 ffffffff802ceea9 ffffffff805955e0 ffff81003d29c588
> ffff81002ec6fe08 ffffffff802c6944 ffff81002f936d30 ffff81003e99e2c0
> Call Trace:
> [<ffffffff802ceea9>] shrink_dcache_for_umount+0x37/0x6e
> [<ffffffff802c6944>] generic_shutdown_super+0x24/0x151
> [<ffffffff802c6a97>] kill_block_super+0x26/0x3b
> [<ffffffff802c6b65>] deactivate_super+0x4c/0x67
> [<ffffffff8022d061>] mntput_no_expire+0x58/0x92
> [<ffffffff80232562>] path_release_on_umount+0x1d/0x2b
> [<ffffffff802d1182>] sys_umount+0x252/0x29b
> [<ffffffff8025f45e>] system_call+0x7e/0x83

yup, thanks. We're expecting that
ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/hot-fixes/reiserfs-make-sure-all-dentries-refs-are-released-before-calling-kill_block_super-try-2.patch
will fix this.

2006-08-10 17:34:21

by Mattia Dongili

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Thu, Aug 10, 2006 at 08:27:49AM -0700, Andrew Morton wrote:
> On Thu, 10 Aug 2006 13:39:11 +0159
> Jiri Slaby <[email protected]> wrote:
>
> > [email protected] wrote:
> > > On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:
> > >
> > >>> Usually this means that there's an IO request in flight and it got lost
> > >>> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
> > >>> lost interrupt (hardware bug, PCI setup bug, etc).
> > >
> > >> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0
> > >
> > > Red herring. yum just wedged again, this time with no reference to floppy drive.
> > > Same traceback. Anybody have anything to suggest before I start playing
> > > hunt-the-wumpus with a -mm bisection?
> >
> > Hmm, I have the accurately same problem...
> > yum + CFQ + BLK_DEV_PIIX + nothing odd in dmesg

oooh, same setup and same trace here, but no yum, see some screenshots
here:
http://oioio.altervista.org/linux/dsc03448.jpg
http://oioio.altervista.org/linux/dsc03449.jpg

The use case for me was simply:
- boot (in single user for the 2 shots)
- suspend
- resume
- wait some seconds and do anything that accesses the disk

[...]
> Is yum the only process which was stuck in D state?

in my case anything accessing the disk, leading to lockup shortly

> If so, I'd still be expecting a device driver/iosched bug.
>
> If not, it's probably a vfs/fs deadlock.

I reverted the full git-block.patch and I'm now using rc3-mm2 since
then suspending to ram, disk and using my laptop for daily stuff:

reboot system boot 2.6.18-rc3-mm2-1 Tue Aug 8 00:02 - 19:30 (2+19:27)

PS: my previous pasts are here: http://lkml.org/lkml/2006/8/7/264
probably an unfortunate Cc list :)

--
mattia
:wq!

2006-08-10 17:39:24

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity
quite badly. It basically totally refuses to answer an IPv6 Neighbor Solicit
packet or IPv6 Echo Request packet. I run a 'tcpdump -n ipv6', and I see the
requests come in, and no packets leaving. Interestingly enough, if I try to
ping6 *out* of the box, it's totally willing to send a Neighbor Solicit outbound
(although it appears to totally ignore the Neighbor Advert packet that comes
back). Of course, things don't work very well at all with busticated Neighbor
Solicit.

A kernel built with IPV6_MULTIPLE_TABLES=n works just fine.

The relevant ifconfig (eth3 is a 100mbit port, eth5 is a wireless card):

eth3 Link encap:Ethernet HWaddr 00:06:5B:EA:8E:4E
inet addr:128.173.14.107 Bcast:128.173.15.255 Mask:255.255.252.0
inet6 addr: 2001:468:c80:2103:206:5bff:feea:8e4e/64 Scope:Global
inet6 addr: fe80::206:5bff:feea:8e4e/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:15529 errors:0 dropped:0 overruns:1 frame:0
TX packets:2073 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2333290 (2.2 MiB) TX bytes:228862 (223.4 KiB)
Interrupt:11 Base address:0x6800

eth5 Link encap:Ethernet HWaddr 00:02:2D:5C:11:48
inet addr:198.82.168.129 Bcast:198.82.168.255 Mask:255.255.255.0
inet6 addr: 2001:468:c80:2181:202:2dff:fe5c:1148/64 Scope:Global
inet6 addr: fe80::202:2dff:fe5c:1148/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2096 errors:0 dropped:0 overruns:0 frame:0
TX packets:144 errors:1 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:280919 (274.3 KiB) TX bytes:22184 (21.6 KiB)
Interrupt:11 Base address:0xe100

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1583 errors:0 dropped:0 overruns:0 frame:0
TX packets:1583 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:642598 (627.5 KiB) TX bytes:642598 (627.5 KiB)

A working routing table:

netstat -r -n -A inet6
Kernel IPv6 routing table
Destination Next Hop Flags Metric Ref Use Iface
::1/128 :: U 0 12 1 lo
2001:468:c80:2103:206:5bff:feea:8e4e/128 :: U 0 4 1 lo
2001:468:c80:2103::/64 :: UA 256 113 0 eth3
2001:468:c80:2181:202:2dff:fe5c:1148/128 :: U 0 0 1 lo
2001:468:c80:2181::/64 :: UA 256 11 0 eth5
fe80::202:2dff:fe5c:1148/128 :: U 0 0 1 lo
fe80::206:5bff:feea:8e4e/128 :: U 0 2 1 lo
fe80::/64 :: U 256 0 0 eth3
fe80::/64 :: U 256 0 0 eth5
ff02::1/128 ff02::1 UC 0 113 0 eth3
ff02::1/128 ff02::1 UC 0 1 0 eth5
ff00::/8 :: U 256 0 0 eth3
ff00::/8 :: U 256 0 0 eth5
::/0 fe80::20f:35ff:fe3e:d41a UGDA 1024 1 0 eth3
::/0 fe80::20f:35ff:fe3e:d41a UGDA 1024 1 0 eth5



Attachments:
(No filename) (226.00 B)

2006-08-10 17:43:59

by Jiri Slaby

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On 8/10/06, Mattia Dongili <[email protected]> wrote:
> On Thu, Aug 10, 2006 at 08:27:49AM -0700, Andrew Morton wrote:
> > On Thu, 10 Aug 2006 13:39:11 +0159
> > Jiri Slaby <[email protected]> wrote:
> >
> > > [email protected] wrote:
> > > > On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:
> > > >
> > > >>> Usually this means that there's an IO request in flight and it got lost
> > > >>> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
> > > >>> lost interrupt (hardware bug, PCI setup bug, etc).
> > > >
> > > >> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O error, dev fd0, sector 0
> > > >
> > > > Red herring. yum just wedged again, this time with no reference to floppy drive.
> > > > Same traceback. Anybody have anything to suggest before I start playing
> > > > hunt-the-wumpus with a -mm bisection?
> > >
> > > Hmm, I have the accurately same problem...
> > > yum + CFQ + BLK_DEV_PIIX + nothing odd in dmesg
>
> oooh, same setup and same trace here, but no yum, see some screenshots
> here:
> http://oioio.altervista.org/linux/dsc03448.jpg
> http://oioio.altervista.org/linux/dsc03449.jpg

This is reiser ^^?!, so we can exclude fs? I have this behaviour on ext3.

regards,
--
<a href="http://www.fi.muni.cz/~xslaby/">Jiri Slaby</a>
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E

2006-08-10 17:44:52

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Thu, 10 Aug 2006 19:33:13 +0200, Mattia Dongili said:

> oooh, same setup and same trace here, but no yum, see some screenshots
> here:
> http://oioio.altervista.org/linux/dsc03448.jpg
> http://oioio.altervista.org/linux/dsc03449.jpg

Not quite the same trace - the first few lines are the same, but your call to
__lock_page() comes in via do_generic_mapping_read(), while Jiri and I are
seeing the call to __lock_page() coming from truncate_inode_pages_range()....


Attachments:
(No filename) (226.00 B)

2006-08-10 20:02:40

by Patrick McHardy

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

[IPV6]: Fix policy routing lookup

When the lookup in a table returns ip6_null_entry the policy routing lookup
returns it instead of continuing in the next table, which effectively means
it only searches the local table.

Signed-off-by: Patrick McHardy <[email protected]>
Signed-off-by: David S. Miller <[email protected]>

---
commit 2b885e76c2b2c74d2dfe86a8140f0b41149f327c
tree 767711f03ea3e990ce02b3720718b77490027793
parent 5bd721a145d02a89a9b69adf3ede9d0b3647ae8b
author Patrick McHardy <[email protected]> Sun, 06 Aug 2006 22:24:08 -0700
committer David S. Miller <[email protected]> Sun, 06 Aug 2006 22:24:08 -0700

net/ipv6/fib6_rules.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index c3c8195..94a46ec 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -94,8 +94,10 @@ int fib6_rule_action(struct fib_rule *ru

if (rt != &ip6_null_entry)
goto out;
-
dst_release(&rt->u.dst);
+ rt = NULL;
+ goto out;
+
discard_pkt:
dst_hold(&rt->u.dst);
out:


Attachments:
x (1.04 kB)

2006-08-10 21:45:13

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - IPV6_MULTIPLE_TABLES borked....

On Thu, 10 Aug 2006 22:02:03 +0200, Patrick McHardy said:

> [email protected] wrote:
> > On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
> >
> >>ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> >
> > Building a kernel with IPV6_MULTIPLE_TABLES=y breaks my IPv6 connectivity

> It should be fixed by this patch (already contained in net-2.6.19).

Confirmed fixed, thanks...


Attachments:
(No filename) (226.00 B)

2006-08-10 23:18:58

by Laurent Riffard

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm



Le 10.08.2006 11:19, Andrew Morton a ?crit :
> On Thu, 10 Aug 2006 11:04:36 +0200
> Laurent Riffard <[email protected]> wrote:
>
>> Le 06.08.2006 12:08, Andrew Morton a ?crit :
>>> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/26.18-rc3-mm2/
>> Hello,
>>
>> On my system, a cron runs every day to check the integrity of
>> installed RPMS, it runs "rpm -v" on each package, which computes
>> MD5 hash for each installed file and compares this result, the file
>> size and modification time with values stored in RPM database.
>>
>> This is the workload. Since 2.6.18-rc3-mm2, this processus eats
>> all the memory and triggers OOM.
>>
>> On my system, "free -t" output normally looks like this ("cached" value
>> is about half of RAM):
>> # free -t
>> total used free shared buffers cached
>> Mem: 515032 508512 6520 0 22992 256032
>> -/+ buffers/cache: 229488 285544
>> Swap: 1116428 324 1116104
>> Total: 1631460 508836 1122624
>>
>> After the rpm database check, "free -t" says:
>> total used free shared buffers cached
>> Mem: 515032 507124 7908 0 8132 398296
>> -/+ buffers/cache: 100696 414336
>> Swap: 1116428 34896 1081532
>> Total: 1631460 542020 1089440
>>
>> And the value of "cached" won't decrease.
>>
>
> Yes, I was just trying to reproduce this. No luck so far. Will try your
> .config tomorrow.
>
> It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
> perhaps that got broken.

I just try it: when CONFIG_ADAPTIVE_READAHEAD is disabled,
/proc/meminfo:Cached is stable and never exceeded 230.000, the system
didn't even try to swap.

$ cat /proc/meminfo # taken a few minutes after the end of rpm -V
MemTotal: 515032 kB
MemFree: 6612 kB
Buffers: 42212 kB
Cached: 182236 kB
SwapCached: 0 kB
Active: 376256 kB
Inactive: 75468 kB
SwapTotal: 1116428 kB
SwapFree: 1116428 kB
Dirty: 272 kB
Writeback: 0 kB
AnonPages: 227260 kB
Mapped: 62812 kB
Slab: 44968 kB
PageTables: 2152 kB
NFS Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1373944 kB
Committed_AS: 637400 kB
VmallocTotal: 515796 kB
VmallocUsed: 6916 kB
VmallocChunk: 508760 kB

> Also, are you able to determine whether the problem is specific to `rpm
> -V'? Are you able to make the leak trigger using other filesystem
> workloads?

Will try...

> If it's specific to `rpm -V' then perhaps direct-io is somehow causing
> pagecache leakage. That would be a bit odd.
>
>
>
> btw, it's not necessary to go all the way to oom to work out if the
> pagecache leak is happening. After booting, do
>
> echo 3 > /proc/sys/vm/drop_pagecache
>
> and record the `Cached' figure in /proc/meminfo. After running some test,
> run `echo 3 > /proc/sys/vm/drop_pagecache' again and check
> /proc/meminfo:Cached. If it dodn't do gown to a similarly low figure,
> we're leaking pagecache.

I played with these values and as far I can remember, I get some poor
improvement. Will try to gather some data.

> btw2: please use /proc/meminfo output rather than free(1). Because free(1)
> shows less info, and it does mysterious mangling of the info which it does
> read in ways which confuse me.

Ok

--
laurent

2006-08-11 02:16:14

by Valdis Klētnieks

[permalink] [raw]
Subject: 2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()

On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

After applying the patch that Patrick McHardy pointed me at, it lived
longer. However, I'm now seeing problems at system shutdown (or anytime
you try to 'ifdown ethX' where ethX has an IPv6 address attached to it):

[ 196.346000] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000014
[ 196.347000] printing eip:
[ 196.348000] c032c436
[ 196.348000] *pde = 00000000
[ 196.349000] Oops: 0000 [#1]
[ 196.349000] 4K_STACKS PREEMPT
[ 196.349000] last sysfs file: /class/net/eth1/address
[ 196.349000] Modules linked in: thermal sony_acpi processor fan button battery ac nfnetlink i8k floppy nvram orinoco_cs orinoco hermes pcmcia firmware_class ohci1394 ieee1394 intel_agp agpgart iTCO_wdt yenta_socket rsrc_nonstatic pcmcia_core rtc
[ 196.349000] CPU: 0
[ 196.349000] EIP: 0060:[<c032c436>] Not tainted VLI
[ 196.349000] EFLAGS: 00010246 (2.6.18-rc3-mm2 #4)
[ 196.349000] EIP is at rt6_lookup+0x47/0x83
[ 196.349000] eax: 00000000 ebx: 00000000 ecx: 00000005 edx: 00000000
[ 196.349000] esi: e8b25c98 edi: e8b25c20 ebp: e8b25c78 esp: e8b25c20
[ 196.349000] ds: 007b es: 007b ss: 0068
[ 196.349000] Process ip (pid: 2511, ti=e8b25000 task=effb0aa0 task.ti=e8b25000)
[ 196.349000] Stack: 00000005 00000000 000080fe 00000000 00000000 00000000 00000000 00000000
[ 196.349000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 196.349000] 00000000 00000000 00000000 00000008 eb6e98c8 e8b25ca8 e8b25cb4 c0327c04
[ 196.349000] Call Trace:
[ 196.349000] [<c0327c04>] ipv6_del_addr+0x2ef/0x3a7
[ 196.349000] [<c0327d3f>] inet6_addr_del+0x83/0xbb
[ 196.349000] [<c0327dd6>] inet6_rtm_deladdr+0x5f/0x6b
[ 196.349000] [<c02da097>] rtnetlink_rcv_msg+0x1b3/0x1d6
[ 196.349000] [<c02e011c>] netlink_run_queue+0x5a/0xc6
[ 196.349000] [<c02d9e9d>] rtnetlink_rcv+0x29/0x42
[ 196.349000] [<c02e0576>] netlink_data_ready+0x12/0x49
[ 196.349000] [<c02df518>] netlink_sendskb+0x1c/0x4d
[ 196.349000] [<c02dfea0>] netlink_unicast+0x1c4/0x1d0
[ 196.349000] [<c02e0557>] netlink_sendmsg+0x274/0x281
[ 196.349000] [<c02ca57e>] sock_sendmsg+0xeb/0x106
[ 196.349000] [<c02cad99>] sys_sendto+0xbe/0xdc
[ 196.349000] [<c02cb522>] sys_socketcall+0xfb/0x186
[ 196.349000] [<c0102849>] sysenter_past_esp+0x56/0x79
[ 196.349000] DWARF2 unwinder stuck at sysenter_past_esp+0x56/0x79
[ 196.349000] Leftover inexact backtrace:
[ 196.349000] [<c01036c7>] show_stack_log_lvl+0x8c/0x97
[ 196.349000] [<c010381f>] show_registers+0x14d/0x1de
[ 196.349000] [<c0103a5b>] die+0x1ab/0x26d
[ 196.349000] [<c0352205>] do_page_fault+0x3f8/0x4c5
[ 196.349000] [<c0351271>] error_code+0x39/0x40
[ 196.349000] [<c0327c04>] ipv6_del_addr+0x2ef/0x3a7
[ 196.349000] [<c0327d3f>] inet6_addr_del+0x83/0xbb
[ 196.349000] [<c0327dd6>] inet6_rtm_deladdr+0x5f/0x6b
[ 196.349000] [<c02da097>] rtnetlink_rcv_msg+0x1b3/0x1d6
[ 196.349000] [<c02e011c>] netlink_run_queue+0x5a/0xc6
[ 196.349000] [<c02d9e9d>] rtnetlink_rcv+0x29/0x42
[ 196.349000] [<c02e0576>] netlink_data_ready+0x12/0x49
[ 196.349000] [<c02df518>] netlink_sendskb+0x1c/0x4d
[ 196.349000] [<c02dfea0>] netlink_unicast+0x1c4/0x1d0
[ 196.349000] [<c02e0557>] netlink_sendmsg+0x274/0x281
[ 196.349000] [<c02ca57e>] sock_sendmsg+0xeb/0x106
[ 196.349000] [<c02cad99>] sys_sendto+0xbe/0xdc
[ 196.349000] [<c02cb522>] sys_socketcall+0xfb/0x186
[ 196.349000] [<c0102849>] sysenter_past_esp+0x56/0x79
[ 196.349000] Code: eb ff 89 5d a8 8d 45 b0 b9 10 00 00 00 89 f2 e8 c9 e0 eb ff 31 d2 83 7d 08 00 0f 95 c2 b9 ad cc 32 c0 89 f8 e8 47 7c 01 00 89 c3 <66> 83 7b 14 00 74 2d 8b 43 04 85 c0 7f 21 68 c4 19 37 c0 68 99
[ 196.349000] EIP: [<c032c436>] rt6_lookup+0x47/0x83 SS:ESP 0068:e8b25c20

The unlucky 'ip' process then gets a SIGSEGV and dies while holding a lock
of some sort, so later 'ip' processes get hung in 'D' state.

Checking the lkml and netdev archives didn't find any useful hits for
'ipv6_addr_rel'...


Attachments:
(No filename) (226.00 B)

2006-08-11 04:20:33

by David Miller

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - BUG in rt6_lookup() from ipv6_del_addr()

From: [email protected]
Date: Thu, 10 Aug 2006 22:15:26 -0400

> On Sun, 06 Aug 2006 03:08:09 PDT, Andrew Morton said:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> After applying the patch that Patrick McHardy pointed me at, it lived
> longer. However, I'm now seeing problems at system shutdown (or anytime
> you try to 'ifdown ethX' where ethX has an IPv6 address attached to it):

This is cured by yet another fix already in the net-2.6.19
tree:

>From 7a3a5e6b0e6847749c756cbe4bf554eda063a577 Mon Sep 17 00:00:00 2001
From: Ville Nuorvala <[email protected]>
Date: Tue, 8 Aug 2006 16:44:17 -0700
Subject: [PATCH] [IPV6]: Make sure fib6_rule_lookup doesn't return NULL

The callers of fib6_rule_lookup don't expect it to return NULL,
therefore it must return ip6_null_entry whenever fib_rule_lookup fails.

Signed-off-by: Ville Nuorvala <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
---
net/ipv6/fib6_rules.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ipv6/fib6_rules.c b/net/ipv6/fib6_rules.c
index bf9bba8..22a2fdb 100644
--- a/net/ipv6/fib6_rules.c
+++ b/net/ipv6/fib6_rules.c
@@ -63,7 +63,11 @@ struct dst_entry *fib6_rule_lookup(struc
if (arg.rule)
fib_rule_put(arg.rule);

- return (struct dst_entry *) arg.result;
+ if (arg.result)
+ return (struct dst_entry *) arg.result;
+
+ dst_hold(&ip6_null_entry.u.dst);
+ return &ip6_null_entry.u.dst;
}

static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
--
1.4.2.rc2.g3e042


2006-08-11 06:17:28

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Thu, 10 Aug 2006 13:44:33 -0400
[email protected] wrote:

> On Thu, 10 Aug 2006 19:33:13 +0200, Mattia Dongili said:
>
> > oooh, same setup and same trace here, but no yum, see some screenshots
> > here:
> > http://oioio.altervista.org/linux/dsc03448.jpg
> > http://oioio.altervista.org/linux/dsc03449.jpg
>
> Not quite the same trace - the first few lines are the same, but your call to
> __lock_page() comes in via do_generic_mapping_read(), while Jiri and I are
> seeing the call to __lock_page() coming from truncate_inode_pages_range()....
>

The suspend+resume->hang bug is known and reputedly fixed.

The stuck-in-lock_page-without-having-done-resume bug is not known.
Someone please try the deadline scheduler, or AS.

2006-08-11 06:26:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

On Thu, 2006-08-10 at 02:19 -0700, Andrew Morton wrote:
> On Thu, 10 Aug 2006 11:04:36 +0200
> Laurent Riffard <[email protected]> wrote:
>
> > Le 06.08.2006 12:08, Andrew Morton a ?crit :
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/26.18-rc3-mm2/
> >
> > Hello,
> >
> > On my system, a cron runs every day to check the integrity of
> > installed RPMS, it runs "rpm -v" on each package, which computes
> > MD5 hash for each installed file and compares this result, the file
> > size and modification time with values stored in RPM database.
> >
> > This is the workload. Since 2.6.18-rc3-mm2, this processus eats
> > all the memory and triggers OOM.
> >
> > On my system, "free -t" output normally looks like this ("cached" value
> > is about half of RAM):
> > # free -t
> > total used free shared buffers cached
> > Mem: 515032 508512 6520 0 22992 256032
> > -/+ buffers/cache: 229488 285544
> > Swap: 1116428 324 1116104
> > Total: 1631460 508836 1122624
> >
> > After the rpm database check, "free -t" says:
> > total used free shared buffers cached
> > Mem: 515032 507124 7908 0 8132 398296
> > -/+ buffers/cache: 100696 414336
> > Swap: 1116428 34896 1081532
> > Total: 1631460 542020 1089440
> >
> > And the value of "cached" won't decrease.
> >
>
> Yes, I was just trying to reproduce this. No luck so far. Will try your
> .config tomorrow.
>
> It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
> perhaps that got broken.

I get no oom-killer action, but as soon as memory gets tight, I get
something even more effective. rpm -qaV reliably emits the below.

kernel BUG at mm/vmscan.c:383!
invalid opcode: 0000 [#1]
4K_STACKS PREEMPT SMP
last sysfs file: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource
Modules linked in: xt_pkttype ipt_LOG xt_limit snd_pcm_oss snd_mixer_oss snd_seq_midi snd_seq_midi_event eeprom snd_seq edd ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 nls_utf8 saa7134_dvb mt352 video_buf_dvb nxt200x tda1004x tuner snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer sd_mod saa7134 bt878 i2c_i801 snd_page_alloc prism54 ir_kbd_i2c bttv video_buf ir_common ohci1394 snd_mpu401 snd_mpu401_uart btcx_risc tveeprom ieee1394 snd_rawmidi snd_seq_device snd soundcore
CPU: 1
EIP: 0060:[<c105a166>] Not tainted VLI
EFLAGS: 00210203 (2.6.18-rc3-mm2-smp #162)
EIP is at remove_mapping+0xa3/0xbf
eax: 80008009 ebx: c1e48200 ecx: c14ad9c0 edx: c1e48200
esi: c1e48200 edi: c14ad9c0 ebp: dffb7e14 esp: dffb7e08
ds: 007b es: 007b ss: 0068
Process kswapd0 (pid: 196, ti=dffb7000 task=dffb9a90 task.ti=dffb7000)
Stack: c1e48200 c1e48218 c14ad9c0 dffb7f28 c105a818 dffb7f18 00000000 dffb7f08
dffb7f10 c14ad680 c14ad690 dffb7f84 c14ad100 00000020 00000000 00000000
00000020 00000000 00000000 c14ad9c0 00000000 00000020 00000000 00000001
Call Trace:
[<c105a818>] shrink_inactive_list+0x696/0x8dc
[<c105aaf0>] shrink_zone+0x92/0xe5
[<c105b125>] kswapd+0x300/0x40e
[<c10361d6>] kthread+0xe4/0xe8
[<c1001005>] kernel_thread_helper+0x5/0xb
DWARF2 unwinder stuck at kernel_thread_helper+0x5/0xb
Leftover inexact backtrace:
[<c1003f83>] show_stack_log_lvl+0xa6/0xcb
[<c1004180>] show_registers+0x1d8/0x286
[<c100437f>] die+0x151/0x333
[<c10045d9>] do_trap+0x78/0xa3
[<c1004f16>] do_invalid_op+0x97/0xa1
[<c13e0369>] error_code+0x39/0x40
[<c105a818>] shrink_inactive_list+0x696/0x8dc
[<c105aaf0>] shrink_zone+0x92/0xe5
[<c105b125>] kswapd+0x300/0x40e
[<c10361d6>] kthread+0xe4/0xe8
[<c1001005>] kernel_thread_helper+0x5/0xb
Code: f0 e8 46 88 ff ff 89 f8 e8 ba 5d 38 00 f0 ff 4e 04 b8 01 00 00 00 5b 5e 5f 5d c3 89 f8 e8 a5 5d 38 00 31 c0 eb d4 8b 56 0c eb 8d <0f> 0b 7f 01 6f 75 42 c1 89 f6 e9 6b ff ff ff 0f 0b 7e 01 6f 75
EIP: [<c105a166>] remove_mapping+0xa3/0xbf SS:ESP 0068:dffb7e08
<0>------------[ cut here ]------------
kernel BUG at mm/vmscan.c:383!
invalid opcode: 0000 [#2]
4K_STACKS PREEMPT SMP
last sysfs file: /devices/pci0000:00/0000:00:01.0/0000:01:00.0/resource
Modules linked in: xt_pkttype ipt_LOG xt_limit snd_pcm_oss snd_mixer_oss snd_seq_midi snd_seq_midi_event eeprom snd_seq edd ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat ip_nat iptable_filter ip6table_mangle ip_conntrack nfnetlink ip_tables ip6table_filter ip6_tables x_tables nls_iso8859_1 nls_cp437 nls_utf8 saa7134_dvb mt352 video_buf_dvb nxt200x tda1004x tuner snd_intel8x0 snd_ac97_codec snd_ac97_bus snd_pcm snd_timer sd_mod saa7134 bt878 i2c_i801 snd_page_alloc prism54 ir_kbd_i2c bttv video_buf ir_common ohci1394 snd_mpu401 snd_mpu401_uart btcx_risc tveeprom ieee1394 snd_rawmidi snd_seq_device snd soundcore
CPU: 0
EIP: 0060:[<c105a166>] Not tainted VLI
EFLAGS: 00210203 (2.6.18-rc3-mm2-smp #162)
EIP is at remove_mapping+0xa3/0xbf
eax: 80008009 ebx: c1e784a0 ecx: c14ad9c0 edx: c1e784a0
esi: c1e784a0 edi: c14ad9c0 ebp: dfda3ba0 esp: dfda3b94
ds: 007b es: 007b ss: 0068
Process rpm (pid: 6150, ti=dfda3000 task=dffca560 task.ti=dfda3000)
Stack: c1e784a0 c1e784b8 c14ad9c0 dfda3cb4 c105a818 dfda3ca4 00000008 dfda3c94
dfda3c9c c14ad680 c14ad690 dfda3cf4 c14ad100 00000020 00000000 00000000
00000020 00000000 00000000 c14ad9c0 00000000 00000020 00000000 00000001
Call Trace:
[<c105a818>] shrink_inactive_list+0x696/0x8dc
[<c105aaf0>] shrink_zone+0x92/0xe5
[<c105b68b>] try_to_free_pages+0x157/0x254
[<c1055c9b>] __alloc_pages+0x155/0x2b4
[<c1057595>] __do_page_cache_readahead+0x120/0x2a3
[<c1057806>] ra_dispatch+0xee/0x100
[<c1057d83>] page_cache_readahead_adaptive+0x3f4/0xb77
[<c105349e>] filemap_nopage+0x41d/0x4ad
[<c105e80d>] __handle_mm_fault+0x12e/0x8fb
[<c101966a>] do_page_fault+0xdc/0x51f
[<c13e0369>] error_code+0x39/0x40
[<b7bc89cf>] 0xb7bc89cf
[<c1003f83>] show_stack_log_lvl+0xa6/0xcb
[<c1004180>] show_registers+0x1d8/0x286
[<c100437f>] die+0x151/0x333
[<c10045d9>] do_trap+0x78/0xa3
[<c1004f16>] do_invalid_op+0x97/0xa1
[<c13e0369>] error_code+0x39/0x40
[<c105a818>] shrink_inactive_list+0x696/0x8dc
[<c105aaf0>] shrink_zone+0x92/0xe5
[<c105b68b>] try_to_free_pages+0x157/0x254
[<c1055c9b>] __alloc_pages+0x155/0x2b4
[<c1057595>] __do_page_cache_readahead+0x120/0x2a3
[<c1057806>] ra_dispatch+0xee/0x100
[<c1057d83>] page_cache_readahead_adaptive+0x3f4/0xb77
[<c105349e>] filemap_nopage+0x41d/0x4ad
[<c105e80d>] __handle_mm_fault+0x12e/0x8fb
[<c101966a>] do_page_fault+0xdc/0x51f
[<c13e0369>] error_code+0x39/0x40
Code: f0 e8 46 88 ff ff 89 f8 e8 ba 5d 38 00 f0 ff 4e 04 b8 01 00 00 00 5b 5e 5f 5d c3 89 f8 e8 a5 5d 38 00 31 c0 eb d4 8b 56 0c eb 8d <0f> 0b 7f 01 6f 75 42 c1 89 f6 e9 6b ff ff ff 0f 0b 7e 01 6f 75
EIP: [<c105a166>] remove_mapping+0xa3/0xbf SS:ESP 0068:dfda3b94



2006-08-11 06:55:38

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

On Fri, 11 Aug 2006 08:33:51 +0000
Mike Galbraith <[email protected]> wrote:

> kernel BUG at mm/vmscan.c:383!

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/hot-fixes/ ;)

2006-08-11 07:30:15

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

On Thu, 2006-08-10 at 23:55 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2006 08:33:51 +0000
> Mike Galbraith <[email protected]> wrote:
>
> > kernel BUG at mm/vmscan.c:383!
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/hot-fixes/ ;)
>

Duh, I should have thought to look there first. Sorry.

Anyhoo, I can reproduce the problem. I now have ~800MB of cache that
echo 3 > drop_caches doesn't help with, and I just started swapping.

MemTotal: 1032656 kB
MemFree: 42704 kB
Buffers: 648 kB
Cached: 825468 kB
SwapCached: 29312 kB
Active: 31196 kB
Inactive: 830144 kB
HighTotal: 131008 kB
HighFree: 3056 kB
LowTotal: 901648 kB
LowFree: 39648 kB
SwapTotal: 1028152 kB
SwapFree: 961356 kB
Dirty: 156 kB
Writeback: 0 kB
AnonPages: 17240 kB
Mapped: 10240 kB
Slab: 118536 kB
PageTables: 1876 kB
NFS Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1544480 kB
Committed_AS: 266884 kB
VmallocTotal: 114680 kB
VmallocUsed: 5372 kB
VmallocChunk: 109216 kB


2006-08-11 12:30:24

by Laurent Riffard

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm


[root@antares test4]# date
ven aoû 11 10:50:10 CEST 2006

[root@antares test4]# cat /proc/memeinfo
MemTotal: 515032 kB
MemFree: 7756 kB
Buffers: 1348 kB
Cached: 376276 kB
SwapCached: 14852 kB
Active: 38408 kB
Inactive: 374000 kB
SwapTotal: 1116428 kB
SwapFree: 967164 kB
Dirty: 64 kB
Writeback: 0 kB
AnonPages: 30164 kB
Mapped: 7028 kB
Slab: 84120 kB
PageTables: 1956 kB
NFS Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1373944 kB
Committed_AS: 493124 kB
VmallocTotal: 515796 kB
VmallocUsed: 6860 kB
VmallocChunk: 508788 kB

[root@antares test4]# echo 3 > /proc/sys/vm/drop_caches

[root@antares test4]# cat /proc/meminfo
MemTotal: 515032 kB
MemFree: 6000 kB
Buffers: 1336 kB
Cached: 380164 kB
SwapCached: 12680 kB
Active: 40532 kB
Inactive: 373024 kB
SwapTotal: 1116428 kB
SwapFree: 964720 kB
Dirty: 12 kB
Writeback: 0 kB
AnonPages: 31600 kB
Mapped: 7676 kB
Slab: 84016 kB
PageTables: 1960 kB
NFS Unstable: 0 kB
Bounce: 0 kB
CommitLimit: 1373944 kB
Committed_AS: 493132 kB
VmallocTotal: 515796 kB
VmallocUsed: 6860 kB
VmallocChunk: 508788 kB

[root@antares test4]# date
ven aoû 11 10:50:56 CEST 2006

[root@antares test4]# vmstat 3
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 7 162140 5848 1572 390100 85 127 2205 213 418 514 9 8 22 61
0 4 162084 16916 1428 380376 481 0 5819 8 490 388 1 5 0 95
0 2 162084 5292 1588 386112 2135 0 4054 53 524 494 1 2 0 98
0 1 162832 11388 1372 375764 1424 497 1963 540 462 395 1 3 0 96



Attachments:
vmstat.log (34.41 kB)
typescript (1.87 kB)
Download all attachments

2006-08-11 13:30:36

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Thu, 10 Aug 2006 23:17:18 PDT, Andrew Morton said:

> The suspend+resume->hang bug is known and reputedly fixed.
>
> The stuck-in-lock_page-without-having-done-resume bug is not known.
> Someone please try the deadline scheduler, or AS.

echo anticipatory >| /sys/block/sda/queue/scheduler
yum -C update yum-updatesd

And yum still hung with the exact same backtrace, so it's not a CFQ bug.


Attachments:
(No filename) (226.00 B)

2006-08-11 18:11:56

by Mark Haverkamp

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Sun, 2006-08-06 at 03:08 -0700, Andrew Morton wrote:
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/

I am seeing problem loading modules at boot time. My initrd tries to
load scsi_mod and percpu_modalloc prints this;

Could not allocate 16 bytes percpu data

This is a 2 processor x86_64 machine. I have attached the output from
the serial console and the config file.

It is related to the mm patches. I can boot OK from the main kernel
tree and the scsi trees.




--
Mark Haverkamp <[email protected]>


Attachments:
mm-insmod-failure.txt (12.96 kB)
config (62.64 kB)
Download all attachments

2006-08-11 18:36:09

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Fri, 11 Aug 2006 11:11:40 -0700
Mark Haverkamp <[email protected]> wrote:

> On Sun, 2006-08-06 at 03:08 -0700, Andrew Morton wrote:
> > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
>
> I am seeing problem loading modules at boot time. My initrd tries to
> load scsi_mod and percpu_modalloc prints this;
>
> Could not allocate 16 bytes percpu data
>
> This is a 2 processor x86_64 machine. I have attached the output from
> the serial console and the config file.
>
> It is related to the mm patches. I can boot OK from the main kernel
> tree and the scsi trees.

Yeah, sorry - this is almost certainly due to the increase in NR_IRQS. It
made this, in include/linux/kernel_stat.h

DECLARE_PER_CPU(struct kernel_stat, kstat);

really big and we consume all the per-cpu memory.


NR_IRQS is (sometimes) calculated from NR_CPUS via complex means. Reducing
your NR_CPUS should fix things up.

2006-08-11 19:43:12

by Mike Galbraith

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - OOM storm

On Fri, 2006-08-11 at 14:31 +0200, Laurent Riffard wrote:
> L
> >> Also, are you able to determine whether the problem is specific to `rpm
> >> -V'? Are you able to make the leak trigger using other filesystem
> >> workloads?
> >
> > Will try...
>
> No luck. For example, "find /usr -type f -print0 | xargs -0 cat > /dev/null"
> does not trigger the problem.

I spent some time looking over what I thought was the obvious candidate,
but alas, no cigar. Not surprising since Andrew can't reproduce it.

> # mount
> /dev/mapper/vglinux1-lvroot on / type ext3 (rw)
> /dev/mapper/vglinux1-lvusr on /usr type reiserfs (ro)
> /dev/mapper/vglinux1-lvvar on /var type ext3 (rw)

Mine is the plainest ext3 config imaginable.

> >> If it's specific to `rpm -V' then perhaps direct-io is somehow causing
> >> pagecache leakage. That would be a bit odd.

It seems odd at the moment.

-Mike

2006-08-11 20:31:23

by Mark Haverkamp

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Fri, 2006-08-11 at 11:36 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2006 11:11:40 -0700
> Mark Haverkamp <[email protected]> wrote:
>
> > On Sun, 2006-08-06 at 03:08 -0700, Andrew Morton wrote:
> > > ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.18-rc3/2.6.18-rc3-mm2/
> >
> > I am seeing problem loading modules at boot time. My initrd tries to
> > load scsi_mod and percpu_modalloc prints this;
> >
> > Could not allocate 16 bytes percpu data
> >
> > This is a 2 processor x86_64 machine. I have attached the output from
> > the serial console and the config file.
> >
> > It is related to the mm patches. I can boot OK from the main kernel
> > tree and the scsi trees.
>
> Yeah, sorry - this is almost certainly due to the increase in NR_IRQS. It
> made this, in include/linux/kernel_stat.h
>
> DECLARE_PER_CPU(struct kernel_stat, kstat);
>
> really big and we consume all the per-cpu memory.
>
>
> NR_IRQS is (sometimes) calculated from NR_CPUS via complex means. Reducing
> your NR_CPUS should fix things up.

It helps. I set NR_CPUS to 8 and got past that problem. Now I can't
get the root to mount.

Here is some output. I had to copy it from the VGA since this doesn't
show up on the serial output.

Creating root device
Mounting root filesystem
mount: error 6 mounting ext3
Switching to new root
ERROR opening /dev/console!!!!:2
error dup2'ing fd of 0 to 0
error dup2'ing fd of 0 to 1
error dup2'ing fd of 0 to 2
umounting old /proc
unmounting old /sys
Switchroot: mount failed: 22
Kernel Panic ....


>
--
Mark Haverkamp <[email protected]>


Attachments:
mm-mount-root.txt (28.05 kB)

2006-08-11 22:38:23

by Laurent Riffard

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

Le 10.08.2006 13:40, Jiri Slaby a ?crit :
> [email protected] wrote:
>> On Wed, 09 Aug 2006 16:43:20 EDT, [email protected] said:
>>
>>>> Usually this means that there's an IO request in flight and it got lost
>>>> somewhere. Device driver bug, IO scheduler bug, etc. Conceivably a
>>>> lost interrupt (hardware bug, PCI setup bug, etc).
>>
>>> Aug 9 14:30:24 turing-police kernel: [ 3535.720000] end_request: I/O
>>> error, dev fd0, sector 0
>>
>> Red herring. yum just wedged again, this time with no reference to
>> floppy drive.
>> Same traceback. Anybody have anything to suggest before I start playing
>> hunt-the-wumpus with a -mm bisection?
>
> Hmm, I have the accurately same problem...
> yum + CFQ + BLK_DEV_PIIX + nothing odd in dmesg
>
> [ 3438.574864] yum D 00000000 0 21659 3838 (NOTLB)
> [ 3438.575098] e5c09d24 00000001 c180f5a8 00000000 e5c09ce0
> c01683e8 fe37c0bc 000002c4
> [ 3438.575388] 00001000 00000001 c18fbbd0 0023001f 00000007
> f26cc560 c1913560 fe4166d5
> [ 3438.575713] 000002c4 0009a619 00000001 f26cc66c c180ec40
> c04ff140 e5c09d14 c01fad44
> [ 3438.576039] Call Trace:
> [ 3438.576113] [<c0373d3b>] io_schedule+0x26/0x30
> [ 3438.576187] [<c014653c>] sync_page+0x39/0x45
> [ 3438.576260] [<c0374401>] __wait_on_bit_lock+0x41/0x64
> [ 3438.576333] [<c01464ef>] __lock_page+0x57/0x5f
> [ 3438.576405] [<c014f5f2>] truncate_inode_pages_range+0x1b6/0x304
> [ 3438.576480] [<c014f76f>] truncate_inode_pages+0x2f/0x40
> [ 3438.576553] [<c01a7bc4>] ext3_delete_inode+0x29/0xf7
> [ 3438.576627] [<c017f26b>] generic_delete_inode+0x65/0xe7
> [ 3438.576701] [<c017f3aa>] generic_drop_inode+0xbd/0x173
> [ 3438.576774] [<c017ed25>] iput+0x6b/0x7b
> [ 3438.576846] [<c017cc57>] dentry_iput+0x68/0xb3
> [ 3438.576919] [<c017d99e>] dput+0x4f/0x19f
> [ 3438.576990] [<c0176164>] sys_renameat+0x1e0/0x212
> [ 3438.577063] [<c01761be>] sys_rename+0x28/0x2a
> [ 3438.577135] [<c01030fb>] syscall_call+0x7/0xb
>
> regards,

Same problem here, with urpmi:

urpmi D CAC9EAA0 6112 29146 30655 (NOTLB)
c813dda0 c0291e70 00000001 cac9eaa0 fe7d7800 000008ad 00000000 cac9ebac
c813ddd4 c1404e00 07efea00 00000005 c813ddd4 c813ddd4 c1404e00 c813dda8
c028fb6b c813ddb0 c01390ed c813ddc8 c02902ea c01390b7 c813ddd4 00000000
Call Trace:
[<c028fb6b>] io_schedule+0xe/0x16
[<c01390ed>] sync_page+0x36/0x3a
[<c02902ea>] __wait_on_bit_lock+0x30/0x58
[<c01390a3>] __lock_page+0x51/0x59
[<c01403f4>] truncate_inode_pages_range+0x1f8/0x24a
[<c0140452>] truncate_inode_pages+0xc/0x12
[<c018a22a>] ext3_delete_inode+0x16/0xc0
[<c0168e93>] generic_delete_inode+0x61/0xcf
[<c0168f13>] generic_drop_inode+0x12/0x13e
[<c0168994>] iput+0x67/0x6a
[<c0166bab>] dentry_iput+0x7c/0x97
[<c016792d>] dput+0x152/0x16b
[<c0160fe5>] sys_renameat+0x17a/0x1dd
[<c016105a>] sys_rename+0x12/0x14
[<c0102c39>] sysenter_past_esp+0x56/0x8d

This is 2.6.18-rc3-mm2 + 6 hot-fixes. CFQ scheduler. The RPM DB is located on /var which is an ext3 FS.

I think I broke my RPM db :-(.
~~
laurent

2006-08-11 22:58:45

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Fri, 11 Aug 2006 13:31:03 -0700
Mark Haverkamp <[email protected]> wrote:

> > NR_IRQS is (sometimes) calculated from NR_CPUS via complex means. Reducing
> > your NR_CPUS should fix things up.
>
> It helps. I set NR_CPUS to 8 and got past that problem. Now I can't
> get the root to mount.
>
> Here is some output. I had to copy it from the VGA since this doesn't
> show up on the serial output.
>
> Creating root device
> Mounting root filesystem
> mount: error 6 mounting ext3
> Switching to new root
> ERROR opening /dev/console!!!!:2
> error dup2'ing fd of 0 to 0
> error dup2'ing fd of 0 to 1
> error dup2'ing fd of 0 to 2
> umounting old /proc
> unmounting old /sys
> Switchroot: mount failed: 22
> Kernel Panic ....

Looks like early userspace got ENXIO when trying to mount the root fs.

Don't know, sorry. What distro is this running?

It might be useful to diff this kernel's boot log with 2.6.18-rc4's, see if
we can spot the problem that way.

2006-08-12 12:59:31

by Mike Galbraith

[permalink] [raw]
Subject: [patch] Re: 2.6.18-rc3-mm2 - OOM storm

On Thu, 2006-08-10 at 02:19 -0700, Andrew Morton wrote:

> It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
> perhaps that got broken.

A typo was pinning pagecache. Fixes leak encountered with rpm -qaV.

Signed-off-by: Mike Galbraith <[email protected]>

--- linux-2.6.18-rc3-mm2/mm/filemap.c.org 2006-08-12 14:04:14.000000000 +0000
+++ linux-2.6.18-rc3-mm2/mm/filemap.c 2006-08-12 14:07:53.000000000 +0000
@@ -1498,7 +1498,7 @@ retry_find:
page_cache_readahead_adaptive(mapping, ra,
file, NULL, NULL,
pgoff, pgoff, pgoff + 1);
- page = find_lock_page(mapping, pgoff);
+ page = find_get_page(mapping, pgoff);
} else if (PageReadahead(page)) {
page_cache_readahead_adaptive(mapping, ra,
file, NULL, page,


2006-08-12 21:25:10

by Laurent Riffard

[permalink] [raw]
Subject: Re: [patch] Re: 2.6.18-rc3-mm2 - OOM storm


Le 12.08.2006 17:07, Mike Galbraith a ?crit :
> On Thu, 2006-08-10 at 02:19 -0700, Andrew Morton wrote:
>
>> It would be interesting to try disabling CONFIG_ADAPTIVE_READAHEAD -
>> perhaps that got broken.
>
> A typo was pinning pagecache. Fixes leak encountered with rpm -qaV.

Problem fixed here too. Thanks

> Signed-off-by: Mike Galbraith <[email protected]>
>
> --- linux-2.6.18-rc3-mm2/mm/filemap.c.org 2006-08-12 14:04:14.000000000 +0000
> +++ linux-2.6.18-rc3-mm2/mm/filemap.c 2006-08-12 14:07:53.000000000 +0000
> @@ -1498,7 +1498,7 @@ retry_find:
> page_cache_readahead_adaptive(mapping, ra,
> file, NULL, NULL,
> pgoff, pgoff, pgoff + 1);
> - page = find_lock_page(mapping, pgoff);
> + page = find_get_page(mapping, pgoff);
> } else if (PageReadahead(page)) {
> page_cache_readahead_adaptive(mapping, ra,
> file, NULL, page,
>
>

--
laurent

2006-08-15 23:38:44

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2 - ext3 locking issue?

On Wed, 09 Aug 2006 15:06:35 EDT, [email protected] said:
> Yum managed to get wedged: 'echo t > /proc/sysrq-trigger' says:
>
> [ 4514.840000] yum D D5C32AA0 0 4747 4430 (NOTLB)
> [ 4514.840000] d5c3dda4 d5c3dd78 00000007 d5c32aa0 bd3ddd00 00000338 00000000 d5c32bc0
> [ 4514.840000] c1601628 d5c3dd9c 64600300 0000001f d5c3ddd8 d5c3ddd8 c1601628 d5c3ddac
> [ 4514.840000] c034fef8 d5c3ddb4 c0136e8e d5c3ddcc c0350026 c0136e58 d5c3ddd8 00000000
> [ 4514.840000] Call Trace:
> [ 4514.840000] [<c034fef8>] io_schedule+0x25/0x44
> [ 4514.840000] [<c0136e8e>] sync_page+0x36/0x3a
> [ 4514.840000] [<c0350026>] __wait_on_bit_lock+0x30/0x58
> [ 4514.840000] [<c0136e44>] __lock_page+0x51/0x59
> [ 4514.840000] [<c013f099>] truncate_inode_pages_range+0x1de/0x230
> [ 4514.840000] [<c013f0f7>] truncate_inode_pages+0xc/0x11
> [ 4514.840000] [<c018ea12>] ext3_delete_inode+0x16/0xbd
> [ 4514.840000] [<c016798f>] generic_delete_inode+0xb6/0x130
> [ 4514.840000] [<c0167a1b>] generic_drop_inode+0x12/0x166
> [ 4514.840000] [<c01673f1>] iput+0x67/0x6a
> [ 4514.840000] [<c0165662>] dentry_iput+0x97/0xcc
> [ 4514.840000] [<c016613d>] dput+0x183/0x19c
> [ 4514.840000] [<c015f64f>] sys_renameat+0x17a/0x1d3
> [ 4514.840000] [<c015f6ba>] sys_rename+0x12/0x14
> [ 4514.840000] [<c0102849>] sysenter_past_esp+0x56/0x79

Well, after a detour into hardware issues (a dying fan ended up escalating
into swapping a motherboard), I built 2.6.18-rc4-mm1 - unable to replicate
the 'yum' hang on that. Somehow, I'm not feeling very motivated to do a
bisect of -rc3-mm2 to find it, unless somebody thinks we should track it down
just in case it's just in hiding....


Attachments:
(No filename) (226.00 B)

2006-08-23 17:02:39

by Mark Haverkamp

[permalink] [raw]
Subject: Re: 2.6.18-rc3-mm2

On Fri, 2006-08-11 at 15:58 -0700, Andrew Morton wrote:
> On Fri, 11 Aug 2006 13:31:03 -0700
> Mark Haverkamp <[email protected]> wrote:
>
> > > NR_IRQS is (sometimes) calculated from NR_CPUS via complex means. Reducing
> > > your NR_CPUS should fix things up.
> >
> > It helps. I set NR_CPUS to 8 and got past that problem. Now I can't
> > get the root to mount.
> >
> > Here is some output. I had to copy it from the VGA since this doesn't
> > show up on the serial output.
> >
> > Creating root device
> > Mounting root filesystem
> > mount: error 6 mounting ext3
> > Switching to new root
> > ERROR opening /dev/console!!!!:2
> > error dup2'ing fd of 0 to 0
> > error dup2'ing fd of 0 to 1
> > error dup2'ing fd of 0 to 2
> > umounting old /proc
> > unmounting old /sys
> > Switchroot: mount failed: 22
> > Kernel Panic ....
>
> Looks like early userspace got ENXIO when trying to mount the root fs.
>
> Don't know, sorry. What distro is this running?
>
> It might be useful to diff this kernel's boot log with 2.6.18-rc4's, see if
> we can spot the problem that way.

Sorry for taking so long to respond. When I got back to it this week I
noticed that there was a new mm kernel patch. I updated to it and now I
can boot my system OK.

Mark.


>
--
Mark Haverkamp <[email protected]>