2009-03-12 05:33:25

by Gene Heskett

[permalink] [raw]
Subject: I just got got another Oops

Greetings all;

This is the 2nd or 3rd one of these I've seen.

This one may have a better set of clues as it occured about 1 second
after amanda started the backup run:
==============================
Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970 from=::ffff:192.168.71.3
Mar 12 01:16:21 coyote kernel: [10353.910914] BUG: unable to handle kernel NULL pointer dereference at 00000006
Mar 12 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0
Mar 12 01:16:21 coyote kernel: [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000
Mar 12 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP
Mar 12 01:16:21 coyote kernel: [10353.910937] last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/irq
Mar 12 01:16:21 coyote kernel: [10353.910940] Modules linked in: nls_utf8 cifs radeon drm nfsd lockd nfs_acl auth_rpcgss rfcomm
exportfs sco bridge stp llc bnep l2cap autofs4 sunrpc ipv6 or51132 cx88_dvb snd_emu10k1_synth videobuf_dvb snd_emux_synth
dvb_core snd_seq_virmidi snd_seq_midi_emul tuner_simple tuner_types snd_emu10k1 tda9887 snd_rawmidi tda8290 snd_ac97_codec
ac97_bus snd_seq_dummy snd_seq_oss tuner snd_seq_midi_event snd_seq snd_pcm_oss cx8800 cx8802 cx88xx snd_mixer_oss snd_pcm
snd_seq_device ir_common v4l2_common videodev i2c_algo_bit snd_timer snd_page_alloc tveeprom v4l1_compat snd_util_mem btcx_risc
snd_hwdep ftdi_sio videobuf_dma_sg usb_storage snd forcedethfirewire_ohci floppy videobuf_core btusb firewire_core usbserial
sr_mod usblp bluetooth cdrom sg soundcore i2c_nforce2 pcspkr crc_itu_t joydev i2c_core evdev button ahci pata_jmicron pata_amd
ata_generic pata_acpi sata_nv libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded:
scsi_wait_scan]
Mar 12 01:16:21 coyote kernel: [10353.910986]
Mar 12 01:16:21 coyote kernel: [10353.910989] Pid: 9222, comm: tar Not tainted (2.6.29-rc7 #4) System Product Name
Mar 12 01:16:21 coyote kernel: [10353.910992] EIP: 0060:[<c046520b>] EFLAGS: 00210202 CPU: 2
Mar 12 01:16:21 coyote kernel: [10353.910995] EIP is at get_page_from_freelist+0x24b/0x4c0
Mar 12 01:16:21 coyote kernel: [10353.910997] EAX: ffffffff EBX: 80004000 ECX: 00000001 EDX: 00000002
Mar 12 01:16:21 coyote kernel: [10353.910999] ESI: c28fc260 EDI: 00000000 EBP: c0b35d5c ESP: c0b35cfc
Mar 12 01:16:21 coyote kernel: [10353.911001] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Mar 12 01:16:21 coyote kernel: [10353.911004] Process tar (pid: 9222, ti=c0b35000 task=f2817520 task.ti=c0b35000)
Mar 12 01:16:21 coyote kernel: [10353.911005] Stack:
Mar 12 01:16:21 coyote kernel: [10353.911006] 00000002 00000044 00000000 00000000 00000000 c0744b80 c06d6480 00000002
Mar 12 01:16:21 coyote kernel: [10353.911011] 00000000 00000000 001201d2 00000002 00200246 00000001 c06d6900 00000100
Mar 12 01:16:21 coyote kernel: [10353.911015] 00000000 c0b35dac c06d7484 c06d6480 c06d6480 c06d6480 f2817520 00002f66
Mar 12 01:16:21 coyote kernel: [10353.911020] Call Trace:
Mar 12 01:16:21 coyote kernel: [10353.911022] [<c04655fe>] ? __alloc_pages_internal+0xae/0x430
Mar 12 01:16:21 coyote kernel: [10353.911028] [<f82b5c60>] ? ext3_readpages+0x0/0x20 [ext3]
Mar 12 01:16:21 coyote kernel: [10353.911040] [<c04676b4>] ? __do_page_cache_readahead+0xe4/0x1e0
Mar 12 01:16:21 coyote kernel: [10353.911044] [<c0467a9b>] ? ondemand_readahead+0x15b/0x180
Mar 12 01:16:21 coyote kernel: [10353.911047] [<c0467b38>] ? page_cache_async_readahead+0x78/0x90
Mar 12 01:16:21 coyote kernel: [10353.911050] [<c0461434>] ? generic_file_aio_read+0x314/0x670
Mar 12 01:16:21 coyote kernel: [10353.911058] [<c048a011>] ? do_sync_read+0xd1/0x110
Mar 12 01:16:21 coyote kernel: [10353.911061] [<c04133a3>] ? lapic_next_event+0x13/0x20
Mar 12 01:16:21 coyote kernel: [10353.911065] [<c043c3c0>] ? autoremove_wake_function+0x0/0x50
Mar 12 01:16:21 coyote kernel: [10353.911069] [<c0447c4b>] ? tick_program_event+0x2b/0x40
Mar 12 01:16:21 coyote kernel: [10353.911072] [<c0440086>] ? hrtimer_interrupt+0xd6/0x220
Mar 12 01:16:21 coyote kernel: [10353.911075] [<c048aac9>] ? vfs_read+0x99/0x140
Mar 12 01:16:21 coyote kernel: [10353.911077] [<c0458e20>] ? audit_syscall_exit+0x1e0/0x3e0
Mar 12 01:16:21 coyote kernel: [10353.911081] [<c0489f40>] ? do_sync_read+0x0/0x110
Mar 12 01:16:21 coyote kernel: [10353.911083] [<c048ac2d>] ? sys_read+0x3d/0x70
Mar 12 01:16:21 coyote kernel: [10353.911085] [<c040336d>] ? sysenter_do_call+0x12/0x21
Mar 12 01:16:21 coyote kernel: [10353.911088] Code: ff 75 d0 9d 89 e0 25 00 f0 ff ff 83 68 14 01 f6 40 08 08 0f 85 30 02 00 00
8b 1e 89 f2 8b 46 08 8b 7e 10 f6 c7 40 74 03 8b 56 0c <8b> 4a 04 31 d2 85 ff 0f 95 c2 83 c0 01 09 c2 31 c0 85 c9 0f 95
Mar 12 01:16:21 coyote kernel: [10353.911112] EIP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0 SS:ESP 0068:c0b35cfc
Mar 12 01:16:21 coyote kernel: [10353.911117] ---[ end trace 8d9559cb0f5d9d82 ]---
========================

And it was accompanied by some sort of an alert mechanism in F10 that said
'KDE Write Daemon' in a series of 22 stacked vertically boxes, and duplicated
many but not all, of the above messages. They weren't clipboardable. :(

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
It's difficult to see the picture when you are inside the frame.


2009-03-12 06:29:05

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thu, 12 Mar 2009 01:33:11 -0400
Gene Heskett <[email protected]> wrote:

> Greetings all;
>
> This is the 2nd or 3rd one of these I've seen.
>
Could you show me .config and disassemble of get_page_from_freelist()?

Thanks,
-Kame

> This one may have a better set of clues as it occured about 1 second
> after amanda started the backup run:
> ==============================
> Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970 from=::ffff:192.168.71.3
> Mar 12 01:16:21 coyote kernel: [10353.910914] BUG: unable to handle kernel NULL pointer dereference at 00000006
> Mar 12 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0
> Mar 12 01:16:21 coyote kernel: [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000
> Mar 12 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP
> Mar 12 01:16:21 coyote kernel: [10353.910937] last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/irq
> Mar 12 01:16:21 coyote kernel: [10353.910940] Modules linked in: nls_utf8 cifs radeon drm nfsd lockd nfs_acl auth_rpcgss rfcomm
> exportfs sco bridge stp llc bnep l2cap autofs4 sunrpc ipv6 or51132 cx88_dvb snd_emu10k1_synth videobuf_dvb snd_emux_synth
> dvb_core snd_seq_virmidi snd_seq_midi_emul tuner_simple tuner_types snd_emu10k1 tda9887 snd_rawmidi tda8290 snd_ac97_codec
> ac97_bus snd_seq_dummy snd_seq_oss tuner snd_seq_midi_event snd_seq snd_pcm_oss cx8800 cx8802 cx88xx snd_mixer_oss snd_pcm
> snd_seq_device ir_common v4l2_common videodev i2c_algo_bit snd_timer snd_page_alloc tveeprom v4l1_compat snd_util_mem btcx_risc
> snd_hwdep ftdi_sio videobuf_dma_sg usb_storage snd forcedethfirewire_ohci floppy videobuf_core btusb firewire_core usbserial
> sr_mod usblp bluetooth cdrom sg soundcore i2c_nforce2 pcspkr crc_itu_t joydev i2c_core evdev button ahci pata_jmicron pata_amd
> ata_generic pata_acpi sata_nv libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded:
> scsi_wait_scan]
> Mar 12 01:16:21 coyote kernel: [10353.910986]
> Mar 12 01:16:21 coyote kernel: [10353.910989] Pid: 9222, comm: tar Not tainted (2.6.29-rc7 #4) System Product Name
> Mar 12 01:16:21 coyote kernel: [10353.910992] EIP: 0060:[<c046520b>] EFLAGS: 00210202 CPU: 2
> Mar 12 01:16:21 coyote kernel: [10353.910995] EIP is at get_page_from_freelist+0x24b/0x4c0
> Mar 12 01:16:21 coyote kernel: [10353.910997] EAX: ffffffff EBX: 80004000 ECX: 00000001 EDX: 00000002
> Mar 12 01:16:21 coyote kernel: [10353.910999] ESI: c28fc260 EDI: 00000000 EBP: c0b35d5c ESP: c0b35cfc
> Mar 12 01:16:21 coyote kernel: [10353.911001] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Mar 12 01:16:21 coyote kernel: [10353.911004] Process tar (pid: 9222, ti=c0b35000 task=f2817520 task.ti=c0b35000)
> Mar 12 01:16:21 coyote kernel: [10353.911005] Stack:
> Mar 12 01:16:21 coyote kernel: [10353.911006] 00000002 00000044 00000000 00000000 00000000 c0744b80 c06d6480 00000002
> Mar 12 01:16:21 coyote kernel: [10353.911011] 00000000 00000000 001201d2 00000002 00200246 00000001 c06d6900 00000100
> Mar 12 01:16:21 coyote kernel: [10353.911015] 00000000 c0b35dac c06d7484 c06d6480 c06d6480 c06d6480 f2817520 00002f66
> Mar 12 01:16:21 coyote kernel: [10353.911020] Call Trace:
> Mar 12 01:16:21 coyote kernel: [10353.911022] [<c04655fe>] ? __alloc_pages_internal+0xae/0x430
> Mar 12 01:16:21 coyote kernel: [10353.911028] [<f82b5c60>] ? ext3_readpages+0x0/0x20 [ext3]
> Mar 12 01:16:21 coyote kernel: [10353.911040] [<c04676b4>] ? __do_page_cache_readahead+0xe4/0x1e0
> Mar 12 01:16:21 coyote kernel: [10353.911044] [<c0467a9b>] ? ondemand_readahead+0x15b/0x180
> Mar 12 01:16:21 coyote kernel: [10353.911047] [<c0467b38>] ? page_cache_async_readahead+0x78/0x90
> Mar 12 01:16:21 coyote kernel: [10353.911050] [<c0461434>] ? generic_file_aio_read+0x314/0x670
> Mar 12 01:16:21 coyote kernel: [10353.911058] [<c048a011>] ? do_sync_read+0xd1/0x110
> Mar 12 01:16:21 coyote kernel: [10353.911061] [<c04133a3>] ? lapic_next_event+0x13/0x20
> Mar 12 01:16:21 coyote kernel: [10353.911065] [<c043c3c0>] ? autoremove_wake_function+0x0/0x50
> Mar 12 01:16:21 coyote kernel: [10353.911069] [<c0447c4b>] ? tick_program_event+0x2b/0x40
> Mar 12 01:16:21 coyote kernel: [10353.911072] [<c0440086>] ? hrtimer_interrupt+0xd6/0x220
> Mar 12 01:16:21 coyote kernel: [10353.911075] [<c048aac9>] ? vfs_read+0x99/0x140
> Mar 12 01:16:21 coyote kernel: [10353.911077] [<c0458e20>] ? audit_syscall_exit+0x1e0/0x3e0
> Mar 12 01:16:21 coyote kernel: [10353.911081] [<c0489f40>] ? do_sync_read+0x0/0x110
> Mar 12 01:16:21 coyote kernel: [10353.911083] [<c048ac2d>] ? sys_read+0x3d/0x70
> Mar 12 01:16:21 coyote kernel: [10353.911085] [<c040336d>] ? sysenter_do_call+0x12/0x21
> Mar 12 01:16:21 coyote kernel: [10353.911088] Code: ff 75 d0 9d 89 e0 25 00 f0 ff ff 83 68 14 01 f6 40 08 08 0f 85 30 02 00 00
> 8b 1e 89 f2 8b 46 08 8b 7e 10 f6 c7 40 74 03 8b 56 0c <8b> 4a 04 31 d2 85 ff 0f 95 c2 83 c0 01 09 c2 31 c0 85 c9 0f 95
> Mar 12 01:16:21 coyote kernel: [10353.911112] EIP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0 SS:ESP 0068:c0b35cfc
> Mar 12 01:16:21 coyote kernel: [10353.911117] ---[ end trace 8d9559cb0f5d9d82 ]---
> ========================
>
> And it was accompanied by some sort of an alert mechanism in F10 that said
> 'KDE Write Daemon' in a series of 22 stacked vertically boxes, and duplicated
> many but not all, of the above messages. They weren't clipboardable. :(
>
> --
> Cheers, Gene
> "There are four boxes to be used in defense of liberty:
> soap, ballot, jury, and ammo. Please use in that order."
> -Ed Howdershelt (Author)
> It's difficult to see the picture when you are inside the frame.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2009-03-12 08:36:50

by David Newall

[permalink] [raw]
Subject: Re: I just got got another Oops

Gene Heskett wrote:
> Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970 from=::ffff:192.168.71.3
> Mar 12 01:16:21 coyote kernel: [10353.910914] BUG: unable to handle kernel NULL pointer dereference at 00000006
> Mar 12 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0
> Mar 12 01:16:21 coyote kernel: [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000
> Mar 12 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP

I think there must be a zone at 0x0. Try adding

if (!zone) goto try_next_zone;


in mm/page_alloc.c, (near line 1469), before

if (NUMA_BUILD && zlc_active &&

2009-03-12 17:29:19

by Gene Heskett

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thursday 12 March 2009, David Newall wrote:
>Gene Heskett wrote:
>> Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970
>> from=::ffff:192.168.71.3 Mar 12 01:16:21 coyote kernel: [10353.910914]
>> BUG: unable to handle kernel NULL pointer dereference at 00000006 Mar 12
>> 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>]
>> get_page_from_freelist+0x24b/0x4c0 Mar 12 01:16:21 coyote kernel:
>> [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000 Mar 12
>> 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP
>
>I think there must be a zone at 0x0. Try adding
>
> if (!zone) goto try_next_zone;
>
>
>in mm/page_alloc.c, (near line 1469), before
>
> if (NUMA_BUILD && zlc_active &&

This line is at #1417 in my mm/page_alloc.c, (in 2.6.29-rc7) but I have added
it, and a rebuild is in progress. I'll post again if I see it again. The
machine is still up, and functioning apparently only semi-normally ATM. I
don't know it its related, but setroubleshoot's daemon is also exiting
frequently. And can't be restarted with the 'service' utility. Off to reboot
next.

Thanks David.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
You're dead, Jim.
-- McCoy, "Amok Time", stardate 3372.7

2009-03-12 18:32:44

by Gene Heskett

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thursday 12 March 2009, David Newall wrote:
>Gene Heskett wrote:
>> Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970
>> from=::ffff:192.168.71.3 Mar 12 01:16:21 coyote kernel: [10353.910914]
>> BUG: unable to handle kernel NULL pointer dereference at 00000006 Mar 12
>> 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>]
>> get_page_from_freelist+0x24b/0x4c0 Mar 12 01:16:21 coyote kernel:
>> [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000 Mar 12
>> 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP
>
>I think there must be a zone at 0x0. Try adding
>
> if (!zone) goto try_next_zone;
>
>
>in mm/page_alloc.c, (near line 1469), before
>
> if (NUMA_BUILD && zlc_active &&
I don't believe that was the fix, David. I just got another Oops
And this time word wrap is turned off before I paste from the log:

I thought it was the last time, sigh...

==============================
Mar 12 14:15:02 coyote kernel: [ 2656.832576] BUG: unable to handle kernel NULL pointer dereference at 00000006
Mar 12 14:15:02 coyote kernel: [ 2656.832583] IP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0
Mar 12 14:15:02 coyote kernel: [ 2656.832613] *pdpt = 000000003216c001 *pde = 0000000000000000
Mar 12 14:15:02 coyote kernel: [ 2656.832618] Oops: 0000 [#1] PREEMPT SMP
Mar 12 14:15:02 coyote kernel: [ 2656.832620] last sysfs file: /sys/devices/pci0000:00/0000:00:09.0/irq
Mar 12 14:15:02 coyote kernel: [ 2656.832623] Modules linked in: radeon drm nls_utf8 cifs nfsd lockd nfs_acl auth_rpcgss
exportfs rfcomm sco bridge stp llc bnep l2cap autofs4 sunrpc ipv6 snd_emu10k1_synth or51132 snd_emux_synth cx88_dvb
videobuf_dvb snd_seq_virmidi dvb_core snd_seq_midi_emul tuner_simple snd_emu10k1 tuner_types snd_rawmidi snd_ac97_codec
ac97_bus tda9887 snd_seq_dummy tda8290 snd_seq_oss snd_seq_midi_event snd_seq tuner snd_pcm_oss cx8800 snd_mixer_oss cx8802
cx88xx snd_pcm snd_seq_device ir_common v4l2_common snd_timer i2c_algo_bit videodev snd_page_alloc snd_util_mem v4l1_compat
tveeprom snd_hwdep btcx_risc snd firewire_ohci videobuf_dma_sg ftdi_sio usb_storage floppy videobuf_core firewire_core
forcedeth sr_mod btusb cdrom sg usbserial i2c_nforce2 soundcore crc_itu_t usblp pcspkr joydev bluetooth i2c_core evdev button
ahci pata_jmicron pata_amd ata_generic pata_acpi sata_nv libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
[last unloaded: drm]
Mar 12 14:15:02 coyote kernel: [ 2656.832669]
Mar 12 14:15:02 coyote kernel: [ 2656.832672] Pid: 18877, comm: kmail Not tainted (2.6.29-rc7 #5) System Product Name
Mar 12 14:15:02 coyote kernel: [ 2656.832675] EIP: 0060:[<c046520b>] EFLAGS: 00210202 CPU: 0
Mar 12 14:15:02 coyote kernel: [ 2656.832678] EIP is at get_page_from_freelist+0x24b/0x4c0
Mar 12 14:15:02 coyote kernel: [ 2656.832680] EAX: ffffffff EBX: 80004000 ECX: 00000001 EDX: 00000002
Mar 12 14:15:02 coyote kernel: [ 2656.832682] ESI: c28fc260 EDI: 00000000 EBP: f2168d5c ESP: f2168cfc
Mar 12 14:15:02 coyote kernel: [ 2656.832684] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Mar 12 14:15:02 coyote kernel: [ 2656.832686] Process kmail (pid: 18877, ti=f2168000 task=f22018b0 task.ti=f2168000)
Mar 12 14:15:02 coyote kernel: [ 2656.832688] Stack:
Mar 12 14:15:02 coyote kernel: [ 2656.832689] 00000002 00000044 c28fc060 00000000 f1463ca4 c0744b80 c06d6480 00000002
Mar 12 14:15:02 coyote kernel: [ 2656.832693] 00000000 00000000 001201d2 00000002 00200246 00000001 c06d6900 00000100
Mar 12 14:15:02 coyote kernel: [ 2656.832698] 00000000 80000000 c06d7484 c06d6480 c06d6480 c06d6480 f22018b0 00000129
Mar 12 14:15:02 coyote kernel: [ 2656.832708] Call Trace:
Mar 12 14:15:02 coyote kernel: [ 2656.832710] [<c04655fe>] ? __alloc_pages_internal+0xae/0x430
Mar 12 14:15:02 coyote kernel: [ 2656.832713] [<c0472503>] ? __do_fault+0x163/0x410
Mar 12 14:15:02 coyote kernel: [ 2656.832717] [<c04676b4>] ? __do_page_cache_readahead+0xe4/0x1e0
Mar 12 14:15:02 coyote kernel: [ 2656.832721] [<c0467a9b>] ? ondemand_readahead+0x15b/0x180
Mar 12 14:15:02 coyote kernel: [ 2656.832724] [<c0467b38>] ? page_cache_async_readahead+0x78/0x90
Mar 12 14:15:02 coyote kernel: [ 2656.832727] [<c0461434>] ? generic_file_aio_read+0x314/0x670
Mar 12 14:15:02 coyote kernel: [ 2656.832733] [<c048a011>] ? do_sync_read+0xd1/0x110
Mar 12 14:15:02 coyote kernel: [ 2656.832735] [<c0478f80>] ? vma_merge+0x190/0x270
Mar 12 14:15:02 coyote kernel: [ 2656.832738] [<c043c3c0>] ? autoremove_wake_function+0x0/0x50
Mar 12 14:15:02 coyote kernel: [ 2656.832742] [<c04039a7>] ? common_interrupt+0x27/0x2c
Mar 12 14:15:02 coyote kernel: [ 2656.832745] [<c048aac9>] ? vfs_read+0x99/0x140
Mar 12 14:15:02 coyote kernel: [ 2656.832747] [<c0489f40>] ? do_sync_read+0x0/0x110
Mar 12 14:15:02 coyote kernel: [ 2656.832749] [<c048ac2d>] ? sys_read+0x3d/0x70
Mar 12 14:15:02 coyote kernel: [ 2656.832752] [<c040336d>] ? sysenter_do_call+0x12/0x21
Mar 12 14:15:02 coyote kernel: [ 2656.832754] Code: ff 75 d0 9d 89 e0 25 00 f0 ff ff 83 68 14 01 f6 40 08 08 0f 85 30 02 00 00
8b 1e 89 f2 8b 46 08 8b 7e 10 f6 c7 40 74 03 8b 56 0c <8b> 4a 04 31 d2 85 ff 0f 95 c2 83 c0 01 09 c2 31 c0 85 c9 0f 95
Mar 12 14:15:02 coyote kernel: [ 2656.832781] EIP: [<c046520b>] get_page_from_freelist+0x24b/0x4c0 SS:ESP 0068:f2168cfc
Mar 12 14:15:02 coyote kernel: [ 2656.832789] ---[ end trace 4e78f421815113f4 ]---
================================

And this time whatever kde uses for an alert popped up another stack of them,
and then kmail went away so I had to kill its last process to get that screen
back & restart kmail.

Call me puzzled. Not sure if I'll answer, but... :)

David;
That patch you sent, should I take it back out & wait for a better one?

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
descramble code needed from software company

2009-03-12 18:49:25

by Gene Heskett

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thursday 12 March 2009, David Newall wrote:
>Gene Heskett wrote:
>> Mar 12 01:15:01 coyote xinetd[2583]: START: amanda pid=8970
>> from=::ffff:192.168.71.3 Mar 12 01:16:21 coyote kernel: [10353.910914]
>> BUG: unable to handle kernel NULL pointer dereference at 00000006 Mar 12
>> 01:16:21 coyote kernel: [10353.910921] IP: [<c046520b>]
>> get_page_from_freelist+0x24b/0x4c0 Mar 12 01:16:21 coyote kernel:
>> [10353.910929] *pdpt = 0000000000b38001 *pde = 0000000000000000 Mar 12
>> 01:16:21 coyote kernel: [10353.910934] Oops: 0000 [#1] PREEMPT SMP
>
>I think there must be a zone at 0x0. Try adding
>
> if (!zone) goto try_next_zone;
>
>
>in mm/page_alloc.c, (near line 1469), before
>
> if (NUMA_BUILD && zlc_active &&

Rebooted to the above now, still scratching head though. Specifically,
setroubleshoot will not return anything but a null for any 'service' command.
In fact, none of the selinux stuff is running ANAICT. Did we break other
stuff?

Thanks David.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
I think that I shall never see
A billboard lovely as a tree.
Indeed, unless the billboards fall
I'll never see a tree at all.
-- Ogden Nash

2009-03-12 19:37:41

by David Newall

[permalink] [raw]
Subject: Re: I just got got another Oops

Gene Heskett wrote:
> Rebooted to the above now, still scratching head though. Specifically,
> setroubleshoot will not return anything but a null for any 'service' command.
> In fact, none of the selinux stuff is running ANAICT. Did we break other
> stuff?
>

It's more that we haven't fixed anything, yet, just tested an
hypothesis. Except that looking deeply, I now see that that hypothesis
can't be right. So this is the point that I ask you to post your
.config and perhaps also as disassembly (objump -d) of mm/pagealloc.o

Sorry to send you off on a wild goose chase.

2009-03-12 23:28:42

by Gene Heskett

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thursday 12 March 2009, David Newall wrote:
>Gene Heskett wrote:
>> Rebooted to the above now, still scratching head though. Specifically,
>> setroubleshoot will not return anything but a null for any 'service'
>> command. In fact, none of the selinux stuff is running ANAICT. Did we
>> break other stuff?
>
>It's more that we haven't fixed anything, yet, just tested an
>hypothesis. Except that looking deeply, I now see that that hypothesis
>can't be right. So this is the point that I ask you to post your
>.config and perhaps also as disassembly (objump -d) of mm/pagealloc.o
>
The .config was posted once before, but is attached again, also the dis of
page_alloc.o

>Sorry to send you off on a wild goose chase.

NP David, I'll go take that extra line back out & wait further instructions.
I also apologize for miss-spelling your name in another post, I see there are
two el's now.

Secondary question: I see by that .config, it appears selinux is out again as
its not mentioned. Or do I have yet another "make oldconfig" breakage?, which
won't be the first time...

Thanks David.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
Delta: A real man lands where he wants to. -- David Letterman


Attachments:
(No filename) (1.29 kB)
.config (57.41 kB)
page_alloc.dis (291.08 kB)
Download all attachments

2009-03-13 04:55:27

by David Newall

[permalink] [raw]
Subject: Re: I just got got another Oops

gcc -g -Wp,-MD,mm/.page_alloc.o.d -nostdinc -isystem /usr/lib/gcc/i486-linux-gnu/4.1.3/include -Iinclude -I/home/davidn/NAS/linux.trees.git/arch/x86/include -include include/linux/autoconf.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -m32 -msoft-float -mregparm=3 -freg-struct-return -mpreferred-stack-boundary=2 -march=k8 -mtune=generic -Wa,-mtune=generic32 -ffreestanding -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -fno-stack-protector -fno-omit-frame-pointer -fno-optimize-sibling-calls -Wdeclaration-after-statement -Wno-pointer-sign -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(page_alloc)" -D"KBUILD_MODNAME=KBUILD_STR(page_alloc)" -c -o mm/page_alloc.o mm/page_alloc.c

scripts/basic/fixdep mm/.page_alloc.o.d mm/page_alloc.o 'gcc etc' > mm/.page_alloc.o.tmp
rm -f mm/.page_alloc.o.d
mv -f mm/.page_alloc.o.tmp mm/.page_alloc.o.cmd


Attachments:
compile_page_alloc (993.00 B)

2009-03-16 02:56:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: I just got got another Oops

On Thu, 12 Mar 2009 14:31:49 -0400
Gene Heskett <[email protected]> wrote:

> Mar 12 14:15:02 coyote kernel: [ 2656.832669]
> Mar 12 14:15:02 coyote kernel: [ 2656.832672] Pid: 18877, comm: kmail Not tainted (2.6.29-rc7 #5) System Product Name
> Mar 12 14:15:02 coyote kernel: [ 2656.832675] EIP: 0060:[<c046520b>] EFLAGS: 00210202 CPU: 0
> Mar 12 14:15:02 coyote kernel: [ 2656.832678] EIP is at get_page_from_freelist+0x24b/0x4c0
> Mar 12 14:15:02 coyote kernel: [ 2656.832680] EAX: ffffffff EBX: 80004000 ECX: 00000001 EDX: 00000002
> Mar 12 14:15:02 coyote kernel: [ 2656.832682] ESI: c28fc260 EDI: 00000000 EBP: f2168d5c ESP: f2168cfc
> Mar 12 14:15:02 coyote kernel: [ 2656.832684] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Mar 12 14:15:02 coyote kernel: [ 2656.832686] Process kmail (pid: 18877, ti=f2168000 task=f22018b0 task.ti=f2168000)
> Mar 12 14:15:02 coyote kernel: [ 2656.832688] Stack:
> Mar 12 14:15:02 coyote kernel: [ 2656.832689] 00000002 00000044 c28fc060 00000000 f1463ca4 c0744b80 c06d6480 00000002
> Mar 12 14:15:02 coyote kernel: [ 2656.832693] 00000000 00000000 001201d2 00000002 00200246 00000001 c06d6900 00000100
> Mar 12 14:15:02 coyote kernel: [ 2656.832698] 00000000 80000000 c06d7484 c06d6480 c06d6480 c06d6480 f22018b0 00000129

Added linux-mm to CC:

22a9: 8b 1e mov (%esi),%ebx #ebx=80004000 = page->flags
22ab: 89 f2 mov %esi,%edx #remember "page"
22ad: 8b 46 08 mov 0x8(%esi),%eax #esi+8=-1 page->mapcount
22b0: 8b 7e 10 mov 0x10(%esi),%edi #esi+16=0 page->mapping
22b3: f6 c7 40 test $0x40,%bh
22b6: 74 03 je 22bb <get_page_from_freelist+0x24b>
22b8: 8b 56 0c mov 0xc(%esi),%edx #page = page->first_page
22bb: 8b 4a 04 mov 0x4(%edx),%ecx #page->_count

Thank you for disassemble list, from above....

In prep_new_page()
610 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
611 {
612 if (unlikely(page_mapcount(page) |
613 (page->mapping != NULL) |
614 (page_count(page) != 0) |
615 (page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
616 bad_page(page);

page->mapping = NULL, (VALID)
page->mapcount = -1 (VALID)
page->count ==> NULL access because PageTail() is set, see below.
(Note: from .config, CONFIG_PAGEFLAGS_EXTENDED is set.)

==
288 static inline int page_count(struct page *page)
289 {
290 return atomic_read(&compound_head(page)->_count);
291 }

281 static inline struct page *compound_head(struct page *page)
282 {
283 if (unlikely(PageTail(page)))
284 return page->first_page;
285 return page;
286 }
==

PageTail() is true (this is invalid) and page->first_page contains obsolete data.
But, here, PG_tail should not be there...

Hmm ?

Regards,
-Kame





2009-03-16 03:24:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: I just got got another Oops

On Mon, 16 Mar 2009 11:55:09 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Thu, 12 Mar 2009 14:31:49 -0400
> Gene Heskett <[email protected]> wrote:
>
> > Mar 12 14:15:02 coyote kernel: [ 2656.832669]
> > Mar 12 14:15:02 coyote kernel: [ 2656.832672] Pid: 18877, comm: kmail Not tainted (2.6.29-rc7 #5) System Product Name
> > Mar 12 14:15:02 coyote kernel: [ 2656.832675] EIP: 0060:[<c046520b>] EFLAGS: 00210202 CPU: 0
> > Mar 12 14:15:02 coyote kernel: [ 2656.832678] EIP is at get_page_from_freelist+0x24b/0x4c0
> > Mar 12 14:15:02 coyote kernel: [ 2656.832680] EAX: ffffffff EBX: 80004000 ECX: 00000001 EDX: 00000002
> > Mar 12 14:15:02 coyote kernel: [ 2656.832682] ESI: c28fc260 EDI: 00000000 EBP: f2168d5c ESP: f2168cfc
> > Mar 12 14:15:02 coyote kernel: [ 2656.832684] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > Mar 12 14:15:02 coyote kernel: [ 2656.832686] Process kmail (pid: 18877, ti=f2168000 task=f22018b0 task.ti=f2168000)
> > Mar 12 14:15:02 coyote kernel: [ 2656.832688] Stack:
> > Mar 12 14:15:02 coyote kernel: [ 2656.832689] 00000002 00000044 c28fc060 00000000 f1463ca4 c0744b80 c06d6480 00000002
> > Mar 12 14:15:02 coyote kernel: [ 2656.832693] 00000000 00000000 001201d2 00000002 00200246 00000001 c06d6900 00000100
> > Mar 12 14:15:02 coyote kernel: [ 2656.832698] 00000000 80000000 c06d7484 c06d6480 c06d6480 c06d6480 f22018b0 00000129
>
> Added linux-mm to CC:
>
> 22a9: 8b 1e mov (%esi),%ebx #ebx=80004000 = page->flags
> 22ab: 89 f2 mov %esi,%edx #remember "page"
> 22ad: 8b 46 08 mov 0x8(%esi),%eax #esi+8=-1 page->mapcount
> 22b0: 8b 7e 10 mov 0x10(%esi),%edi #esi+16=0 page->mapping
> 22b3: f6 c7 40 test $0x40,%bh
> 22b6: 74 03 je 22bb <get_page_from_freelist+0x24b>
> 22b8: 8b 56 0c mov 0xc(%esi),%edx #page = page->first_page
> 22bb: 8b 4a 04 mov 0x4(%edx),%ecx #page->_count
>
> Thank you for disassemble list, from above....
>
> In prep_new_page()
> 610 static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> 611 {
> 612 if (unlikely(page_mapcount(page) |
> 613 (page->mapping != NULL) |
> 614 (page_count(page) != 0) |
> 615 (page->flags & PAGE_FLAGS_CHECK_AT_PREP)))
> 616 bad_page(page);
>
> page->mapping = NULL, (VALID)
> page->mapcount = -1 (VALID)
> page->count ==> NULL access because PageTail() is set, see below.
> (Note: from .config, CONFIG_PAGEFLAGS_EXTENDED is set.)
>
> ==
> 288 static inline int page_count(struct page *page)
> 289 {
> 290 return atomic_read(&compound_head(page)->_count);
> 291 }
>
> 281 static inline struct page *compound_head(struct page *page)
> 282 {
> 283 if (unlikely(PageTail(page)))
> 284 return page->first_page;
> 285 return page;
> 286 }
> ==
>
> PageTail() is true (this is invalid) and page->first_page contains obsolete data.
> But, here, PG_tail should not be there...
>

Gene-san, could you set CONFIG_DEBUG_VM (and other debug option ?)
I think it can give us another view.

-Kame


2009-03-16 08:05:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

Hi,
I'm sorry if I miss something..

>From this patch
==
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=79f4b7bf393e67bbffec807cc68caaefc72b82ee
==
#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
...
@@ -468,16 +467,16 @@ static inline int free_pages_check(struct page *page)
(page_count(page) != 0) |
(page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
....
+ if (PageReserved(page))
+ return 1;
+ if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
+ page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ return 0;
}
==

PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().

This means PG_head/PG_tail(PG_compound) flags are cleared here and Compound page
will never be freed in sane way.

Regards,
-Kame

2009-03-16 21:46:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

On Mon, 16 Mar 2009, KAMEZAWA Hiroyuki wrote:
> Hi,
> I'm sorry if I miss something..

I think it's me who missed something, and needs to say sorry.

>
> >From this patch
> ==
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=79f4b7bf393e67bbffec807cc68caaefc72b82ee
> ==
> #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
> ...
> @@ -468,16 +467,16 @@ static inline int free_pages_check(struct page *page)
> (page_count(page) != 0) |
> (page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
> ....
> + if (PageReserved(page))
> + return 1;
> + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + return 0;
> }
> ==
>
> PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().
>
> This means PG_head/PG_tail(PG_compound) flags are cleared here

Yes, well spotted. How embarrassing. I must have got confused
about when the checking occurred when freeing a compound page.

> and Compound page will never be freed in sane way.

But is that so? I'll admit I've not tried this out yet, but my
understanding is that the Compound page actually gets freed fine:
free_compound_page() should have passed the right order down, and this
PAGE_FLAGS_CHECK_AT_PREP clearing should remove the Head/Tail/Compound
flags - doesn't it all work out sanely, without any leaking?

What goes missing is all the destroy_compound_page() checks:
that's at present just dead code.

There's several things we could do about this.

1. We could regard destroy_compound_page() as legacy debugging code
from when compound pages were first introduced, and sanctify my error
by removing it. Obviously that's appealing to me, makes me look like
a prophet rather than idiot! That's not necessarily the right thing to
do, but might appeal also to those cutting overhead from page_alloc.c.

2. We could do the destroy_compound_page() stuff in free_compound_page()
before calling __free_pages_ok(), and add the Head/Tail/Compound flags
into PAGE_FLAGS_CHECK_AT_FREE. That seems a more natural ordering to
me, and would remove the PageCompound check from a hotter path; but
I've a suspicion there's a good reason why it was not done that way,
that I'm overlooking at this moment.

3. We can define a PAGE_FLAGS_CLEAR_AT_FREE which omits the Head/Tail/
Compound flags, and lets destroy_compound_page() be called as before
where it's currently intended.

What do you think? I suspect I'm going to have to spend tomorrow
worrying about something else entirely, and won't return here until
Wednesday.

But as regards the original "I just got got another Oops": my bug
that you point out here doesn't account for that, does it? It's
still a mystery, isn't it, how the PageTail bit came to be set at
that point?

But that Oops does demonstrate that it's a very bad idea to be using
the deceptive page_count() in those bad_page() checks: we need to be
checking page->_count directly.

And in looking at this, I notice something else to worry about:
that CONFIG_HUGETLBFS prep_compound_gigantic_page(), which seems
to exist for a more general case than "p = page + i" - what happens
when such a gigantic page is freed, and arrives at the various
"p = page + i" assumptions on the freeing path?

Hugh

2009-03-16 23:45:37

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

On Mon, 16 Mar 2009 21:44:11 +0000 (GMT)
Hugh Dickins <[email protected]> wrote:

> On Mon, 16 Mar 2009, KAMEZAWA Hiroyuki wrote:
> > Hi,
> > I'm sorry if I miss something..
>
> I think it's me who missed something, and needs to say sorry.
>
> >
> > >From this patch
> > ==
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=79f4b7bf393e67bbffec807cc68caaefc72b82ee
> > ==
> > #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
> > ...
> > @@ -468,16 +467,16 @@ static inline int free_pages_check(struct page *page)
> > (page_count(page) != 0) |
> > (page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
> > ....
> > + if (PageReserved(page))
> > + return 1;
> > + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> > + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > + return 0;
> > }
> > ==
> >
> > PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().
> >
> > This means PG_head/PG_tail(PG_compound) flags are cleared here
>
> Yes, well spotted. How embarrassing. I must have got confused
> about when the checking occurred when freeing a compound page.
>
> > and Compound page will never be freed in sane way.
>
> But is that so? I'll admit I've not tried this out yet, but my
> understanding is that the Compound page actually gets freed fine:
> free_compound_page() should have passed the right order down, and this
> PAGE_FLAGS_CHECK_AT_PREP clearing should remove the Head/Tail/Compound
> flags - doesn't it all work out sanely, without any leaking?
>
I think it works sanely and pages are freed in valid way.
But bad_page() checking for compound pages (at destroy_compound_page())
is not done.


> What goes missing is all the destroy_compound_page() checks:
> that's at present just dead code.
>
> There's several things we could do about this.
>
> 1. We could regard destroy_compound_page() as legacy debugging code
> from when compound pages were first introduced, and sanctify my error
> by removing it. Obviously that's appealing to me, makes me look like
> a prophet rather than idiot! That's not necessarily the right thing to
> do, but might appeal also to those cutting overhead from page_alloc.c.
>
> 2. We could do the destroy_compound_page() stuff in free_compound_page()
> before calling __free_pages_ok(), and add the Head/Tail/Compound flags
> into PAGE_FLAGS_CHECK_AT_FREE. That seems a more natural ordering to
> me, and would remove the PageCompound check from a hotter path; but
> I've a suspicion there's a good reason why it was not done that way,
> that I'm overlooking at this moment.
>
> 3. We can define a PAGE_FLAGS_CLEAR_AT_FREE which omits the Head/Tail/
> Compound flags, and lets destroy_compound_page() be called as before
> where it's currently intended.
>
> What do you think? I suspect I'm going to have to spend tomorrow
> worrying about something else entirely, and won't return here until
> Wednesday.
>
I like "2".


> But as regards the original "I just got got another Oops": my bug
> that you point out here doesn't account for that, does it? It's
> still a mystery, isn't it, how the PageTail bit came to be set at
> that point?
>
I never find "who set it/where does it set". But page_alloc.c is an only
file which modifies PageTail bit and I'm the last modifier of it.
So, I'm intersted in this Oops.


> But that Oops does demonstrate that it's a very bad idea to be using
> the deceptive page_count() in those bad_page() checks: we need to be
> checking page->_count directly.
>
I think so.

> And in looking at this, I notice something else to worry about:
> that CONFIG_HUGETLBFS prep_compound_gigantic_page(), which seems
> to exist for a more general case than "p = page + i" - what happens
> when such a gigantic page is freed, and arrives at the various
> "p = page + i" assumptions on the freeing path?
>
Ah, I missed that path. I'll look into that today.

Thanks,
-Kame

2009-03-20 15:23:29

by Mel Gorman

[permalink] [raw]
Subject: Re: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

On Mon, Mar 16, 2009 at 09:44:11PM +0000, Hugh Dickins wrote:
> On Mon, 16 Mar 2009, KAMEZAWA Hiroyuki wrote:
> > Hi,
> > I'm sorry if I miss something..
>
> I think it's me who missed something, and needs to say sorry.
>

Joining the party late as always.

> >
> > >From this patch
> > ==
> > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=79f4b7bf393e67bbffec807cc68caaefc72b82ee
> > ==
> > #define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
> > ...
> > @@ -468,16 +467,16 @@ static inline int free_pages_check(struct page *page)
> > (page_count(page) != 0) |
> > (page->flags & PAGE_FLAGS_CHECK_AT_FREE)))
> > ....
> > + if (PageReserved(page))
> > + return 1;
> > + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> > + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > + return 0;
> > }
> > ==
> >
> > PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().
> >
> > This means PG_head/PG_tail(PG_compound) flags are cleared here
>
> Yes, well spotted. How embarrassing. I must have got confused
> about when the checking occurred when freeing a compound page.
>

I noticed this actually during the page allocator work and concluded
it didn't matter because free_pages_check() cleared out the bits in
the same way destroy_compound_page() did. The big difference was that
destroy_compound_page() did a lot more sanity checks and was slower.

I accidentally fixed this (because I implemented what I though things
should be doing instead of what they were really doing) at one point and
the overhead was so high of the debugging check that I just made a note to
"deal with this later, it's weird looking but ok".

> > and Compound page will never be freed in sane way.
>
> But is that so? I'll admit I've not tried this out yet, but my
> understanding is that the Compound page actually gets freed fine:
> free_compound_page() should have passed the right order down, and this
> PAGE_FLAGS_CHECK_AT_PREP clearing should remove the Head/Tail/Compound
> flags - doesn't it all work out sanely, without any leaking?
>

That's more or less what I thought. It can't leak but it's not what you
expect from compound page destructors either.

> What goes missing is all the destroy_compound_page() checks:
> that's at present just dead code.
>
> There's several things we could do about this.
>
> 1. We could regard destroy_compound_page() as legacy debugging code
> from when compound pages were first introduced, and sanctify my error
> by removing it. Obviously that's appealing to me, makes me look like
> a prophet rather than idiot! That's not necessarily the right thing to
> do, but might appeal also to those cutting overhead from page_alloc.c.
>

The function is pretty heavy it has to be said. This would be my preferred
option rather than making the allocator go slower.

> 2. We could do the destroy_compound_page() stuff in free_compound_page()
> before calling __free_pages_ok(), and add the Head/Tail/Compound flags
> into PAGE_FLAGS_CHECK_AT_FREE. hat seems a more natural ordering to
> me, and would remove the PageCompound check from a hotter path; but
> I've a suspicion there's a good reason why it was not done that way,
> that I'm overlooking at this moment.
>

I made this change and dropped it on the grounds it slowed things up so
badly. It was part of allowing compound pages to be on the PCP lists.
and ended up looking something like

static void free_compound_page(struct page *page)
{
unsigned int order = compound_order(page);

VM_BUG_ON(!PageCompound(page));
if (unlikely(destroy_compound_page(page, order)))
return;

__free_pages_ok(page, order);
}

> 3. We can define a PAGE_FLAGS_CLEAR_AT_FREE which omits the Head/Tail/
> Compound flags, and lets destroy_compound_page() be called as before
> where it's currently intended.
>

Also did that, slowed things up. Tried fixing destroy_compound_page()
but it was doing the same work as free_pages_check() so it also sucked.

> What do you think? I suspect I'm going to have to spend tomorrow
> worrying about something else entirely, and won't return here until
> Wednesday.
>
> But as regards the original "I just got got another Oops": my bug
> that you point out here doesn't account for that, does it? It's
> still a mystery, isn't it, how the PageTail bit came to be set at
> that point?
>
> But that Oops does demonstrate that it's a very bad idea to be using
> the deceptive page_count() in those bad_page() checks: we need to be
> checking page->_count directly.
>
> And in looking at this, I notice something else to worry about:
> that CONFIG_HUGETLBFS prep_compound_gigantic_page(), which seems
> to exist for a more general case than "p = page + i" - what happens
> when such a gigantic page is freed, and arrives at the various
> "p = page + i" assumptions on the freeing path?
>

That function is a bit confusing I'll give you that. Glancing through,
what happens is that the destuctor gets replaced with a free_huge_page()
which throws the page onto those free lists instead. It never hits the
buddy lists on the grounds they can't handle orders >= MAX_ORDER.

Out of curiousity, here is a patch that was intended for a totally different
purpose but ended up forcing destroy_compound_page() to be used. It sucked
so I ended up unfixing it again. It can't be merged as-is obviously but
you'll see I redefined your flags a bit to exclude the compound flags
and all that jazz. It could be rebased of course but it'd make more sense
to have destroy_compound_page() that only does real work for DEBUG_VM as
free_pages_check() already does enough work.

====

>From 93f9b5ebae0000ae3e7985c98680226f4bdd90a8 Mon Sep 17 00:00:00 2001
From: Mel Gorman <[email protected]>
Date: Mon, 9 Mar 2009 11:56:56 +0000
Subject: [PATCH 32/34] Allow compound pages to be stored on the PCP lists

The SLUB allocator frees and allocates compound pages. The setup costs
for compound pages are noticeable in profiles and incur cache misses as
every struct page has to be checked and written. This patch allows
compound pages to be stored on the PCP list to save on teardown and
setup time.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/page-flags.h | 4 ++-
mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++-------------
2 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 219a523..4177ec1 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -388,7 +388,9 @@ static inline void __ClearPageTail(struct page *page)
* Pages being prepped should not have any flags set. It they are set,
* there has been a kernel bug or struct page corruption.
*/
-#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
+#define PAGE_FLAGS_CHECK_AT_PREP_BUDDY ((1 << NR_PAGEFLAGS) - 1)
+#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) & \
+ ~(1 << PG_head | 1 << PG_tail))

#endif /* !__GENERATING_BOUNDS_H */
#endif /* PAGE_FLAGS_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 253fd98..2941638 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -280,11 +280,7 @@ out:
* put_page() function. Its ->lru.prev holds the order of allocation.
* This usage means that zero-order pages may not be compound.
*/
-
-static void free_compound_page(struct page *page)
-{
- __free_pages_ok(page, compound_order(page));
-}
+static void free_compound_page(struct page *page);

void prep_compound_page(struct page *page, unsigned long order)
{
@@ -553,7 +549,9 @@ static inline void __free_one_page(struct page *page,
zone->free_area[page_order(page)].nr_free++;
}

-static inline int free_pages_check(struct page *page)
+/* Sanity check a free pages flags */
+static inline int check_freepage_flags(struct page *page,
+ unsigned long prepflags)
{
if (unlikely(page_mapcount(page) |
(page->mapping != NULL) |
@@ -562,8 +560,8 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
- page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
+ if (page->flags & prepflags)
+ page->flags &= ~prepflags;
return 0;
}

@@ -602,6 +600,12 @@ static int free_pcppages_bulk(struct zone *zone, int count,
page = list_entry(list->prev, struct page, lru);
freed += 1 << page->index;
list_del(&page->lru);
+
+ /* SLUB can have compound pages to the free lists */
+ if (unlikely(PageCompound(page)))
+ if (unlikely(destroy_compound_page(page, page->index)))
+ continue;
+
__free_one_page(page, zone, page->index, migratetype);
}
spin_unlock(&zone->lock);
@@ -633,8 +637,10 @@ static void __free_pages_ok(struct page *page, unsigned int order)
int bad = 0;
int clearMlocked = PageMlocked(page);

+ VM_BUG_ON(PageCompound(page));
for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ bad += check_freepage_flags(page + i,
+ PAGE_FLAGS_CHECK_AT_PREP_BUDDY);
if (bad)
return;

@@ -738,8 +744,20 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
if (gfp_flags & __GFP_ZERO)
prep_zero_page(page, order, gfp_flags);

- if (order && (gfp_flags & __GFP_COMP))
- prep_compound_page(page, order);
+ /*
+ * If a compound page is requested, we have to check the page being
+ * prepped. If it's already compound, we leave it alone. If a
+ * compound page is not requested but the page being prepped is
+ * compound, then it must be destroyed
+ */
+ if (order) {
+ if ((gfp_flags & __GFP_COMP) && !PageCompound(page))
+ prep_compound_page(page, order);
+
+ if (!(gfp_flags & __GFP_COMP) && PageCompound(page))
+ if (unlikely(destroy_compound_page(page, order)))
+ return 1;
+ }

return 0;
}
@@ -1105,14 +1123,9 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
int migratetype;
int clearMlocked = PageMlocked(page);

- /* SLUB can return lowish-order compound pages that need handling */
- if (order > 0 && unlikely(PageCompound(page)))
- if (unlikely(destroy_compound_page(page, order)))
- return;
-
if (PageAnon(page))
page->mapping = NULL;
- if (free_pages_check(page))
+ if (check_freepage_flags(page, PAGE_FLAGS_CHECK_AT_PREP))
return;

if (!PageHighMem(page)) {
@@ -1160,6 +1173,15 @@ out:
put_cpu();
}

+static void free_compound_page(struct page *page)
+{
+ unsigned int order = compound_order(page);
+ if (order <= PAGE_ALLOC_COSTLY_ORDER)
+ free_hot_cold_page(page, order, 0);
+ else
+ __free_pages_ok(page, order);
+}
+
void free_hot_page(struct page *page)
{
free_hot_cold_page(page, 0, 0);
--
1.5.6.5

2009-03-22 14:57:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

On Fri, 20 Mar 2009, Mel Gorman wrote:
> On Mon, Mar 16, 2009 at 09:44:11PM +0000, Hugh Dickins wrote:
> > On Mon, 16 Mar 2009, KAMEZAWA Hiroyuki wrote:
> > >
> > > PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().
> > > This means PG_head/PG_tail(PG_compound) flags are cleared here
> >
> > Yes, well spotted. How embarrassing. I must have got confused
> > about when the checking occurred when freeing a compound page.
>
> I noticed this actually during the page allocator work and concluded
> it didn't matter because free_pages_check() cleared out the bits in
> the same way destroy_compound_page() did. The big difference was that
> destroy_compound_page() did a lot more sanity checks and was slower.
>
> I accidentally fixed this (because I implemented what I though things
> should be doing instead of what they were really doing) at one point and
> the overhead was so high of the debugging check that I just made a note to
> "deal with this later, it's weird looking but ok".

I'm surprised the overhead was so high: I'd have imagined that it
was just treading on the same cachelines as free_pages_check()
already did, doing rather less work.

>
> > > and Compound page will never be freed in sane way.
> >
> > But is that so? I'll admit I've not tried this out yet, but my
> > understanding is that the Compound page actually gets freed fine:
> > free_compound_page() should have passed the right order down, and this
> > PAGE_FLAGS_CHECK_AT_PREP clearing should remove the Head/Tail/Compound
> > flags - doesn't it all work out sanely, without any leaking?
> >
>
> That's more or less what I thought. It can't leak but it's not what you
> expect from compound page destructors either.
>
> > What goes missing is all the destroy_compound_page() checks:
> > that's at present just dead code.
> >
> > There's several things we could do about this.
> >
> > 1. We could regard destroy_compound_page() as legacy debugging code
> > from when compound pages were first introduced, and sanctify my error
> > by removing it. Obviously that's appealing to me, makes me look like
> > a prophet rather than idiot! That's not necessarily the right thing to
> > do, but might appeal also to those cutting overhead from page_alloc.c.
> >
>
> The function is pretty heavy it has to be said. This would be my preferred
> option rather than making the allocator go slower.

KAMEZAWA-san has voted for 2, so that was what I was intending to do.
But if destroy_compound_page() really is costly, I'm happy to throw
it out if others agree.

I don't think it actually buys us a great deal: the main thing it checks
(looking forward to the reuse of the pages, rather than just checking
that what was set up is still there) is that the order being freed is
not greater than the order that was allocated; but I think a PG_buddy
or a page->_count in the excess should catch that in free_pages_check().

And we don't have any such check for the much(?) more common case of
freeing a non-compound high-order page.

>
> > 2. We could do the destroy_compound_page() stuff in free_compound_page()
> > before calling __free_pages_ok(), and add the Head/Tail/Compound flags
> > into PAGE_FLAGS_CHECK_AT_FREE. hat seems a more natural ordering to
> > me, and would remove the PageCompound check from a hotter path; but
> > I've a suspicion there's a good reason why it was not done that way,
> > that I'm overlooking at this moment.
> >
>
> I made this change and dropped it on the grounds it slowed things up so
> badly. It was part of allowing compound pages to be on the PCP lists.
> and ended up looking something like
>
> static void free_compound_page(struct page *page)
> {
> unsigned int order = compound_order(page);
>
> VM_BUG_ON(!PageCompound(page));
> if (unlikely(destroy_compound_page(page, order)))
> return;
>
> __free_pages_ok(page, order);
> }

Yes, that's how I was imagining it. But I think we'd also want
to change hugetlb.c's set_compound_page_dtor(page, NULL) to
set_compound_page_dtor(page, free_compound_page), wouldn't we?
So far as I can see, that's the case that led the destroy call
to be sited in __free_one_page(), but I still don't get why it
was done that way.

>
> > 3. We can define a PAGE_FLAGS_CLEAR_AT_FREE which omits the Head/Tail/
> > Compound flags, and lets destroy_compound_page() be called as before
> > where it's currently intended.
> >
>
> Also did that, slowed things up. Tried fixing destroy_compound_page()
> but it was doing the same work as free_pages_check() so it also sucked.
>
> > What do you think? I suspect I'm going to have to spend tomorrow
> > worrying about something else entirely, and won't return here until
> > Wednesday.
> >
> > But as regards the original "I just got got another Oops": my bug
> > that you point out here doesn't account for that, does it? It's
> > still a mystery, isn't it, how the PageTail bit came to be set at
> > that point?
> >
> > But that Oops does demonstrate that it's a very bad idea to be using
> > the deceptive page_count() in those bad_page() checks: we need to be
> > checking page->_count directly.

I notice your/Nick's 20/25 addresses this issue, good - I'd even be
happy to see that change go into 2.6.29, though probably too late now
(and it has been that way forever). But note, it does need one of us
to replace the page_count in bad_page() in the same way, that's missing.

I've given up on trying to understand how that PageTail is set in
Gene's oops. I was thinking that it got left behind somewhere
because of my destroy_compound_page sequence error, but I just
can't see how: I wonder if it's just a corrupt bit in the struct.

I don't now feel that we need to rush a fix for my error into 2.6.29:
it does appear to be working nicely enough with that inadvertent
change, and we're not yet agreed on which way to go from here.

> >
> > And in looking at this, I notice something else to worry about:
> > that CONFIG_HUGETLBFS prep_compound_gigantic_page(), which seems
> > to exist for a more general case than "p = page + i" - what happens
> > when such a gigantic page is freed, and arrives at the various
> > "p = page + i" assumptions on the freeing path?
> >
>
> That function is a bit confusing I'll give you that. Glancing through,
> what happens is that the destuctor gets replaced with a free_huge_page()
> which throws the page onto those free lists instead. It never hits the
> buddy lists on the grounds they can't handle orders >= MAX_ORDER.

Ah yes, thanks a lot, I'd forgotten all that. Yes, there appear to
be adequate MAX_ORDER checks in hugetlb.c to prevent that danger.

>
> Out of curiousity,

My curiosity is very limited at the moment, I'm afraid I've not glanced.

> here is a patch that was intended for a totally different
> purpose but ended up forcing destroy_compound_page() to be used. It sucked
> so I ended up unfixing it again. It can't be merged as-is obviously but
> you'll see I redefined your flags a bit to exclude the compound flags
> and all that jazz. It could be rebased of course but it'd make more sense
> to have destroy_compound_page() that only does real work for DEBUG_VM as
> free_pages_check() already does enough work.

Yes, putting it under DEBUG_VM could be a compromise; though by now I've
persuaded myself that it's of little value, and the times it might catch
something would be out there without DEBUG_VM=y.

Hugh

>
> ====
>
> >From 93f9b5ebae0000ae3e7985c98680226f4bdd90a8 Mon Sep 17 00:00:00 2001
> From: Mel Gorman <[email protected]>
> Date: Mon, 9 Mar 2009 11:56:56 +0000
> Subject: [PATCH 32/34] Allow compound pages to be stored on the PCP lists
>
> The SLUB allocator frees and allocates compound pages. The setup costs
> for compound pages are noticeable in profiles and incur cache misses as
> every struct page has to be checked and written. This patch allows
> compound pages to be stored on the PCP list to save on teardown and
> setup time.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/page-flags.h | 4 ++-
> mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++-------------
> 2 files changed, 42 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 219a523..4177ec1 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -388,7 +388,9 @@ static inline void __ClearPageTail(struct page *page)
> * Pages being prepped should not have any flags set. It they are set,
> * there has been a kernel bug or struct page corruption.
> */
> -#define PAGE_FLAGS_CHECK_AT_PREP ((1 << NR_PAGEFLAGS) - 1)
> +#define PAGE_FLAGS_CHECK_AT_PREP_BUDDY ((1 << NR_PAGEFLAGS) - 1)
> +#define PAGE_FLAGS_CHECK_AT_PREP (((1 << NR_PAGEFLAGS) - 1) & \
> + ~(1 << PG_head | 1 << PG_tail))
>
> #endif /* !__GENERATING_BOUNDS_H */
> #endif /* PAGE_FLAGS_H */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 253fd98..2941638 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -280,11 +280,7 @@ out:
> * put_page() function. Its ->lru.prev holds the order of allocation.
> * This usage means that zero-order pages may not be compound.
> */
> -
> -static void free_compound_page(struct page *page)
> -{
> - __free_pages_ok(page, compound_order(page));
> -}
> +static void free_compound_page(struct page *page);
>
> void prep_compound_page(struct page *page, unsigned long order)
> {
> @@ -553,7 +549,9 @@ static inline void __free_one_page(struct page *page,
> zone->free_area[page_order(page)].nr_free++;
> }
>
> -static inline int free_pages_check(struct page *page)
> +/* Sanity check a free pages flags */
> +static inline int check_freepage_flags(struct page *page,
> + unsigned long prepflags)
> {
> if (unlikely(page_mapcount(page) |
> (page->mapping != NULL) |
> @@ -562,8 +560,8 @@ static inline int free_pages_check(struct page *page)
> bad_page(page);
> return 1;
> }
> - if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> - page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> + if (page->flags & prepflags)
> + page->flags &= ~prepflags;
> return 0;
> }
>
> @@ -602,6 +600,12 @@ static int free_pcppages_bulk(struct zone *zone, int count,
> page = list_entry(list->prev, struct page, lru);
> freed += 1 << page->index;
> list_del(&page->lru);
> +
> + /* SLUB can have compound pages to the free lists */
> + if (unlikely(PageCompound(page)))
> + if (unlikely(destroy_compound_page(page, page->index)))
> + continue;
> +
> __free_one_page(page, zone, page->index, migratetype);
> }
> spin_unlock(&zone->lock);
> @@ -633,8 +637,10 @@ static void __free_pages_ok(struct page *page, unsigned int order)
> int bad = 0;
> int clearMlocked = PageMlocked(page);
>
> + VM_BUG_ON(PageCompound(page));
> for (i = 0 ; i < (1 << order) ; ++i)
> - bad += free_pages_check(page + i);
> + bad += check_freepage_flags(page + i,
> + PAGE_FLAGS_CHECK_AT_PREP_BUDDY);
> if (bad)
> return;
>
> @@ -738,8 +744,20 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
> if (gfp_flags & __GFP_ZERO)
> prep_zero_page(page, order, gfp_flags);
>
> - if (order && (gfp_flags & __GFP_COMP))
> - prep_compound_page(page, order);
> + /*
> + * If a compound page is requested, we have to check the page being
> + * prepped. If it's already compound, we leave it alone. If a
> + * compound page is not requested but the page being prepped is
> + * compound, then it must be destroyed
> + */
> + if (order) {
> + if ((gfp_flags & __GFP_COMP) && !PageCompound(page))
> + prep_compound_page(page, order);
> +
> + if (!(gfp_flags & __GFP_COMP) && PageCompound(page))
> + if (unlikely(destroy_compound_page(page, order)))
> + return 1;
> + }
>
> return 0;
> }
> @@ -1105,14 +1123,9 @@ static void free_hot_cold_page(struct page *page, int order, int cold)
> int migratetype;
> int clearMlocked = PageMlocked(page);
>
> - /* SLUB can return lowish-order compound pages that need handling */
> - if (order > 0 && unlikely(PageCompound(page)))
> - if (unlikely(destroy_compound_page(page, order)))
> - return;
> -
> if (PageAnon(page))
> page->mapping = NULL;
> - if (free_pages_check(page))
> + if (check_freepage_flags(page, PAGE_FLAGS_CHECK_AT_PREP))
> return;
>
> if (!PageHighMem(page)) {
> @@ -1160,6 +1173,15 @@ out:
> put_cpu();
> }
>
> +static void free_compound_page(struct page *page)
> +{
> + unsigned int order = compound_order(page);
> + if (order <= PAGE_ALLOC_COSTLY_ORDER)
> + free_hot_cold_page(page, order, 0);
> + else
> + __free_pages_ok(page, order);
> +}
> +
> void free_hot_page(struct page *page)
> {
> free_hot_cold_page(page, 0, 0);
> --
> 1.5.6.5

2009-03-23 11:28:15

by Mel Gorman

[permalink] [raw]
Subject: Re: BUG?: PAGE_FLAGS_CHECK_AT_PREP seems to be cleared too early (Was Re: I just got got another Oops

On Sun, Mar 22, 2009 at 02:55:08PM +0000, Hugh Dickins wrote:
> On Fri, 20 Mar 2009, Mel Gorman wrote:
> > On Mon, Mar 16, 2009 at 09:44:11PM +0000, Hugh Dickins wrote:
> > > On Mon, 16 Mar 2009, KAMEZAWA Hiroyuki wrote:
> > > >
> > > > PAGE_FLAGS_CHECK_AT_PREP is cleared by free_pages_check().
> > > > This means PG_head/PG_tail(PG_compound) flags are cleared here
> > >
> > > Yes, well spotted. How embarrassing. I must have got confused
> > > about when the checking occurred when freeing a compound page.
> >
> > I noticed this actually during the page allocator work and concluded
> > it didn't matter because free_pages_check() cleared out the bits in
> > the same way destroy_compound_page() did. The big difference was that
> > destroy_compound_page() did a lot more sanity checks and was slower.
> >
> > I accidentally fixed this (because I implemented what I though things
> > should be doing instead of what they were really doing) at one point and
> > the overhead was so high of the debugging check that I just made a note to
> > "deal with this later, it's weird looking but ok".
>
> I'm surprised the overhead was so high: I'd have imagined that it
> was just treading on the same cachelines as free_pages_check()
> already did, doing rather less work.
>

My recollection is that it looked heavy because I was running netperf which
was allocating on one CPU and freeing on the other, incurring a cache miss
for every page it wrote to. This showed up heavily in profiles as you might
imagine. However, this penalty would also be hit in free_pages_check() if
destroy_compound_page() had not run so that skewed my perception. Still, we are
running over the same array of pages twice, when we could have done it once.

> >
> > > > and Compound page will never be freed in sane way.
> > >
> > > But is that so? I'll admit I've not tried this out yet, but my
> > > understanding is that the Compound page actually gets freed fine:
> > > free_compound_page() should have passed the right order down, and this
> > > PAGE_FLAGS_CHECK_AT_PREP clearing should remove the Head/Tail/Compound
> > > flags - doesn't it all work out sanely, without any leaking?
> > >
> >
> > That's more or less what I thought. It can't leak but it's not what you
> > expect from compound page destructors either.
> >
> > > What goes missing is all the destroy_compound_page() checks:
> > > that's at present just dead code.
> > >
> > > There's several things we could do about this.
> > >
> > > 1. We could regard destroy_compound_page() as legacy debugging code
> > > from when compound pages were first introduced, and sanctify my error
> > > by removing it. Obviously that's appealing to me, makes me look like
> > > a prophet rather than idiot! That's not necessarily the right thing to
> > > do, but might appeal also to those cutting overhead from page_alloc.c.
> > >
> >
> > The function is pretty heavy it has to be said. This would be my preferred
> > option rather than making the allocator go slower.
>
> KAMEZAWA-san has voted for 2, so that was what I was intending to do.
> But if destroy_compound_page() really is costly, I'm happy to throw
> it out if others agree.
>

I withdraw the objection on the grounds that 2 is the more correct
option of the two. Even though it is heavy, it is also possible to hold
compound pages on the PCP lists for a time and can be avoided in more
ways than one.

> I don't think it actually buys us a great deal: the main thing it checks
> (looking forward to the reuse of the pages, rather than just checking
> that what was set up is still there) is that the order being freed is
> not greater than the order that was allocated; but I think a PG_buddy
> or a page->_count in the excess should catch that in free_pages_check().
>
> And we don't have any such check for the much(?) more common case of
> freeing a non-compound high-order page.
>

We have a similar check sortof. It looks like this

for (i = 0 ; i < (1 << order) ; ++i)
bad += free_pages_check(page + i);

This is where we are walking over the array twice. One way of fixing this would
be to move the free_pages_check() higher in the call chain for high-order
pages and have destroy_compound_page() first checkec the tail pages know
where their head is and then call free_pages_check(). That should re-enable
just the debugging check without too much cost.

> > > 2. We could do the destroy_compound_page() stuff in free_compound_page()
> > > before calling __free_pages_ok(), and add the Head/Tail/Compound flags
> > > into PAGE_FLAGS_CHECK_AT_FREE. hat seems a more natural ordering to
> > > me, and would remove the PageCompound check from a hotter path; but
> > > I've a suspicion there's a good reason why it was not done that way,
> > > that I'm overlooking at this moment.
> > >
> >
> > I made this change and dropped it on the grounds it slowed things up so
> > badly. It was part of allowing compound pages to be on the PCP lists.
> > and ended up looking something like
> >
> > static void free_compound_page(struct page *page)
> > {
> > unsigned int order = compound_order(page);
> >
> > VM_BUG_ON(!PageCompound(page));
> > if (unlikely(destroy_compound_page(page, order)))
> > return;
> >
> > __free_pages_ok(page, order);
> > }
>
> Yes, that's how I was imagining it. But I think we'd also want
> to change hugetlb.c's set_compound_page_dtor(page, NULL) to
> set_compound_page_dtor(page, free_compound_page), wouldn't we?

For full correctness, yes. As it is, it happens to work because the
compound flags get cleared and destroy_compound_page() is little more
than a debug check.

> So far as I can see, that's the case that led the destroy call
> to be sited in __free_one_page(), but I still don't get why it
> was done that way.
>

I don't recall any reasoning but probably because it just worked. The
first time huge pages had a destructor set to NULL was commit
41d78ba55037468e6c86c53e3076d1a74841de39 and it appears to have been
carried forward ever since.

> >
> > > 3. We can define a PAGE_FLAGS_CLEAR_AT_FREE which omits the Head/Tail/
> > > Compound flags, and lets destroy_compound_page() be called as before
> > > where it's currently intended.
> > >
> >
> > Also did that, slowed things up. Tried fixing destroy_compound_page()
> > but it was doing the same work as free_pages_check() so it also sucked.
> >
> > > What do you think? I suspect I'm going to have to spend tomorrow
> > > worrying about something else entirely, and won't return here until
> > > Wednesday.
> > >
> > > But as regards the original "I just got got another Oops": my bug
> > > that you point out here doesn't account for that, does it? It's
> > > still a mystery, isn't it, how the PageTail bit came to be set at
> > > that point?
> > >
> > > But that Oops does demonstrate that it's a very bad idea to be using
> > > the deceptive page_count() in those bad_page() checks: we need to be
> > > checking page->_count directly.
>
> I notice your/Nick's 20/25 addresses this issue, good - I'd even be
> happy to see that change go into 2.6.29, though probably too late now
> (and it has been that way forever).

Agreed, although that change is an accident essentially. It's not super
clear to me it would help but I haven't looked closely enough at the oops
to have a useful opinion.

> But note, it does need one of us
> to replace the page_count in bad_page() in the same way, that's missing.
>
> I've given up on trying to understand how that PageTail is set in
> Gene's oops. I was thinking that it got left behind somewhere
> because of my destroy_compound_page sequence error, but I just
> can't see how: I wonder if it's just a corrupt bit in the struct.
>

I can't see how it can be left behind either as it should have been
getting clobbered. If it was something like inappropriate buddy merging,
a lot more would have broken.

> I don't now feel that we need to rush a fix for my error into 2.6.29:
> it does appear to be working nicely enough with that inadvertent
> change, and we're not yet agreed on which way to go from here.
>
> > >
> > > And in looking at this, I notice something else to worry about:
> > > that CONFIG_HUGETLBFS prep_compound_gigantic_page(), which seems
> > > to exist for a more general case than "p = page + i" - what happens
> > > when such a gigantic page is freed, and arrives at the various
> > > "p = page + i" assumptions on the freeing path?
> > >
> >
> > That function is a bit confusing I'll give you that. Glancing through,
> > what happens is that the destuctor gets replaced with a free_huge_page()
> > which throws the page onto those free lists instead. It never hits the
> > buddy lists on the grounds they can't handle orders >= MAX_ORDER.
>
> Ah yes, thanks a lot, I'd forgotten all that. Yes, there appear to
> be adequate MAX_ORDER checks in hugetlb.c to prevent that danger.
>
> >
> > Out of curiousity,
>
> My curiosity is very limited at the moment, I'm afraid I've not glanced.
>

No harm.

> > here is a patch that was intended for a totally different
> > purpose but ended up forcing destroy_compound_page() to be used. It sucked
> > so I ended up unfixing it again. It can't be merged as-is obviously but
> > you'll see I redefined your flags a bit to exclude the compound flags
> > and all that jazz. It could be rebased of course but it'd make more sense
> > to have destroy_compound_page() that only does real work for DEBUG_VM as
> > free_pages_check() already does enough work.
>
> Yes, putting it under DEBUG_VM could be a compromise; though by now I've
> persuaded myself that it's of little value, and the times it might catch
> something would be out there without DEBUG_VM=y.
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab