2014-01-31 21:22:46

by Holger Kiehl

[permalink] [raw]
Subject: Need help in bug in isolate_migratepages_range

Hello,

today one of our system got a kernel bug message. It kept on running
but more and more process begin to be stuck in D state (eg. a simple w
command would never return) and I eventually had to reboot. Here the
full message:

Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
Jan 31 13:07:43 asterix kernel: Stack:
Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
Jan 31 13:07:43 asterix kernel: Call Trace:
Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
data to another host. Any idea what the cause of this bug is? Could it be
hardware? The system has been running now for five years without any problems.

Please CC me since I am not on the list.

Many thanks in advance.

Regards,
Holger


2014-02-03 12:20:57

by Michal Hocko

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

[CCing linux-mm]

Does this ring bells? I haven't checked very deeply but it doesn't seem
to be fixed since 3.12.

Hoolger, could you post your config, please?

On Fri 31-01-14 21:12:27, Holger Kiehl wrote:
> Hello,
>
> today one of our system got a kernel bug message. It kept on running
> but more and more process begin to be stuck in D state (eg. a simple w
> command would never return) and I eventually had to reboot. Here the
> full message:
>
> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
> Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
> Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
> Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
> Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
> Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
> Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
> Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
> Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
> Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
> Jan 31 13:07:43 asterix kernel: Stack:
> Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
> Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
> Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
> Jan 31 13:07:43 asterix kernel: Call Trace:
> Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
> Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
> Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
> Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
> Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
> Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
> Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
> Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
> Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
> Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
> Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
> Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
> Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
> Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
> Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
> Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
> Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
>
> Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
> data to another host. Any idea what the cause of this bug is? Could it be
> hardware? The system has been running now for five years without any problems.
>
> Please CC me since I am not on the list.
>
> Many thanks in advance.
>
> Regards,
> Holger
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Michal Hocko
SUSE Labs

2014-02-03 14:29:27

by Holger Kiehl

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

I have attached it. Please, tell me if you do not get the attachment.

Thank you for looking into this.

Regards,
Holger


On Mon, 3 Feb 2014, Michal Hocko wrote:

> [CCing linux-mm]
>
> Does this ring bells? I haven't checked very deeply but it doesn't seem
> to be fixed since 3.12.
>
> Hoolger, could you post your config, please?
>
> On Fri 31-01-14 21:12:27, Holger Kiehl wrote:
>> Hello,
>>
>> today one of our system got a kernel bug message. It kept on running
>> but more and more process begin to be stuck in D state (eg. a simple w
>> command would never return) and I eventually had to reboot. Here the
>> full message:
>>
>> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>> Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
>> Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
>> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
>> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
>> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
>> Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
>> Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>> Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
>> Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
>> Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
>> Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
>> Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
>> Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
>> Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
>> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
>> Jan 31 13:07:43 asterix kernel: Stack:
>> Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
>> Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
>> Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
>> Jan 31 13:07:43 asterix kernel: Call Trace:
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
>> Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
>> Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
>> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
>> Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>> Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
>> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
>>
>> Kernel is a plain kernel.org kernel 3.12.9 and it uses drbd to replicate
>> data to another host. Any idea what the cause of this bug is? Could it be
>> hardware? The system has been running now for five years without any problems.
>>
>> Please CC me since I am not on the list.
>>
>> Many thanks in advance.
>>
>> Regards,
>> Holger
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Michal Hocko
> SUSE Labs
>


Attachments:
.config (74.92 kB)

2014-02-03 16:20:41

by Michal Hocko

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

On Mon 03-02-14 14:29:22, Holger Kiehl wrote:
> I have attached it. Please, tell me if you do not get the attachment.

I hoped it would help me to get a closer compiled code to yours but I am
probably using too different gcc.
Anyway I've tried to check whether I can hook on something and it seems
that this is a race with thp merge/split or something like that.

[...]
> >> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
> >> Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> >> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
> >> Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
> >> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
> >> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
> >> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
> >> Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
> >> Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> >> Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
> >> Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
> >> Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
> >> Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
> >> Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
> >> Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
> >> Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
> >> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> >> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
> >> Jan 31 13:07:43 asterix kernel: Stack:
> >> Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
> >> Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
> >> Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
> >> Jan 31 13:07:43 asterix kernel: Call Trace:
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
> >> Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
> >> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
> >> Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
> >> Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
> >> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
> >> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---

This seems to match:
17027: 49 8b 17 mov (%r15),%rdx # page->flags
1702a: 4c 89 f8 mov %r15,%rax
1702d: 80 e6 80 and $0x80,%dh # PageTail test
17030: 74 04 je 17036 <isolate_migratepages_range+0x2bf>
17032: 49 8b 47 30 mov 0x30(%r15),%rax # page = page->first_page
17036: 8b 40 1c mov 0x1c(%rax),%eax <<< page->_count
17039: ff c8 dec %eax

Which seems to be inlined compound_head. DH is 0x80 so this is a tail
page. This would suggest that tail page doesn't have firs_pages set up
properly and it contains NULL.

But maybe I've just matched the code incorrectly. Could you try to
disassemble your vmlinux a send the generated code, please?

Something like
objdump -d vmlinux > vmlinux.dis
and cut out isolate_migratepages_range function. Or simply upload your
vmlinux.dis somewhere so that we can download it.
--
Michal Hocko
SUSE Labs

2014-02-03 16:52:19

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

On 02/03/2014 05:20 PM, Michal Hocko wrote:
> On Mon 03-02-14 14:29:22, Holger Kiehl wrote:
>> I have attached it. Please, tell me if you do not get the attachment.
>
> I hoped it would help me to get a closer compiled code to yours but I am
> probably using too different gcc.
> Anyway I've tried to check whether I can hook on something and it seems
> that this is a race with thp merge/split or something like that.
>
> [...]
>>>> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>>> Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
>>>> Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
>>>> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
>>>> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
>>>> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
>>>> Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
>>>> Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
>>>> Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
>>>> Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
>>>> Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
>>>> Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
>>>> Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
>>>> Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
>>>> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
>>>> Jan 31 13:07:43 asterix kernel: Stack:
>>>> Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
>>>> Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
>>>> Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
>>>> Jan 31 13:07:43 asterix kernel: Call Trace:
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
>>>> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
>>>> Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
>>>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
>>>> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
>
> This seems to match:
> 17027: 49 8b 17 mov (%r15),%rdx # page->flags
> 1702a: 4c 89 f8 mov %r15,%rax
> 1702d: 80 e6 80 and $0x80,%dh # PageTail test
> 17030: 74 04 je 17036 <isolate_migratepages_range+0x2bf>
> 17032: 49 8b 47 30 mov 0x30(%r15),%rax # page = page->first_page
> 17036: 8b 40 1c mov 0x1c(%rax),%eax <<< page->_count
> 17039: ff c8 dec %eax
>
> Which seems to be inlined compound_head. DH is 0x80 so this is a tail
> page. This would suggest that tail page doesn't have firs_pages set up
> properly and it contains NULL.

It seems to come from balloon_page_movable() and its test
page_count(page) == 1.

> But maybe I've just matched the code incorrectly. Could you try to
> disassemble your vmlinux a send the generated code, please?
>
> Something like
> objdump -d vmlinux > vmlinux.dis
> and cut out isolate_migratepages_range function. Or simply upload your
> vmlinux.dis somewhere so that we can download it.
>

2014-02-03 19:50:26

by Holger Kiehl

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

On Mon, 3 Feb 2014, Michal Hocko wrote:

> On Mon 03-02-14 14:29:22, Holger Kiehl wrote:
>> I have attached it. Please, tell me if you do not get the attachment.
>
> I hoped it would help me to get a closer compiled code to yours but I am
> probably using too different gcc.
>
I have an old gcc, it is 4.4.1-2.

> Anyway I've tried to check whether I can hook on something and it seems
> that this is a race with thp merge/split or something like that.
>
> [...]
>>>> Jan 31 13:07:43 asterix kernel: BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
>>>> Jan 31 13:07:43 asterix kernel: IP: [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: PGD 7d3074067 PUD 7d3073067 PMD 0
>>>> Jan 31 13:07:43 asterix kernel: Oops: 0000 [#1] SMP
>>>> Jan 31 13:07:43 asterix kernel: Modules linked in: drbd lru_cache coretemp ipmi_devintf bonding nf_conntrack_ftp binfmt_misc usbhid i2c_i801 sg ehci_pci i2c_core ehci_hcd uhci_hcd i5000_edac i5k_amb ipmi_si ipmi_msghandler usbcore usb_common [last unloaded: microcode]
>>>> Jan 31 13:07:43 asterix kernel: CPU: 5 PID: 14164 Comm: java Not tainted 3.12.9 #1
>>>> Jan 31 13:07:43 asterix kernel: Hardware name: FUJITSU SIEMENS PRIMERGY RX300 S4 /D2519, BIOS 4.06 Rev. 1.04.2519 07/30/2008
>>>> Jan 31 13:07:43 asterix kernel: task: ffff8807d30b08c0 ti: ffff8807d30b2000 task.ti: ffff8807d30b2000
>>>> Jan 31 13:07:43 asterix kernel: RIP: 0010:[<ffffffff810af0ac>] [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: RSP: 0000:ffff8807d30b3928 EFLAGS: 00010286
>>>> Jan 31 13:07:43 asterix kernel: RAX: 0000000000000000 RBX: 000000000020ec09 RCX: 0000000000000002
>>>> Jan 31 13:07:43 asterix kernel: RDX: 2c00000000008000 RSI: 0000000000000004 RDI: 000000000000006c
>>>> Jan 31 13:07:43 asterix kernel: RBP: ffff8807d30b39f8 R08: ffff88083fbde390 R09: 0000000000000001
>>>> Jan 31 13:07:43 asterix kernel: R10: 0000000000000000 R11: ffffea000733a000 R12: ffff8807d30b3a58
>>>> Jan 31 13:07:43 asterix kernel: R13: ffffea000733a1f8 R14: 0000000000000000 R15: ffff88083ffe1d80
>>>> Jan 31 13:07:43 asterix kernel: FS: 00007f9d9e72f910(0000) GS:ffff88083fd40000(0000) knlGS:0000000000000000
>>>> Jan 31 13:07:43 asterix kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c CR3: 00000007d3070000 CR4: 00000000000407e0
>>>> Jan 31 13:07:43 asterix kernel: Stack:
>>>> Jan 31 13:07:43 asterix kernel: 0000000000000009 ffff88083ffe16c0 ffffea00002e6af0 ffff8807d30b3998
>>>> Jan 31 13:07:43 asterix kernel: ffff8807d30b2010 00ff8807d30b08c0 ffff8807d30b08c0 000000000020f000
>>>> Jan 31 13:07:43 asterix kernel: 0000000000000000 000000000000083b 000000000000000a ffff8807d30b3a68
>>>> Jan 31 13:07:43 asterix kernel: Call Trace:
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810a161f>] ? lru_add_drain_cpu+0x25/0x97
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810af687>] compact_zone+0x2b5/0x319
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810da586>] ? put_super+0x20/0x2c
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afa4d>] compact_zone_order+0xad/0xc4
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810afaf5>] try_to_compact_pages+0x91/0xe8
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109b92d>] ? page_alloc_cpu_notify+0x3e/0x3e
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109da34>] __alloc_pages_direct_compact+0xae/0x195
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8109e45d>] __alloc_pages_nodemask+0x772/0x7b5
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810c85a3>] alloc_pages_vma+0xd6/0x101
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810d47e3>] do_huge_pmd_anonymous_page+0x199/0x2ee
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810b3884>] handle_mm_fault+0x1b7/0xceb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8105dedc>] ? __dequeue_entity+0x2e/0x33
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8c3>] __do_page_fault+0x3bd/0x3e4
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810bbe1a>] ? mprotect_fixup+0x1c9/0x1fb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810aa0f0>] ? vm_mmap_pgoff+0x6d/0x8f
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff810795f5>] ? SyS_futex+0x103/0x13d
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff8102d8f3>] do_page_fault+0x9/0xb
>>>> Jan 31 13:07:43 asterix kernel: [<ffffffff813d3672>] page_fault+0x22/0x30
>>>> Jan 31 13:07:43 asterix kernel: Code: 00 41 f7 45 00 ff ff ff 01 0f 85 43 02 00 00 41 8b 45 18 85 c0 0f 89 37 02 00 00 49 8b 55 00 4c 89 e8 66 85 d2 79 04 49 8b 45 30 <8b> 40 1c 83 f8 01 0f 85 1b 02 00 00 49 8b 55 08 30 c0 48 85 d2
>>>> Jan 31 13:07:43 asterix kernel: RIP [<ffffffff810af0ac>] isolate_migratepages_range+0x32d/0x653
>>>> Jan 31 13:07:43 asterix kernel: RSP <ffff8807d30b3928>
>>>> Jan 31 13:07:43 asterix kernel: CR2: 000000000000001c
>>>> Jan 31 13:07:43 asterix kernel: ---[ end trace fba75c5b0b9175ea ]---
>
> This seems to match:
> 17027: 49 8b 17 mov (%r15),%rdx # page->flags
> 1702a: 4c 89 f8 mov %r15,%rax
> 1702d: 80 e6 80 and $0x80,%dh # PageTail test
> 17030: 74 04 je 17036 <isolate_migratepages_range+0x2bf>
> 17032: 49 8b 47 30 mov 0x30(%r15),%rax # page = page->first_page
> 17036: 8b 40 1c mov 0x1c(%rax),%eax <<< page->_count
> 17039: ff c8 dec %eax
>
> Which seems to be inlined compound_head. DH is 0x80 so this is a tail
> page. This would suggest that tail page doesn't have firs_pages set up
> properly and it contains NULL.
>
> But maybe I've just matched the code incorrectly. Could you try to
> disassemble your vmlinux a send the generated code, please?
>
> Something like
> objdump -d vmlinux > vmlinux.dis
> and cut out isolate_migratepages_range function. Or simply upload your
> vmlinux.dis somewhere so that we can download it.
>
I have attached the cut out. In case you want to see the full version,
you can download it from here:

ftp://ftp.dwd.de/pub/afd/test/vmlinux.dis.xz

Thank you for helping!

Regards,
Holger


Attachments:
vmlinux.dis.isolate_migratepages_range (26.42 kB)

2014-02-04 00:06:54

by David Rientjes

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

On Mon, 3 Feb 2014, Vlastimil Babka wrote:

> It seems to come from balloon_page_movable() and its test page_count(page) ==
> 1.
>

Hmm, I think it might be because compound_head() == NULL here. Holger,
this looks like a race condition when allocating a compound page, did you
only see it once or is it actually reproducible?

I think this happens when a new compound page is allocated and PageBuddy
is cleared before prep_compound_page() and then we see PageTail(p) set but
p->first_page is not yet initialized. Is there any way to avoid memory
barriers in compound_page()?

2014-02-04 07:17:29

by Holger Kiehl

[permalink] [raw]
Subject: Re: Need help in bug in isolate_migratepages_range

On Mon, 3 Feb 2014, David Rientjes wrote:

> On Mon, 3 Feb 2014, Vlastimil Babka wrote:
>
>> It seems to come from balloon_page_movable() and its test page_count(page) ==
>> 1.
>>
>
> Hmm, I think it might be because compound_head() == NULL here. Holger,
> this looks like a race condition when allocating a compound page, did you
> only see it once or is it actually reproducible?
>
No, this only happened once. It is not reproducable, the system was running
for four days without problems. And before this kernel, five years without
any problems.

Thanks,
Holger

2014-02-05 00:02:50

by David Rientjes

[permalink] [raw]
Subject: [patch] mm, page_alloc: make first_page visible before PageTail

Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned
pages") introduces page_count(page) into memory compaction which
dereferences page->first_page if PageTail(page).

Introduce a store memory barrier to ensure page->first_page is properly
initialized so that code that does page_count(page) on pages off the lru
always have a valid p->first_page.

Reported-by: Holger Kiehl <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -369,9 +369,10 @@ void prep_compound_page(struct page *page, unsigned long order)
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
- __SetPageTail(p);
set_page_count(p, 0);
p->first_page = page;
+ smp_wmb();
+ __SetPageTail(p);
}
}

2014-02-05 00:06:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mm, page_alloc: make first_page visible before PageTail

On Tue, 4 Feb 2014 16:02:39 -0800 (PST) David Rientjes <[email protected]> wrote:

> Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned
> pages") introduces page_count(page) into memory compaction which
> dereferences page->first_page if PageTail(page).
>
> Introduce a store memory barrier to ensure page->first_page is properly
> initialized so that code that does page_count(page) on pages off the lru
> always have a valid p->first_page.

Could we have a code comment please? Even checkpatch knows this rule!

> Reported-by: Holger Kiehl <[email protected]>

What did Holger report?

2014-02-05 00:14:12

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] mm, page_alloc: make first_page visible before PageTail

On Tue, 4 Feb 2014, Andrew Morton wrote:

> > Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned
> > pages") introduces page_count(page) into memory compaction which
> > dereferences page->first_page if PageTail(page).
> >
> > Introduce a store memory barrier to ensure page->first_page is properly
> > initialized so that code that does page_count(page) on pages off the lru
> > always have a valid p->first_page.
>
> Could we have a code comment please? Even checkpatch knows this rule!
>

Ok.

> > Reported-by: Holger Kiehl <[email protected]>
>
> What did Holger report?
>

A once-in-five-years NULL pointer dereference on the aforementioned
page_count(page).

2014-02-05 00:22:57

by David Rientjes

[permalink] [raw]
Subject: [patch v2] mm, page_alloc: make first_page visible before PageTail

Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned
pages") introduces page_count(page) into memory compaction which
dereferences page->first_page if PageTail(page).

Introduce a store memory barrier to ensure page->first_page is properly
initialized so that code that does page_count(page) on pages off the lru
always have a valid p->first_page.

Reported-by: Holger Kiehl <[email protected]>
Signed-off-by: David Rientjes <[email protected]>
---
v2: with commentary, per checkpatch

mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -369,9 +369,11 @@ void prep_compound_page(struct page *page, unsigned long order)
__SetPageHead(page);
for (i = 1; i < nr_pages; i++) {
struct page *p = page + i;
- __SetPageTail(p);
set_page_count(p, 0);
p->first_page = page;
+ /* Make sure p->first_page is always valid for PageTail() */
+ smp_wmb();
+ __SetPageTail(p);
}
}

2014-02-05 08:42:39

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch v2] mm, page_alloc: make first_page visible before PageTail

On Tue 04-02-14 16:22:53, David Rientjes wrote:
> Commit bf6bddf1924e ("mm: introduce compaction and migration for ballooned
> pages") introduces page_count(page) into memory compaction which
> dereferences page->first_page if PageTail(page).
>
> Introduce a store memory barrier to ensure page->first_page is properly
> initialized so that code that does page_count(page) on pages off the lru
> always have a valid p->first_page.
>
> Reported-by: Holger Kiehl <[email protected]>
> Signed-off-by: David Rientjes <[email protected]>
> ---
> v2: with commentary, per checkpatch
>
> mm/page_alloc.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -369,9 +369,11 @@ void prep_compound_page(struct page *page, unsigned long order)
> __SetPageHead(page);
> for (i = 1; i < nr_pages; i++) {
> struct page *p = page + i;
> - __SetPageTail(p);
> set_page_count(p, 0);
> p->first_page = page;
> + /* Make sure p->first_page is always valid for PageTail() */
> + smp_wmb();
> + __SetPageTail(p);

Where is the pairing smp_rmb? I would expect it in comound_head.

> }
> }
>

--
Michal Hocko
SUSE Labs

2014-02-12 09:00:43

by David Rientjes

[permalink] [raw]
Subject: [patch -mm] mm: close PageTail race

Since "mm, compaction: avoid isolating pinned pages", it has been
possible for page_count(page) to race with prep_compound_page() by
finding PageTail(page) set with a NULL or dangling page->first_page.

"mm, page_alloc: make first_page visible before PageTail" adds a store
memory barrier to prep_compound_page() to ensure page->first_page is
set, but nothing is preventing compound_head() from seeing a dangling
head page.

This patch uses Andrea's implementation of compound_trans_head() that
deals with such a race and makes it the default compound_head()
implementation. This includes a read memory barrier that ensures that
if PageTail(head) is true that we return a head page that is neither
NULL nor dangling.

This is the safest way to ensure we see the head page that we are
expecting, PageTail(page) is already in the unlikely() path and the
memory barriers are unfortunately required.

Hugetlbfs is the exception, we don't enforce a store memory barrier
during init since no race is possible.

Signed-off-by: David Rientjes <[email protected]>
---
Note: this is targeted for -mm because there is a prerequisite patch,
mm-page_alloc-make-first_page-visible-before-pagetail.patch, in that
tree which adds a smp_wmb() to prep_compound_page().

drivers/block/aoe/aoecmd.c | 4 ++--
drivers/vfio/vfio_iommu_type1.c | 4 ++--
fs/proc/page.c | 5 ++---
include/linux/huge_mm.h | 41 -----------------------------------------
include/linux/mm.h | 14 ++++++++++++--
mm/ksm.c | 2 +-
mm/memory-failure.c | 2 +-
mm/swap.c | 4 ++--
8 files changed, 22 insertions(+), 54 deletions(-)

diff --git a/drivers/block/aoe/aoecmd.c b/drivers/block/aoe/aoecmd.c
--- a/drivers/block/aoe/aoecmd.c
+++ b/drivers/block/aoe/aoecmd.c
@@ -874,7 +874,7 @@ bio_pageinc(struct bio *bio)
/* Non-zero page count for non-head members of
* compound pages is no longer allowed by the kernel.
*/
- page = compound_trans_head(bv.bv_page);
+ page = compound_head(bv.bv_page);
atomic_inc(&page->_count);
}
}
@@ -887,7 +887,7 @@ bio_pagedec(struct bio *bio)
struct bvec_iter iter;

bio_for_each_segment(bv, bio, iter) {
- page = compound_trans_head(bv.bv_page);
+ page = compound_head(bv.bv_page);
atomic_dec(&page->_count);
}
}
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -186,12 +186,12 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
if (pfn_valid(pfn)) {
bool reserved;
struct page *tail = pfn_to_page(pfn);
- struct page *head = compound_trans_head(tail);
+ struct page *head = compound_head(tail);
reserved = !!(PageReserved(head));
if (head != tail) {
/*
* "head" is not a dangling pointer
- * (compound_trans_head takes care of that)
+ * (compound_head takes care of that)
* but the hugepage may have been split
* from under us (and we may not hold a
* reference count on the head page so it can
diff --git a/fs/proc/page.c b/fs/proc/page.c
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -121,9 +121,8 @@ u64 stable_page_flags(struct page *page)
* just checks PG_head/PG_tail, so we need to check PageLRU/PageAnon
* to make sure a given page is a thp, not a non-huge compound page.
*/
- else if (PageTransCompound(page) &&
- (PageLRU(compound_trans_head(page)) ||
- PageAnon(compound_trans_head(page))))
+ else if (PageTransCompound(page) && (PageLRU(compound_head(page)) ||
+ PageAnon(compound_head(page))))
u |= 1 << KPF_THP;

/*
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -157,46 +157,6 @@ static inline int hpage_nr_pages(struct page *page)
return HPAGE_PMD_NR;
return 1;
}
-/*
- * compound_trans_head() should be used instead of compound_head(),
- * whenever the "page" passed as parameter could be the tail of a
- * transparent hugepage that could be undergoing a
- * __split_huge_page_refcount(). The page structure layout often
- * changes across releases and it makes extensive use of unions. So if
- * the page structure layout will change in a way that
- * page->first_page gets clobbered by __split_huge_page_refcount, the
- * implementation making use of smp_rmb() will be required.
- *
- * Currently we define compound_trans_head as compound_head, because
- * page->private is in the same union with page->first_page, and
- * page->private isn't clobbered. However this also means we're
- * currently leaving dirt into the page->private field of anonymous
- * pages resulting from a THP split, instead of setting page->private
- * to zero like for every other page that has PG_private not set. But
- * anonymous pages don't use page->private so this is not a problem.
- */
-#if 0
-/* This will be needed if page->private will be clobbered in split_huge_page */
-static inline struct page *compound_trans_head(struct page *page)
-{
- if (PageTail(page)) {
- struct page *head;
- head = page->first_page;
- smp_rmb();
- /*
- * head may be a dangling pointer.
- * __split_huge_page_refcount clears PageTail before
- * overwriting first_page, so if PageTail is still
- * there it means the head pointer isn't dangling.
- */
- if (PageTail(page))
- return head;
- }
- return page;
-}
-#else
-#define compound_trans_head(page) compound_head(page)
-#endif

extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp);
@@ -226,7 +186,6 @@ static inline int split_huge_page(struct page *page)
do { } while (0)
#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
do { } while (0)
-#define compound_trans_head(page) compound_head(page)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -399,8 +399,18 @@ static inline void compound_unlock_irqrestore(struct page *page,

static inline struct page *compound_head(struct page *page)
{
- if (unlikely(PageTail(page)))
- return page->first_page;
+ if (unlikely(PageTail(page))) {
+ struct page *head = page->first_page;
+
+ /*
+ * page->first_page may be a dangling pointer to an old
+ * compound page, so recheck that it is still a tail
+ * page before returning.
+ */
+ smp_rmb();
+ if (likely(PageTail(page)))
+ return head;
+ }
return page;
}

diff --git a/mm/ksm.c b/mm/ksm.c
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -444,7 +444,7 @@ static void break_cow(struct rmap_item *rmap_item)
static struct page *page_trans_compound_anon(struct page *page)
{
if (PageTransCompound(page)) {
- struct page *head = compound_trans_head(page);
+ struct page *head = compound_head(page);
/*
* head may actually be splitted and freed from under
* us but it's ok here.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1649,7 +1649,7 @@ int soft_offline_page(struct page *page, int flags)
{
int ret;
unsigned long pfn = page_to_pfn(page);
- struct page *hpage = compound_trans_head(page);
+ struct page *hpage = compound_head(page);

if (PageHWPoison(page)) {
pr_info("soft offline: %#lx page already poisoned\n", pfn);
diff --git a/mm/swap.c b/mm/swap.c
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -98,7 +98,7 @@ static void put_compound_page(struct page *page)
}

/* __split_huge_page_refcount can run under us */
- page_head = compound_trans_head(page);
+ page_head = compound_head(page);

/*
* THP can not break up slab pages so avoid taking
@@ -253,7 +253,7 @@ bool __get_page_tail(struct page *page)
*/
unsigned long flags;
bool got;
- struct page *page_head = compound_trans_head(page);
+ struct page *page_head = compound_head(page);

/* Ref to put_compound_page() comment. */
if (!__compound_tail_refcounted(page_head)) {