LinuxLists.cc - Memory hotplug regression in 4.13

2017-09-19 16:41:26

Subject: Memory hotplug regression in 4.13

Hi Michal,

I'm seeing oopses in various locations when hotplugging memory in an x86
vm while running a 32-bit kernel. The config I'm using is attached. To
reproduce I'm using kvm with the memory options "-m
size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:

object_add memory-backend-ram,id=mem1,size=512M
device_add pc-dimm,id=dimm1,memdev=mem1

Not long after that I'll see an oops, not always in the same location
but most often in wp_page_copy, like this one:

[ 24.673623] BUG: unable to handle kernel paging request at dffff000
[ 24.675569] IP: wp_page_copy+0xa8/0x660
[ 24.676792] *pdpt = 0000000004d6a001 *pde = 0000000004e6d067
[ 24.676797] *pte = 0000000000000000
[ 24.678522]
[ 24.680066] Oops: 0002 [#1] SMP
[ 24.681037] Modules linked in: ppdev nls_utf8 isofs kvm_intel kvm irqbypass input_leds joydev parport_pc serio_raw i2c_piix4 mac_hid parport qemu_fw_cfg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_rec
ov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper psmouse syscopyarea sysfillrect virtio_blk sysimgblt fb_sys_fops drm virtio_net pata_acpi floppy
[ 24.688918] CPU: 1 PID: 819 Comm: sshd Tainted: G W 4.12.0+ #62
[ 24.690131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 24.691656] task: dbbbcc00 task.stack: dbbea000
[ 24.692484] EIP: wp_page_copy+0xa8/0x660
[ 24.693166] EFLAGS: 00210282 CPU: 1
[ 24.693769] EAX: dffff000 EBX: d2214000 ECX: dffff000 EDX: 0000003e
[ 24.694838] ESI: d2214000 EDI: dffff004 EBP: dbbebe9c ESP: dbbebe60
[ 24.695908] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 24.696865] CR0: 80050033 CR2: dffff000 CR3: 1b985b80 CR4: 000006f0
[ 24.697945] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 24.699010] DR6: fffe0ff0 DR7: 00000400
[ 24.699670] Call Trace:
[ 24.700133] do_wp_page+0x83/0x4f0
[ 24.700762] ? kmap_atomic_prot+0x3c/0x100
[ 24.701421] handle_mm_fault+0x95c/0xe50
[ 24.702053] ? default_send_IPI_single+0x2c/0x30
[ 24.702788] ? resched_curr+0x51/0xc0
[ 24.703382] ? check_preempt_curr+0x75/0x80
[ 24.704081] __do_page_fault+0x209/0x500
[ 24.704732] ? kvm_async_pf_task_wake+0x100/0x100
[ 24.705491] trace_do_page_fault+0x3f/0xe0
[ 24.706151] ? kvm_async_pf_task_wake+0x100/0x100
[ 24.706902] do_async_page_fault+0x55/0x70
[ 24.707571] common_exception+0x6c/0x72
[ 24.708212] EIP: 0xb722676a
[ 24.708677] EFLAGS: 00210282 CPU: 1
[ 24.709235] EAX: bfe086e0 EBX: 01200011 ECX: 00000000 EDX: 00000000
[ 24.710222] ESI: 00000000 EDI: 00000426 EBP: bfe08728 ESP: bfe086e0
[ 24.711215] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
[ 24.712097] Code: 00 00 8b 4d e8 85 c9 0f 84 1e 05 00 00 8b 45 e8 e8 4e d1 ea ff 89 c3 8b 45 e0 89 de e8 42 d1 ea ff 8b 13 8d 78 04 89 c1 83 e7 fc <89> 10 8b 93 fc 0f 00 00 29 f9 29 ce 81 c1 00 10 00 00 c1 e9 02
[ 24.714927] EIP: wp_page_copy+0xa8/0x660 SS:ESP: 0068:dbbebe60
[ 24.715792] CR2: 00000000dffff000

I ran a bisect and landed on a commit of yours, f1dd2cd13c4b "mm,
memory_hotplug: do not associate hotadded memory to zones until online",
as the first commit with this issue.

Thanks,
Seth

Attachments:

(No filename) (3.23 kB)
config (205.94 kB)
Download all attachments

2017-09-20 09:29:35

by Michal Hocko

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

Hi,
I am currently at a conference so I will most probably get to this next
week but I will try to ASAP.

On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> Hi Michal,
>
> I'm seeing oopses in various locations when hotplugging memory in an x86
> vm while running a 32-bit kernel. The config I'm using is attached. To
> reproduce I'm using kvm with the memory options "-m
> size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
>
> object_add memory-backend-ram,id=mem1,size=512M
> device_add pc-dimm,id=dimm1,memdev=mem1
>
> Not long after that I'll see an oops, not always in the same location
> but most often in wp_page_copy, like this one:

This is rather surprising. How do you online the memory?

> [ 24.673623] BUG: unable to handle kernel paging request at dffff000
> [ 24.675569] IP: wp_page_copy+0xa8/0x660

could you resolve the IP into the source line?

> [ 24.676792] *pdpt = 0000000004d6a001 *pde = 0000000004e6d067
> [ 24.676797] *pte = 0000000000000000
> [ 24.678522]
> [ 24.680066] Oops: 0002 [#1] SMP
> [ 24.681037] Modules linked in: ppdev nls_utf8 isofs kvm_intel kvm irqbypass input_leds joydev parport_pc serio_raw i2c_piix4 mac_hid parport qemu_fw_cfg iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_rec
> ov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cirrus ttm drm_kms_helper psmouse syscopyarea sysfillrect virtio_blk sysimgblt fb_sys_fops drm virtio_net pata_acpi floppy
> [ 24.688918] CPU: 1 PID: 819 Comm: sshd Tainted: G W 4.12.0+ #62
> [ 24.690131] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> [ 24.691656] task: dbbbcc00 task.stack: dbbea000
> [ 24.692484] EIP: wp_page_copy+0xa8/0x660
> [ 24.693166] EFLAGS: 00210282 CPU: 1
> [ 24.693769] EAX: dffff000 EBX: d2214000 ECX: dffff000 EDX: 0000003e
> [ 24.694838] ESI: d2214000 EDI: dffff004 EBP: dbbebe9c ESP: dbbebe60
> [ 24.695908] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
> [ 24.696865] CR0: 80050033 CR2: dffff000 CR3: 1b985b80 CR4: 000006f0
> [ 24.697945] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
> [ 24.699010] DR6: fffe0ff0 DR7: 00000400
> [ 24.699670] Call Trace:
> [ 24.700133] do_wp_page+0x83/0x4f0
> [ 24.700762] ? kmap_atomic_prot+0x3c/0x100
> [ 24.701421] handle_mm_fault+0x95c/0xe50
> [ 24.702053] ? default_send_IPI_single+0x2c/0x30
> [ 24.702788] ? resched_curr+0x51/0xc0
> [ 24.703382] ? check_preempt_curr+0x75/0x80
> [ 24.704081] __do_page_fault+0x209/0x500
> [ 24.704732] ? kvm_async_pf_task_wake+0x100/0x100
> [ 24.705491] trace_do_page_fault+0x3f/0xe0
> [ 24.706151] ? kvm_async_pf_task_wake+0x100/0x100
> [ 24.706902] do_async_page_fault+0x55/0x70
> [ 24.707571] common_exception+0x6c/0x72
> [ 24.708212] EIP: 0xb722676a
> [ 24.708677] EFLAGS: 00210282 CPU: 1
> [ 24.709235] EAX: bfe086e0 EBX: 01200011 ECX: 00000000 EDX: 00000000
> [ 24.710222] ESI: 00000000 EDI: 00000426 EBP: bfe08728 ESP: bfe086e0
> [ 24.711215] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b
> [ 24.712097] Code: 00 00 8b 4d e8 85 c9 0f 84 1e 05 00 00 8b 45 e8 e8 4e d1 ea ff 89 c3 8b 45 e0 89 de e8 42 d1 ea ff 8b 13 8d 78 04 89 c1 83 e7 fc <89> 10 8b 93 fc 0f 00 00 29 f9 29 ce 81 c1 00 10 00 00 c1 e9 02
> [ 24.714927] EIP: wp_page_copy+0xa8/0x660 SS:ESP: 0068:dbbebe60
> [ 24.715792] CR2: 00000000dffff000
>
> I ran a bisect and landed on a commit of yours, f1dd2cd13c4b "mm,
> memory_hotplug: do not associate hotadded memory to zones until online",
> as the first commit with this issue.
--
Michal Hocko
SUSE Labs

2017-09-21 05:40:39

by Seth Forshee

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote:
> Hi,
> I am currently at a conference so I will most probably get to this next
> week but I will try to ASAP.
>
> On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> > Hi Michal,
> >
> > I'm seeing oopses in various locations when hotplugging memory in an x86
> > vm while running a 32-bit kernel. The config I'm using is attached. To
> > reproduce I'm using kvm with the memory options "-m
> > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
> >
> > object_add memory-backend-ram,id=mem1,size=512M
> > device_add pc-dimm,id=dimm1,memdev=mem1
> >
> > Not long after that I'll see an oops, not always in the same location
> > but most often in wp_page_copy, like this one:
>
> This is rather surprising. How do you online the memory?

The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.

> > [ 24.673623] BUG: unable to handle kernel paging request at dffff000
> > [ 24.675569] IP: wp_page_copy+0xa8/0x660
>
> could you resolve the IP into the source line?

It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
and the problem still occurs there. It's pointing to the call to
__builtin_memcpy in memcpy (include/linux/string.h line 340), which we
get to via wp_page_copy -> cow_user_page -> copy_user_highpage.

Thanks,
Seth

2017-09-25 12:58:29

by Michal Hocko

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote:
> > Hi,
> > I am currently at a conference so I will most probably get to this next
> > week but I will try to ASAP.
> >
> > On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> > > Hi Michal,
> > >
> > > I'm seeing oopses in various locations when hotplugging memory in an x86
> > > vm while running a 32-bit kernel. The config I'm using is attached. To
> > > reproduce I'm using kvm with the memory options "-m
> > > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
> > >
> > > object_add memory-backend-ram,id=mem1,size=512M
> > > device_add pc-dimm,id=dimm1,memdev=mem1
> > >
> > > Not long after that I'll see an oops, not always in the same location
> > > but most often in wp_page_copy, like this one:
> >
> > This is rather surprising. How do you online the memory?
>
> The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.

OK, so the memory gets online automagically at the time when it is
hotadded. Could you send the full dmesg?

> > > [ 24.673623] BUG: unable to handle kernel paging request at dffff000
> > > [ 24.675569] IP: wp_page_copy+0xa8/0x660
> >
> > could you resolve the IP into the source line?
>
> It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> and the problem still occurs there. It's pointing to the call to
> __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> get to via wp_page_copy -> cow_user_page -> copy_user_highpage.

Hmm, this is interesting. That would mean that we have successfully
mapped the destination page but its memory is still not accessible.

Right now I do not see how the patch you have bisected to could make any
difference because it only postponed the onlining to be independent but
your config simply onlines automatically so there shouldn't be any
semantic change. Maybe there is some sort of off-by-one or something.

I will try to investigate some more. Do you think it would be possible
to configure kdump on your system and provide me with the vmcore in some
way?
--
Michal Hocko
SUSE Labs

2017-12-01 14:23:35

by Seth Forshee

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > On Wed, Sep 20, 2017 at 11:29:31AM +0200, Michal Hocko wrote:
> > > Hi,
> > > I am currently at a conference so I will most probably get to this next
> > > week but I will try to ASAP.
> > >
> > > On Tue 19-09-17 11:41:14, Seth Forshee wrote:
> > > > Hi Michal,
> > > >
> > > > I'm seeing oopses in various locations when hotplugging memory in an x86
> > > > vm while running a 32-bit kernel. The config I'm using is attached. To
> > > > reproduce I'm using kvm with the memory options "-m
> > > > size=512M,slots=3,maxmem=2G". Then in the qemu monitor I run:
> > > >
> > > > object_add memory-backend-ram,id=mem1,size=512M
> > > > device_add pc-dimm,id=dimm1,memdev=mem1
> > > >
> > > > Not long after that I'll see an oops, not always in the same location
> > > > but most often in wp_page_copy, like this one:
> > >
> > > This is rather surprising. How do you online the memory?
> >
> > The kernel has CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y.
>
> OK, so the memory gets online automagically at the time when it is
> hotadded. Could you send the full dmesg?
>
> > > > [ 24.673623] BUG: unable to handle kernel paging request at dffff000
> > > > [ 24.675569] IP: wp_page_copy+0xa8/0x660
> > >
> > > could you resolve the IP into the source line?
> >
> > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > and the problem still occurs there. It's pointing to the call to
> > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
>
> Hmm, this is interesting. That would mean that we have successfully
> mapped the destination page but its memory is still not accessible.
>
> Right now I do not see how the patch you have bisected to could make any
> difference because it only postponed the onlining to be independent but
> your config simply onlines automatically so there shouldn't be any
> semantic change. Maybe there is some sort of off-by-one or something.
>
> I will try to investigate some more. Do you think it would be possible
> to configure kdump on your system and provide me with the vmcore in some
> way?

Sorry, I got busy with other stuff and this kind of fell off my radar.
It came to my attention again recently though.

I was looking through the hotplug rework changes, and I noticed that
32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
rework it doesn't look like it's possible for memory to be associated
with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
and am now no longer seeing the oopses.

I'm sure this isn't the correct fix, but I think it does confirm that
the problem is that the memory should be associated with ZONE_HIGHMEM
but is not.

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index d4b5f29906b9..fddc134c5c3b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -833,6 +833,12 @@ void __ref move_pfn_range_to_zone(struct zone *zone,
set_zone_contiguous(zone);
}

+#ifdef CONFIG_HIGHMEM
+static enum zone_type default_zone = ZONE_HIGHMEM;
+#else
+static enum zone_type default_zone = ZONE_NORMAL;
+#endif
+
/*
* Returns a default kernel memory zone for the given pfn range.
* If no kernel zone covers this pfn range it will automatically go
@@ -844,14 +850,14 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
struct pglist_data *pgdat = NODE_DATA(nid);
int zid;

- for (zid = 0; zid <= ZONE_NORMAL; zid++) {
+ for (zid = 0; zid <= default_zone; zid++) {
struct zone *zone = &pgdat->node_zones[zid];

if (zone_intersects(zone, start_pfn, nr_pages))
return zone;
}

- return &pgdat->node_zones[ZONE_NORMAL];
+ return &pgdat->node_zones[default_zone];
}

static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,

2017-12-18 14:53:26

by Michal Hocko

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
[...]
> > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > > and the problem still occurs there. It's pointing to the call to
> > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> >
> > Hmm, this is interesting. That would mean that we have successfully
> > mapped the destination page but its memory is still not accessible.
> >
> > Right now I do not see how the patch you have bisected to could make any
> > difference because it only postponed the onlining to be independent but
> > your config simply onlines automatically so there shouldn't be any
> > semantic change. Maybe there is some sort of off-by-one or something.
> >
> > I will try to investigate some more. Do you think it would be possible
> > to configure kdump on your system and provide me with the vmcore in some
> > way?
>
> Sorry, I got busy with other stuff and this kind of fell off my radar.
> It came to my attention again recently though.

Apology on my side. This has completely fall of my radar.

> I was looking through the hotplug rework changes, and I noticed that
> 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> rework it doesn't look like it's possible for memory to be associated
> with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> and am now no longer seeing the oopses.

Thanks a lot for debugging! Do I read the above correctly that the
current code simply returns ZONE_NORMAL and maps an unrelated pfn into
this zone and that leads to later blowups? Could you attach the fresh
boot dmesg output please?

> I'm sure this isn't the correct fix, but I think it does confirm that
> the problem is that the memory should be associated with ZONE_HIGHMEM
> but is not.

Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
zone. The kernel cannot access that memory directly. It is essentially a
movable zone from the hotplug API POV. We simply do not have any way to
tell into which zone we want to online this memory range in.
Unfortunately both zones _can_ be present. It would require an explicit
configuration (movable_node and a NUMA hoptlugable nodes running in 32b
or and movable memory configured explicitly on the kernel command line).

The below patch is not really complete but I would rather start simple.
Maybe we do not even have to care as most 32b users will never use both
zones at the same time. I've placed a warning to learn about those.

Does this pass your testing?
---
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 262bfd26baf9..18fec18bdb60 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
return &pgdat->node_zones[ZONE_NORMAL];
}

+static struct zone *default_movable_zone_for_pfn(int nid)
+{
+ /*
+ * Please note that 32b HIGHMEM systems might have 2 movable zones
+ * actually so we have to check for both. This is rather ugly hack
+ * to enforce using Highmem on those systems but we do not have a
+ * good user API to tell into which movable zone we should online.
+ * WARN if we have a movable zone which is not highmem.
+ */
+#ifdef CONFIG_HIGHMEM
+ WARN_ON_ONCE(!zone_movable_is_highmem());
+ return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
+#else
+ return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+#endif
+}
+
static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
unsigned long nr_pages)
{
struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
nr_pages);
- struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+ struct zone *movable_zone = default_movable_zone_for_pfn(nid);
bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);

@@ -886,7 +903,7 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);

if (online_type == MMOP_ONLINE_MOVABLE)
- return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+ return default_movable_zone_for_pfn(nid);

return default_zone_for_pfn(nid, start_pfn, nr_pages);
}
--
Michal Hocko
SUSE Labs

2017-12-18 17:55:09

by Randy Dunlap

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On 12/18/2017 06:53 AM, Michal Hocko wrote:
> On Fri 01-12-17 08:23:27, Seth Forshee wrote:
>> On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
>>> On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> [...]
>>>> It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
>>>> and the problem still occurs there. It's pointing to the call to
>>>> __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
>>>> get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
>>>
>>> Hmm, this is interesting. That would mean that we have successfully
>>> mapped the destination page but its memory is still not accessible.
>>>
>>> Right now I do not see how the patch you have bisected to could make any
>>> difference because it only postponed the onlining to be independent but
>>> your config simply onlines automatically so there shouldn't be any
>>> semantic change. Maybe there is some sort of off-by-one or something.
>>>
>>> I will try to investigate some more. Do you think it would be possible
>>> to configure kdump on your system and provide me with the vmcore in some
>>> way?
>>
>> Sorry, I got busy with other stuff and this kind of fell off my radar.
>> It came to my attention again recently though.
>
> Apology on my side. This has completely fall of my radar.
>
>> I was looking through the hotplug rework changes, and I noticed that
>> 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
>> rework it doesn't look like it's possible for memory to be associated
>> with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
>> and am now no longer seeing the oopses.
>
> Thanks a lot for debugging! Do I read the above correctly that the
> current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> this zone and that leads to later blowups? Could you attach the fresh
> boot dmesg output please?
>
>> I'm sure this isn't the correct fix, but I think it does confirm that
>> the problem is that the memory should be associated with ZONE_HIGHMEM
>> but is not.
>
>
> Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> zone. The kernel cannot access that memory directly. It is essentially a
> movable zone from the hotplug API POV. We simply do not have any way to
> tell into which zone we want to online this memory range in.
> Unfortunately both zones _can_ be present. It would require an explicit
> configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> or and movable memory configured explicitly on the kernel command line).
>
> The below patch is not really complete but I would rather start simple.
> Maybe we do not even have to care as most 32b users will never use both
> zones at the same time. I've placed a warning to learn about those.
>
> Does this pass your testing?
> ---
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 262bfd26baf9..18fec18bdb60 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
> return &pgdat->node_zones[ZONE_NORMAL];
> }
>
> +static struct zone *default_movable_zone_for_pfn(int nid)
> +{
> + /*
> + * Please note that 32b HIGHMEM systems might have 2 movable zones

Please spell out 32-bit. It took me a bit to realize what "32b" was.

ta.

> + * actually so we have to check for both. This is rather ugly hack
> + * to enforce using Highmem on those systems but we do not have a
> + * good user API to tell into which movable zone we should online.
> + * WARN if we have a movable zone which is not highmem.
> + */
> +#ifdef CONFIG_HIGHMEM
> + WARN_ON_ONCE(!zone_movable_is_highmem());
> + return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
> +#else
> + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> +#endif
> +}
> +
> static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> unsigned long nr_pages)
> {
> struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
> nr_pages);
> - struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> + struct zone *movable_zone = default_movable_zone_for_pfn(nid);
> bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
> bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
>
> @@ -886,7 +903,7 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
>
> if (online_type == MMOP_ONLINE_MOVABLE)
> - return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> + return default_movable_zone_for_pfn(nid);
>
> return default_zone_for_pfn(nid, start_pfn, nr_pages);
> }
>

--
~Randy

2017-12-22 14:49:31

by Michal Hocko

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> [...]
> > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > > > and the problem still occurs there. It's pointing to the call to
> > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > >
> > > Hmm, this is interesting. That would mean that we have successfully
> > > mapped the destination page but its memory is still not accessible.
> > >
> > > Right now I do not see how the patch you have bisected to could make any
> > > difference because it only postponed the onlining to be independent but
> > > your config simply onlines automatically so there shouldn't be any
> > > semantic change. Maybe there is some sort of off-by-one or something.
> > >
> > > I will try to investigate some more. Do you think it would be possible
> > > to configure kdump on your system and provide me with the vmcore in some
> > > way?
> >
> > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > It came to my attention again recently though.
>
> Apology on my side. This has completely fall of my radar.
>
> > I was looking through the hotplug rework changes, and I noticed that
> > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > rework it doesn't look like it's possible for memory to be associated
> > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > and am now no longer seeing the oopses.
>
> Thanks a lot for debugging! Do I read the above correctly that the
> current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> this zone and that leads to later blowups? Could you attach the fresh
> boot dmesg output please?
>
> > I'm sure this isn't the correct fix, but I think it does confirm that
> > the problem is that the memory should be associated with ZONE_HIGHMEM
> > but is not.
>
>
> Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> zone. The kernel cannot access that memory directly. It is essentially a
> movable zone from the hotplug API POV. We simply do not have any way to
> tell into which zone we want to online this memory range in.
> Unfortunately both zones _can_ be present. It would require an explicit
> configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> or and movable memory configured explicitly on the kernel command line).
>
> The below patch is not really complete but I would rather start simple.
> Maybe we do not even have to care as most 32b users will never use both
> zones at the same time. I've placed a warning to learn about those.
>
> Does this pass your testing?

Any chances to test this?

> ---
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 262bfd26baf9..18fec18bdb60 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
> return &pgdat->node_zones[ZONE_NORMAL];
> }
>
> +static struct zone *default_movable_zone_for_pfn(int nid)
> +{
> + /*
> + * Please note that 32b HIGHMEM systems might have 2 movable zones
> + * actually so we have to check for both. This is rather ugly hack
> + * to enforce using Highmem on those systems but we do not have a
> + * good user API to tell into which movable zone we should online.
> + * WARN if we have a movable zone which is not highmem.
> + */
> +#ifdef CONFIG_HIGHMEM
> + WARN_ON_ONCE(!zone_movable_is_highmem());
> + return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
> +#else
> + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> +#endif
> +}
> +
> static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> unsigned long nr_pages)
> {
> struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
> nr_pages);
> - struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> + struct zone *movable_zone = default_movable_zone_for_pfn(nid);
> bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
> bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
>
> @@ -886,7 +903,7 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
>
> if (online_type == MMOP_ONLINE_MOVABLE)
> - return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> + return default_movable_zone_for_pfn(nid);
>
> return default_zone_for_pfn(nid, start_pfn, nr_pages);
> }
> --
> Michal Hocko
> SUSE Labs

--
Michal Hocko
SUSE Labs

2017-12-22 16:12:53

by Seth Forshee

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote:
> On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> > On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > [...]
> > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > > > > and the problem still occurs there. It's pointing to the call to
> > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > > >
> > > > Hmm, this is interesting. That would mean that we have successfully
> > > > mapped the destination page but its memory is still not accessible.
> > > >
> > > > Right now I do not see how the patch you have bisected to could make any
> > > > difference because it only postponed the onlining to be independent but
> > > > your config simply onlines automatically so there shouldn't be any
> > > > semantic change. Maybe there is some sort of off-by-one or something.
> > > >
> > > > I will try to investigate some more. Do you think it would be possible
> > > > to configure kdump on your system and provide me with the vmcore in some
> > > > way?
> > >
> > > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > > It came to my attention again recently though.
> >
> > Apology on my side. This has completely fall of my radar.
> >
> > > I was looking through the hotplug rework changes, and I noticed that
> > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > > rework it doesn't look like it's possible for memory to be associated
> > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > > and am now no longer seeing the oopses.
> >
> > Thanks a lot for debugging! Do I read the above correctly that the
> > current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> > this zone and that leads to later blowups? Could you attach the fresh
> > boot dmesg output please?
> >
> > > I'm sure this isn't the correct fix, but I think it does confirm that
> > > the problem is that the memory should be associated with ZONE_HIGHMEM
> > > but is not.
> >
> >
> > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> > zone. The kernel cannot access that memory directly. It is essentially a
> > movable zone from the hotplug API POV. We simply do not have any way to
> > tell into which zone we want to online this memory range in.
> > Unfortunately both zones _can_ be present. It would require an explicit
> > configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> > or and movable memory configured explicitly on the kernel command line).
> >
> > The below patch is not really complete but I would rather start simple.
> > Maybe we do not even have to care as most 32b users will never use both
> > zones at the same time. I've placed a warning to learn about those.
> >
> > Does this pass your testing?
>
> Any chances to test this?

Yes, I should get to testing it soon. I'm working through a backlog of
things I need to get done and this just hasn't quite made it to the top.

> > ---
> > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > index 262bfd26baf9..18fec18bdb60 100644
> > --- a/mm/memory_hotplug.c
> > +++ b/mm/memory_hotplug.c
> > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
> > return &pgdat->node_zones[ZONE_NORMAL];
> > }
> >
> > +static struct zone *default_movable_zone_for_pfn(int nid)
> > +{
> > + /*
> > + * Please note that 32b HIGHMEM systems might have 2 movable zones
> > + * actually so we have to check for both. This is rather ugly hack
> > + * to enforce using Highmem on those systems but we do not have a
> > + * good user API to tell into which movable zone we should online.
> > + * WARN if we have a movable zone which is not highmem.
> > + */
> > +#ifdef CONFIG_HIGHMEM
> > + WARN_ON_ONCE(!zone_movable_is_highmem());
> > + return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
> > +#else
> > + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > +#endif
> > +}
> > +
> > static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> > unsigned long nr_pages)
> > {
> > struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
> > nr_pages);
> > - struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > + struct zone *movable_zone = default_movable_zone_for_pfn(nid);
> > bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
> > bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
> >
> > @@ -886,7 +903,7 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> > return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
> >
> > if (online_type == MMOP_ONLINE_MOVABLE)
> > - return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > + return default_movable_zone_for_pfn(nid);
> >
> > return default_zone_for_pfn(nid, start_pfn, nr_pages);
> > }
> > --
> > Michal Hocko
> > SUSE Labs
>
> --
> Michal Hocko
> SUSE Labs

2017-12-22 18:45:29

by Seth Forshee

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Fri, Dec 22, 2017 at 10:12:40AM -0600, Seth Forshee wrote:
> On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote:
> > On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> > > On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > > [...]
> > > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > > > > > and the problem still occurs there. It's pointing to the call to
> > > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > > > >
> > > > > Hmm, this is interesting. That would mean that we have successfully
> > > > > mapped the destination page but its memory is still not accessible.
> > > > >
> > > > > Right now I do not see how the patch you have bisected to could make any
> > > > > difference because it only postponed the onlining to be independent but
> > > > > your config simply onlines automatically so there shouldn't be any
> > > > > semantic change. Maybe there is some sort of off-by-one or something.
> > > > >
> > > > > I will try to investigate some more. Do you think it would be possible
> > > > > to configure kdump on your system and provide me with the vmcore in some
> > > > > way?
> > > >
> > > > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > > > It came to my attention again recently though.
> > >
> > > Apology on my side. This has completely fall of my radar.
> > >
> > > > I was looking through the hotplug rework changes, and I noticed that
> > > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > > > rework it doesn't look like it's possible for memory to be associated
> > > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > > > and am now no longer seeing the oopses.
> > >
> > > Thanks a lot for debugging! Do I read the above correctly that the
> > > current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> > > this zone and that leads to later blowups? Could you attach the fresh
> > > boot dmesg output please?
> > >
> > > > I'm sure this isn't the correct fix, but I think it does confirm that
> > > > the problem is that the memory should be associated with ZONE_HIGHMEM
> > > > but is not.
> > >
> > >
> > > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> > > zone. The kernel cannot access that memory directly. It is essentially a
> > > movable zone from the hotplug API POV. We simply do not have any way to
> > > tell into which zone we want to online this memory range in.
> > > Unfortunately both zones _can_ be present. It would require an explicit
> > > configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> > > or and movable memory configured explicitly on the kernel command line).
> > >
> > > The below patch is not really complete but I would rather start simple.
> > > Maybe we do not even have to care as most 32b users will never use both
> > > zones at the same time. I've placed a warning to learn about those.
> > >
> > > Does this pass your testing?
> >
> > Any chances to test this?
>
> Yes, I should get to testing it soon. I'm working through a backlog of
> things I need to get done and this just hasn't quite made it to the top.

I started by testing vanilla 4.15-rc4 with a vm that has several memory
slots already populated at boot. With that I no longer get an oops,
however while /sys/devices/system/memory/*/online is 1 it looks like the
memory isn't being used. With your patch the behavior is the same. I'm
attaching dmesg from both kernels.

Thanks,
Seth

>
> > > ---
> > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> > > index 262bfd26baf9..18fec18bdb60 100644
> > > --- a/mm/memory_hotplug.c
> > > +++ b/mm/memory_hotplug.c
> > > @@ -855,12 +855,29 @@ static struct zone *default_kernel_zone_for_pfn(int nid, unsigned long start_pfn
> > > return &pgdat->node_zones[ZONE_NORMAL];
> > > }
> > >
> > > +static struct zone *default_movable_zone_for_pfn(int nid)
> > > +{
> > > + /*
> > > + * Please note that 32b HIGHMEM systems might have 2 movable zones
> > > + * actually so we have to check for both. This is rather ugly hack
> > > + * to enforce using Highmem on those systems but we do not have a
> > > + * good user API to tell into which movable zone we should online.
> > > + * WARN if we have a movable zone which is not highmem.
> > > + */
> > > +#ifdef CONFIG_HIGHMEM
> > > + WARN_ON_ONCE(!zone_movable_is_highmem());
> > > + return &NODE_DATA(nid)->node_zones[ZONE_HIGHMEM];
> > > +#else
> > > + return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > > +#endif
> > > +}
> > > +
> > > static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
> > > unsigned long nr_pages)
> > > {
> > > struct zone *kernel_zone = default_kernel_zone_for_pfn(nid, start_pfn,
> > > nr_pages);
> > > - struct zone *movable_zone = &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > > + struct zone *movable_zone = default_movable_zone_for_pfn(nid);
> > > bool in_kernel = zone_intersects(kernel_zone, start_pfn, nr_pages);
> > > bool in_movable = zone_intersects(movable_zone, start_pfn, nr_pages);
> > >
> > > @@ -886,7 +903,7 @@ struct zone * zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
> > > return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
> > >
> > > if (online_type == MMOP_ONLINE_MOVABLE)
> > > - return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
> > > + return default_movable_zone_for_pfn(nid);
> > >
> > > return default_zone_for_pfn(nid, start_pfn, nr_pages);
> > > }
> > > --
> > > Michal Hocko
> > > SUSE Labs
> >
> > --
> > Michal Hocko
> > SUSE Labs

Attachments:

(No filename) (5.73 kB)
dmesg-4.15-rc4.txt (33.26 kB)
dmesg-4.15-rc4+patch.txt (35.52 kB)
Download all attachments

2017-12-29 13:05:50

by Michal Hocko

[permalink] [raw]

Subject: Re: Memory hotplug regression in 4.13

On Fri 22-12-17 12:45:15, Seth Forshee wrote:
> On Fri, Dec 22, 2017 at 10:12:40AM -0600, Seth Forshee wrote:
> > On Fri, Dec 22, 2017 at 03:49:25PM +0100, Michal Hocko wrote:
> > > On Mon 18-12-17 15:53:20, Michal Hocko wrote:
> > > > On Fri 01-12-17 08:23:27, Seth Forshee wrote:
> > > > > On Mon, Sep 25, 2017 at 02:58:25PM +0200, Michal Hocko wrote:
> > > > > > On Thu 21-09-17 00:40:34, Seth Forshee wrote:
> > > > [...]
> > > > > > > It seems I don't have that kernel anymore, but I've got a 4.14-rc1 build
> > > > > > > and the problem still occurs there. It's pointing to the call to
> > > > > > > __builtin_memcpy in memcpy (include/linux/string.h line 340), which we
> > > > > > > get to via wp_page_copy -> cow_user_page -> copy_user_highpage.
> > > > > >
> > > > > > Hmm, this is interesting. That would mean that we have successfully
> > > > > > mapped the destination page but its memory is still not accessible.
> > > > > >
> > > > > > Right now I do not see how the patch you have bisected to could make any
> > > > > > difference because it only postponed the onlining to be independent but
> > > > > > your config simply onlines automatically so there shouldn't be any
> > > > > > semantic change. Maybe there is some sort of off-by-one or something.
> > > > > >
> > > > > > I will try to investigate some more. Do you think it would be possible
> > > > > > to configure kdump on your system and provide me with the vmcore in some
> > > > > > way?
> > > > >
> > > > > Sorry, I got busy with other stuff and this kind of fell off my radar.
> > > > > It came to my attention again recently though.
> > > >
> > > > Apology on my side. This has completely fall of my radar.
> > > >
> > > > > I was looking through the hotplug rework changes, and I noticed that
> > > > > 32-bit x86 previously was using ZONE_HIGHMEM as a default but after the
> > > > > rework it doesn't look like it's possible for memory to be associated
> > > > > with ZONE_HIGHMEM when onlining. So I made the change below against 4.14
> > > > > and am now no longer seeing the oopses.
> > > >
> > > > Thanks a lot for debugging! Do I read the above correctly that the
> > > > current code simply returns ZONE_NORMAL and maps an unrelated pfn into
> > > > this zone and that leads to later blowups? Could you attach the fresh
> > > > boot dmesg output please?
> > > >
> > > > > I'm sure this isn't the correct fix, but I think it does confirm that
> > > > > the problem is that the memory should be associated with ZONE_HIGHMEM
> > > > > but is not.
> > > >
> > > >
> > > > Yes, the fix is not quite right. HIGHMEM is not a _kernel_ memory
> > > > zone. The kernel cannot access that memory directly. It is essentially a
> > > > movable zone from the hotplug API POV. We simply do not have any way to
> > > > tell into which zone we want to online this memory range in.
> > > > Unfortunately both zones _can_ be present. It would require an explicit
> > > > configuration (movable_node and a NUMA hoptlugable nodes running in 32b
> > > > or and movable memory configured explicitly on the kernel command line).
> > > >
> > > > The below patch is not really complete but I would rather start simple.
> > > > Maybe we do not even have to care as most 32b users will never use both
> > > > zones at the same time. I've placed a warning to learn about those.
> > > >
> > > > Does this pass your testing?
> > >
> > > Any chances to test this?
> >
> > Yes, I should get to testing it soon. I'm working through a backlog of
> > things I need to get done and this just hasn't quite made it to the top.
>
> I started by testing vanilla 4.15-rc4 with a vm that has several memory
> slots already populated at boot. With that I no longer get an oops,
> however while /sys/devices/system/memory/*/online is 1 it looks like the
> memory isn't being used. With your patch the behavior is the same. I'm
> attaching dmesg from both kernels.

What do you mean? The overal available memory doesn't match the size of
all memblocks?
--
Michal Hocko
SUSE Labs