2013-06-13 13:39:32

by Tang Chen

[permalink] [raw]
Subject: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

The following patch-set from Yinghai allocates pagetables to local nodes.
v1: https://lkml.org/lkml/2013/3/7/642
v2: https://lkml.org/lkml/2013/3/10/47
v3: https://lkml.org/lkml/2013/4/4/639
v4: https://lkml.org/lkml/2013/4/11/829

Since pagetable pages are used by the kernel, they cannot be offlined.
As a result, they cannot be hot-remove.

This patch fix this problem with the following solution:

1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.

2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
whole memory block they reside in is offlined, the kernel can
still access the pagetables.
(This changes the semantics of offline/online a little bit.)

3. Do not free LOCAL_NODE_DATAL pages to buddy system because they
were skipped when in offline/online procedures. The memory block
they reside in could have been offlined.

Anyway, this problem should be fixed. Any better idea is welcome.

Change log:
v1 -> v2:
patch2: As suggested by Wu Jianguo, define a macro to check if a page
cantains local node data.
patch4: As suggested by Wu Jianguo, prevent freeing LOCAL_NODE_DATA
pages in free_pagetable() instead of in put_page_bootmem().

Tang Chen (4):
bootmem, mem-hotplug: Register local pagetable pages with
LOCAL_NODE_DATA when freeing bootmem.
mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.
mem-hotplug: Skip LOCAL_NODE_DATA pages in memory online procedure.
mem-hotplug: Do not free LOCAL_NODE_DATA pages to buddy system in
hot-remove procedure.

arch/x86/mm/init_64.c | 10 +++++++-
include/linux/memblock.h | 22 +++++++++++++++++
include/linux/memory_hotplug.h | 20 +++++++++++++--
mm/memblock.c | 52 ++++++++++++++++++++++++++++++++++++++++
mm/memory_hotplug.c | 31 +++++++++++++++++++++++
mm/page_alloc.c | 15 ++++++++++-
mm/page_isolation.c | 5 ++++
7 files changed, 149 insertions(+), 6 deletions(-)


2013-06-13 13:27:58

by Tang Chen

[permalink] [raw]
Subject: [Part3 PATCH v2 3/4] mem-hotplug: Skip LOCAL_NODE_DATA pages in memory online procedure.

Pages marked as LOCAL_NODE_DATA are skipped when we do memory offline.
So we have to skip them again when we do memory online.

Signed-off-by: Tang Chen <[email protected]>
---
mm/memory_hotplug.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index c2017eb..3561048 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -843,6 +843,11 @@ static int online_pages_range(unsigned long start_pfn, unsigned long nr_pages,
if (PageReserved(pfn_to_page(start_pfn)))
for (i = 0; i < nr_pages; i++) {
page = pfn_to_page(start_pfn + i);
+
+ /* Skip pages storing local node kernel data. */
+ if (is_local_node_data(page))
+ continue;
+
(*online_page_callback)(page);
onlined_pages++;
}
--
1.7.1

2013-06-13 13:34:57

by Tang Chen

[permalink] [raw]
Subject: [Part3 PATCH v2 1/4] bootmem, mem-hotplug: Register local pagetable pages with LOCAL_NODE_DATA when freeing bootmem.

As Yinghai suggested, even if a node is movable node, which has only
ZONE_MOVABLE, pagetables should be put in the local node.

In memory hot-remove logic, it offlines all pages first, and then
removes pagetables. But the local pagetable pages cannot be offlined
because they are used by kernel.

So we should skip this kind of pages in offline procedure. But first
of all, we need to mark them.

This patch marks local node data pages in the same way as we mark the
SECTION_INFO and MIX_SECTION_INFO data pages. We introduce a new type
of bootmem: LOCAL_NODE_DATA. And use page->lru.next to mark this type
of memory.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/init_64.c | 2 +
include/linux/memblock.h | 22 +++++++++++++++++
include/linux/memory_hotplug.h | 13 ++++++++-
mm/memblock.c | 52 ++++++++++++++++++++++++++++++++++++++++
mm/memory_hotplug.c | 26 ++++++++++++++++++++
5 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bb00c46..25de304 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1053,6 +1053,8 @@ static void __init register_page_bootmem_info(void)

for_each_online_node(i)
register_page_bootmem_info_node(NODE_DATA(i));
+
+ register_page_bootmem_local_node();
#endif
}

diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index a85ced9..8a38eef 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -131,6 +131,28 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))

+void __next_local_node_mem_range(int *idx, int nid, phys_addr_t *out_start,
+ phys_addr_t *out_end, int *out_nid);
+
+/**
+ * for_each_local_node_mem_range - iterate memblock areas storing local node
+ * data
+ * @i: int used as loop variable
+ * @nid: node selector, %MAX_NUMNODES for all nodes
+ * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @p_nid: ptr to int for nid of the range, can be %NULL
+ *
+ * Walks over memblock areas storing local node data. Since all the local node
+ * areas will be reserved by memblock, this iterator will only iterate
+ * memblock.reserve. Available as soon as memblock is initialized.
+ */
+#define for_each_local_node_mem_range(i, nid, p_start, p_end, p_nid) \
+ for (i = -1, \
+ __next_local_node_mem_range(&i, nid, p_start, p_end, p_nid); \
+ i != -1; \
+ __next_local_node_mem_range(&i, nid, p_start, p_end, p_nid))
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 0b21e54..c0c4107 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -16,14 +16,19 @@ struct memory_block;

/*
* Types for free bootmem stored in page->lru.next. These have to be in
- * some random range in unsigned long space for debugging purposes.
+ * some random range in unsigned long space for debugging purposes except
+ * LOCAL_NODE_DATA.
+ *
+ * LOCAL_NODE_DATA is used to mark local node pages storing data to
+ * describe the memory of the node, such as local pagetable pages.
*/
enum {
MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE = 12,
SECTION_INFO = MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE,
MIX_SECTION_INFO,
NODE_INFO,
- MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = NODE_INFO,
+ LOCAL_NODE_DATA,
+ MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = LOCAL_NODE_DATA,
};

/* Types for control the zone type of onlined memory */
@@ -179,10 +184,14 @@ static inline void arch_refresh_nodedata(int nid, pg_data_t *pgdat)

#ifdef CONFIG_HAVE_BOOTMEM_INFO_NODE
extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
+extern void register_page_bootmem_local_node(void);
#else
static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
}
+static inline void register_page_bootmem_local_node()
+{
+}
#endif
extern void put_page_bootmem(struct page *page);
extern void get_page_bootmem(unsigned long ingo, struct page *page,
diff --git a/mm/memblock.c b/mm/memblock.c
index e672b9e..420e7fb 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -631,6 +631,58 @@ bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
return region->flags & MEMBLK_HOTPLUGGABLE;
}

+/*
+ * Common iterator to find next range with the same flags.
+ */
+static void __init_memblock __next_flag_mem_range(int *idx, int nid,
+ unsigned long flags,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end, int *out_nid)
+{
+ struct memblock_type *rsv = &memblock.reserved;
+ struct memblock_region *r;
+
+ while (++*idx < rsv->cnt) {
+ r = &rsv->regions[*idx];
+
+ if (nid != MAX_NUMNODES &&
+ nid != memblock_get_region_node(r))
+ continue;
+
+ if (r->flags & flags)
+ break;
+ }
+
+ if (*idx >= rsv->cnt) {
+ *idx = -1;
+ return;
+ }
+
+ if (out_start)
+ *out_start = r->base;
+ if (out_end)
+ *out_end = r->base + r->size;
+ if (out_nid)
+ *out_nid = memblock_get_region_node(r);
+}
+
+/**
+ * __next_local_node_mem_range - next function for
+ * for_each_local_node_mem_range()
+ * @idx: pointer to int loop variable
+ * @nid: node selector, %MAX_NUMNODES for all nodes
+ * @out_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @out_nid: ptr to int for nid of the range, can be %NULL
+ */
+void __init_memblock __next_local_node_mem_range(int *idx, int nid,
+ phys_addr_t *out_start,
+ phys_addr_t *out_end, int *out_nid)
+{
+ __next_flag_mem_range(idx, nid, MEMBLK_LOCAL_NODE,
+ out_start, out_end, out_nid);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1ad92b4..c2017eb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -30,6 +30,7 @@
#include <linux/mm_inline.h>
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
+#include <linux/memblock.h>

#include <asm/tlbflush.h>

@@ -191,6 +192,31 @@ static void register_page_bootmem_info_section(unsigned long start_pfn)
}
#endif /* !CONFIG_SPARSEMEM_VMEMMAP */

+void __ref register_page_bootmem_local_node()
+{
+ int i, nid;
+ phys_addr_t start, end;
+ unsigned long start_pfn, end_pfn;
+ struct page *page;
+
+ for_each_local_node_mem_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ start_pfn = PFN_DOWN(start);
+ end_pfn = PFN_UP(end);
+ page = pfn_to_page(start_pfn);
+
+ for ( ; start_pfn <= end_pfn; start_pfn++, page++) {
+ /*
+ * We need to set the whole page as LOCAL_NODE_DATA,
+ * so we get the upper end_pfn. But this upper end_pfn
+ * may not exist. So we have to check if the page
+ * present before we access its struct page.
+ */
+ if (pfn_present(start_pfn))
+ get_page_bootmem(nid, page, LOCAL_NODE_DATA);
+ }
+ }
+}
+
void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
unsigned long i, pfn, end_pfn, nr_pages;
--
1.7.1

2013-06-13 13:37:47

by Tang Chen

[permalink] [raw]
Subject: [Part3 PATCH v2 2/4] mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.

In memory offline procedure, skip pages marked as LOCAL_NODE_DATA.
For now, this kind of pages are used to store local node pagetables.

The minimum unit of memory online/offline is a memory block. In a
block, the movable pages will be offlined as usual (unmapped and
isolated), and the pagetable pages will be skipped. After the iteration
of all page, the block will be set as offline, but the kernel can
still access the pagetable pages. This is user transparent.

v1 -> v2: As suggested by Wu Jianguo <[email protected]>, define a
macro to check if a page contains local node data.

Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memory_hotplug.h | 7 ++++++-
mm/page_alloc.c | 15 +++++++++++++--
mm/page_isolation.c | 5 +++++
3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index c0c4107..05de193 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,8 +5,8 @@
#include <linux/spinlock.h>
#include <linux/notifier.h>
#include <linux/bug.h>
+#include <linux/mm_types.h>

-struct page;
struct zone;
struct pglist_data;
struct mem_section;
@@ -31,6 +31,11 @@ enum {
MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE = LOCAL_NODE_DATA,
};

+static inline bool is_local_node_data(struct page *page)
+{
+ return (unsigned long)page->lru.next == LOCAL_NODE_DATA;
+}
+
/* Types for control the zone type of onlined memory */
enum {
ONLINE_KEEP,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a5325b2..7cd8f13 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5772,6 +5772,11 @@ bool has_unmovable_pages(struct zone *zone, struct page *page, int count,
continue;

page = pfn_to_page(check);
+
+ /* Skip pages storing local node kernel data. */
+ if (is_local_node_data(page))
+ continue;
+
/*
* We can't use page_count without pin a page
* because another CPU can free compound page.
@@ -6095,8 +6100,7 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
struct page *page;
struct zone *zone;
int order, i;
- unsigned long pfn;
- unsigned long flags;
+ unsigned long pfn, flags;
/* find the first valid pfn */
for (pfn = start_pfn; pfn < end_pfn; pfn++)
if (pfn_valid(pfn))
@@ -6112,6 +6116,13 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
continue;
}
page = pfn_to_page(pfn);
+
+ /* Skip pages storing local node kernel data. */
+ if (is_local_node_data(page)) {
+ pfn++;
+ continue;
+ }
+
/*
* The HWPoisoned page may be not in buddy system, and
* page_count() is not 0.
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 383bdbb..4cb0ccb 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -6,6 +6,7 @@
#include <linux/page-isolation.h>
#include <linux/pageblock-flags.h>
#include <linux/memory.h>
+#include <linux/memory_hotplug.h>
#include "internal.h"

int set_migratetype_isolate(struct page *page, bool skip_hwpoisoned_pages)
@@ -181,6 +182,7 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
continue;
}
page = pfn_to_page(pfn);
+
if (PageBuddy(page)) {
/*
* If race between isolatation and allocation happens,
@@ -208,6 +210,9 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
*/
pfn++;
continue;
+ } else if (is_local_node_data(page)) {
+ pfn++;
+ continue;
}
else
break;
--
1.7.1

2013-06-13 13:41:48

by Tang Chen

[permalink] [raw]
Subject: [Part3 PATCH v2 4/4] mem-hotplug: Do not free LOCAL_NODE_DATA pages to buddy system in hot-remove procedure.

In memory hot-remove procedure, we free pagetable pages to buddy system.
But for local pagetable pages, do not free them to buddy system because
they were skipped in offline procedure. The memory block they reside in
could have been offlined, and we won't offline it again.

v1 -> v2: Prevent freeing LOCAL_NODE_DATA pages in free_pagetable() instead
of in put_page_bootmem().

Suggested-by: Wu Jianguo <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/mm/init_64.c | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 25de304..ffaf24a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -726,7 +726,13 @@ static void __meminit free_pagetable(struct page *page, int order)
if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
while (nr_pages--)
put_page_bootmem(page++);
- } else
+ } else if (!is_local_node_data(page))
+ /*
+ * Do not free pages with local node kernel data
+ * (for now, just local pagetables) to the buddy
+ * system because we skipped these pages when
+ * offlining the corresponding block.
+ */
__free_pages_bootmem(page, order);
} else
free_pages((unsigned long)page_address(page), order);
--
1.7.1

2013-06-13 17:17:08

by Dave Hansen

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 2/4] mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.

On 06/13/2013 06:03 AM, Tang Chen wrote:
> +static inline bool is_local_node_data(struct page *page)
> +{
> + return (unsigned long)page->lru.next == LOCAL_NODE_DATA;
> +}

page->lru is already in a union. Could you please just add a new entry
to the union with a nice associated comment instead of reusing it this way?

2013-06-14 05:34:12

by Tang Chen

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 1/4] bootmem, mem-hotplug: Register local pagetable pages with LOCAL_NODE_DATA when freeing bootmem.

Hi Michal,

Please see below.

On 06/13/2013 10:16 PM, Michal Nazarewicz wrote:
......
>> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
>> index a85ced9..8a38eef 100644
>> --- a/include/linux/memblock.h
>> +++ b/include/linux/memblock.h
>> @@ -131,6 +131,28 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
>> i != (u64)ULLONG_MAX; \
>> __next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
>>
>> +void __next_local_node_mem_range(int *idx, int nid, phys_addr_t *out_start,
>> + phys_addr_t *out_end, int *out_nid);
>
> Why not make it return int?

The same reason below.

>
>> +
>> +/**
>> + * for_each_local_node_mem_range - iterate memblock areas storing local node
>> + * data
>> + * @i: int used as loop variable
>> + * @nid: node selector, %MAX_NUMNODES for all nodes
>> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
>> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
>> + * @p_nid: ptr to int for nid of the range, can be %NULL
>> + *
>> + * Walks over memblock areas storing local node data. Since all the local node
>> + * areas will be reserved by memblock, this iterator will only iterate
>> + * memblock.reserve. Available as soon as memblock is initialized.
>> + */
>> +#define for_each_local_node_mem_range(i, nid, p_start, p_end, p_nid) \
>> + for (i = -1, \
>> + __next_local_node_mem_range(&i, nid, p_start, p_end, p_nid); \
>> + i != -1; \
>> + __next_local_node_mem_range(&i, nid, p_start, p_end, p_nid))
>> +
>
> If __next_local_node_mem_range() returned int, this would be easier:
>
> +#define for_each_local_node_mem_range(i, nid, p_start, p_end, p_nid) \
> + for (i = -1;
> + (i = __next_local_node_mem_range(i, nid, p_start, p_end, p_nid)) != -1; )

Yes, we can do it like this.

But I tried to do something similar to for_each_free_mem_range and
for_each_free_mem_range_reverse to keep the code coincident.

How do you think to change all this similar functions into your way ?

>
......
>> +void __init_memblock __next_local_node_mem_range(int *idx, int nid,
>> + phys_addr_t *out_start,
>> + phys_addr_t *out_end, int *out_nid)
>> +{
>> + __next_flag_mem_range(idx, nid, MEMBLK_LOCAL_NODE,
>> + out_start, out_end, out_nid);
>> +}
>
> static inline in a header file perhaps?

OK, will put it in a header file in the next version.

Thanks. :)

2013-06-14 05:42:18

by Tang Chen

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 2/4] mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.

Hi Dave,

On 06/14/2013 01:17 AM, Dave Hansen wrote:
> On 06/13/2013 06:03 AM, Tang Chen wrote:
>> +static inline bool is_local_node_data(struct page *page)
>> +{
>> + return (unsigned long)page->lru.next == LOCAL_NODE_DATA;
>> +}
>
> page->lru is already in a union. Could you please just add a new entry
> to the union with a nice associated comment instead of reusing it this way?
>

You mean add a new entry to the union in page structure ?

Hum, seems a good idea. :)

And as you know, NODE_INFO, SECTION_INFO, ... , they all reuse page->lru.
So I need to modify the associated code too. This is easy to do, and I can
do it in the next version soon.

Thanks. :)

2013-06-17 02:00:52

by Jianguo Wu

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

Hi Tang,

On 2013/6/13 21:03, Tang Chen wrote:

> The following patch-set from Yinghai allocates pagetables to local nodes.
> v1: https://lkml.org/lkml/2013/3/7/642
> v2: https://lkml.org/lkml/2013/3/10/47
> v3: https://lkml.org/lkml/2013/4/4/639
> v4: https://lkml.org/lkml/2013/4/11/829
>
> Since pagetable pages are used by the kernel, they cannot be offlined.
> As a result, they cannot be hot-remove.
>
> This patch fix this problem with the following solution:
>
> 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
> pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
> LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
>
> 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
> whole memory block they reside in is offlined, the kernel can
> still access the pagetables.
> (This changes the semantics of offline/online a little bit.)
>
> 3. Do not free LOCAL_NODE_DATAL pages to buddy system because they
> were skipped when in offline/online procedures. The memory block
> they reside in could have been offlined.
>

s/LOCAL_NODE_DATAL/LOCAL_NODE_DATA

> Anyway, this problem should be fixed. Any better idea is welcome.
>
> Change log:
> v1 -> v2:
> patch2: As suggested by Wu Jianguo, define a macro to check if a page
> cantains local node data.
> patch4: As suggested by Wu Jianguo, prevent freeing LOCAL_NODE_DATA
> pages in free_pagetable() instead of in put_page_bootmem().
>
> Tang Chen (4):
> bootmem, mem-hotplug: Register local pagetable pages with
> LOCAL_NODE_DATA when freeing bootmem.
> mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.
> mem-hotplug: Skip LOCAL_NODE_DATA pages in memory online procedure.
> mem-hotplug: Do not free LOCAL_NODE_DATA pages to buddy system in
> hot-remove procedure.
>
> arch/x86/mm/init_64.c | 10 +++++++-
> include/linux/memblock.h | 22 +++++++++++++++++
> include/linux/memory_hotplug.h | 20 +++++++++++++--
> mm/memblock.c | 52 ++++++++++++++++++++++++++++++++++++++++
> mm/memory_hotplug.c | 31 +++++++++++++++++++++++
> mm/page_alloc.c | 15 ++++++++++-
> mm/page_isolation.c | 5 ++++
> 7 files changed, 149 insertions(+), 6 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> .
>


2013-06-17 02:13:48

by Tang Chen

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

On 06/17/2013 09:58 AM, Jianguo Wu wrote:
> Hi Tang,
>
> On 2013/6/13 21:03, Tang Chen wrote:
>
>> The following patch-set from Yinghai allocates pagetables to local nodes.
>> v1: https://lkml.org/lkml/2013/3/7/642
>> v2: https://lkml.org/lkml/2013/3/10/47
>> v3: https://lkml.org/lkml/2013/4/4/639
>> v4: https://lkml.org/lkml/2013/4/11/829
>>
>> Since pagetable pages are used by the kernel, they cannot be offlined.
>> As a result, they cannot be hot-remove.
>>
>> This patch fix this problem with the following solution:
>>
>> 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
>> pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
>> LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
>>
>> 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
>> whole memory block they reside in is offlined, the kernel can
>> still access the pagetables.
>> (This changes the semantics of offline/online a little bit.)
>>
>> 3. Do not free LOCAL_NODE_DATAL pages to buddy system because they
>> were skipped when in offline/online procedures. The memory block
>> they reside in could have been offlined.
>>
>
> s/LOCAL_NODE_DATAL/LOCAL_NODE_DATA

Oh...thank you. Will send a new version soon. :)

>
>> Anyway, this problem should be fixed. Any better idea is welcome.
>>
>> Change log:
>> v1 -> v2:
>> patch2: As suggested by Wu Jianguo, define a macro to check if a page
>> cantains local node data.
>> patch4: As suggested by Wu Jianguo, prevent freeing LOCAL_NODE_DATA
>> pages in free_pagetable() instead of in put_page_bootmem().
>>
>> Tang Chen (4):
>> bootmem, mem-hotplug: Register local pagetable pages with
>> LOCAL_NODE_DATA when freeing bootmem.
>> mem-hotplug: Skip LOCAL_NODE_DATA pages in memory offline procedure.
>> mem-hotplug: Skip LOCAL_NODE_DATA pages in memory online procedure.
>> mem-hotplug: Do not free LOCAL_NODE_DATA pages to buddy system in
>> hot-remove procedure.
>>
>> arch/x86/mm/init_64.c | 10 +++++++-
>> include/linux/memblock.h | 22 +++++++++++++++++
>> include/linux/memory_hotplug.h | 20 +++++++++++++--
>> mm/memblock.c | 52 ++++++++++++++++++++++++++++++++++++++++
>> mm/memory_hotplug.c | 31 +++++++++++++++++++++++
>> mm/page_alloc.c | 15 ++++++++++-
>> mm/page_isolation.c | 5 ++++
>> 7 files changed, 149 insertions(+), 6 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>>
>> .
>>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-18 17:05:25

by Vasilis Liaskovitis

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

Hi,

On Thu, Jun 13, 2013 at 09:03:52PM +0800, Tang Chen wrote:
> The following patch-set from Yinghai allocates pagetables to local nodes.
> v1: https://lkml.org/lkml/2013/3/7/642
> v2: https://lkml.org/lkml/2013/3/10/47
> v3: https://lkml.org/lkml/2013/4/4/639
> v4: https://lkml.org/lkml/2013/4/11/829
>
> Since pagetable pages are used by the kernel, they cannot be offlined.
> As a result, they cannot be hot-remove.
>
> This patch fix this problem with the following solution:
>
> 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
> pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
> LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
>
> 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
> whole memory block they reside in is offlined, the kernel can
> still access the pagetables.
> (This changes the semantics of offline/online a little bit.)

This could be a design problem of part3: if we allow local pagetable memory
to not be offlined but allow the offlining to return successfully, then
hot-remove is going to succeed. But the direct mapped pagetable pages are still
mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
physical DIMMs getting disabled in real hardware, or in a VM case the
corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
system can crash as a result.

I think these local pagetables do need to be unmapped from kernel, offlined and
removed somehow - otherwise hot-remove should fail. Could they be migrated
alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
migrated, correct?

What is the original reason for local node pagetable allocation with regards
to memory hotplug? I assume we want to have hotplugged nodes use only their local
memory, so that there are no inter-node memory dependencies for hot-add/remove.
Are there other reasons that I am missing?


I get a crash with the following:
- boot 2-node VM with 4G on node0 and 0G on node1
- guest kernel linux_next20130607 + part1 + part2 + part3 patchsets
- hotplugged 2 dimms on node1 (previosuly node 1 had no memory)
- reboot with movablecore=acpi, ZONE_MOVABLE is created for node1
- I verified that in this case LOCAL_NODE_DATA pages are allocated on node1
- remove a dimm on node1

[ 260.615677] Offlined Pages 32768
[ 260.619650] Offlined Pages 32768
[ 260.630514] Offlined Pages 32768
[ 260.634342] Offlined Pages 32768
[ 260.637989] Offlined Pages 32768
[ 260.642021] Offlined Pages 32768
[ 260.643852] Offlined Pages 32768
[ 260.645555] Offlined Pages 32768
[ 260.817960] general protection fault: 0000 [#1] SMP
[ 260.818384] Modules linked in: netconsole pci_hotplug lp loop kvm_amd kvm ppdev psmouse parport_pc i2c_piix4 parport i2c_core serio_raw evdev processor microcode button thermal_sys ext3 jbd mbcache sr_mod cdrom ata_generic virtio_blk virtio_net ata_piix libata scsi_mod virtio_pci virtio_ring virtio
[ 260.820449] CPU: 0 PID: 177 Comm: kworker/0:2 Not tainted 3.10.0-rc4-next-20130607-guest #1
[ 260.820449] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 260.820449] Workqueue: kacpi_hotplug acpi_os_execute_deferred
[ 260.820449] task: ffff880115c80080 ti: ffff880115564000 task.ti: ffff880115564000
[ 260.820449] RIP: 0010:[<ffffffff812b92b9>] [<ffffffff812b92b9>] memchr_inv+0x89/0x110
[ 260.820449] RSP: 0018:ffff880115565b90 EFLAGS: 00010206
[ 260.820449] RAX: 0000000008000000 RBX: ffff8801e0000000 RCX: 0000000000000000
[ 260.820449] RDX: 0000000040000000 RSI: 00000000000000fd RDI: 7fff8801c0000000
[ 260.820449] RBP: ffff8801e0000000 R08: 00000000000000fd R09: 0101010101010101
[ 260.820449] R10: 0000000000000000 R11: ffff880116e38e50 R12: ffff880220000000
[ 260.820449] R13: ffff8801c0000000 R14: 00003ffffffff000 R15: ffff880116e38e01
[ 260.820449] FS: 00007fa6d8de37c0(0000) GS:ffff88011f000000(0000) knlGS:0000000000000000
[ 260.820449] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 260.820449] CR2: 00007fa6d8dea000 CR3: 000000000180b000 CR4: 00000000000006f0
[ 260.820449] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 260.820449] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 260.820449] Stack:
[ 260.820449] ffffffff8151e3bc 0000160000000000 ffffc00000000fff ffff880001a5d038
[ 260.820449] ffffffff8150e39e ffff880116e38e50 0000000000000100 0000000000000000
[ 260.820449] ffff8801e0000000 ffff8801181fb060 ffffffff8150e424 ffff88021ffff000
[ 260.820449] Call Trace:
[ 260.820449] [<ffffffff8151e3bc>] ? remove_pagetable+0x20c/0x7f7
[ 260.820449] [<ffffffff8150e39e>] ? klist_put+0x4e/0xc0
[ 260.820449] [<ffffffff8150e424>] ? klist_iter_exit+0x14/0x20
[ 260.820449] [<ffffffff8150f385>] ? arch_remove_memory+0x75/0xd0
[ 260.820449] [<ffffffff81510e18>] ? remove_memory+0x68/0x80
[ 260.820449] [<ffffffff8132baa5>] ? acpi_memory_device_remove+0x90/0xea
[ 260.820449] [<ffffffff81308157>] ? acpi_bus_device_detach+0x37/0x5c
[ 260.820449] [<ffffffff813081b5>] ? acpi_bus_trim+0x39/0x6e
[ 260.820449] [<ffffffff81308c74>] ? acpi_scan_hot_remove+0x136/0x264
[ 260.820449] [<ffffffff81304b15>] ? acpi_evaluate_hotplug_ost+0x86/0x8b
[ 260.820449] [<ffffffff81308e3a>] ? acpi_bus_device_eject+0x98/0xcc
[ 260.820449] [<ffffffff813042ea>] ? acpi_os_execute_deferred+0x1f/0x2b
[ 260.820449] [<ffffffff8105a183>] ? process_one_work+0x153/0x480
[ 260.820449] [<ffffffff8105b2b6>] ? worker_thread+0x116/0x3d0
[ 260.820449] [<ffffffff8105b1a0>] ? manage_workers+0x2b0/0x2b0
[ 260.820449] [<ffffffff8105b1a0>] ? manage_workers+0x2b0/0x2b0
[ 260.820449] [<ffffffff81060b56>] ? kthread+0xc6/0xd0
[ 260.820449] [<ffffffff81060a90>] ? kthread_freezable_should_stop+0x60/0x60
[ 260.820449] [<ffffffff8152c6ec>] ? ret_from_fork+0x7c/0xb0
[ 260.820449] [<ffffffff81060a90>] ? kthread_freezable_should_stop+0x60/0x60
[ 260.820449] Code: 75 f0 45 89 c0 4c 01 c7 4c 29 c2 0f 1f 80 00 00 00 00 48 89 d0 48 c1 e8 03 85 c0 74 2a 44 0f b6 c6 49 b9 01 01 01 01 01 01 01 01 <48> 8b 0f 4d 0f af c1 4c 39 c1 74 08 eb 41 90 48 3b 0f 75 3b 48
[ 260.820449] RIP [<ffffffff812b92b9>] memchr_inv+0x89/0x110
[ 260.820449] RSP <ffff880115565b90>
[ 260.852908] ---[ end trace 5e9b131e0294cfb6 ]---

I haven't analyzed more yet.

Other times instead of a crash on hot-remove I get a "fork: cannot allocate memory"
always after hot-remove.

thanks,

- Vasilis

2013-06-19 00:00:15

by Toshi Kani

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

On Tue, 2013-06-18 at 19:05 +0200, Vasilis Liaskovitis wrote:
> Hi,
>
> On Thu, Jun 13, 2013 at 09:03:52PM +0800, Tang Chen wrote:
> > The following patch-set from Yinghai allocates pagetables to local nodes.
> > v1: https://lkml.org/lkml/2013/3/7/642
> > v2: https://lkml.org/lkml/2013/3/10/47
> > v3: https://lkml.org/lkml/2013/4/4/639
> > v4: https://lkml.org/lkml/2013/4/11/829
> >
> > Since pagetable pages are used by the kernel, they cannot be offlined.
> > As a result, they cannot be hot-remove.
> >
> > This patch fix this problem with the following solution:
> >
> > 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
> > pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
> > LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
> >
> > 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
> > whole memory block they reside in is offlined, the kernel can
> > still access the pagetables.
> > (This changes the semantics of offline/online a little bit.)
>
> This could be a design problem of part3: if we allow local pagetable memory
> to not be offlined but allow the offlining to return successfully, then
> hot-remove is going to succeed. But the direct mapped pagetable pages are still
> mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
> physical DIMMs getting disabled in real hardware, or in a VM case the
> corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
> system can crash as a result.
>
> I think these local pagetables do need to be unmapped from kernel, offlined and
> removed somehow - otherwise hot-remove should fail. Could they be migrated
> alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
> migrated, correct?
>
> What is the original reason for local node pagetable allocation with regards
> to memory hotplug? I assume we want to have hotplugged nodes use only their local
> memory, so that there are no inter-node memory dependencies for hot-add/remove.
> Are there other reasons that I am missing?

I second Vasilis. The part1/2/3 series could be much simpler & less
riskier if we focus on the SRAT changes first, and make the local node
pagetable changes as a separate item. Is there particular reason why
they have to be done at a same time?

Thanks,
-Toshi

2013-06-19 02:59:37

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

2013/06/19 8:59, Toshi Kani wrote:
> On Tue, 2013-06-18 at 19:05 +0200, Vasilis Liaskovitis wrote:
>> Hi,
>>
>> On Thu, Jun 13, 2013 at 09:03:52PM +0800, Tang Chen wrote:
>>> The following patch-set from Yinghai allocates pagetables to local nodes.
>>> v1: https://lkml.org/lkml/2013/3/7/642
>>> v2: https://lkml.org/lkml/2013/3/10/47
>>> v3: https://lkml.org/lkml/2013/4/4/639
>>> v4: https://lkml.org/lkml/2013/4/11/829
>>>
>>> Since pagetable pages are used by the kernel, they cannot be offlined.
>>> As a result, they cannot be hot-remove.
>>>
>>> This patch fix this problem with the following solution:
>>>
>>> 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
>>> pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
>>> LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
>>>
>>> 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
>>> whole memory block they reside in is offlined, the kernel can
>>> still access the pagetables.
>>> (This changes the semantics of offline/online a little bit.)
>>
>> This could be a design problem of part3: if we allow local pagetable memory
>> to not be offlined but allow the offlining to return successfully, then
>> hot-remove is going to succeed. But the direct mapped pagetable pages are still
>> mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
>> physical DIMMs getting disabled in real hardware, or in a VM case the
>> corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
>> system can crash as a result.
>>
>> I think these local pagetables do need to be unmapped from kernel, offlined and
>> removed somehow - otherwise hot-remove should fail. Could they be migrated
>> alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
>> migrated, correct?
>>
>> What is the original reason for local node pagetable allocation with regards
>> to memory hotplug? I assume we want to have hotplugged nodes use only their local
>> memory, so that there are no inter-node memory dependencies for hot-add/remove.
>> Are there other reasons that I am missing?
>
> I second Vasilis. The part1/2/3 series could be much simpler & less
> riskier if we focus on the SRAT changes first, and make the local node
> pagetable changes as a separate item. Is there particular reason why
> they have to be done at a same time?

If my understanding is correct:
Main purpose of Yinghai's work is to put pagetable on local node ram.
For this, he needs to know SRAT information before setting pagetable.
So part1 does them same time.

Thanks,
Yasuaki Ishimatsu

>
> Thanks,
> -Toshi
>
>

2013-06-19 07:26:14

by Tang Chen

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

Hi Vasilis, Yinghai,

On 06/19/2013 01:05 AM, Vasilis Liaskovitis wrote:
......
>
> This could be a design problem of part3: if we allow local pagetable memory
> to not be offlined but allow the offlining to return successfully, then
> hot-remove is going to succeed. But the direct mapped pagetable pages are still
> mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
> physical DIMMs getting disabled in real hardware, or in a VM case the
> corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
> system can crash as a result.
>

Yes. Since the pagetable pages is only allocated to local node, a node may
have more than one device, hot-remove only one memory device could be
problematic.

But I think it will work if we hot-remove a whole node. I should have
mentioned it. And sorry for the not fully test.

I think allocating pagetable pages to local device will resolve this
problem.
And need to restructure this patch-set.

> I think these local pagetables do need to be unmapped from kernel, offlined and
> removed somehow - otherwise hot-remove should fail. Could they be migrated
> alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
> migrated, correct?

I think we have unmapped the local pagetables. in functions
free_pud/pmd/pte_table(), we cleared pud, pmd, and pte. We just didn't
free the pagetable pages to buddy.

But when we are not hot-removing the whole node, it is still problematic.
This is true, and it is my design problem.

>
> What is the original reason for local node pagetable allocation with regards
> to memory hotplug? I assume we want to have hotplugged nodes use only their local
> memory, so that there are no inter-node memory dependencies for hot-add/remove.
> Are there other reasons that I am missing?

I think the original reason to do local node pagetable is to improve
performance.
Using local pagetable, vmemmap and so on will be faster.

But actually I think there is no particular reason to implement memory
hot-remove
and local node pagetable at the same time. And before this patch-set, I
also
suggested once that implement memory hot-remove first, and then improve
it to
local pagetable. But Yinghai has done the local pagetable work in has
patches (part1).
And my work is based on his patches. So I just did it.

But obviously it is more complicated than I thought.

And now, it seems tj has some more thinking on part1.

So how about the following plan:
1. Implement arranging hotpluggable memory with SRAT first, without
local pagetable.
(The main work in part2. And of course, need some patches in part1.)
2. Do the local device pagetable work, not local node.
3. Improve memory hotplug to support local device pagetable.

I also want Yinghai's suggestion because local node pagetable is his idea.

>
>
> I get a crash with the following:

Will try to fix it soon.

Thanks. :)

> - boot 2-node VM with 4G on node0 and 0G on node1
> - guest kernel linux_next20130607 + part1 + part2 + part3 patchsets
> - hotplugged 2 dimms on node1 (previosuly node 1 had no memory)
> - reboot with movablecore=acpi, ZONE_MOVABLE is created for node1
> - I verified that in this case LOCAL_NODE_DATA pages are allocated on node1
> - remove a dimm on node1
>
> [ 260.615677] Offlined Pages 32768
> [ 260.619650] Offlined Pages 32768
> [ 260.630514] Offlined Pages 32768
> [ 260.634342] Offlined Pages 32768
> [ 260.637989] Offlined Pages 32768
> [ 260.642021] Offlined Pages 32768
> [ 260.643852] Offlined Pages 32768
> [ 260.645555] Offlined Pages 32768
> [ 260.817960] general protection fault: 0000 [#1] SMP
> [ 260.818384] Modules linked in: netconsole pci_hotplug lp loop kvm_amd kvm ppdev psmouse parport_pc i2c_piix4 parport i2c_core serio_raw evdev processor microcode button thermal_sys ext3 jbd mbcache sr_mod cdrom ata_generic virtio_blk virtio_net ata_piix libata scsi_mod virtio_pci virtio_ring virtio
> [ 260.820449] CPU: 0 PID: 177 Comm: kworker/0:2 Not tainted 3.10.0-rc4-next-20130607-guest #1
> [ 260.820449] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 260.820449] Workqueue: kacpi_hotplug acpi_os_execute_deferred
> [ 260.820449] task: ffff880115c80080 ti: ffff880115564000 task.ti: ffff880115564000
> [ 260.820449] RIP: 0010:[<ffffffff812b92b9>] [<ffffffff812b92b9>] memchr_inv+0x89/0x110
> [ 260.820449] RSP: 0018:ffff880115565b90 EFLAGS: 00010206
> [ 260.820449] RAX: 0000000008000000 RBX: ffff8801e0000000 RCX: 0000000000000000
> [ 260.820449] RDX: 0000000040000000 RSI: 00000000000000fd RDI: 7fff8801c0000000
> [ 260.820449] RBP: ffff8801e0000000 R08: 00000000000000fd R09: 0101010101010101
> [ 260.820449] R10: 0000000000000000 R11: ffff880116e38e50 R12: ffff880220000000
> [ 260.820449] R13: ffff8801c0000000 R14: 00003ffffffff000 R15: ffff880116e38e01
> [ 260.820449] FS: 00007fa6d8de37c0(0000) GS:ffff88011f000000(0000) knlGS:0000000000000000
> [ 260.820449] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 260.820449] CR2: 00007fa6d8dea000 CR3: 000000000180b000 CR4: 00000000000006f0
> [ 260.820449] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 260.820449] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 260.820449] Stack:
> [ 260.820449] ffffffff8151e3bc 0000160000000000 ffffc00000000fff ffff880001a5d038
> [ 260.820449] ffffffff8150e39e ffff880116e38e50 0000000000000100 0000000000000000
> [ 260.820449] ffff8801e0000000 ffff8801181fb060 ffffffff8150e424 ffff88021ffff000
> [ 260.820449] Call Trace:
> [ 260.820449] [<ffffffff8151e3bc>] ? remove_pagetable+0x20c/0x7f7
> [ 260.820449] [<ffffffff8150e39e>] ? klist_put+0x4e/0xc0
> [ 260.820449] [<ffffffff8150e424>] ? klist_iter_exit+0x14/0x20
> [ 260.820449] [<ffffffff8150f385>] ? arch_remove_memory+0x75/0xd0
> [ 260.820449] [<ffffffff81510e18>] ? remove_memory+0x68/0x80
> [ 260.820449] [<ffffffff8132baa5>] ? acpi_memory_device_remove+0x90/0xea
> [ 260.820449] [<ffffffff81308157>] ? acpi_bus_device_detach+0x37/0x5c
> [ 260.820449] [<ffffffff813081b5>] ? acpi_bus_trim+0x39/0x6e
> [ 260.820449] [<ffffffff81308c74>] ? acpi_scan_hot_remove+0x136/0x264
> [ 260.820449] [<ffffffff81304b15>] ? acpi_evaluate_hotplug_ost+0x86/0x8b
> [ 260.820449] [<ffffffff81308e3a>] ? acpi_bus_device_eject+0x98/0xcc
> [ 260.820449] [<ffffffff813042ea>] ? acpi_os_execute_deferred+0x1f/0x2b
> [ 260.820449] [<ffffffff8105a183>] ? process_one_work+0x153/0x480
> [ 260.820449] [<ffffffff8105b2b6>] ? worker_thread+0x116/0x3d0
> [ 260.820449] [<ffffffff8105b1a0>] ? manage_workers+0x2b0/0x2b0
> [ 260.820449] [<ffffffff8105b1a0>] ? manage_workers+0x2b0/0x2b0
> [ 260.820449] [<ffffffff81060b56>] ? kthread+0xc6/0xd0
> [ 260.820449] [<ffffffff81060a90>] ? kthread_freezable_should_stop+0x60/0x60
> [ 260.820449] [<ffffffff8152c6ec>] ? ret_from_fork+0x7c/0xb0
> [ 260.820449] [<ffffffff81060a90>] ? kthread_freezable_should_stop+0x60/0x60
> [ 260.820449] Code: 75 f0 45 89 c0 4c 01 c7 4c 29 c2 0f 1f 80 00 00 00 00 48 89 d0 48 c1 e8 03 85 c0 74 2a 44 0f b6 c6 49 b9 01 01 01 01 01 01 01 01<48> 8b 0f 4d 0f af c1 4c 39 c1 74 08 eb 41 90 48 3b 0f 75 3b 48
> [ 260.820449] RIP [<ffffffff812b92b9>] memchr_inv+0x89/0x110
> [ 260.820449] RSP<ffff880115565b90>
> [ 260.852908] ---[ end trace 5e9b131e0294cfb6 ]---
>
> I haven't analyzed more yet.
>
> Other times instead of a crash on hot-remove I get a "fork: cannot allocate memory"
> always after hot-remove.
>
> thanks,
>
> - Vasilis
>

2013-06-19 10:00:50

by Vasilis Liaskovitis

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

Hi Tang,
On Wed, Jun 19, 2013 at 03:29:06PM +0800, Tang Chen wrote:
> Hi Vasilis, Yinghai,
>
> On 06/19/2013 01:05 AM, Vasilis Liaskovitis wrote:
> ......
> >
> >This could be a design problem of part3: if we allow local pagetable memory
> >to not be offlined but allow the offlining to return successfully, then
> >hot-remove is going to succeed. But the direct mapped pagetable pages are still
> >mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
> >physical DIMMs getting disabled in real hardware, or in a VM case the
> >corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
> >system can crash as a result.
> >
>
> Yes. Since the pagetable pages is only allocated to local node, a node may
> have more than one device, hot-remove only one memory device could be
> problematic.
>
> But I think it will work if we hot-remove a whole node. I should have
> mentioned it. And sorry for the not fully test.

ok, the crash I saw was also for the partial node removal.

> I think allocating pagetable pages to local device will resolve this
> problem.

ok. Yes, you mentioned this approach before I think.

> And need to restructure this patch-set.
>
> >I think these local pagetables do need to be unmapped from kernel, offlined and
> >removed somehow - otherwise hot-remove should fail. Could they be migrated
> >alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
> >migrated, correct?
>
> I think we have unmapped the local pagetables. in functions
> free_pud/pmd/pte_table(), we cleared pud, pmd, and pte. We just didn't
> free the pagetable pages to buddy.

ok, thanks for explaining.

>
> But when we are not hot-removing the whole node, it is still problematic.
> This is true, and it is my design problem.
>
> >
> >What is the original reason for local node pagetable allocation with regards
> >to memory hotplug? I assume we want to have hotplugged nodes use only their local
> >memory, so that there are no inter-node memory dependencies for hot-add/remove.
> >Are there other reasons that I am missing?
>
> I think the original reason to do local node pagetable is to improve
> performance.
> Using local pagetable, vmemmap and so on will be faster.
>
> But actually I think there is no particular reason to implement
> memory hot-remove
> and local node pagetable at the same time. And before this
> patch-set, I also
> suggested once that implement memory hot-remove first, and then
> improve it to
> local pagetable. But Yinghai has done the local pagetable work in
> has patches (part1).
> And my work is based on his patches. So I just did it.
>
> But obviously it is more complicated than I thought.
>
> And now, it seems tj has some more thinking on part1.
>
> So how about the following plan:
> 1. Implement arranging hotpluggable memory with SRAT first, without
> local pagetable.
> (The main work in part2. And of course, need some patches in part1.)

agreed (and yes, several patches from part1 will be needed to do the early srat
parsing here)

> 2. Do the local device pagetable work, not local node.
> 3. Improve memory hotplug to support local device pagetable.

ok, I 'll think about these as well, and help out.

thanks,

- Vasilis

2013-06-19 15:18:58

by Toshi Kani

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

On Wed, 2013-06-19 at 15:29 +0800, Tang Chen wrote:
> Hi Vasilis, Yinghai,
>
> On 06/19/2013 01:05 AM, Vasilis Liaskovitis wrote:
> ......
> >
> > This could be a design problem of part3: if we allow local pagetable memory
> > to not be offlined but allow the offlining to return successfully, then
> > hot-remove is going to succeed. But the direct mapped pagetable pages are still
> > mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
> > physical DIMMs getting disabled in real hardware, or in a VM case the
> > corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
> > system can crash as a result.
> >
>
> Yes. Since the pagetable pages is only allocated to local node, a node may
> have more than one device, hot-remove only one memory device could be
> problematic.
>
> But I think it will work if we hot-remove a whole node. I should have
> mentioned it. And sorry for the not fully test.
>
> I think allocating pagetable pages to local device will resolve this problem.
> And need to restructure this patch-set.
>
> > I think these local pagetables do need to be unmapped from kernel, offlined and
> > removed somehow - otherwise hot-remove should fail. Could they be migrated
> > alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
> > migrated, correct?
>
> I think we have unmapped the local pagetables. in functions
> free_pud/pmd/pte_table(), we cleared pud, pmd, and pte. We just didn't
> free the pagetable pages to buddy.
>
> But when we are not hot-removing the whole node, it is still problematic.
> This is true, and it is my design problem.
>
> >
> > What is the original reason for local node pagetable allocation with regards
> > to memory hotplug? I assume we want to have hotplugged nodes use only their local
> > memory, so that there are no inter-node memory dependencies for hot-add/remove.
> > Are there other reasons that I am missing?
>
> I think the original reason to do local node pagetable is to improve
> performance.
> Using local pagetable, vmemmap and so on will be faster.
>
> But actually I think there is no particular reason to implement memory hot-remove
> and local node pagetable at the same time. And before this patch-set, I also
> suggested once that implement memory hot-remove first, and then improve it to
> local pagetable. But Yinghai has done the local pagetable work in has
> patches (part1).
> And my work is based on his patches. So I just did it.
>
> But obviously it is more complicated than I thought.
>
> And now, it seems tj has some more thinking on part1.
>
> So how about the following plan:
> 1. Implement arranging hotpluggable memory with SRAT first, without
> local pagetable.
> (The main work in part2. And of course, need some patches in part1.)
> 2. Do the local device pagetable work, not local node.
> 3. Improve memory hotplug to support local device pagetable.

Sounds like a good plan. Thanks Tang!
-Toshi

2013-06-19 15:33:15

by Toshi Kani

[permalink] [raw]
Subject: Re: [Part3 PATCH v2 0/4] Support hot-remove local pagetable pages.

On Wed, 2013-06-19 at 11:58 +0900, Yasuaki Ishimatsu wrote:
> 2013/06/19 8:59, Toshi Kani wrote:
> > On Tue, 2013-06-18 at 19:05 +0200, Vasilis Liaskovitis wrote:
> >> Hi,
> >>
> >> On Thu, Jun 13, 2013 at 09:03:52PM +0800, Tang Chen wrote:
> >>> The following patch-set from Yinghai allocates pagetables to local nodes.
> >>> v1: https://lkml.org/lkml/2013/3/7/642
> >>> v2: https://lkml.org/lkml/2013/3/10/47
> >>> v3: https://lkml.org/lkml/2013/4/4/639
> >>> v4: https://lkml.org/lkml/2013/4/11/829
> >>>
> >>> Since pagetable pages are used by the kernel, they cannot be offlined.
> >>> As a result, they cannot be hot-remove.
> >>>
> >>> This patch fix this problem with the following solution:
> >>>
> >>> 1. Introduce a new bootmem type LOCAL_NODE_DATAL, and register local
> >>> pagetable pages as LOCAL_NODE_DATAL by setting page->lru.next to
> >>> LOCAL_NODE_DATAL, just like we register SECTION_INFO pages.
> >>>
> >>> 2. Skip LOCAL_NODE_DATAL pages in offline/online procedures. When the
> >>> whole memory block they reside in is offlined, the kernel can
> >>> still access the pagetables.
> >>> (This changes the semantics of offline/online a little bit.)
> >>
> >> This could be a design problem of part3: if we allow local pagetable memory
> >> to not be offlined but allow the offlining to return successfully, then
> >> hot-remove is going to succeed. But the direct mapped pagetable pages are still
> >> mapped in the kernel. The hot-removed memblocks will suddenly disappear (think
> >> physical DIMMs getting disabled in real hardware, or in a VM case the
> >> corresponding guest memory getting freed from the emulator e.g. qemu/kvm). The
> >> system can crash as a result.
> >>
> >> I think these local pagetables do need to be unmapped from kernel, offlined and
> >> removed somehow - otherwise hot-remove should fail. Could they be migrated
> >> alternatively e.g. to node 0 memory? But Iiuc direct mapped pages cannot be
> >> migrated, correct?
> >>
> >> What is the original reason for local node pagetable allocation with regards
> >> to memory hotplug? I assume we want to have hotplugged nodes use only their local
> >> memory, so that there are no inter-node memory dependencies for hot-add/remove.
> >> Are there other reasons that I am missing?
> >
> > I second Vasilis. The part1/2/3 series could be much simpler & less
> > riskier if we focus on the SRAT changes first, and make the local node
> > pagetable changes as a separate item. Is there particular reason why
> > they have to be done at a same time?
>
> If my understanding is correct:
> Main purpose of Yinghai's work is to put pagetable on local node ram.
> For this, he needs to know SRAT information before setting pagetable.
> So part1 does them same time.

Thanks Yasuaki. I like Tang Chen's new plan, and I think it is going to
be easier to proceed in that way.
-Toshi