2012-10-05 02:20:56

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 0/10] memory-hotplug: hot-remove physical memory

The patch-set was divided from following thread's patch-set.

https://lkml.org/lkml/2012/9/5/201

If you want to know the reason, please read following thread.

https://lkml.org/lkml/2012/10/2/83

The patch-set has only the function of kernel core side for physical
memory hot remove. So if you use the patch, please apply following
patches.

- bug fix for memory hot remove
https://lkml.org/lkml/2012/9/27/39
https://lkml.org/lkml/2012/10/2/83
http://www.spinics.net/lists/linux-mm/msg42982.html

- acpi framework
https://lkml.org/lkml/2012/10/3/126
https://lkml.org/lkml/2012/10/3/641

The patches can free/remove the following things:

- /sys/firmware/memmap/X/{end, start, type} : [PATCH 2/10]
- mem_section and related sysfs files : [PATCH 3-4/10]
- memmap of sparse-vmemmap : [PATCH 5-7/10]
- page table of removed memory : [RFC PATCH 8/10]
- node and related sysfs files : [RFC PATCH 9-10/10]

* [PATCH 1/10] checks whether the memory can be removed or not.

If you find lack of function for physical memory hot-remove, please let me
know.

How to test this patchset?
1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
ACPI_HOTPLUG_MEMORY must be selected.
2. load the module acpi_memhotplug
3. hotplug the memory device(it depends on your hardware)
You will see the memory device under the directory /sys/bus/acpi/devices/.
Its name is PNP0C80:XX.
4. online/offline pages provided by this memory device
You can write online/offline to /sys/devices/system/memory/memoryX/state to
online/offline pages provided by this memory device
5. hotremove the memory device
You can hotremove the memory device by the hardware, or writing 1 to
/sys/bus/acpi/devices/PNP0C80:XX/eject.

Note: if the memory provided by the memory device is used by the kernel, it
can't be offlined. It is not a bug.

Known problems:
1. memory can't be offlined when CONFIG_MEMCG is selected.
For example: there is a memory device on node 1. The address range
is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
and memory11 under the directory /sys/devices/system/memory/.
If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
when we online pages. When we online memory8, the memory stored page cgroup
is not provided by this memory device. But when we online memory9, the memory
stored page cgroup may be provided by memory8. So we can't offline memory8
now. We should offline the memory in the reversed order.
When the memory device is hotremoved, we will auto offline memory provided
by this memory device. But we don't know which memory is onlined first, so
offlining memory may fail. In such case, you should offline the memory by
hand before hotremoving the memory device.
2. hotremoving memory device may cause kernel panicked
This bug will be fixed by Liu Jiang's patch:
https://lkml.org/lkml/2012/7/3/1


2012-10-05 02:26:13

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

When calling remove_memory(), the memory should be offline. If the function
is used to online memory, kernel panic may occur.

So the patch checks whether memory is offline or not.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>

---
drivers/base/memory.c | 39 +++++++++++++++++++++++++++++++++++++++
include/linux/memory.h | 5 +++++
mm/memory_hotplug.c | 17 +++++++++++++++--
3 files changed, 59 insertions(+), 2 deletions(-)

Index: linux-3.6/drivers/base/memory.c
===================================================================
--- linux-3.6.orig/drivers/base/memory.c 2012-10-04 14:22:57.000000000 +0900
+++ linux-3.6/drivers/base/memory.c 2012-10-04 14:45:46.653585860 +0900
@@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
}
EXPORT_SYMBOL(unregister_memory_isolate_notifier);

+bool is_memblk_offline(unsigned long start, unsigned long size)
+{
+ struct memory_block *mem = NULL;
+ struct mem_section *section;
+ unsigned long start_pfn, end_pfn;
+ unsigned long pfn, section_nr;
+
+ start_pfn = PFN_DOWN(start);
+ end_pfn = PFN_UP(start + size);
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ section_nr = pfn_to_section_nr(pfn);
+ if (!present_section_nr(section_nr))
+ continue;
+
+ section = __nr_to_section(section_nr);
+ /* same memblock? */
+ if (mem)
+ if ((section_nr >= mem->start_section_nr) &&
+ (section_nr <= mem->end_section_nr))
+ continue;
+
+ mem = find_memory_block_hinted(section, mem);
+ if (!mem)
+ continue;
+ if (mem->state == MEM_OFFLINE)
+ continue;
+
+ kobject_put(&mem->dev.kobj);
+ return false;
+ }
+
+ if (mem)
+ kobject_put(&mem->dev.kobj);
+
+ return true;
+}
+EXPORT_SYMBOL(is_memblk_offline);
+
/*
* register_memory - Setup a sysfs device for a memory block
*/
Index: linux-3.6/include/linux/memory.h
===================================================================
--- linux-3.6.orig/include/linux/memory.h 2012-10-02 18:00:22.000000000 +0900
+++ linux-3.6/include/linux/memory.h 2012-10-04 14:44:40.902581028 +0900
@@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
{
return 0;
}
+static inline bool is_memblk_offline(unsigned long start, unsigned long size)
+{
+ return false;
+}
#else
extern int register_memory_notifier(struct notifier_block *nb);
extern void unregister_memory_notifier(struct notifier_block *nb);
@@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
extern struct memory_block *find_memory_block_hinted(struct mem_section *,
struct memory_block *);
extern struct memory_block *find_memory_block(struct mem_section *);
+extern bool is_memblk_offline(unsigned long start, unsigned long size);
#define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
enum mem_add_context { BOOT, HOTPLUG };
#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 14:31:08.000000000 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 14:58:22.449687986 +0900
@@ -1045,8 +1045,21 @@ int offline_memory(u64 start, u64 size)

int remove_memory(int nid, u64 start, u64 size)
{
- /* It is not implemented yet*/
- return 0;
+ int ret = 0;
+ lock_memory_hotplug();
+ /*
+ * The memory might become online by other task, even if you offine it.
+ * So we check whether the memory has been onlined or not.
+ */
+ if (!is_memblk_offline(start, size)) {
+ pr_warn("memory removing [mem %#010llx-%#010llx] failed, "
+ "because the memmory range is online\n",
+ start, start + size);
+ ret = -EAGAIN;
+ }
+
+ unlock_memory_hotplug();
+ return ret;
}
EXPORT_SYMBOL_GPL(remove_memory);
#else

2012-10-05 02:27:36

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 2/10] memory-hotplug : remove /sys/firmware/memmap/X sysfs

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
sysfs files are created. But there is no code to remove these files. The patch
implements the function to remove them.

Note : The code does not free firmware_map_entry since there is no way to free
memory which is allocated by bootmem.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>

---
drivers/firmware/memmap.c | 98 ++++++++++++++++++++++++++++++++++++++++++-
include/linux/firmware-map.h | 6 ++
mm/memory_hotplug.c | 7 ++-
3 files changed, 108 insertions(+), 3 deletions(-)

Index: linux-3.6/drivers/firmware/memmap.c
===================================================================
--- linux-3.6.orig/drivers/firmware/memmap.c 2012-10-04 18:27:05.195500420 +0900
+++ linux-3.6/drivers/firmware/memmap.c 2012-10-04 18:27:18.901514330 +0900
@@ -21,6 +21,7 @@
#include <linux/types.h>
#include <linux/bootmem.h>
#include <linux/slab.h>
+#include <linux/mm.h>

/*
* Data types ------------------------------------------------------------------
@@ -41,6 +42,7 @@ struct firmware_map_entry {
const char *type; /* type of the memory range */
struct list_head list; /* entry for the linked list */
struct kobject kobj; /* kobject for each entry */
+ unsigned int bootmem:1; /* allocated from bootmem */
};

/*
@@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_att
.show = memmap_attr_show,
};

+
+static inline struct firmware_map_entry *
+to_memmap_entry(struct kobject *kobj)
+{
+ return container_of(kobj, struct firmware_map_entry, kobj);
+}
+
+static void release_firmware_map_entry(struct kobject *kobj)
+{
+ struct firmware_map_entry *entry = to_memmap_entry(kobj);
+
+ if (entry->bootmem)
+ /* There is no way to free memory allocated from bootmem */
+ return;
+
+ kfree(entry);
+}
+
static struct kobj_type memmap_ktype = {
+ .release = release_firmware_map_entry,
.sysfs_ops = &memmap_attr_ops,
.default_attrs = def_attrs,
};
@@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
* in firmware initialisation code in one single thread of execution.
*/
static LIST_HEAD(map_entries);
+static DEFINE_SPINLOCK(map_entries_lock);

/**
* firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
@@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 st
INIT_LIST_HEAD(&entry->list);
kobject_init(&entry->kobj, &memmap_ktype);

+ spin_lock(&map_entries_lock);
list_add_tail(&entry->list, &map_entries);
+ spin_unlock(&map_entries_lock);

return 0;
}

+/**
+ * firmware_map_remove_entry() - Does the real work to remove a firmware
+ * memmap entry.
+ * @entry: removed entry.
+ **/
+static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
+{
+ spin_lock(&map_entries_lock);
+ list_del(&entry->list);
+ spin_unlock(&map_entries_lock);
+}
+
/*
* Add memmap entry on sysfs
*/
@@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct
return 0;
}

+/*
+ * Remove memmap entry on sysfs
+ */
+static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
+{
+ kobject_put(&entry->kobj);
+}
+
+/*
+ * Search memmap entry
+ */
+
+static struct firmware_map_entry * __meminit
+firmware_map_find_entry(u64 start, u64 end, const char *type)
+{
+ struct firmware_map_entry *entry;
+
+ spin_lock(&map_entries_lock);
+ list_for_each_entry(entry, &map_entries, list)
+ if ((entry->start == start) && (entry->end == end) &&
+ (!strcmp(entry->type, type))) {
+ spin_unlock(&map_entries_lock);
+ return entry;
+ }
+
+ spin_unlock(&map_entries_lock);
+ return NULL;
+}
+
/**
* firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
* memory hotplug.
@@ -193,9 +258,36 @@ int __init firmware_map_add_early(u64 st
if (WARN_ON(!entry))
return -ENOMEM;

+ entry->bootmem = 1;
return firmware_map_add_entry(start, end, type, entry);
}

+/**
+ * firmware_map_remove() - remove a firmware mapping entry
+ * @start: Start of the memory range.
+ * @end: End of the memory range.
+ * @type: Type of the memory range.
+ *
+ * removes a firmware mapping entry.
+ *
+ * Returns 0 on success, or -EINVAL if no entry.
+ **/
+int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
+{
+ struct firmware_map_entry *entry;
+
+ entry = firmware_map_find_entry(start, end - 1, type);
+ if (!entry)
+ return -EINVAL;
+
+ firmware_map_remove_entry(entry);
+
+ /* remove the memmap entry */
+ remove_sysfs_fw_map_entry(entry);
+
+ return 0;
+}
+
/*
* Sysfs functions -------------------------------------------------------------
*/
@@ -217,8 +309,10 @@ static ssize_t type_show(struct firmware
return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
}

-#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
-#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
+static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
+{
+ return container_of(attr, struct memmap_attribute, attr);
+}

static ssize_t memmap_attr_show(struct kobject *kobj,
struct attribute *attr, char *buf)
Index: linux-3.6/include/linux/firmware-map.h
===================================================================
--- linux-3.6.orig/include/linux/firmware-map.h 2012-10-04 18:27:05.197500422 +0900
+++ linux-3.6/include/linux/firmware-map.h 2012-10-04 18:27:18.904514333 +0900
@@ -25,6 +25,7 @@

int firmware_map_add_early(u64 start, u64 end, const char *type);
int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
+int firmware_map_remove(u64 start, u64 end, const char *type);

#else /* CONFIG_FIRMWARE_MEMMAP */

@@ -38,6 +39,11 @@ static inline int firmware_map_add_hotpl
return 0;
}

+static inline int firmware_map_remove(u64 start, u64 end, const char *type)
+{
+ return 0;
+}
+
#endif /* CONFIG_FIRMWARE_MEMMAP */

#endif /* _LINUX_FIRMWARE_MAP_H */
Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:27:03.000000000 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:28:42.851599524 +0900
@@ -1043,7 +1043,7 @@ int offline_memory(u64 start, u64 size)
return 0;
}

-int remove_memory(int nid, u64 start, u64 size)
+int __ref remove_memory(int nid, u64 start, u64 size)
{
int ret = 0;
lock_memory_hotplug();
@@ -1056,8 +1056,13 @@ int remove_memory(int nid, u64 start, u6
"because the memmory range is online\n",
start, start + size);
ret = -EAGAIN;
+ goto out;
}

+ /* remove memmap entry */
+ firmware_map_remove(start, start + size, "System RAM");
+
+out:
unlock_memory_hotplug();
return ret;
}

2012-10-05 02:30:18

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 3/10] memory-hotplug : introduce new function arch_remove_memory() for removing page table depends on architecture

From: Wen Congyang <[email protected]>

For removing memory, we need to remove page table. But it depends
on architecture. So the patch introduce arch_remove_memory() for
removing page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Benjamin Herrenschmidt <[email protected]>
CC: Paul Mackerras <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
---
arch/ia64/mm/init.c | 18 ++++++++++++++++++
arch/powerpc/mm/mem.c | 12 ++++++++++++
arch/s390/mm/init.c | 12 ++++++++++++
arch/sh/mm/init.c | 17 +++++++++++++++++
arch/tile/mm/init.c | 8 ++++++++
arch/x86/mm/init_32.c | 12 ++++++++++++
arch/x86/mm/init_64.c | 15 +++++++++++++++
include/linux/memory_hotplug.h | 1 +
mm/memory_hotplug.c | 1 +
9 files changed, 96 insertions(+)

Index: linux-3.6/arch/ia64/mm/init.c
===================================================================
--- linux-3.6.orig/arch/ia64/mm/init.c 2012-10-04 18:27:03.082498276 +0900
+++ linux-3.6/arch/ia64/mm/init.c 2012-10-04 18:28:50.087606867 +0900
@@ -688,6 +688,24 @@ int arch_add_memory(int nid, u64 start,

return ret;
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+ int ret;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ ret = __remove_pages(zone, start_pfn, nr_pages);
+ if (ret)
+ pr_warn("%s: Problem encountered in __remove_pages() as"
+ " ret=%d\n", __func__, ret);
+
+ return ret;
+}
+#endif
#endif

/*
Index: linux-3.6/arch/powerpc/mm/mem.c
===================================================================
--- linux-3.6.orig/arch/powerpc/mm/mem.c 2012-10-04 18:27:03.084498278 +0900
+++ linux-3.6/arch/powerpc/mm/mem.c 2012-10-04 18:28:50.094606874 +0900
@@ -133,6 +133,18 @@ int arch_add_memory(int nid, u64 start,

return __add_pages(nid, zone, start_pfn, nr_pages);
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
#endif /* CONFIG_MEMORY_HOTPLUG */

/*
Index: linux-3.6/arch/s390/mm/init.c
===================================================================
--- linux-3.6.orig/arch/s390/mm/init.c 2012-10-04 18:27:03.080498274 +0900
+++ linux-3.6/arch/s390/mm/init.c 2012-10-04 18:28:50.104606884 +0900
@@ -257,4 +257,16 @@ int arch_add_memory(int nid, u64 start,
vmem_remove_mapping(start, size);
return rc;
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ /*
+ * There is no hardware or firmware interface which could trigger a
+ * hot memory remove on s390. So there is nothing that needs to be
+ * implemented.
+ */
+ return -EBUSY;
+}
+#endif
#endif /* CONFIG_MEMORY_HOTPLUG */
Index: linux-3.6/arch/sh/mm/init.c
===================================================================
--- linux-3.6.orig/arch/sh/mm/init.c 2012-10-04 18:27:03.091498285 +0900
+++ linux-3.6/arch/sh/mm/init.c 2012-10-04 18:28:50.116606897 +0900
@@ -558,4 +558,21 @@ int memory_add_physaddr_to_nid(u64 addr)
EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
#endif

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+ int ret;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ ret = __remove_pages(zone, start_pfn, nr_pages);
+ if (unlikely(ret))
+ pr_warn("%s: Failed, __remove_pages() == %d\n", __func__,
+ ret);
+
+ return ret;
+}
+#endif
#endif /* CONFIG_MEMORY_HOTPLUG */
Index: linux-3.6/arch/tile/mm/init.c
===================================================================
--- linux-3.6.orig/arch/tile/mm/init.c 2012-10-04 18:27:03.078498272 +0900
+++ linux-3.6/arch/tile/mm/init.c 2012-10-04 18:28:50.122606903 +0900
@@ -935,6 +935,14 @@ int remove_memory(u64 start, u64 size)
{
return -EINVAL;
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ /* TODO */
+ return -EBUSY;
+}
+#endif
#endif

struct kmem_cache *pgd_cache;
Index: linux-3.6/arch/x86/mm/init_32.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/init_32.c 2012-10-04 18:27:03.089498283 +0900
+++ linux-3.6/arch/x86/mm/init_32.c 2012-10-04 18:28:50.128606909 +0900
@@ -842,6 +842,18 @@ int arch_add_memory(int nid, u64 start,

return __add_pages(nid, zone, start_pfn, nr_pages);
}
+
+#ifdef CONFIG_MEMORY_HOTREMOVE
+int arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ return __remove_pages(zone, start_pfn, nr_pages);
+}
+#endif
#endif

/*
Index: linux-3.6/arch/x86/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:27:03.086498280 +0900
+++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:28:50.132606913 +0900
@@ -675,6 +675,21 @@ int arch_add_memory(int nid, u64 start,
}
EXPORT_SYMBOL_GPL(arch_add_memory);

+#ifdef CONFIG_MEMORY_HOTREMOVE
+int __ref arch_remove_memory(u64 start, u64 size)
+{
+ unsigned long start_pfn = start >> PAGE_SHIFT;
+ unsigned long nr_pages = size >> PAGE_SHIFT;
+ struct zone *zone;
+ int ret;
+
+ zone = page_zone(pfn_to_page(start_pfn));
+ ret = __remove_pages(zone, start_pfn, nr_pages);
+ WARN_ON_ONCE(ret);
+
+ return ret;
+}
+#endif
#endif /* CONFIG_MEMORY_HOTPLUG */

static struct kcore_list kcore_vsyscall;
Index: linux-3.6/include/linux/memory_hotplug.h
===================================================================
--- linux-3.6.orig/include/linux/memory_hotplug.h 2012-10-04 18:27:03.094498288 +0900
+++ linux-3.6/include/linux/memory_hotplug.h 2012-10-04 18:28:50.137606918 +0900
@@ -85,6 +85,7 @@ extern void __online_page_free(struct pa

#ifdef CONFIG_MEMORY_HOTREMOVE
extern bool is_pageblock_removable_nolock(struct page *page);
+extern int arch_remove_memory(u64 start, u64 size);
#endif /* CONFIG_MEMORY_HOTREMOVE */

/* reasonably generic interface to expand the physical pages in a zone */
Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:28:42.851599524 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:29:50.577668254 +0900
@@ -1062,6 +1062,7 @@ int __ref remove_memory(int nid, u64 sta
/* remove memmap entry */
firmware_map_remove(start, start + size, "System RAM");

+ arch_remove_memory(start, size);
out:
unlock_memory_hotplug();
return ret;

2012-10-05 02:31:45

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 4/10] memory-hotplug : unregister memory section on SPARSEMEM_VMEMMAP

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even if
we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

So the patch add unregister_memory_section() into __remove_section().

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
---
mm/memory_hotplug.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:29:50.577668254 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:29:58.284676075 +0900
@@ -279,11 +279,14 @@ static int __meminit __add_section(int n
#ifdef CONFIG_SPARSEMEM_VMEMMAP
static int __remove_section(struct zone *zone, struct mem_section *ms)
{
- /*
- * XXX: Freeing memmap with vmemmap is not implement yet.
- * This should be removed later.
- */
- return -EBUSY;
+ int ret = -EINVAL;
+
+ if (!valid_section(ms))
+ return ret;
+
+ ret = unregister_memory_section(ms);
+
+ return ret;
}
#else
static int __remove_section(struct zone *zone, struct mem_section *ms)

2012-10-05 02:32:55

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 5/10] memory-hotplug : memory-hotplug: check page type in get_page_bootmem

The function get_page_bootmem() may be called more than one time to the same
page. There is no need to set page's type, private if the function is not
the first time called to the page.

Note: the patch is just optimization and does not fix any problem.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
---
mm/memory_hotplug.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:29:58.284676075 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:03.454680542 +0900
@@ -95,10 +95,17 @@ static void release_memory_resource(stru
static void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type)
{
- page->lru.next = (struct list_head *) type;
- SetPagePrivate(page);
- set_page_private(page, info);
- atomic_inc(&page->_count);
+ unsigned long page_type;
+
+ page_type = (unsigned long)page->lru.next;
+ if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
+ page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
+ page->lru.next = (struct list_head *)type;
+ SetPagePrivate(page);
+ set_page_private(page, info);
+ atomic_inc(&page->_count);
+ } else
+ atomic_inc(&page->_count);
}

/* reference to __meminit __free_pages_bootmem is valid

2012-10-05 02:34:02

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 6/10] memory-hotplug : implement register_page_bootmem_info_section of sparse-vmemmap

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by get_page_bootmem().
So the patch searches pages of virtual mapping and registers the pages by
get_page_bootmem().

Note: register_page_bootmem_memmap() is not implemented for ia64, ppc, s390,
and sparc.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
---
arch/ia64/mm/discontig.c | 6 ++++
arch/powerpc/mm/init_64.c | 6 ++++
arch/s390/mm/vmem.c | 6 ++++
arch/sparc/mm/init_64.c | 6 ++++
arch/x86/mm/init_64.c | 52 +++++++++++++++++++++++++++++++++++++++++
include/linux/memory_hotplug.h | 11 +-------
include/linux/mm.h | 3 +-
mm/memory_hotplug.c | 37 ++++++++++++++++++++++++++---
8 files changed, 113 insertions(+), 14 deletions(-)

Index: linux-3.6/include/linux/memory_hotplug.h
===================================================================
--- linux-3.6.orig/include/linux/memory_hotplug.h 2012-10-04 17:15:03.029828127 +0900
+++ linux-3.6/include/linux/memory_hotplug.h 2012-10-04 17:15:59.010833688 +0900
@@ -163,17 +163,10 @@ static inline void arch_refresh_nodedata
#endif /* CONFIG_NUMA */
#endif /* CONFIG_HAVE_ARCH_NODEDATA_EXTENSION */

-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
-{
-}
-static inline void put_page_bootmem(struct page *page)
-{
-}
-#else
extern void register_page_bootmem_info_node(struct pglist_data *pgdat);
extern void put_page_bootmem(struct page *page);
-#endif
+extern void get_page_bootmem(unsigned long ingo, struct page *page,
+ unsigned long type);

/*
* Lock for memory hotplug guarantees 1) all callbacks for memory hotplug
Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 17:15:27.213831361 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 17:37:00.176401540 +0900
@@ -91,9 +91,8 @@ static void release_memory_resource(stru
}

#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
-#ifndef CONFIG_SPARSEMEM_VMEMMAP
-static void get_page_bootmem(unsigned long info, struct page *page,
- unsigned long type)
+void get_page_bootmem(unsigned long info, struct page *page,
+ unsigned long type)
{
unsigned long page_type;

@@ -127,6 +126,7 @@ void __ref put_page_bootmem(struct page

}

+#ifndef CONFIG_SPARSEMEM_VMEMMAP
static void register_page_bootmem_info_section(unsigned long start_pfn)
{
unsigned long *usemap, mapsize, section_nr, i;
@@ -160,6 +160,36 @@ static void register_page_bootmem_info_s
get_page_bootmem(section_nr, page, MIX_SECTION_INFO);

}
+#else
+static void register_page_bootmem_info_section(unsigned long start_pfn)
+{
+ unsigned long *usemap, mapsize, section_nr, i;
+ struct mem_section *ms;
+ struct page *page, *memmap;
+
+ if (!pfn_valid(start_pfn))
+ return;
+
+ section_nr = pfn_to_section_nr(start_pfn);
+ ms = __nr_to_section(section_nr);
+
+ memmap = sparse_decode_mem_map(ms->section_mem_map, section_nr);
+
+ page = virt_to_page(memmap);
+ mapsize = sizeof(struct page) * PAGES_PER_SECTION;
+ mapsize = PAGE_ALIGN(mapsize) >> PAGE_SHIFT;
+
+ register_page_bootmem_memmap(section_nr, memmap, PAGES_PER_SECTION);
+
+ usemap = __nr_to_section(section_nr)->pageblock_flags;
+ page = virt_to_page(usemap);
+
+ mapsize = PAGE_ALIGN(usemap_size()) >> PAGE_SHIFT;
+
+ for (i = 0; i < mapsize; i++, page++)
+ get_page_bootmem(section_nr, page, MIX_SECTION_INFO);
+}
+#endif

void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
@@ -202,7 +232,6 @@ void register_page_bootmem_info_node(str
register_page_bootmem_info_section(pfn);
}
}
-#endif /* !CONFIG_SPARSEMEM_VMEMMAP */

static void grow_zone_span(struct zone *zone, unsigned long start_pfn,
unsigned long end_pfn)
Index: linux-3.6/arch/ia64/mm/discontig.c
===================================================================
--- linux-3.6.orig/arch/ia64/mm/discontig.c 2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6/arch/ia64/mm/discontig.c 2012-10-04 17:15:59.209833459 +0900
@@ -822,4 +822,10 @@ int __meminit vmemmap_populate(struct pa
{
return vmemmap_populate_basepages(start_page, size, node);
}
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+ /* TODO */
+}
#endif
Index: linux-3.6/arch/powerpc/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/powerpc/mm/init_64.c 2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6/arch/powerpc/mm/init_64.c 2012-10-04 17:15:59.217833663 +0900
@@ -298,5 +298,11 @@ int __meminit vmemmap_populate(struct pa

return 0;
}
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+ /* TODO */
+}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */

Index: linux-3.6/arch/s390/mm/vmem.c
===================================================================
--- linux-3.6.orig/arch/s390/mm/vmem.c 2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6/arch/s390/mm/vmem.c 2012-10-04 17:15:59.227833764 +0900
@@ -227,6 +227,12 @@ out:
return ret;
}

+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+ /* TODO */
+}
+
/*
* Add memory segment to the segment list if it doesn't overlap with
* an already present segment.
Index: linux-3.6/arch/sparc/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/sparc/mm/init_64.c 2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6/arch/sparc/mm/init_64.c 2012-10-04 17:15:59.232833747 +0900
@@ -2077,6 +2077,12 @@ void __meminit vmemmap_populate_print_la
node_start = 0;
}
}
+
+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+ /* TODO */
+}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */

static void prot_init_common(unsigned long page_none,
Index: linux-3.6/arch/x86/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 17:15:03.021828121 +0900
+++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 17:15:59.240833769 +0900
@@ -993,6 +993,58 @@ vmemmap_populate(struct page *start_page
return 0;
}

+void register_page_bootmem_memmap(unsigned long section_nr,
+ struct page *start_page, unsigned long size)
+{
+ unsigned long addr = (unsigned long)start_page;
+ unsigned long end = (unsigned long)(start_page + size);
+ unsigned long next;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+
+ for (; addr < end; addr = next) {
+ pte_t *pte = NULL;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd)) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ continue;
+ }
+ get_page_bootmem(section_nr, pgd_page(*pgd), MIX_SECTION_INFO);
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud)) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ continue;
+ }
+ get_page_bootmem(section_nr, pud_page(*pud), MIX_SECTION_INFO);
+
+ if (!cpu_has_pse) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ continue;
+ get_page_bootmem(section_nr, pmd_page(*pmd),
+ MIX_SECTION_INFO);
+
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte))
+ continue;
+ get_page_bootmem(section_nr, pte_page(*pte),
+ SECTION_INFO);
+ } else {
+ next = pmd_addr_end(addr, end);
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ continue;
+ get_page_bootmem(section_nr, pmd_page(*pmd),
+ SECTION_INFO);
+ }
+ }
+}
+
void __meminit vmemmap_populate_print_last(void)
{
if (p_start) {
Index: linux-3.6/include/linux/mm.h
===================================================================
--- linux-3.6.orig/include/linux/mm.h 2012-10-01 08:47:46.000000000 +0900
+++ linux-3.6/include/linux/mm.h 2012-10-04 17:15:59.246833767 +0900
@@ -1618,7 +1618,8 @@ int vmemmap_populate_basepages(struct pa
unsigned long pages, int node);
int vmemmap_populate(struct page *start_page, unsigned long pages, int node);
void vmemmap_populate_print_last(void);
-
+void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
+ unsigned long size);

enum mf_flags {
MF_COUNT_INCREASED = 1 << 0,

2012-10-05 02:35:25

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 7/10] memory-hotplug : remove memmap of sparse-vmemmap

All pages of virtual mapping in removed memory cannot be freed, since some pages
used as PGD/PUD includes not only removed memory but also other memory. So the
patch checks whether page can be freed or not.

How to check whether page can be freed or not?
1. When removing memory, the page structs of the revmoved memory are filled
with 0FD.
2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
In this case, the page used as PT/PMD can be freed.

Applying patch, __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is integrated
into one. So __remove_section() of CONFIG_SPARSEMEM_VMEMMAP is deleted.

Note: vmemmap_kfree() and vmemmap_free_bootmem() are not implemented for ia64,
ppc, s390, and sparc.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Wen Congyang <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
---
arch/ia64/mm/discontig.c | 8 +++
arch/powerpc/mm/init_64.c | 8 +++
arch/s390/mm/vmem.c | 8 +++
arch/sparc/mm/init_64.c | 8 +++
arch/x86/mm/init_64.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 2
mm/memory_hotplug.c | 17 ------
mm/sparse.c | 5 +
8 files changed, 158 insertions(+), 17 deletions(-)

Index: linux-3.6/arch/ia64/mm/discontig.c
===================================================================
--- linux-3.6.orig/arch/ia64/mm/discontig.c 2012-10-04 18:30:15.475692638 +0900
+++ linux-3.6/arch/ia64/mm/discontig.c 2012-10-04 18:30:21.145698389 +0900
@@ -823,6 +823,14 @@ int __meminit vmemmap_populate(struct pa
return vmemmap_populate_basepages(start_page, size, node);
}

+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
{
Index: linux-3.6/arch/powerpc/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/powerpc/mm/init_64.c 2012-10-04 18:30:15.494692657 +0900
+++ linux-3.6/arch/powerpc/mm/init_64.c 2012-10-04 18:30:21.150698394 +0900
@@ -299,6 +299,14 @@ int __meminit vmemmap_populate(struct pa
return 0;
}

+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
{
Index: linux-3.6/arch/s390/mm/vmem.c
===================================================================
--- linux-3.6.orig/arch/s390/mm/vmem.c 2012-10-04 18:30:15.506692670 +0900
+++ linux-3.6/arch/s390/mm/vmem.c 2012-10-04 18:30:21.157698401 +0900
@@ -227,6 +227,14 @@ out:
return ret;
}

+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
{
Index: linux-3.6/arch/sparc/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/sparc/mm/init_64.c 2012-10-04 18:30:15.512692676 +0900
+++ linux-3.6/arch/sparc/mm/init_64.c 2012-10-04 18:30:21.163698408 +0900
@@ -2078,6 +2078,14 @@ void __meminit vmemmap_populate_print_la
}
}

+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+}
+
void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
{
Index: linux-3.6/arch/x86/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:15.517692681 +0900
+++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
@@ -993,6 +993,125 @@ vmemmap_populate(struct page *start_page
return 0;
}

+#define PAGE_INUSE 0xFD
+
+unsigned long find_and_clear_pte_page(unsigned long addr, unsigned long end,
+ struct page **pp, int *page_size)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;
+ void *page_addr;
+ unsigned long next;
+
+ *pp = NULL;
+
+ pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd))
+ return pgd_addr_end(addr, end);
+
+ pud = pud_offset(pgd, addr);
+ if (pud_none(*pud))
+ return pud_addr_end(addr, end);
+
+ if (!cpu_has_pse) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return next;
+
+ pte = pte_offset_kernel(pmd, addr);
+ if (pte_none(*pte))
+ return next;
+
+ *page_size = PAGE_SIZE;
+ *pp = pte_page(*pte);
+ } else {
+ next = pmd_addr_end(addr, end);
+
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd))
+ return next;
+
+ *page_size = PMD_SIZE;
+ *pp = pmd_page(*pmd);
+ }
+
+ /*
+ * Removed page structs are filled with 0xFD.
+ */
+ memset((void *)addr, PAGE_INUSE, next - addr);
+
+ page_addr = page_address(*pp);
+
+ /*
+ * Check the page is filled with 0xFD or not.
+ * memchr_inv() returns the address. In this case, we cannot
+ * clear PTE/PUD entry, since the page is used by other.
+ * So we cannot also free the page.
+ *
+ * memchr_inv() returns NULL. In this case, we can clear
+ * PTE/PUD entry, since the page is not used by other.
+ * So we can also free the page.
+ */
+ if (memchr_inv(page_addr, PAGE_INUSE, *page_size)) {
+ *pp = NULL;
+ return next;
+ }
+
+ if (!cpu_has_pse)
+ pte_clear(&init_mm, addr, pte);
+ else
+ pmd_clear(pmd);
+
+ return next;
+}
+
+void vmemmap_kfree(struct page *memmap, unsigned long nr_pages)
+{
+ unsigned long addr = (unsigned long)memmap;
+ unsigned long end = (unsigned long)(memmap + nr_pages);
+ unsigned long next;
+ struct page *page;
+ int page_size;
+
+ for (; addr < end; addr = next) {
+ page = NULL;
+ page_size = 0;
+ next = find_and_clear_pte_page(addr, end, &page, &page_size);
+ if (!page)
+ continue;
+
+ free_pages((unsigned long)page_address(page),
+ get_order(page_size));
+ __flush_tlb_one(addr);
+ }
+}
+
+void vmemmap_free_bootmem(struct page *memmap, unsigned long nr_pages)
+{
+ unsigned long addr = (unsigned long)memmap;
+ unsigned long end = (unsigned long)(memmap + nr_pages);
+ unsigned long next;
+ struct page *page;
+ int page_size;
+ unsigned long magic;
+
+ for (; addr < end; addr = next) {
+ page = NULL;
+ page_size = 0;
+ next = find_and_clear_pte_page(addr, end, &page, &page_size);
+ if (!page)
+ continue;
+
+ magic = (unsigned long) page->lru.next;
+ if (magic == SECTION_INFO)
+ put_page_bootmem(page);
+ flush_tlb_kernel_range(addr, end);
+ }
+}
+
void register_page_bootmem_memmap(unsigned long section_nr,
struct page *start_page, unsigned long size)
{
Index: linux-3.6/include/linux/mm.h
===================================================================
--- linux-3.6.orig/include/linux/mm.h 2012-10-04 18:30:15.524692688 +0900
+++ linux-3.6/include/linux/mm.h 2012-10-04 18:30:21.177698422 +0900
@@ -1620,6 +1620,8 @@ int vmemmap_populate(struct page *start_
void vmemmap_populate_print_last(void);
void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
unsigned long size);
+void vmemmap_kfree(struct page *memmpa, unsigned long nr_pages);
+void vmemmap_free_bootmem(struct page *memmpa, unsigned long nr_pages);

enum mf_flags {
MF_COUNT_INCREASED = 1 << 0,
Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:30:15.464692627 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:21.182698427 +0900
@@ -312,19 +312,6 @@ static int __meminit __add_section(int n
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}

-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-static int __remove_section(struct zone *zone, struct mem_section *ms)
-{
- int ret = -EINVAL;
-
- if (!valid_section(ms))
- return ret;
-
- ret = unregister_memory_section(ms);
-
- return ret;
-}
-#else
static int __remove_section(struct zone *zone, struct mem_section *ms)
{
unsigned long flags;
@@ -341,9 +328,9 @@ static int __remove_section(struct zone
pgdat_resize_lock(pgdat, &flags);
sparse_remove_one_section(zone, ms);
pgdat_resize_unlock(pgdat, &flags);
- return 0;
+
+ return ret;
}
-#endif

/*
* Reasonably generic function for adding memory. It is
Index: linux-3.6/mm/sparse.c
===================================================================
--- linux-3.6.orig/mm/sparse.c 2012-10-04 18:26:53.908488966 +0900
+++ linux-3.6/mm/sparse.c 2012-10-04 18:30:21.185698430 +0900
@@ -613,12 +613,13 @@ static inline struct page *kmalloc_secti
/* This will make the necessary allocations eventually. */
return sparse_mem_map_populate(pnum, nid);
}
-static void __kfree_section_memmap(struct page *memmap, unsigned long nr_pages)
+static void __kfree_section_memmap(struct page *page, unsigned long nr_pages)
{
- return; /* XXX: Not implemented yet */
+ vmemmap_kfree(page, nr_pages);
}
static void free_map_bootmem(struct page *page, unsigned long nr_pages)
{
+ vmemmap_free_bootmem(page, nr_pages);
}
#else
static struct page *__kmalloc_section_memmap(unsigned long nr_pages)

2012-10-05 02:36:44

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

From: Wen Congyang <[email protected]>

For hot removing memory, we sholud remove page table about the memory.
So the patch searches a page table about the removed memory, and clear
page table.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 1
arch/x86/mm/init_64.c | 147 +++++++++++++++++++++++++++++++++++
arch/x86/mm/pageattr.c | 47 +++++------
3 files changed, 173 insertions(+), 22 deletions(-)

Index: linux-3.6/arch/x86/mm/init_64.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
+++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:27.317704652 +0900
@@ -675,6 +675,151 @@ int arch_add_memory(int nid, u64 start,
}
EXPORT_SYMBOL_GPL(arch_add_memory);

+static void __meminit
+phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
+{
+ unsigned pages = 0;
+ int i = pte_index(addr);
+
+ pte_t *pte = pte_page + pte_index(addr);
+
+ for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+
+ if (addr >= end)
+ break;
+
+ if (!pte_present(*pte))
+ continue;
+
+ pages++;
+ set_pte(pte, __pte(0));
+ }
+
+ update_page_count(PG_LEVEL_4K, -pages);
+}
+
+static void __meminit
+phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
+{
+ unsigned long pages = 0, next;
+ int i = pmd_index(addr);
+
+ for (; i < PTRS_PER_PMD; i++, addr = next) {
+ unsigned long pte_phys;
+ pmd_t *pmd = pmd_page + pmd_index(addr);
+ pte_t *pte;
+
+ if (addr >= end)
+ break;
+
+ next = (addr & PMD_MASK) + PMD_SIZE;
+
+ if (!pmd_present(*pmd))
+ continue;
+
+ if (pmd_large(*pmd)) {
+ if ((addr & ~PMD_MASK) == 0 && next <= end) {
+ set_pmd(pmd, __pmd(0));
+ pages++;
+ continue;
+ }
+
+ /*
+ * We use 2M page, but we need to remove part of them,
+ * so split 2M page to 4K page.
+ */
+ pte = alloc_low_page(&pte_phys);
+ __split_large_page((pte_t *)pmd, addr, pte);
+
+ spin_lock(&init_mm.page_table_lock);
+ pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ spin_lock(&init_mm.page_table_lock);
+ pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
+ phys_pte_remove(pte, addr, end);
+ unmap_low_page(pte);
+ spin_unlock(&init_mm.page_table_lock);
+ }
+ update_page_count(PG_LEVEL_2M, -pages);
+}
+
+static void __meminit
+phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
+{
+ unsigned long pages = 0, next;
+ int i = pud_index(addr);
+
+ for (; i < PTRS_PER_PUD; i++, addr = next) {
+ unsigned long pmd_phys;
+ pud_t *pud = pud_page + pud_index(addr);
+ pmd_t *pmd;
+
+ if (addr >= end)
+ break;
+
+ next = (addr & PUD_MASK) + PUD_SIZE;
+
+ if (!pud_present(*pud))
+ continue;
+
+ if (pud_large(*pud)) {
+ if ((addr & ~PUD_MASK) == 0 && next <= end) {
+ set_pud(pud, __pud(0));
+ pages++;
+ continue;
+ }
+
+ /*
+ * We use 1G page, but we need to remove part of them,
+ * so split 1G page to 2M page.
+ */
+ pmd = alloc_low_page(&pmd_phys);
+ __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
+
+ spin_lock(&init_mm.page_table_lock);
+ pud_populate(&init_mm, pud, __va(pmd_phys));
+ spin_unlock(&init_mm.page_table_lock);
+ }
+
+ pmd = map_low_page(pmd_offset(pud, 0));
+ phys_pmd_remove(pmd, addr, end);
+ unmap_low_page(pmd);
+ __flush_tlb_all();
+ }
+ __flush_tlb_all();
+
+ update_page_count(PG_LEVEL_1G, -pages);
+}
+
+void __meminit
+kernel_physical_mapping_remove(unsigned long start, unsigned long end)
+{
+ unsigned long next;
+
+ start = (unsigned long)__va(start);
+ end = (unsigned long)__va(end);
+
+ for (; start < end; start = next) {
+ pgd_t *pgd = pgd_offset_k(start);
+ pud_t *pud;
+
+ next = (start + PGDIR_SIZE) & PGDIR_MASK;
+ if (next > end)
+ next = end;
+
+ if (!pgd_present(*pgd))
+ continue;
+
+ pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
+ phys_pud_remove(pud, __pa(start), __pa(end));
+ unmap_low_page(pud);
+ }
+
+ __flush_tlb_all();
+}
+
#ifdef CONFIG_MEMORY_HOTREMOVE
int __ref arch_remove_memory(u64 start, u64 size)
{
@@ -687,6 +832,8 @@ int __ref arch_remove_memory(u64 start,
ret = __remove_pages(zone, start_pfn, nr_pages);
WARN_ON_ONCE(ret);

+ kernel_physical_mapping_remove(start, start + size);
+
return ret;
}
#endif
Index: linux-3.6/arch/x86/include/asm/pgtable_types.h
===================================================================
--- linux-3.6.orig/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:26:51.925486954 +0900
+++ linux-3.6/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:30:27.322704656 +0900
@@ -334,6 +334,7 @@ static inline void update_page_count(int
* as a pte too.
*/
extern pte_t *lookup_address(unsigned long address, unsigned int *level);
+extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);

#endif /* !__ASSEMBLY__ */

Index: linux-3.6/arch/x86/mm/pageattr.c
===================================================================
--- linux-3.6.orig/arch/x86/mm/pageattr.c 2012-10-04 18:26:51.923486952 +0900
+++ linux-3.6/arch/x86/mm/pageattr.c 2012-10-04 18:30:27.328704662 +0900
@@ -501,21 +501,13 @@ out_unlock:
return do_split;
}

-static int split_large_page(pte_t *kpte, unsigned long address)
+int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
{
unsigned long pfn, pfninc = 1;
unsigned int i, level;
- pte_t *pbase, *tmp;
+ pte_t *tmp;
pgprot_t ref_prot;
- struct page *base;
-
- if (!debug_pagealloc)
- spin_unlock(&cpa_lock);
- base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
- if (!debug_pagealloc)
- spin_lock(&cpa_lock);
- if (!base)
- return -ENOMEM;
+ struct page *base = virt_to_page(pbase);

spin_lock(&pgd_lock);
/*
@@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte,
* up for us already:
*/
tmp = lookup_address(address, &level);
- if (tmp != kpte)
- goto out_unlock;
+ if (tmp != kpte) {
+ spin_unlock(&pgd_lock);
+ return 1;
+ }

- pbase = (pte_t *)page_address(base);
paravirt_alloc_pte(&init_mm, page_to_pfn(base));
ref_prot = pte_pgprot(pte_clrhuge(*kpte));
/*
@@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte,
* going on.
*/
__flush_tlb_all();
+ spin_unlock(&pgd_lock);

- base = NULL;
+ return 0;
+}

-out_unlock:
- /*
- * If we dropped out via the lookup_address check under
- * pgd_lock then stick the page back into the pool:
- */
- if (base)
+static int split_large_page(pte_t *kpte, unsigned long address)
+{
+ pte_t *pbase;
+ struct page *base;
+
+ if (!debug_pagealloc)
+ spin_unlock(&cpa_lock);
+ base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
+ if (!debug_pagealloc)
+ spin_lock(&cpa_lock);
+ if (!base)
+ return -ENOMEM;
+
+ pbase = (pte_t *)page_address(base);
+ if (__split_large_page(kpte, address, pbase))
__free_page(base);
- spin_unlock(&pgd_lock);

return 0;
}

2012-10-05 02:37:52

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 9/10] memory-hotplug : memory_hotplug: clear zone when removing the memory

When a memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in the function __add_zone(). So we should revert them
when the memory is removed.

The patch adds a new function __remove_zone() to do this.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
---
mm/memory_hotplug.c | 207 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 207 insertions(+)

Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:30:21.182698427 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:31.767709165 +0900
@@ -312,10 +312,213 @@ static int __meminit __add_section(int n
return register_new_memory(nid, __pfn_to_section(phys_start_pfn));
}

+/* find the smallest valid pfn in the range [start_pfn, end_pfn) */
+static int find_smallest_section_pfn(int nid, struct zone *zone,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ struct mem_section *ms;
+
+ for (; start_pfn < end_pfn; start_pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(start_pfn);
+
+ if (unlikely(!valid_section(ms)))
+ continue;
+
+ if (unlikely(pfn_to_nid(start_pfn)) != nid)
+ continue;
+
+ if (zone && zone != page_zone(pfn_to_page(start_pfn)))
+ continue;
+
+ return start_pfn;
+ }
+
+ return 0;
+}
+
+/* find the biggest valid pfn in the range [start_pfn, end_pfn). */
+static int find_biggest_section_pfn(int nid, struct zone *zone,
+ unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ struct mem_section *ms;
+ unsigned long pfn;
+
+ /* pfn is the end pfn of a memory section. */
+ pfn = end_pfn - 1;
+ for (; pfn >= start_pfn; pfn -= PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+
+ if (unlikely(!valid_section(ms)))
+ continue;
+
+ if (unlikely(pfn_to_nid(pfn)) != nid)
+ continue;
+
+ if (zone && zone != page_zone(pfn_to_page(pfn)))
+ continue;
+
+ return pfn;
+ }
+
+ return 0;
+}
+
+static void shrink_zone_span(struct zone *zone, unsigned long start_pfn,
+ unsigned long end_pfn)
+{
+ unsigned long zone_start_pfn = zone->zone_start_pfn;
+ unsigned long zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
+ unsigned long pfn;
+ struct mem_section *ms;
+ int nid = zone_to_nid(zone);
+
+ zone_span_writelock(zone);
+ if (zone_start_pfn == start_pfn) {
+ /*
+ * If the section is smallest section in the zone, it need
+ * shrink zone->zone_start_pfn and zone->zone_spanned_pages.
+ * In this case, we find second smallest valid mem_section
+ * for shrinking zone.
+ */
+ pfn = find_smallest_section_pfn(nid, zone, end_pfn,
+ zone_end_pfn);
+ if (pfn) {
+ zone->zone_start_pfn = pfn;
+ zone->spanned_pages = zone_end_pfn - pfn;
+ }
+ } else if (zone_end_pfn == end_pfn) {
+ /*
+ * If the section is biggest section in the zone, it need
+ * shrink zone->spanned_pages.
+ * In this case, we find second biggest valid mem_section for
+ * shrinking zone.
+ */
+ pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn,
+ start_pfn);
+ if (pfn)
+ zone->spanned_pages = pfn - zone_start_pfn + 1;
+ }
+
+ /*
+ * The section is not biggest or smallest mem_section in the zone, it
+ * only creates a hole in the zone. So in this case, we need not
+ * change the zone. But perhaps, the zone has only hole data. Thus
+ * it check the zone has only hole or not.
+ */
+ pfn = zone_start_pfn;
+ for (; pfn < zone_end_pfn; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+
+ if (unlikely(!valid_section(ms)))
+ continue;
+
+ if (page_zone(pfn_to_page(pfn)) != zone)
+ continue;
+
+ /* If the section is current section, it continues the loop */
+ if (start_pfn == pfn)
+ continue;
+
+ /* If we find valid section, we have nothing to do */
+ zone_span_writeunlock(zone);
+ return;
+ }
+
+ /* The zone has no valid section */
+ zone->zone_start_pfn = 0;
+ zone->spanned_pages = 0;
+ zone_span_writeunlock(zone);
+}
+
+static void shrink_pgdat_span(struct pglist_data *pgdat,
+ unsigned long start_pfn, unsigned long end_pfn)
+{
+ unsigned long pgdat_start_pfn = pgdat->node_start_pfn;
+ unsigned long pgdat_end_pfn =
+ pgdat->node_start_pfn + pgdat->node_spanned_pages;
+ unsigned long pfn;
+ struct mem_section *ms;
+ int nid = pgdat->node_id;
+
+ if (pgdat_start_pfn == start_pfn) {
+ /*
+ * If the section is smallest section in the pgdat, it need
+ * shrink pgdat->node_start_pfn and pgdat->node_spanned_pages.
+ * In this case, we find second smallest valid mem_section
+ * for shrinking zone.
+ */
+ pfn = find_smallest_section_pfn(nid, NULL, end_pfn,
+ pgdat_end_pfn);
+ if (pfn) {
+ pgdat->node_start_pfn = pfn;
+ pgdat->node_spanned_pages = pgdat_end_pfn - pfn;
+ }
+ } else if (pgdat_end_pfn == end_pfn) {
+ /*
+ * If the section is biggest section in the pgdat, it need
+ * shrink pgdat->node_spanned_pages.
+ * In this case, we find second biggest valid mem_section for
+ * shrinking zone.
+ */
+ pfn = find_biggest_section_pfn(nid, NULL, pgdat_start_pfn,
+ start_pfn);
+ if (pfn)
+ pgdat->node_spanned_pages = pfn - pgdat_start_pfn + 1;
+ }
+
+ /*
+ * If the section is not biggest or smallest mem_section in the pgdat,
+ * it only creates a hole in the pgdat. So in this case, we need not
+ * change the pgdat.
+ * But perhaps, the pgdat has only hole data. Thus it check the pgdat
+ * has only hole or not.
+ */
+ pfn = pgdat_start_pfn;
+ for (; pfn < pgdat_end_pfn; pfn += PAGES_PER_SECTION) {
+ ms = __pfn_to_section(pfn);
+
+ if (unlikely(!valid_section(ms)))
+ continue;
+
+ if (pfn_to_nid(pfn) != nid)
+ continue;
+
+ /* If the section is current section, it continues the loop */
+ if (start_pfn == pfn)
+ continue;
+
+ /* If we find valid section, we have nothing to do */
+ return;
+ }
+
+ /* The pgdat has no valid section */
+ pgdat->node_start_pfn = 0;
+ pgdat->node_spanned_pages = 0;
+}
+
+static void __remove_zone(struct zone *zone, unsigned long start_pfn)
+{
+ struct pglist_data *pgdat = zone->zone_pgdat;
+ int nr_pages = PAGES_PER_SECTION;
+ int zone_type;
+ unsigned long flags;
+
+ zone_type = zone - pgdat->node_zones;
+
+ pgdat_resize_lock(zone->zone_pgdat, &flags);
+ shrink_zone_span(zone, start_pfn, start_pfn + nr_pages);
+ shrink_pgdat_span(pgdat, start_pfn, start_pfn + nr_pages);
+ pgdat_resize_unlock(zone->zone_pgdat, &flags);
+}
+
static int __remove_section(struct zone *zone, struct mem_section *ms)
{
unsigned long flags;
struct pglist_data *pgdat = zone->zone_pgdat;
+ unsigned long start_pfn;
+ int scn_nr;
int ret = -EINVAL;

if (!valid_section(ms))
@@ -325,6 +528,10 @@ static int __remove_section(struct zone
if (ret)
return ret;

+ scn_nr = __section_nr(ms);
+ start_pfn = section_nr_to_pfn(scn_nr);
+ __remove_zone(zone, start_pfn);
+
pgdat_resize_lock(pgdat, &flags);
sparse_remove_one_section(zone, ms);
pgdat_resize_unlock(pgdat, &flags);

2012-10-05 02:38:54

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: [PATCH 10/10] memory-hotplug : remove sysfs file of node

From: Wen Congyang <[email protected]>

This patch introduces a new function try_offline_node() to
remove sysfs file of node when all memory sections of this
node are removed. If some memory sections of this node are
not removed, this function does nothing.

CC: David Rientjes <[email protected]>
CC: Jiang Liu <[email protected]>
CC: Len Brown <[email protected]>
CC: Christoph Lameter <[email protected]>
Cc: Minchan Kim <[email protected]>
CC: Andrew Morton <[email protected]>
CC: KOSAKI Motohiro <[email protected]>
CC: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Wen Congyang <[email protected]>
---
mm/memory_hotplug.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

Index: linux-3.6/mm/memory_hotplug.c
===================================================================
--- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:30:31.767709165 +0900
+++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:32:46.907842637 +0900
@@ -29,6 +29,7 @@
#include <linux/suspend.h>
#include <linux/mm_inline.h>
#include <linux/firmware-map.h>
+#include <linux/stop_machine.h>

#include <asm/tlbflush.h>

@@ -1276,6 +1277,57 @@ int offline_memory(u64 start, u64 size)
return 0;
}

+static int check_cpu_on_node(void *data)
+{
+ struct pglist_data *pgdat = data;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ if (cpu_to_node(cpu) == pgdat->node_id)
+ /*
+ * the cpu on this node is onlined, and we can't
+ * offline this node.
+ */
+ return -EBUSY;
+ }
+
+ return 0;
+}
+
+/* offline the node if all memory sections of this node are removed */
+static void try_offline_node(int nid)
+{
+ unsigned long start_pfn = NODE_DATA(nid)->node_start_pfn;
+ unsigned long end_pfn = start_pfn + NODE_DATA(nid)->node_spanned_pages;
+ unsigned long pfn;
+
+ for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+ unsigned long section_nr = pfn_to_section_nr(pfn);
+
+ if (!present_section_nr(section_nr))
+ continue;
+
+ if (pfn_to_nid(pfn) != nid)
+ continue;
+
+ /*
+ * some memory sections of this node are not removed, and we
+ * can't offline node now.
+ */
+ return;
+ }
+
+ if (stop_machine(check_cpu_on_node, NODE_DATA(nid), NULL))
+ return;
+
+ /*
+ * all memory sections of this node are removed, we can offline this
+ * node now.
+ */
+ node_set_offline(nid);
+ unregister_one_node(nid);
+}
+
int __ref remove_memory(int nid, u64 start, u64 size)
{
int ret = 0;
@@ -1296,6 +1348,8 @@ int __ref remove_memory(int nid, u64 sta
firmware_map_remove(start, start + size, "System RAM");

arch_remove_memory(start, size);
+
+ try_offline_node(nid);
out:
unlock_memory_hotplug();
return ret;

2012-10-05 19:07:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 0/10] memory-hotplug: hot-remove physical memory

> Known problems:
> 1. memory can't be offlined when CONFIG_MEMCG is selected.
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages. When we online memory8, the memory stored page cgroup
> is not provided by this memory device. But when we online memory9, the memory
> stored page cgroup may be provided by memory8. So we can't offline memory8
> now. We should offline the memory in the reversed order.
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail. In such case, you should offline the memory by
> hand before hotremoving the memory device.

Just iterate twice. 1st iterate: offline every non primary memory
block. 2nd iterate:
offline primary (i.e. first added) memory block. It may work.

2012-10-05 19:33:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

On Thu, Oct 4, 2012 at 10:25 PM, Yasuaki Ishimatsu
<[email protected]> wrote:
> When calling remove_memory(), the memory should be offline. If the function
> is used to online memory, kernel panic may occur.
>
> So the patch checks whether memory is offline or not.

You don't explain WHY we need the check.


> CC: David Rientjes <[email protected]>
> CC: Jiang Liu <[email protected]>
> CC: Len Brown <[email protected]>
> CC: Christoph Lameter <[email protected]>
> Cc: Minchan Kim <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Wen Congyang <[email protected]>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>
> ---
> drivers/base/memory.c | 39 +++++++++++++++++++++++++++++++++++++++
> include/linux/memory.h | 5 +++++
> mm/memory_hotplug.c | 17 +++++++++++++++--
> 3 files changed, 59 insertions(+), 2 deletions(-)
>
> Index: linux-3.6/drivers/base/memory.c
> ===================================================================
> --- linux-3.6.orig/drivers/base/memory.c 2012-10-04 14:22:57.000000000 +0900
> +++ linux-3.6/drivers/base/memory.c 2012-10-04 14:45:46.653585860 +0900
> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
> }
> EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>
> +bool is_memblk_offline(unsigned long start, unsigned long size)

Don't use memblk. Usually memblk mean struct numa_meminfo for x86/numa.
Maybe memory_range_offlined() is better.

And, this function don't take struct memory_block, then this file may be no good
place.

And you need to write down function comment.


> +{
> + struct memory_block *mem = NULL;
> + struct mem_section *section;
> + unsigned long start_pfn, end_pfn;
> + unsigned long pfn, section_nr;
> +
> + start_pfn = PFN_DOWN(start);
> + end_pfn = PFN_UP(start + size);
> +
> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
> + section_nr = pfn_to_section_nr(pfn);
> + if (!present_section_nr(section_nr))
> + continue;
> +
> + section = __nr_to_section(section_nr);
> + /* same memblock? */
> + if (mem)
> + if ((section_nr >= mem->start_section_nr) &&
> + (section_nr <= mem->end_section_nr))
> + continue;
> +
> + mem = find_memory_block_hinted(section, mem);
> + if (!mem)
> + continue;
> + if (mem->state == MEM_OFFLINE)
> + continue;
> +
> + kobject_put(&mem->dev.kobj);
> + return false;
> + }
> +
> + if (mem)
> + kobject_put(&mem->dev.kobj);
> +
> + return true;
> +}
> +EXPORT_SYMBOL(is_memblk_offline);
> +
> /*
> * register_memory - Setup a sysfs device for a memory block
> */
> Index: linux-3.6/include/linux/memory.h
> ===================================================================
> --- linux-3.6.orig/include/linux/memory.h 2012-10-02 18:00:22.000000000 +0900
> +++ linux-3.6/include/linux/memory.h 2012-10-04 14:44:40.902581028 +0900
> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
> {
> return 0;
> }
> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
> +{
> + return false;
> +}
> #else
> extern int register_memory_notifier(struct notifier_block *nb);
> extern void unregister_memory_notifier(struct notifier_block *nb);
> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
> extern struct memory_block *find_memory_block_hinted(struct mem_section *,
> struct memory_block *);
> extern struct memory_block *find_memory_block(struct mem_section *);
> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
> enum mem_add_context { BOOT, HOTPLUG };
> #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
> Index: linux-3.6/mm/memory_hotplug.c
> ===================================================================
> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 14:31:08.000000000 +0900
> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 14:58:22.449687986 +0900
> @@ -1045,8 +1045,21 @@ int offline_memory(u64 start, u64 size)
>
> int remove_memory(int nid, u64 start, u64 size)
> {

Your remove_memory() don't remove anything. that's strange.


> - /* It is not implemented yet*/
> - return 0;
> + int ret = 0;
> + lock_memory_hotplug();
> + /*
> + * The memory might become online by other task, even if you offine it.
> + * So we check whether the memory has been onlined or not.
> + */
> + if (!is_memblk_offline(start, size)) {
> + pr_warn("memory removing [mem %#010llx-%#010llx] failed, "
> + "because the memmory range is online\n",
> + start, start + size);

No good warning. You should output which memory block can't be
offlined, I think.


> + ret = -EAGAIN;
> + }
> +
> + unlock_memory_hotplug();
> + return ret;
> }
> EXPORT_SYMBOL_GPL(remove_memory);
> #else
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-10-05 19:42:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/10] memory-hotplug : remove /sys/firmware/memmap/X sysfs

On Thu, Oct 4, 2012 at 10:26 PM, Yasuaki Ishimatsu
<[email protected]> wrote:
> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
> sysfs files are created. But there is no code to remove these files. The patch
> implements the function to remove them.
>
> Note : The code does not free firmware_map_entry since there is no way to free
> memory which is allocated by bootmem.

You have to explain why this is ok. I guess the unfreed
firmware_map_entry is reused
at next online memory and don't make memory leak, right?



>
> CC: David Rientjes <[email protected]>
> CC: Jiang Liu <[email protected]>
> CC: Len Brown <[email protected]>
> CC: Christoph Lameter <[email protected]>
> Cc: Minchan Kim <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> Signed-off-by: Wen Congyang <[email protected]>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>
> ---
> drivers/firmware/memmap.c | 98 ++++++++++++++++++++++++++++++++++++++++++-
> include/linux/firmware-map.h | 6 ++
> mm/memory_hotplug.c | 7 ++-
> 3 files changed, 108 insertions(+), 3 deletions(-)
>
> Index: linux-3.6/drivers/firmware/memmap.c
> ===================================================================
> --- linux-3.6.orig/drivers/firmware/memmap.c 2012-10-04 18:27:05.195500420 +0900
> +++ linux-3.6/drivers/firmware/memmap.c 2012-10-04 18:27:18.901514330 +0900
> @@ -21,6 +21,7 @@
> #include <linux/types.h>
> #include <linux/bootmem.h>
> #include <linux/slab.h>
> +#include <linux/mm.h>
>
> /*
> * Data types ------------------------------------------------------------------
> @@ -41,6 +42,7 @@ struct firmware_map_entry {
> const char *type; /* type of the memory range */
> struct list_head list; /* entry for the linked list */
> struct kobject kobj; /* kobject for each entry */
> + unsigned int bootmem:1; /* allocated from bootmem */

Use bool.


> };
>
> /*
> @@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_att
> .show = memmap_attr_show,
> };
>
> +
> +static inline struct firmware_map_entry *
> +to_memmap_entry(struct kobject *kobj)
> +{
> + return container_of(kobj, struct firmware_map_entry, kobj);
> +}
> +
> +static void release_firmware_map_entry(struct kobject *kobj)
> +{
> + struct firmware_map_entry *entry = to_memmap_entry(kobj);
> +
> + if (entry->bootmem)
> + /* There is no way to free memory allocated from bootmem */
> + return;
> +
> + kfree(entry);
> +}
> +
> static struct kobj_type memmap_ktype = {
> + .release = release_firmware_map_entry,
> .sysfs_ops = &memmap_attr_ops,
> .default_attrs = def_attrs,
> };
> @@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
> * in firmware initialisation code in one single thread of execution.
> */
> static LIST_HEAD(map_entries);
> +static DEFINE_SPINLOCK(map_entries_lock);
>
> /**
> * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
> @@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 st
> INIT_LIST_HEAD(&entry->list);
> kobject_init(&entry->kobj, &memmap_ktype);
>
> + spin_lock(&map_entries_lock);
> list_add_tail(&entry->list, &map_entries);
> + spin_unlock(&map_entries_lock);
>
> return 0;
> }
>
> +/**
> + * firmware_map_remove_entry() - Does the real work to remove a firmware
> + * memmap entry.
> + * @entry: removed entry.
> + **/
> +static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)

Don't use inline in *.c file. gcc is wise than you.

> +{
> + spin_lock(&map_entries_lock);
> + list_del(&entry->list);
> + spin_unlock(&map_entries_lock);
> +}
> +
> /*
> * Add memmap entry on sysfs
> */
> @@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct
> return 0;
> }
>
> +/*
> + * Remove memmap entry on sysfs
> + */
> +static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
> +{
> + kobject_put(&entry->kobj);
> +}
> +
> +/*
> + * Search memmap entry
> + */
> +
> +static struct firmware_map_entry * __meminit
> +firmware_map_find_entry(u64 start, u64 end, const char *type)
> +{
> + struct firmware_map_entry *entry;
> +
> + spin_lock(&map_entries_lock);
> + list_for_each_entry(entry, &map_entries, list)
> + if ((entry->start == start) && (entry->end == end) &&
> + (!strcmp(entry->type, type))) {
> + spin_unlock(&map_entries_lock);
> + return entry;
> + }
> +
> + spin_unlock(&map_entries_lock);
> + return NULL;
> +}
> +
> /**
> * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
> * memory hotplug.
> @@ -193,9 +258,36 @@ int __init firmware_map_add_early(u64 st
> if (WARN_ON(!entry))
> return -ENOMEM;
>
> + entry->bootmem = 1;
> return firmware_map_add_entry(start, end, type, entry);
> }
>
> +/**
> + * firmware_map_remove() - remove a firmware mapping entry
> + * @start: Start of the memory range.
> + * @end: End of the memory range.
> + * @type: Type of the memory range.
> + *
> + * removes a firmware mapping entry.
> + *
> + * Returns 0 on success, or -EINVAL if no entry.
> + **/
> +int __meminit firmware_map_remove(u64 start, u64 end, const char *type)

Remove type argument if this is always passed "System RAM".

> +{
> + struct firmware_map_entry *entry;
> +
> + entry = firmware_map_find_entry(start, end - 1, type);
> + if (!entry)
> + return -EINVAL;
> +
> + firmware_map_remove_entry(entry);
> +
> + /* remove the memmap entry */
> + remove_sysfs_fw_map_entry(entry);
> +
> + return 0;
> +}
> +
> /*
> * Sysfs functions -------------------------------------------------------------
> */
> @@ -217,8 +309,10 @@ static ssize_t type_show(struct firmware
> return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
> }
>
> -#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
> -#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
> +static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
> +{
> + return container_of(attr, struct memmap_attribute, attr);
> +}
>
> static ssize_t memmap_attr_show(struct kobject *kobj,
> struct attribute *attr, char *buf)
> Index: linux-3.6/include/linux/firmware-map.h
> ===================================================================
> --- linux-3.6.orig/include/linux/firmware-map.h 2012-10-04 18:27:05.197500422 +0900
> +++ linux-3.6/include/linux/firmware-map.h 2012-10-04 18:27:18.904514333 +0900
> @@ -25,6 +25,7 @@
>
> int firmware_map_add_early(u64 start, u64 end, const char *type);
> int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
> +int firmware_map_remove(u64 start, u64 end, const char *type);
>
> #else /* CONFIG_FIRMWARE_MEMMAP */
>
> @@ -38,6 +39,11 @@ static inline int firmware_map_add_hotpl
> return 0;
> }
>
> +static inline int firmware_map_remove(u64 start, u64 end, const char *type)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_FIRMWARE_MEMMAP */
>
> #endif /* _LINUX_FIRMWARE_MAP_H */
> Index: linux-3.6/mm/memory_hotplug.c
> ===================================================================
> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:27:03.000000000 +0900
> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:28:42.851599524 +0900
> @@ -1043,7 +1043,7 @@ int offline_memory(u64 start, u64 size)
> return 0;
> }
>
> -int remove_memory(int nid, u64 start, u64 size)
> +int __ref remove_memory(int nid, u64 start, u64 size)
> {
> int ret = 0;
> lock_memory_hotplug();
> @@ -1056,8 +1056,13 @@ int remove_memory(int nid, u64 start, u6
> "because the memmory range is online\n",
> start, start + size);
> ret = -EAGAIN;
> + goto out;
> }
>
> + /* remove memmap entry */
> + firmware_map_remove(start, start + size, "System RAM");
> +
> +out:
> unlock_memory_hotplug();
> return ret;
> }


Other than that, looks ok to me.

2012-10-08 04:37:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

Yasuaki Ishimatsu <[email protected]> writes:
> + }
> +
> + /*
> + * We use 2M page, but we need to remove part of them,
> + * so split 2M page to 4K page.
> + */
> + pte = alloc_low_page(&pte_phys);

What happens when the allocation fails?

alloc_low_page seems to be buggy there too, it would __pa a NULL
pointer.

> + if (pud_large(*pud)) {
> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
> + set_pud(pud, __pud(0));
> + pages++;
> + continue;
> + }
> +
> + /*
> + * We use 1G page, but we need to remove part of them,
> + * so split 1G page to 2M page.
> + */
> + pmd = alloc_low_page(&pmd_phys);

Same here

> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
> +
> + spin_lock(&init_mm.page_table_lock);
> + pud_populate(&init_mm, pud, __va(pmd_phys));
> + spin_unlock(&init_mm.page_table_lock);
> + }
> +
> + pmd = map_low_page(pmd_offset(pud, 0));
> + phys_pmd_remove(pmd, addr, end);
> + unmap_low_page(pmd);
> + __flush_tlb_all();
> + }
> + __flush_tlb_all();

This doesn't flush the other CPUs doesn't it?

-Andi

--
[email protected] -- Speaking for myself only

2012-10-08 05:18:05

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

At 10/08/2012 12:37 PM, Andi Kleen Wrote:
> Yasuaki Ishimatsu <[email protected]> writes:
>> + }
>> +
>> + /*
>> + * We use 2M page, but we need to remove part of them,
>> + * so split 2M page to 4K page.
>> + */
>> + pte = alloc_low_page(&pte_phys);
>
> What happens when the allocation fails?
>
> alloc_low_page seems to be buggy there too, it would __pa a NULL
> pointer.

Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set.
Otherwise, it will return a NULL pointer. I will update this patch to deal
with NULL pointer.

>
>> + if (pud_large(*pud)) {
>> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
>> + set_pud(pud, __pud(0));
>> + pages++;
>> + continue;
>> + }
>> +
>> + /*
>> + * We use 1G page, but we need to remove part of them,
>> + * so split 1G page to 2M page.
>> + */
>> + pmd = alloc_low_page(&pmd_phys);
>
> Same here
>
>> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>> +
>> + spin_lock(&init_mm.page_table_lock);
>> + pud_populate(&init_mm, pud, __va(pmd_phys));
>> + spin_unlock(&init_mm.page_table_lock);
>> + }
>> +
>> + pmd = map_low_page(pmd_offset(pud, 0));
>> + phys_pmd_remove(pmd, addr, end);
>> + unmap_low_page(pmd);
>> + __flush_tlb_all();
>> + }
>> + __flush_tlb_all();
>
> This doesn't flush the other CPUs doesn't it?

How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all()
on each online cpu?

Thanks
Wen Congyang

>
> -Andi
>

2012-10-08 05:20:48

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 0/10] memory-hotplug: hot-remove physical memory

At 10/06/2012 03:06 AM, KOSAKI Motohiro Wrote:
>> Known problems:
>> 1. memory can't be offlined when CONFIG_MEMCG is selected.
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages. When we online memory8, the memory stored page cgroup
>> is not provided by this memory device. But when we online memory9, the memory
>> stored page cgroup may be provided by memory8. So we can't offline memory8
>> now. We should offline the memory in the reversed order.
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail. In such case, you should offline the memory by
>> hand before hotremoving the memory device.
>
> Just iterate twice. 1st iterate: offline every non primary memory
> block. 2nd iterate:
> offline primary (i.e. first added) memory block. It may work.
>

OK, I will try it.

Thanks
Wen Congyang

2012-10-09 08:27:12

by Jianguo Wu

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

Hi Congyang,
I think we should also free pages which are used by page tables after removing
page tables of the memory.

From: Jianguo Wu <[email protected]>

Signed-off-by: Jianguo Wu <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
---
arch/x86/mm/init_64.c | 110 +++++++++++++++++++++++++++++++++++++++---------
1 files changed, 89 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 5596dfa..81f9c3b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -675,6 +675,74 @@ int arch_add_memory(int nid, u64 start, u64 size)
}
EXPORT_SYMBOL_GPL(arch_add_memory);

+static inline void free_pagetable(struct page *page)
+{
+ struct zone *zone;
+
+ __ClearPageReserved(page);
+ __free_page(page);
+
+ zone = page_zone(page);
+ zone_span_writelock(zone);
+ zone->present_pages++;
+ zone_span_writeunlock(zone);
+ totalram_pages++;
+}
+
+static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
+{
+ pte_t *pte;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PTE; i++) {
+ pte = pte_start + i;
+ if (pte_val(*pte))
+ break;
+ }
+
+ /* free a pte talbe */
+ if (i == PTRS_PER_PTE) {
+ free_pagetable(pmd_page(*pmd));
+ pmd_clear(pmd);
+ }
+}
+
+static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
+{
+ pmd_t *pmd;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PMD; i++) {
+ pmd = pmd_start + i;
+ if (pmd_val(*pmd))
+ break;
+ }
+
+ /* free a pmd talbe */
+ if (i == PTRS_PER_PMD) {
+ free_pagetable(pud_page(*pud));
+ pud_clear(pud);
+ }
+}
+
+static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
+{
+ pud_t *pud;
+ int i;
+
+ for (i = 0; i < PTRS_PER_PUD; i++) {
+ pud = pud_start + i;
+ if (pud_val(*pud))
+ break;
+ }
+
+ /* free a pud table */
+ if (i == PTRS_PER_PUD) {
+ free_pagetable(pgd_page(*pgd));
+ pgd_clear(pgd);
+ }
+}
+
static void __meminit
phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
{
@@ -704,21 +772,19 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
unsigned long pages = 0, next;
int i = pmd_index(addr);

- for (; i < PTRS_PER_PMD; i++, addr = next) {
+ for (; i < PTRS_PER_PMD && addr < end; i++, addr = next) {
unsigned long pte_phys;
pmd_t *pmd = pmd_page + pmd_index(addr);
pte_t *pte;

- if (addr >= end)
- break;
-
- next = (addr & PMD_MASK) + PMD_SIZE;
+ next = pmd_addr_end(addr, end);

if (!pmd_present(*pmd))
continue;

if (pmd_large(*pmd)) {
- if ((addr & ~PMD_MASK) == 0 && next <= end) {
+ if (IS_ALIGNED(addr, PMD_SIZE) &&
+ IS_ALIGNED(next, PMD_SIZE)) {
set_pmd(pmd, __pmd(0));
pages++;
continue;
@@ -729,7 +795,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
* so split 2M page to 4K page.
*/
pte = alloc_low_page(&pte_phys);
- __split_large_page((pte_t *)pmd, addr, pte);
+ __split_large_page((pte_t *)pmd,
+ (unsigned long)__va(addr), pte);

spin_lock(&init_mm.page_table_lock);
pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
@@ -738,7 +805,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)

spin_lock(&init_mm.page_table_lock);
pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
- phys_pte_remove(pte, addr, end);
+ phys_pte_remove(pte, addr, next);
+ free_pte_table(pte, pmd);
unmap_low_page(pte);
spin_unlock(&init_mm.page_table_lock);
}
@@ -751,21 +819,19 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
unsigned long pages = 0, next;
int i = pud_index(addr);

- for (; i < PTRS_PER_PUD; i++, addr = next) {
+ for (; i < PTRS_PER_PUD && addr < end; i++, addr = next) {
unsigned long pmd_phys;
pud_t *pud = pud_page + pud_index(addr);
pmd_t *pmd;

- if (addr >= end)
- break;
-
- next = (addr & PUD_MASK) + PUD_SIZE;
+ next = pud_addr_end(addr, end);

if (!pud_present(*pud))
continue;

if (pud_large(*pud)) {
- if ((addr & ~PUD_MASK) == 0 && next <= end) {
+ if (IS_ALIGNED(addr, PUD_SIZE) &&
+ IS_ALIGNED(next, PUD_SIZE)) {
set_pud(pud, __pud(0));
pages++;
continue;
@@ -776,15 +842,18 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
* so split 1G page to 2M page.
*/
pmd = alloc_low_page(&pmd_phys);
- __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
+ __split_large_page((pte_t *)pud,
+ (unsigned long)__va(addr),
+ (pte_t *)pmd);

spin_lock(&init_mm.page_table_lock);
pud_populate(&init_mm, pud, __va(pmd_phys));
spin_unlock(&init_mm.page_table_lock);
}

- pmd = map_low_page(pmd_offset(pud, 0));
- phys_pmd_remove(pmd, addr, end);
+ pmd = map_low_page((pmd_t *)pud_page_vaddr(*pud));
+ phys_pmd_remove(pmd, addr, next);
+ free_pmd_table(pmd, pud);
unmap_low_page(pmd);
__flush_tlb_all();
}
@@ -805,15 +874,14 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
pgd_t *pgd = pgd_offset_k(start);
pud_t *pud;

- next = (start + PGDIR_SIZE) & PGDIR_MASK;
- if (next > end)
- next = end;
+ next = pgd_addr_end(start, end);

if (!pgd_present(*pgd))
continue;

pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
- phys_pud_remove(pud, __pa(start), __pa(end));
+ phys_pud_remove(pud, __pa(start), __pa(next));
+ free_pud_table(pud, pgd);
unmap_low_page(pud);
}

-- 1.7.6.1 .


On 2012-10-5 10:36, Yasuaki Ishimatsu wrote:
> From: Wen Congyang <[email protected]>
>
> For hot removing memory, we sholud remove page table about the memory.
> So the patch searches a page table about the removed memory, and clear
> page table.
>
> CC: David Rientjes <[email protected]>
> CC: Jiang Liu <[email protected]>
> CC: Len Brown <[email protected]>
> CC: Christoph Lameter <[email protected]>
> Cc: Minchan Kim <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Yasuaki Ishimatsu <[email protected]>
> Signed-off-by: Wen Congyang <[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 1
> arch/x86/mm/init_64.c | 147 +++++++++++++++++++++++++++++++++++
> arch/x86/mm/pageattr.c | 47 +++++------
> 3 files changed, 173 insertions(+), 22 deletions(-)
>
> Index: linux-3.6/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
> +++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:27.317704652 +0900
> @@ -675,6 +675,151 @@ int arch_add_memory(int nid, u64 start,
> }
> EXPORT_SYMBOL_GPL(arch_add_memory);
>
> +static void __meminit
> +phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
> +{
> + unsigned pages = 0;
> + int i = pte_index(addr);
> +
> + pte_t *pte = pte_page + pte_index(addr);
> +
> + for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
> +
> + if (addr >= end)
> + break;
> +
> + if (!pte_present(*pte))
> + continue;
> +
> + pages++;
> + set_pte(pte, __pte(0));
> + }
> +
> + update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
> +{
> + unsigned long pages = 0, next;
> + int i = pmd_index(addr);
> +
> + for (; i < PTRS_PER_PMD; i++, addr = next) {
> + unsigned long pte_phys;
> + pmd_t *pmd = pmd_page + pmd_index(addr);
> + pte_t *pte;
> +
> + if (addr >= end)
> + break;
> +
> + next = (addr & PMD_MASK) + PMD_SIZE;
> +
> + if (!pmd_present(*pmd))
> + continue;
> +
> + if (pmd_large(*pmd)) {
> + if ((addr & ~PMD_MASK) == 0 && next <= end) {
> + set_pmd(pmd, __pmd(0));
> + pages++;
> + continue;
> + }
> +
> + /*
> + * We use 2M page, but we need to remove part of them,
> + * so split 2M page to 4K page.
> + */
> + pte = alloc_low_page(&pte_phys);
> + __split_large_page((pte_t *)pmd, addr, pte);
> +
> + spin_lock(&init_mm.page_table_lock);
> + pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> + spin_unlock(&init_mm.page_table_lock);
> + }
> +
> + spin_lock(&init_mm.page_table_lock);
> + pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> + phys_pte_remove(pte, addr, end);
> + unmap_low_page(pte);
> + spin_unlock(&init_mm.page_table_lock);
> + }
> + update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
> +{
> + unsigned long pages = 0, next;
> + int i = pud_index(addr);
> +
> + for (; i < PTRS_PER_PUD; i++, addr = next) {
> + unsigned long pmd_phys;
> + pud_t *pud = pud_page + pud_index(addr);
> + pmd_t *pmd;
> +
> + if (addr >= end)
> + break;
> +
> + next = (addr & PUD_MASK) + PUD_SIZE;
> +
> + if (!pud_present(*pud))
> + continue;
> +
> + if (pud_large(*pud)) {
> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
> + set_pud(pud, __pud(0));
> + pages++;
> + continue;
> + }
> +
> + /*
> + * We use 1G page, but we need to remove part of them,
> + * so split 1G page to 2M page.
> + */
> + pmd = alloc_low_page(&pmd_phys);
> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
> +
> + spin_lock(&init_mm.page_table_lock);
> + pud_populate(&init_mm, pud, __va(pmd_phys));
> + spin_unlock(&init_mm.page_table_lock);
> + }
> +
> + pmd = map_low_page(pmd_offset(pud, 0));
> + phys_pmd_remove(pmd, addr, end);
> + unmap_low_page(pmd);
> + __flush_tlb_all();
> + }
> + __flush_tlb_all();
> +
> + update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +void __meminit
> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> +{
> + unsigned long next;
> +
> + start = (unsigned long)__va(start);
> + end = (unsigned long)__va(end);
> +
> + for (; start < end; start = next) {
> + pgd_t *pgd = pgd_offset_k(start);
> + pud_t *pud;
> +
> + next = (start + PGDIR_SIZE) & PGDIR_MASK;
> + if (next > end)
> + next = end;
> +
> + if (!pgd_present(*pgd))
> + continue;
> +
> + pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> + phys_pud_remove(pud, __pa(start), __pa(end));
> + unmap_low_page(pud);
> + }
> +
> + __flush_tlb_all();
> +}
> +
> #ifdef CONFIG_MEMORY_HOTREMOVE
> int __ref arch_remove_memory(u64 start, u64 size)
> {
> @@ -687,6 +832,8 @@ int __ref arch_remove_memory(u64 start,
> ret = __remove_pages(zone, start_pfn, nr_pages);
> WARN_ON_ONCE(ret);
>
> + kernel_physical_mapping_remove(start, start + size);
> +
> return ret;
> }
> #endif
> Index: linux-3.6/arch/x86/include/asm/pgtable_types.h
> ===================================================================
> --- linux-3.6.orig/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:26:51.925486954 +0900
> +++ linux-3.6/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:30:27.322704656 +0900
> @@ -334,6 +334,7 @@ static inline void update_page_count(int
> * as a pte too.
> */
> extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>
> #endif /* !__ASSEMBLY__ */
>
> Index: linux-3.6/arch/x86/mm/pageattr.c
> ===================================================================
> --- linux-3.6.orig/arch/x86/mm/pageattr.c 2012-10-04 18:26:51.923486952 +0900
> +++ linux-3.6/arch/x86/mm/pageattr.c 2012-10-04 18:30:27.328704662 +0900
> @@ -501,21 +501,13 @@ out_unlock:
> return do_split;
> }
>
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> {
> unsigned long pfn, pfninc = 1;
> unsigned int i, level;
> - pte_t *pbase, *tmp;
> + pte_t *tmp;
> pgprot_t ref_prot;
> - struct page *base;
> -
> - if (!debug_pagealloc)
> - spin_unlock(&cpa_lock);
> - base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> - if (!debug_pagealloc)
> - spin_lock(&cpa_lock);
> - if (!base)
> - return -ENOMEM;
> + struct page *base = virt_to_page(pbase);
>
> spin_lock(&pgd_lock);
> /*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte,
> * up for us already:
> */
> tmp = lookup_address(address, &level);
> - if (tmp != kpte)
> - goto out_unlock;
> + if (tmp != kpte) {
> + spin_unlock(&pgd_lock);
> + return 1;
> + }
>
> - pbase = (pte_t *)page_address(base);
> paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> /*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte,
> * going on.
> */
> __flush_tlb_all();
> + spin_unlock(&pgd_lock);
>
> - base = NULL;
> + return 0;
> +}
>
> -out_unlock:
> - /*
> - * If we dropped out via the lookup_address check under
> - * pgd_lock then stick the page back into the pool:
> - */
> - if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> + pte_t *pbase;
> + struct page *base;
> +
> + if (!debug_pagealloc)
> + spin_unlock(&cpa_lock);
> + base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> + if (!debug_pagealloc)
> + spin_lock(&cpa_lock);
> + if (!base)
> + return -ENOMEM;
> +
> + pbase = (pte_t *)page_address(base);
> + if (__split_large_page(kpte, address, pbase))
> __free_page(base);
> - spin_unlock(&pgd_lock);
>
> return 0;
> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-11 00:36:06

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

On 10/08/2012 01:23 PM, Wen Congyang wrote:
> At 10/08/2012 12:37 PM, Andi Kleen Wrote:
>> Yasuaki Ishimatsu <[email protected]> writes:
>>> + }
>>> +
>>> + /*
>>> + * We use 2M page, but we need to remove part of them,
>>> + * so split 2M page to 4K page.
>>> + */
>>> + pte = alloc_low_page(&pte_phys);
>> What happens when the allocation fails?
>>
>> alloc_low_page seems to be buggy there too, it would __pa a NULL
>> pointer.
> Yes, it will cause kernek panicked in __pa() if CONFI_DEBUG_VIRTUAL is set.
> Otherwise, it will return a NULL pointer. I will update this patch to deal
> with NULL pointer.
>
>>> + if (pud_large(*pud)) {
>>> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
>>> + set_pud(pud, __pud(0));
>>> + pages++;
>>> + continue;
>>> + }
>>> +
>>> + /*
>>> + * We use 1G page, but we need to remove part of them,
>>> + * so split 1G page to 2M page.
>>> + */
>>> + pmd = alloc_low_page(&pmd_phys);
>> Same here
>>
>>> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>>> +
>>> + spin_lock(&init_mm.page_table_lock);
>>> + pud_populate(&init_mm, pud, __va(pmd_phys));
>>> + spin_unlock(&init_mm.page_table_lock);
>>> + }
>>> +
>>> + pmd = map_low_page(pmd_offset(pud, 0));
>>> + phys_pmd_remove(pmd, addr, end);
>>> + unmap_low_page(pmd);
>>> + __flush_tlb_all();
>>> + }
>>> + __flush_tlb_all();

Hi Congyang,

I see you call __flush_tlb_all() every pud entry(all pmd, pte related to
it changed) modified, then how to determine the flush frequency? why not
every pmd entry?

Regards,
Chen

>> This doesn't flush the other CPUs doesn't it?
> How to flush the other CPU's tlb? use on_each_cpu() to run __flush_tlb_all()
> on each online cpu?
>
> Thanks
> Wen Congyang
>
>> -Andi
>>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-11 07:07:43

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 2/10] memory-hotplug : remove /sys/firmware/memmap/X sysfs

2012/10/06 4:36, KOSAKI Motohiro wrote:
> On Thu, Oct 4, 2012 at 10:26 PM, Yasuaki Ishimatsu
> <[email protected]> wrote:
>> When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start, type}
>> sysfs files are created. But there is no code to remove these files. The patch
>> implements the function to remove them.
>>
>> Note : The code does not free firmware_map_entry since there is no way to free
>> memory which is allocated by bootmem.
>
> You have to explain why this is ok. I guess the unfreed
> firmware_map_entry is reused
> at next online memory and don't make memory leak, right?

Unfortunately, it is no. It makes memory leak about firmware_map_entry size.
If we hot add memory, slab allocater prepares a other memory for
firmware_map_entry.

In my understanding, if the memory is allocated by bootmem allocator,
the memory is not managed by slab allocator. So we can not use kfree()
against the memory.
On the other hand, the page of the memory may have various data allocalted
by bootmem allocater with the exception of the firmware_map_entry. Thus we
cannot free the page.

So the patch makes memory leak. But I think the memory leak size is
very samll. And it does not affect the system.

>
>
>
>>
>> CC: David Rientjes <[email protected]>
>> CC: Jiang Liu <[email protected]>
>> CC: Len Brown <[email protected]>
>> CC: Christoph Lameter <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: KOSAKI Motohiro <[email protected]>
>> Signed-off-by: Wen Congyang <[email protected]>
>> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>>
>> ---
>> drivers/firmware/memmap.c | 98 ++++++++++++++++++++++++++++++++++++++++++-
>> include/linux/firmware-map.h | 6 ++
>> mm/memory_hotplug.c | 7 ++-
>> 3 files changed, 108 insertions(+), 3 deletions(-)
>>
>> Index: linux-3.6/drivers/firmware/memmap.c
>> ===================================================================
>> --- linux-3.6.orig/drivers/firmware/memmap.c 2012-10-04 18:27:05.195500420 +0900
>> +++ linux-3.6/drivers/firmware/memmap.c 2012-10-04 18:27:18.901514330 +0900
>> @@ -21,6 +21,7 @@
>> #include <linux/types.h>
>> #include <linux/bootmem.h>
>> #include <linux/slab.h>
>> +#include <linux/mm.h>
>>
>> /*
>> * Data types ------------------------------------------------------------------
>> @@ -41,6 +42,7 @@ struct firmware_map_entry {
>> const char *type; /* type of the memory range */
>> struct list_head list; /* entry for the linked list */
>> struct kobject kobj; /* kobject for each entry */
>> + unsigned int bootmem:1; /* allocated from bootmem */
>
> Use bool.

We'll update it.

>
>> };
>>
>> /*
>> @@ -79,7 +81,26 @@ static const struct sysfs_ops memmap_att
>> .show = memmap_attr_show,
>> };
>>
>> +
>> +static inline struct firmware_map_entry *
>> +to_memmap_entry(struct kobject *kobj)
>> +{
>> + return container_of(kobj, struct firmware_map_entry, kobj);
>> +}
>> +
>> +static void release_firmware_map_entry(struct kobject *kobj)
>> +{
>> + struct firmware_map_entry *entry = to_memmap_entry(kobj);
>> +
>> + if (entry->bootmem)
>> + /* There is no way to free memory allocated from bootmem */
>> + return;
>> +
>> + kfree(entry);
>> +}
>> +
>> static struct kobj_type memmap_ktype = {
>> + .release = release_firmware_map_entry,
>> .sysfs_ops = &memmap_attr_ops,
>> .default_attrs = def_attrs,
>> };
>> @@ -94,6 +115,7 @@ static struct kobj_type memmap_ktype = {
>> * in firmware initialisation code in one single thread of execution.
>> */
>> static LIST_HEAD(map_entries);
>> +static DEFINE_SPINLOCK(map_entries_lock);
>>
>> /**
>> * firmware_map_add_entry() - Does the real work to add a firmware memmap entry.
>> @@ -118,11 +140,25 @@ static int firmware_map_add_entry(u64 st
>> INIT_LIST_HEAD(&entry->list);
>> kobject_init(&entry->kobj, &memmap_ktype);
>>
>> + spin_lock(&map_entries_lock);
>> list_add_tail(&entry->list, &map_entries);
>> + spin_unlock(&map_entries_lock);
>>
>> return 0;
>> }
>>
>> +/**
>> + * firmware_map_remove_entry() - Does the real work to remove a firmware
>> + * memmap entry.
>> + * @entry: removed entry.
>> + **/
>> +static inline void firmware_map_remove_entry(struct firmware_map_entry *entry)
>
> Don't use inline in *.c file. gcc is wise than you.

We'll update it.

>> +{
>> + spin_lock(&map_entries_lock);
>> + list_del(&entry->list);
>> + spin_unlock(&map_entries_lock);
>> +}
>> +
>> /*
>> * Add memmap entry on sysfs
>> */
>> @@ -144,6 +180,35 @@ static int add_sysfs_fw_map_entry(struct
>> return 0;
>> }
>>
>> +/*
>> + * Remove memmap entry on sysfs
>> + */
>> +static inline void remove_sysfs_fw_map_entry(struct firmware_map_entry *entry)
>> +{
>> + kobject_put(&entry->kobj);
>> +}
>> +
>> +/*
>> + * Search memmap entry
>> + */
>> +
>> +static struct firmware_map_entry * __meminit
>> +firmware_map_find_entry(u64 start, u64 end, const char *type)
>> +{
>> + struct firmware_map_entry *entry;
>> +
>> + spin_lock(&map_entries_lock);
>> + list_for_each_entry(entry, &map_entries, list)
>> + if ((entry->start == start) && (entry->end == end) &&
>> + (!strcmp(entry->type, type))) {
>> + spin_unlock(&map_entries_lock);
>> + return entry;
>> + }
>> +
>> + spin_unlock(&map_entries_lock);
>> + return NULL;
>> +}
>> +
>> /**
>> * firmware_map_add_hotplug() - Adds a firmware mapping entry when we do
>> * memory hotplug.
>> @@ -193,9 +258,36 @@ int __init firmware_map_add_early(u64 st
>> if (WARN_ON(!entry))
>> return -ENOMEM;
>>
>> + entry->bootmem = 1;
>> return firmware_map_add_entry(start, end, type, entry);
>> }
>>
>> +/**
>> + * firmware_map_remove() - remove a firmware mapping entry
>> + * @start: Start of the memory range.
>> + * @end: End of the memory range.
>> + * @type: Type of the memory range.
>> + *
>> + * removes a firmware mapping entry.
>> + *
>> + * Returns 0 on success, or -EINVAL if no entry.
>> + **/
>> +int __meminit firmware_map_remove(u64 start, u64 end, const char *type)
>
> Remove type argument if this is always passed "System RAM".

Probably, the type is always "System RAM". But we need to check whether
that the range of start and end variables are "System RAM" or not.
So I want to keep it.

Thanks,
Yasuaki Ishimatsu

>
>> +{
>> + struct firmware_map_entry *entry;
>> +
>> + entry = firmware_map_find_entry(start, end - 1, type);
>> + if (!entry)
>> + return -EINVAL;
>> +
>> + firmware_map_remove_entry(entry);
>> +
>> + /* remove the memmap entry */
>> + remove_sysfs_fw_map_entry(entry);
>> +
>> + return 0;
>> +}
>> +
>> /*
>> * Sysfs functions -------------------------------------------------------------
>> */
>> @@ -217,8 +309,10 @@ static ssize_t type_show(struct firmware
>> return snprintf(buf, PAGE_SIZE, "%s\n", entry->type);
>> }
>>
>> -#define to_memmap_attr(_attr) container_of(_attr, struct memmap_attribute, attr)
>> -#define to_memmap_entry(obj) container_of(obj, struct firmware_map_entry, kobj)
>> +static inline struct memmap_attribute *to_memmap_attr(struct attribute *attr)
>> +{
>> + return container_of(attr, struct memmap_attribute, attr);
>> +}
>>
>> static ssize_t memmap_attr_show(struct kobject *kobj,
>> struct attribute *attr, char *buf)
>> Index: linux-3.6/include/linux/firmware-map.h
>> ===================================================================
>> --- linux-3.6.orig/include/linux/firmware-map.h 2012-10-04 18:27:05.197500422 +0900
>> +++ linux-3.6/include/linux/firmware-map.h 2012-10-04 18:27:18.904514333 +0900
>> @@ -25,6 +25,7 @@
>>
>> int firmware_map_add_early(u64 start, u64 end, const char *type);
>> int firmware_map_add_hotplug(u64 start, u64 end, const char *type);
>> +int firmware_map_remove(u64 start, u64 end, const char *type);
>>
>> #else /* CONFIG_FIRMWARE_MEMMAP */
>>
>> @@ -38,6 +39,11 @@ static inline int firmware_map_add_hotpl
>> return 0;
>> }
>>
>> +static inline int firmware_map_remove(u64 start, u64 end, const char *type)
>> +{
>> + return 0;
>> +}
>> +
>> #endif /* CONFIG_FIRMWARE_MEMMAP */
>>
>> #endif /* _LINUX_FIRMWARE_MAP_H */
>> Index: linux-3.6/mm/memory_hotplug.c
>> ===================================================================
>> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:27:03.000000000 +0900
>> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:28:42.851599524 +0900
>> @@ -1043,7 +1043,7 @@ int offline_memory(u64 start, u64 size)
>> return 0;
>> }
>>
>> -int remove_memory(int nid, u64 start, u64 size)
>> +int __ref remove_memory(int nid, u64 start, u64 size)
>> {
>> int ret = 0;
>> lock_memory_hotplug();
>> @@ -1056,8 +1056,13 @@ int remove_memory(int nid, u64 start, u6
>> "because the memmory range is online\n",
>> start, start + size);
>> ret = -EAGAIN;
>> + goto out;
>> }
>>
>> + /* remove memmap entry */
>> + firmware_map_remove(start, start + size, "System RAM");
>> +
>> +out:
>> unlock_memory_hotplug();
>> return ret;
>> }
>
>
> Other than that, looks ok to me.
>

2012-10-12 19:28:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 5/10] memory-hotplug : memory-hotplug: check page type in get_page_bootmem

On Thu, Oct 4, 2012 at 10:32 PM, Yasuaki Ishimatsu
<[email protected]> wrote:
> The function get_page_bootmem() may be called more than one time to the same
> page. There is no need to set page's type, private if the function is not
> the first time called to the page.
>
> Note: the patch is just optimization and does not fix any problem.
>
> CC: David Rientjes <[email protected]>
> CC: Jiang Liu <[email protected]>
> CC: Len Brown <[email protected]>
> CC: Christoph Lameter <[email protected]>
> Cc: Minchan Kim <[email protected]>
> CC: Andrew Morton <[email protected]>
> CC: KOSAKI Motohiro <[email protected]>
> CC: Wen Congyang <[email protected]>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
> ---
> mm/memory_hotplug.c | 15 +++++++++++----
> 1 file changed, 11 insertions(+), 4 deletions(-)
>
> Index: linux-3.6/mm/memory_hotplug.c
> ===================================================================
> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:29:58.284676075 +0900
> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:03.454680542 +0900
> @@ -95,10 +95,17 @@ static void release_memory_resource(stru
> static void get_page_bootmem(unsigned long info, struct page *page,
> unsigned long type)
> {
> - page->lru.next = (struct list_head *) type;
> - SetPagePrivate(page);
> - set_page_private(page, info);
> - atomic_inc(&page->_count);
> + unsigned long page_type;
> +
> + page_type = (unsigned long)page->lru.next;

If I understand correctly, page->lru.next might be uninitialized yet.

Moreover, I have no seen any good effect in this patch. I don't understand
why we need to increase code complexity.



> + if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
> + page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
> + page->lru.next = (struct list_head *)type;
> + SetPagePrivate(page);
> + set_page_private(page, info);
> + atomic_inc(&page->_count);
> + } else
> + atomic_inc(&page->_count);
> }
>
> /* reference to __meminit __free_pages_bootmem is valid
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-10-19 00:50:33

by Yasuaki Ishimatsu

[permalink] [raw]
Subject: Re: [PATCH 5/10] memory-hotplug : memory-hotplug: check page type in get_page_bootmem

Hi Kosaki,

Sorry for late reply.

2012/10/13 4:28, KOSAKI Motohiro wrote:
> On Thu, Oct 4, 2012 at 10:32 PM, Yasuaki Ishimatsu
> <[email protected]> wrote:
>> The function get_page_bootmem() may be called more than one time to the same
>> page. There is no need to set page's type, private if the function is not
>> the first time called to the page.
>>
>> Note: the patch is just optimization and does not fix any problem.
>>
>> CC: David Rientjes <[email protected]>
>> CC: Jiang Liu <[email protected]>
>> CC: Len Brown <[email protected]>
>> CC: Christoph Lameter <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: KOSAKI Motohiro <[email protected]>
>> CC: Wen Congyang <[email protected]>
>> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>> ---
>> mm/memory_hotplug.c | 15 +++++++++++----
>> 1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> Index: linux-3.6/mm/memory_hotplug.c
>> ===================================================================
>> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:29:58.284676075 +0900
>> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:03.454680542 +0900
>> @@ -95,10 +95,17 @@ static void release_memory_resource(stru
>> static void get_page_bootmem(unsigned long info, struct page *page,
>> unsigned long type)
>> {
>> - page->lru.next = (struct list_head *) type;
>> - SetPagePrivate(page);
>> - set_page_private(page, info);
>> - atomic_inc(&page->_count);
>> + unsigned long page_type;
>> +
>> + page_type = (unsigned long)page->lru.next;
>
> If I understand correctly, page->lru.next might be uninitialized yet.

Ah yes. I was misunderstanding...

Hi Wen,

When you update the physical hot remove patch-set, please drop the patch.

Thanks,
Yasuaki Ishimatsu

> Moreover, I have no seen any good effect in this patch. I don't understand
> why we need to increase code complexity.
>
>
>
>> + if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>> + page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
>> + page->lru.next = (struct list_head *)type;
>> + SetPagePrivate(page);
>> + set_page_private(page, info);
>> + atomic_inc(&page->_count);
>> + } else
>> + atomic_inc(&page->_count);
>> }
>>
>> /* reference to __meminit __free_pages_bootmem is valid
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-19 01:49:57

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 5/10] memory-hotplug : memory-hotplug: check page type in get_page_bootmem

At 10/19/2012 08:49 AM, Yasuaki Ishimatsu Wrote:
> Hi Kosaki,
>
> Sorry for late reply.
>
> 2012/10/13 4:28, KOSAKI Motohiro wrote:
>> On Thu, Oct 4, 2012 at 10:32 PM, Yasuaki Ishimatsu
>> <[email protected]> wrote:
>>> The function get_page_bootmem() may be called more than one time to
>>> the same
>>> page. There is no need to set page's type, private if the function is
>>> not
>>> the first time called to the page.
>>>
>>> Note: the patch is just optimization and does not fix any problem.
>>>
>>> CC: David Rientjes <[email protected]>
>>> CC: Jiang Liu <[email protected]>
>>> CC: Len Brown <[email protected]>
>>> CC: Christoph Lameter <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> CC: Andrew Morton <[email protected]>
>>> CC: KOSAKI Motohiro <[email protected]>
>>> CC: Wen Congyang <[email protected]>
>>> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>>> ---
>>> mm/memory_hotplug.c | 15 +++++++++++----
>>> 1 file changed, 11 insertions(+), 4 deletions(-)
>>>
>>> Index: linux-3.6/mm/memory_hotplug.c
>>> ===================================================================
>>> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 18:29:58.284676075
>>> +0900
>>> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 18:30:03.454680542
>>> +0900
>>> @@ -95,10 +95,17 @@ static void release_memory_resource(stru
>>> static void get_page_bootmem(unsigned long info, struct page *page,
>>> unsigned long type)
>>> {
>>> - page->lru.next = (struct list_head *) type;
>>> - SetPagePrivate(page);
>>> - set_page_private(page, info);
>>> - atomic_inc(&page->_count);
>>> + unsigned long page_type;
>>> +
>>> + page_type = (unsigned long)page->lru.next;
>>
>> If I understand correctly, page->lru.next might be uninitialized yet.
>
> Ah yes. I was misunderstanding...
>
> Hi Wen,
>
> When you update the physical hot remove patch-set, please drop the patch.

OK

Thanks
Wen Congyang

>
> Thanks,
> Yasuaki Ishimatsu
>> Moreover, I have no seen any good effect in this patch. I don't
>> understand
>> why we need to increase code complexity.
>>
>>
>>
>>> + if (page_type < MEMORY_HOTPLUG_MIN_BOOTMEM_TYPE ||
>>> + page_type > MEMORY_HOTPLUG_MAX_BOOTMEM_TYPE){
>>> + page->lru.next = (struct list_head *)type;
>>> + SetPagePrivate(page);
>>> + set_page_private(page, info);
>>> + atomic_inc(&page->_count);
>>> + } else
>>> + atomic_inc(&page->_count);
>>> }
>>>
>>> /* reference to __meminit __free_pages_bootmem is valid
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>> linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2012-10-19 10:39:08

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

At 10/06/2012 03:27 AM, KOSAKI Motohiro Wrote:
> On Thu, Oct 4, 2012 at 10:25 PM, Yasuaki Ishimatsu
> <[email protected]> wrote:
>> When calling remove_memory(), the memory should be offline. If the function
>> is used to online memory, kernel panic may occur.
>>
>> So the patch checks whether memory is offline or not.
>
> You don't explain WHY we need the check.

This patch is no necessary now, because the newest kernel has checked
it.

Thanks
Wen Congyang

>
>
>> CC: David Rientjes <[email protected]>
>> CC: Jiang Liu <[email protected]>
>> CC: Len Brown <[email protected]>
>> CC: Christoph Lameter <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: KOSAKI Motohiro <[email protected]>
>> Signed-off-by: Wen Congyang <[email protected]>
>> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
>>
>> ---
>> drivers/base/memory.c | 39 +++++++++++++++++++++++++++++++++++++++
>> include/linux/memory.h | 5 +++++
>> mm/memory_hotplug.c | 17 +++++++++++++++--
>> 3 files changed, 59 insertions(+), 2 deletions(-)
>>
>> Index: linux-3.6/drivers/base/memory.c
>> ===================================================================
>> --- linux-3.6.orig/drivers/base/memory.c 2012-10-04 14:22:57.000000000 +0900
>> +++ linux-3.6/drivers/base/memory.c 2012-10-04 14:45:46.653585860 +0900
>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>> }
>> EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>
>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>
> Don't use memblk. Usually memblk mean struct numa_meminfo for x86/numa.
> Maybe memory_range_offlined() is better.
>
> And, this function don't take struct memory_block, then this file may be no good
> place.
>
> And you need to write down function comment.
>
>
>> +{
>> + struct memory_block *mem = NULL;
>> + struct mem_section *section;
>> + unsigned long start_pfn, end_pfn;
>> + unsigned long pfn, section_nr;
>> +
>> + start_pfn = PFN_DOWN(start);
>> + end_pfn = PFN_UP(start + size);
>> +
>> + for (pfn = start_pfn; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
>> + section_nr = pfn_to_section_nr(pfn);
>> + if (!present_section_nr(section_nr))
>> + continue;
>> +
>> + section = __nr_to_section(section_nr);
>> + /* same memblock? */
>> + if (mem)
>> + if ((section_nr >= mem->start_section_nr) &&
>> + (section_nr <= mem->end_section_nr))
>> + continue;
>> +
>> + mem = find_memory_block_hinted(section, mem);
>> + if (!mem)
>> + continue;
>> + if (mem->state == MEM_OFFLINE)
>> + continue;
>> +
>> + kobject_put(&mem->dev.kobj);
>> + return false;
>> + }
>> +
>> + if (mem)
>> + kobject_put(&mem->dev.kobj);
>> +
>> + return true;
>> +}
>> +EXPORT_SYMBOL(is_memblk_offline);
>> +
>> /*
>> * register_memory - Setup a sysfs device for a memory block
>> */
>> Index: linux-3.6/include/linux/memory.h
>> ===================================================================
>> --- linux-3.6.orig/include/linux/memory.h 2012-10-02 18:00:22.000000000 +0900
>> +++ linux-3.6/include/linux/memory.h 2012-10-04 14:44:40.902581028 +0900
>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>> {
>> return 0;
>> }
>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>> +{
>> + return false;
>> +}
>> #else
>> extern int register_memory_notifier(struct notifier_block *nb);
>> extern void unregister_memory_notifier(struct notifier_block *nb);
>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>> extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>> struct memory_block *);
>> extern struct memory_block *find_memory_block(struct mem_section *);
>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
>> enum mem_add_context { BOOT, HOTPLUG };
>> #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>> Index: linux-3.6/mm/memory_hotplug.c
>> ===================================================================
>> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 14:31:08.000000000 +0900
>> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 14:58:22.449687986 +0900
>> @@ -1045,8 +1045,21 @@ int offline_memory(u64 start, u64 size)
>>
>> int remove_memory(int nid, u64 start, u64 size)
>> {
>
> Your remove_memory() don't remove anything. that's strange.
>
>
>> - /* It is not implemented yet*/
>> - return 0;
>> + int ret = 0;
>> + lock_memory_hotplug();
>> + /*
>> + * The memory might become online by other task, even if you offine it.
>> + * So we check whether the memory has been onlined or not.
>> + */
>> + if (!is_memblk_offline(start, size)) {
>> + pr_warn("memory removing [mem %#010llx-%#010llx] failed, "
>> + "because the memmory range is online\n",
>> + start, start + size);
>
> No good warning. You should output which memory block can't be
> offlined, I think.
>
>
>> + ret = -EAGAIN;
>> + }
>> +
>> + unlock_memory_hotplug();
>> + return ret;
>> }
>> EXPORT_SYMBOL_GPL(remove_memory);
>> #else
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2012-10-19 14:16:04

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

At 2012/10/19 18:44, Wen Congyang Wrote:
> At 10/06/2012 03:27 AM, KOSAKI Motohiro Wrote:
>> On Thu, Oct 4, 2012 at 10:25 PM, Yasuaki Ishimatsu
>> <[email protected]> wrote:
>>> When calling remove_memory(), the memory should be offline. If the function
>>> is used to online memory, kernel panic may occur.
>>>
>>> So the patch checks whether memory is offline or not.
>>
>> You don't explain WHY we need the check.
>
> This patch is no necessary now, because the newest kernel has checked
> it.

I think it again, and found that this check is necessary. Because we only
lock memory hotplug when offlining pages. Here is the steps to offline and
remove memory:

1. lock memory hotplug
2. offline a memory section
3. unlock memory hotplug
4. repeat 1-3 to offline all memory sections
5. lock memory hotplug
6. remove memory
7. unlock memory hotplug

All memory sections must be offlined before removing memory. But we
don't hold
the lock in the whole operation. So we should check whether all memory
sections
are offlined before step6.

>
> Thanks
> Wen Congyang
>
>>
>>
>>> CC: David Rientjes<[email protected]>
>>> CC: Jiang Liu<[email protected]>
>>> CC: Len Brown<[email protected]>
>>> CC: Christoph Lameter<[email protected]>
>>> Cc: Minchan Kim<[email protected]>
>>> CC: Andrew Morton<[email protected]>
>>> CC: KOSAKI Motohiro<[email protected]>
>>> Signed-off-by: Wen Congyang<[email protected]>
>>> Signed-off-by: Yasuaki Ishimatsu<[email protected]>
>>>
>>> ---
>>> drivers/base/memory.c | 39 +++++++++++++++++++++++++++++++++++++++
>>> include/linux/memory.h | 5 +++++
>>> mm/memory_hotplug.c | 17 +++++++++++++++--
>>> 3 files changed, 59 insertions(+), 2 deletions(-)
>>>
>>> Index: linux-3.6/drivers/base/memory.c
>>> ===================================================================
>>> --- linux-3.6.orig/drivers/base/memory.c 2012-10-04 14:22:57.000000000 +0900
>>> +++ linux-3.6/drivers/base/memory.c 2012-10-04 14:45:46.653585860 +0900
>>> @@ -70,6 +70,45 @@ void unregister_memory_isolate_notifier(
>>> }
>>> EXPORT_SYMBOL(unregister_memory_isolate_notifier);
>>>
>>> +bool is_memblk_offline(unsigned long start, unsigned long size)
>>
>> Don't use memblk. Usually memblk mean struct numa_meminfo for x86/numa.
>> Maybe memory_range_offlined() is better.
>>
>> And, this function don't take struct memory_block, then this file may be no good
>> place.
>>
>> And you need to write down function comment.
>>
>>
>>> +{
>>> + struct memory_block *mem = NULL;
>>> + struct mem_section *section;
>>> + unsigned long start_pfn, end_pfn;
>>> + unsigned long pfn, section_nr;
>>> +
>>> + start_pfn = PFN_DOWN(start);
>>> + end_pfn = PFN_UP(start + size);
>>> +
>>> + for (pfn = start_pfn; pfn< end_pfn; pfn += PAGES_PER_SECTION) {
>>> + section_nr = pfn_to_section_nr(pfn);
>>> + if (!present_section_nr(section_nr))
>>> + continue;
>>> +
>>> + section = __nr_to_section(section_nr);
>>> + /* same memblock? */
>>> + if (mem)
>>> + if ((section_nr>= mem->start_section_nr)&&
>>> + (section_nr<= mem->end_section_nr))
>>> + continue;
>>> +
>>> + mem = find_memory_block_hinted(section, mem);
>>> + if (!mem)
>>> + continue;
>>> + if (mem->state == MEM_OFFLINE)
>>> + continue;
>>> +
>>> + kobject_put(&mem->dev.kobj);
>>> + return false;
>>> + }
>>> +
>>> + if (mem)
>>> + kobject_put(&mem->dev.kobj);
>>> +
>>> + return true;
>>> +}
>>> +EXPORT_SYMBOL(is_memblk_offline);
>>> +
>>> /*
>>> * register_memory - Setup a sysfs device for a memory block
>>> */
>>> Index: linux-3.6/include/linux/memory.h
>>> ===================================================================
>>> --- linux-3.6.orig/include/linux/memory.h 2012-10-02 18:00:22.000000000 +0900
>>> +++ linux-3.6/include/linux/memory.h 2012-10-04 14:44:40.902581028 +0900
>>> @@ -106,6 +106,10 @@ static inline int memory_isolate_notify(
>>> {
>>> return 0;
>>> }
>>> +static inline bool is_memblk_offline(unsigned long start, unsigned long size)
>>> +{
>>> + return false;
>>> +}
>>> #else
>>> extern int register_memory_notifier(struct notifier_block *nb);
>>> extern void unregister_memory_notifier(struct notifier_block *nb);
>>> @@ -120,6 +124,7 @@ extern int memory_isolate_notify(unsigne
>>> extern struct memory_block *find_memory_block_hinted(struct mem_section *,
>>> struct memory_block *);
>>> extern struct memory_block *find_memory_block(struct mem_section *);
>>> +extern bool is_memblk_offline(unsigned long start, unsigned long size);
>>> #define CONFIG_MEM_BLOCK_SIZE (PAGES_PER_SECTION<<PAGE_SHIFT)
>>> enum mem_add_context { BOOT, HOTPLUG };
>>> #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
>>> Index: linux-3.6/mm/memory_hotplug.c
>>> ===================================================================
>>> --- linux-3.6.orig/mm/memory_hotplug.c 2012-10-04 14:31:08.000000000 +0900
>>> +++ linux-3.6/mm/memory_hotplug.c 2012-10-04 14:58:22.449687986 +0900
>>> @@ -1045,8 +1045,21 @@ int offline_memory(u64 start, u64 size)
>>>
>>> int remove_memory(int nid, u64 start, u64 size)
>>> {
>>
>> Your remove_memory() don't remove anything. that's strange.

IIUC, this batch is based on another patchset.

>>
>>
>>> - /* It is not implemented yet*/
>>> - return 0;
>>> + int ret = 0;
>>> + lock_memory_hotplug();
>>> + /*
>>> + * The memory might become online by other task, even if you offine it.
>>> + * So we check whether the memory has been onlined or not.
>>> + */
>>> + if (!is_memblk_offline(start, size)) {
>>> + pr_warn("memory removing [mem %#010llx-%#010llx] failed, "
>>> + "because the memmory range is online\n",
>>> + start, start + size);
>>
>> No good warning. You should output which memory block can't be
>> offlined, I think.

OK. I'll update it.

Thanks
Wen Congyang

>>
>>
>>> + ret = -EAGAIN;
>>> + }
>>> +
>>> + unlock_memory_hotplug();
>>> + return ret;
>>> }
>>> EXPORT_SYMBOL_GPL(remove_memory);
>>> #else
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2012-10-19 18:33:44

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

> I think it again, and found that this check is necessary. Because we only
> lock memory hotplug when offlining pages. Here is the steps to offline and
> remove memory:
>
> 1. lock memory hotplug
> 2. offline a memory section
> 3. unlock memory hotplug
> 4. repeat 1-3 to offline all memory sections
> 5. lock memory hotplug
> 6. remove memory
> 7. unlock memory hotplug
>
> All memory sections must be offlined before removing memory. But we don't
> hold
> the lock in the whole operation. So we should check whether all memory
> sections
> are offlined before step6.

You should describe the race scenario in the patch description. OK?

2012-10-20 01:47:45

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 1/10] memory-hotplug : check whether memory is offline or not when removing memory

At 10/20/2012 02:33 AM, KOSAKI Motohiro Wrote:
>> I think it again, and found that this check is necessary. Because we only
>> lock memory hotplug when offlining pages. Here is the steps to offline and
>> remove memory:
>>
>> 1. lock memory hotplug
>> 2. offline a memory section
>> 3. unlock memory hotplug
>> 4. repeat 1-3 to offline all memory sections
>> 5. lock memory hotplug
>> 6. remove memory
>> 7. unlock memory hotplug
>>
>> All memory sections must be offlined before removing memory. But we don't
>> hold
>> the lock in the whole operation. So we should check whether all memory
>> sections
>> are offlined before step6.
>
> You should describe the race scenario in the patch description. OK?
>

OK

Thanks
Wen Congyang

2012-10-22 07:48:24

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

Hi, Wu

Sorry for late reply.

At 10/09/2012 04:26 PM, wujianguo Wrote:
> Hi Congyang,
> I think we should also free pages which are used by page tables after removing
> page tables of the memory.

It is OK to do it.

>
> From: Jianguo Wu <[email protected]>
>
> Signed-off-by: Jianguo Wu <[email protected]>
> Signed-off-by: Jiang Liu <[email protected]>
> ---
> arch/x86/mm/init_64.c | 110 +++++++++++++++++++++++++++++++++++++++---------
> 1 files changed, 89 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 5596dfa..81f9c3b 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -675,6 +675,74 @@ int arch_add_memory(int nid, u64 start, u64 size)
> }
> EXPORT_SYMBOL_GPL(arch_add_memory);
>
> +static inline void free_pagetable(struct page *page)
> +{
> + struct zone *zone;
> +
> + __ClearPageReserved(page);
> + __free_page(page);
> +
> + zone = page_zone(page);
> + zone_span_writelock(zone);
> + zone->present_pages++;
> + zone_span_writeunlock(zone);
> + totalram_pages++;

Why do you update zone and totalram_pages here?

> +}
> +
> +static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> + pte_t *pte;
> + int i;
> +
> + for (i = 0; i < PTRS_PER_PTE; i++) {
> + pte = pte_start + i;
> + if (pte_val(*pte))
> + break;
> + }
> +
> + /* free a pte talbe */
> + if (i == PTRS_PER_PTE) {
> + free_pagetable(pmd_page(*pmd));

The memory may be allocated at booting. So it is very dangerous to
free it without any check.

> + pmd_clear(pmd);
> + }
> +}
> +
> +static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> + pmd_t *pmd;
> + int i;
> +
> + for (i = 0; i < PTRS_PER_PMD; i++) {
> + pmd = pmd_start + i;
> + if (pmd_val(*pmd))
> + break;
> + }
> +
> + /* free a pmd talbe */
> + if (i == PTRS_PER_PMD) {
> + free_pagetable(pud_page(*pud));
> + pud_clear(pud);
> + }
> +}
> +
> +static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> + pud_t *pud;
> + int i;
> +
> + for (i = 0; i < PTRS_PER_PUD; i++) {
> + pud = pud_start + i;
> + if (pud_val(*pud))
> + break;
> + }
> +
> + /* free a pud table */
> + if (i == PTRS_PER_PUD) {
> + free_pagetable(pgd_page(*pgd));
> + pgd_clear(pgd);
> + }
> +}
> +
> static void __meminit
> phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
> {
> @@ -704,21 +772,19 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
> unsigned long pages = 0, next;
> int i = pmd_index(addr);
>
> - for (; i < PTRS_PER_PMD; i++, addr = next) {
> + for (; i < PTRS_PER_PMD && addr < end; i++, addr = next) {
> unsigned long pte_phys;
> pmd_t *pmd = pmd_page + pmd_index(addr);
> pte_t *pte;
>
> - if (addr >= end)
> - break;
> -
> - next = (addr & PMD_MASK) + PMD_SIZE;
> + next = pmd_addr_end(addr, end);
>
> if (!pmd_present(*pmd))
> continue;
>
> if (pmd_large(*pmd)) {
> - if ((addr & ~PMD_MASK) == 0 && next <= end) {
> + if (IS_ALIGNED(addr, PMD_SIZE) &&
> + IS_ALIGNED(next, PMD_SIZE)) {
> set_pmd(pmd, __pmd(0));
> pages++;
> continue;
> @@ -729,7 +795,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
> * so split 2M page to 4K page.
> */
> pte = alloc_low_page(&pte_phys);
> - __split_large_page((pte_t *)pmd, addr, pte);
> + __split_large_page((pte_t *)pmd,
> + (unsigned long)__va(addr), pte);
>
> spin_lock(&init_mm.page_table_lock);
> pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> @@ -738,7 +805,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>
> spin_lock(&init_mm.page_table_lock);
> pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> - phys_pte_remove(pte, addr, end);
> + phys_pte_remove(pte, addr, next);
> + free_pte_table(pte, pmd);
> unmap_low_page(pte);
> spin_unlock(&init_mm.page_table_lock);
> }
> @@ -751,21 +819,19 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
> unsigned long pages = 0, next;
> int i = pud_index(addr);
>
> - for (; i < PTRS_PER_PUD; i++, addr = next) {
> + for (; i < PTRS_PER_PUD && addr < end; i++, addr = next) {
> unsigned long pmd_phys;
> pud_t *pud = pud_page + pud_index(addr);
> pmd_t *pmd;
>
> - if (addr >= end)
> - break;
> -
> - next = (addr & PUD_MASK) + PUD_SIZE;
> + next = pud_addr_end(addr, end);
>
> if (!pud_present(*pud))
> continue;
>
> if (pud_large(*pud)) {
> - if ((addr & ~PUD_MASK) == 0 && next <= end) {
> + if (IS_ALIGNED(addr, PUD_SIZE) &&
> + IS_ALIGNED(next, PUD_SIZE)) {
> set_pud(pud, __pud(0));
> pages++;
> continue;
> @@ -776,15 +842,18 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
> * so split 1G page to 2M page.
> */
> pmd = alloc_low_page(&pmd_phys);
> - __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
> + __split_large_page((pte_t *)pud,
> + (unsigned long)__va(addr),
> + (pte_t *)pmd);
>
> spin_lock(&init_mm.page_table_lock);
> pud_populate(&init_mm, pud, __va(pmd_phys));
> spin_unlock(&init_mm.page_table_lock);
> }
>
> - pmd = map_low_page(pmd_offset(pud, 0));
> - phys_pmd_remove(pmd, addr, end);
> + pmd = map_low_page((pmd_t *)pud_page_vaddr(*pud));

Hmm, pmd_offset(pud, 0) is equal to (pmd_t *)pud_page_vaddr(*pud).

Is it OK to merge your patch into my patch?

Thanks
Wen Congyang

> + phys_pmd_remove(pmd, addr, next);
> + free_pmd_table(pmd, pud);
> unmap_low_page(pmd);
> __flush_tlb_all();
> }
> @@ -805,15 +874,14 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
> pgd_t *pgd = pgd_offset_k(start);
> pud_t *pud;
>
> - next = (start + PGDIR_SIZE) & PGDIR_MASK;
> - if (next > end)
> - next = end;
> + next = pgd_addr_end(start, end);
>
> if (!pgd_present(*pgd))
> continue;
>
> pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> - phys_pud_remove(pud, __pa(start), __pa(end));
> + phys_pud_remove(pud, __pa(start), __pa(next));
> + free_pud_table(pud, pgd);
> unmap_low_page(pud);
> }
>
> -- 1.7.6.1 .
>
>
> On 2012-10-5 10:36, Yasuaki Ishimatsu wrote:
>> From: Wen Congyang <[email protected]>
>>
>> For hot removing memory, we sholud remove page table about the memory.
>> So the patch searches a page table about the removed memory, and clear
>> page table.
>>
>> CC: David Rientjes <[email protected]>
>> CC: Jiang Liu <[email protected]>
>> CC: Len Brown <[email protected]>
>> CC: Christoph Lameter <[email protected]>
>> Cc: Minchan Kim <[email protected]>
>> CC: Andrew Morton <[email protected]>
>> CC: KOSAKI Motohiro <[email protected]>
>> CC: Yasuaki Ishimatsu <[email protected]>
>> Signed-off-by: Wen Congyang <[email protected]>
>> ---
>> arch/x86/include/asm/pgtable_types.h | 1
>> arch/x86/mm/init_64.c | 147 +++++++++++++++++++++++++++++++++++
>> arch/x86/mm/pageattr.c | 47 +++++------
>> 3 files changed, 173 insertions(+), 22 deletions(-)
>>
>> Index: linux-3.6/arch/x86/mm/init_64.c
>> ===================================================================
>> --- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
>> +++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:27.317704652 +0900
>> @@ -675,6 +675,151 @@ int arch_add_memory(int nid, u64 start,
>> }
>> EXPORT_SYMBOL_GPL(arch_add_memory);
>>
>> +static void __meminit
>> +phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
>> +{
>> + unsigned pages = 0;
>> + int i = pte_index(addr);
>> +
>> + pte_t *pte = pte_page + pte_index(addr);
>> +
>> + for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
>> +
>> + if (addr >= end)
>> + break;
>> +
>> + if (!pte_present(*pte))
>> + continue;
>> +
>> + pages++;
>> + set_pte(pte, __pte(0));
>> + }
>> +
>> + update_page_count(PG_LEVEL_4K, -pages);
>> +}
>> +
>> +static void __meminit
>> +phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>> +{
>> + unsigned long pages = 0, next;
>> + int i = pmd_index(addr);
>> +
>> + for (; i < PTRS_PER_PMD; i++, addr = next) {
>> + unsigned long pte_phys;
>> + pmd_t *pmd = pmd_page + pmd_index(addr);
>> + pte_t *pte;
>> +
>> + if (addr >= end)
>> + break;
>> +
>> + next = (addr & PMD_MASK) + PMD_SIZE;
>> +
>> + if (!pmd_present(*pmd))
>> + continue;
>> +
>> + if (pmd_large(*pmd)) {
>> + if ((addr & ~PMD_MASK) == 0 && next <= end) {
>> + set_pmd(pmd, __pmd(0));
>> + pages++;
>> + continue;
>> + }
>> +
>> + /*
>> + * We use 2M page, but we need to remove part of them,
>> + * so split 2M page to 4K page.
>> + */
>> + pte = alloc_low_page(&pte_phys);
>> + __split_large_page((pte_t *)pmd, addr, pte);
>> +
>> + spin_lock(&init_mm.page_table_lock);
>> + pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> + spin_unlock(&init_mm.page_table_lock);
>> + }
>> +
>> + spin_lock(&init_mm.page_table_lock);
>> + pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> + phys_pte_remove(pte, addr, end);
>> + unmap_low_page(pte);
>> + spin_unlock(&init_mm.page_table_lock);
>> + }
>> + update_page_count(PG_LEVEL_2M, -pages);
>> +}
>> +
>> +static void __meminit
>> +phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>> +{
>> + unsigned long pages = 0, next;
>> + int i = pud_index(addr);
>> +
>> + for (; i < PTRS_PER_PUD; i++, addr = next) {
>> + unsigned long pmd_phys;
>> + pud_t *pud = pud_page + pud_index(addr);
>> + pmd_t *pmd;
>> +
>> + if (addr >= end)
>> + break;
>> +
>> + next = (addr & PUD_MASK) + PUD_SIZE;
>> +
>> + if (!pud_present(*pud))
>> + continue;
>> +
>> + if (pud_large(*pud)) {
>> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
>> + set_pud(pud, __pud(0));
>> + pages++;
>> + continue;
>> + }
>> +
>> + /*
>> + * We use 1G page, but we need to remove part of them,
>> + * so split 1G page to 2M page.
>> + */
>> + pmd = alloc_low_page(&pmd_phys);
>> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>> +
>> + spin_lock(&init_mm.page_table_lock);
>> + pud_populate(&init_mm, pud, __va(pmd_phys));
>> + spin_unlock(&init_mm.page_table_lock);
>> + }
>> +
>> + pmd = map_low_page(pmd_offset(pud, 0));
>> + phys_pmd_remove(pmd, addr, end);
>> + unmap_low_page(pmd);
>> + __flush_tlb_all();
>> + }
>> + __flush_tlb_all();
>> +
>> + update_page_count(PG_LEVEL_1G, -pages);
>> +}
>> +
>> +void __meminit
>> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>> +{
>> + unsigned long next;
>> +
>> + start = (unsigned long)__va(start);
>> + end = (unsigned long)__va(end);
>> +
>> + for (; start < end; start = next) {
>> + pgd_t *pgd = pgd_offset_k(start);
>> + pud_t *pud;
>> +
>> + next = (start + PGDIR_SIZE) & PGDIR_MASK;
>> + if (next > end)
>> + next = end;
>> +
>> + if (!pgd_present(*pgd))
>> + continue;
>> +
>> + pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> + phys_pud_remove(pud, __pa(start), __pa(end));
>> + unmap_low_page(pud);
>> + }
>> +
>> + __flush_tlb_all();
>> +}
>> +
>> #ifdef CONFIG_MEMORY_HOTREMOVE
>> int __ref arch_remove_memory(u64 start, u64 size)
>> {
>> @@ -687,6 +832,8 @@ int __ref arch_remove_memory(u64 start,
>> ret = __remove_pages(zone, start_pfn, nr_pages);
>> WARN_ON_ONCE(ret);
>>
>> + kernel_physical_mapping_remove(start, start + size);
>> +
>> return ret;
>> }
>> #endif
>> Index: linux-3.6/arch/x86/include/asm/pgtable_types.h
>> ===================================================================
>> --- linux-3.6.orig/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:26:51.925486954 +0900
>> +++ linux-3.6/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:30:27.322704656 +0900
>> @@ -334,6 +334,7 @@ static inline void update_page_count(int
>> * as a pte too.
>> */
>> extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>
>> #endif /* !__ASSEMBLY__ */
>>
>> Index: linux-3.6/arch/x86/mm/pageattr.c
>> ===================================================================
>> --- linux-3.6.orig/arch/x86/mm/pageattr.c 2012-10-04 18:26:51.923486952 +0900
>> +++ linux-3.6/arch/x86/mm/pageattr.c 2012-10-04 18:30:27.328704662 +0900
>> @@ -501,21 +501,13 @@ out_unlock:
>> return do_split;
>> }
>>
>> -static int split_large_page(pte_t *kpte, unsigned long address)
>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>> {
>> unsigned long pfn, pfninc = 1;
>> unsigned int i, level;
>> - pte_t *pbase, *tmp;
>> + pte_t *tmp;
>> pgprot_t ref_prot;
>> - struct page *base;
>> -
>> - if (!debug_pagealloc)
>> - spin_unlock(&cpa_lock);
>> - base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> - if (!debug_pagealloc)
>> - spin_lock(&cpa_lock);
>> - if (!base)
>> - return -ENOMEM;
>> + struct page *base = virt_to_page(pbase);
>>
>> spin_lock(&pgd_lock);
>> /*
>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte,
>> * up for us already:
>> */
>> tmp = lookup_address(address, &level);
>> - if (tmp != kpte)
>> - goto out_unlock;
>> + if (tmp != kpte) {
>> + spin_unlock(&pgd_lock);
>> + return 1;
>> + }
>>
>> - pbase = (pte_t *)page_address(base);
>> paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>> ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>> /*
>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte,
>> * going on.
>> */
>> __flush_tlb_all();
>> + spin_unlock(&pgd_lock);
>>
>> - base = NULL;
>> + return 0;
>> +}
>>
>> -out_unlock:
>> - /*
>> - * If we dropped out via the lookup_address check under
>> - * pgd_lock then stick the page back into the pool:
>> - */
>> - if (base)
>> +static int split_large_page(pte_t *kpte, unsigned long address)
>> +{
>> + pte_t *pbase;
>> + struct page *base;
>> +
>> + if (!debug_pagealloc)
>> + spin_unlock(&cpa_lock);
>> + base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> + if (!debug_pagealloc)
>> + spin_lock(&cpa_lock);
>> + if (!base)
>> + return -ENOMEM;
>> +
>> + pbase = (pte_t *)page_address(base);
>> + if (__split_large_page(kpte, address, pbase))
>> __free_page(base);
>> - spin_unlock(&pgd_lock);
>>
>> return 0;
>> }
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
>

2012-10-23 07:09:48

by Jianguo Wu

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

On 2012-10-22 15:11, Wen Congyang wrote:
> Hi, Wu
>
> Sorry for late reply.
>
> At 10/09/2012 04:26 PM, wujianguo Wrote:
>> Hi Congyang,
>> I think we should also free pages which are used by page tables after removing
>> page tables of the memory.
>
> It is OK to do it.
>
>>
>> From: Jianguo Wu <[email protected]>
>>
>> Signed-off-by: Jianguo Wu <[email protected]>
>> Signed-off-by: Jiang Liu <[email protected]>
>> ---
>> arch/x86/mm/init_64.c | 110 +++++++++++++++++++++++++++++++++++++++---------
>> 1 files changed, 89 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 5596dfa..81f9c3b 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -675,6 +675,74 @@ int arch_add_memory(int nid, u64 start, u64 size)
>> }
>> EXPORT_SYMBOL_GPL(arch_add_memory);
>>
>> +static inline void free_pagetable(struct page *page)
>> +{
>> + struct zone *zone;
>> +
>> + __ClearPageReserved(page);
>> + __free_page(page);
>> +
>> + zone = page_zone(page);
>> + zone_span_writelock(zone);
>> + zone->present_pages++;
>> + zone_span_writeunlock(zone);
>> + totalram_pages++;
>
> Why do you update zone and totalram_pages here?
Sorry, I made a mistake here. Only if the page was allocated at booting, we should update
zone and totalram_pages(zone->present_pages and totalram_pages mean pages which are
managed by buddy system).

How about:
static inline void free_pagetable(struct page *page)
{
struct zone *zone;
bool bootmem = false;

/* bootmem page has reserved flag */
if (PageReserved(page)) {
__ClearPageReserved(page);
bootmem = true;
}

__free_page(page);

if (bootmem) {
zone = page_zone(page);
zone_span_writelock(zone);
zone->present_pages++;
zone_span_writeunlock(zone);
totalram_pages++;
}
}

>
>> +}
>> +
>> +static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> + pte_t *pte;
>> + int i;
>> +
>> + for (i = 0; i < PTRS_PER_PTE; i++) {
>> + pte = pte_start + i;
>> + if (pte_val(*pte))
>> + break;
>> + }
>> +
>> + /* free a pte talbe */
>> + if (i == PTRS_PER_PTE) {
>> + free_pagetable(pmd_page(*pmd));
>
> The memory may be allocated at booting. So it is very dangerous to
> free it without any check.
The page is only used by page table, so is safe to free it when all the page table
entries have been cleared, right?

>
>> + pmd_clear(pmd);
>> + }
>> +}
>> +
>> +static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> + pmd_t *pmd;
>> + int i;
>> +
>> + for (i = 0; i < PTRS_PER_PMD; i++) {
>> + pmd = pmd_start + i;
>> + if (pmd_val(*pmd))
>> + break;
>> + }
>> +
>> + /* free a pmd talbe */
>> + if (i == PTRS_PER_PMD) {
>> + free_pagetable(pud_page(*pud));
>> + pud_clear(pud);
>> + }
>> +}
>> +
>> +static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
>> +{
>> + pud_t *pud;
>> + int i;
>> +
>> + for (i = 0; i < PTRS_PER_PUD; i++) {
>> + pud = pud_start + i;
>> + if (pud_val(*pud))
>> + break;
>> + }
>> +
>> + /* free a pud table */
>> + if (i == PTRS_PER_PUD) {
>> + free_pagetable(pgd_page(*pgd));
>> + pgd_clear(pgd);
>> + }
>> +}
>> +
>> static void __meminit
>> phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
>> {
>> @@ -704,21 +772,19 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>> unsigned long pages = 0, next;
>> int i = pmd_index(addr);
>>
>> - for (; i < PTRS_PER_PMD; i++, addr = next) {
>> + for (; i < PTRS_PER_PMD && addr < end; i++, addr = next) {
>> unsigned long pte_phys;
>> pmd_t *pmd = pmd_page + pmd_index(addr);
>> pte_t *pte;
>>
>> - if (addr >= end)
>> - break;
>> -
>> - next = (addr & PMD_MASK) + PMD_SIZE;
>> + next = pmd_addr_end(addr, end);
>>
>> if (!pmd_present(*pmd))
>> continue;
>>
>> if (pmd_large(*pmd)) {
>> - if ((addr & ~PMD_MASK) == 0 && next <= end) {
>> + if (IS_ALIGNED(addr, PMD_SIZE) &&
>> + IS_ALIGNED(next, PMD_SIZE)) {
>> set_pmd(pmd, __pmd(0));
>> pages++;
>> continue;
>> @@ -729,7 +795,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>> * so split 2M page to 4K page.
>> */
>> pte = alloc_low_page(&pte_phys);
>> - __split_large_page((pte_t *)pmd, addr, pte);
>> + __split_large_page((pte_t *)pmd,
>> + (unsigned long)__va(addr), pte);
>>
>> spin_lock(&init_mm.page_table_lock);
>> pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> @@ -738,7 +805,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>
>> spin_lock(&init_mm.page_table_lock);
>> pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> - phys_pte_remove(pte, addr, end);
>> + phys_pte_remove(pte, addr, next);
>> + free_pte_table(pte, pmd);
>> unmap_low_page(pte);
>> spin_unlock(&init_mm.page_table_lock);
>> }
>> @@ -751,21 +819,19 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>> unsigned long pages = 0, next;
>> int i = pud_index(addr);
>>
>> - for (; i < PTRS_PER_PUD; i++, addr = next) {
>> + for (; i < PTRS_PER_PUD && addr < end; i++, addr = next) {
>> unsigned long pmd_phys;
>> pud_t *pud = pud_page + pud_index(addr);
>> pmd_t *pmd;
>>
>> - if (addr >= end)
>> - break;
>> -
>> - next = (addr & PUD_MASK) + PUD_SIZE;
>> + next = pud_addr_end(addr, end);
>>
>> if (!pud_present(*pud))
>> continue;
>>
>> if (pud_large(*pud)) {
>> - if ((addr & ~PUD_MASK) == 0 && next <= end) {
>> + if (IS_ALIGNED(addr, PUD_SIZE) &&
>> + IS_ALIGNED(next, PUD_SIZE)) {
>> set_pud(pud, __pud(0));
>> pages++;
>> continue;
>> @@ -776,15 +842,18 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>> * so split 1G page to 2M page.
>> */
>> pmd = alloc_low_page(&pmd_phys);
>> - __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>> + __split_large_page((pte_t *)pud,
>> + (unsigned long)__va(addr),
>> + (pte_t *)pmd);
>>
>> spin_lock(&init_mm.page_table_lock);
>> pud_populate(&init_mm, pud, __va(pmd_phys));
>> spin_unlock(&init_mm.page_table_lock);
>> }
>>
>> - pmd = map_low_page(pmd_offset(pud, 0));
>> - phys_pmd_remove(pmd, addr, end);
>> + pmd = map_low_page((pmd_t *)pud_page_vaddr(*pud));
>
> Hmm, pmd_offset(pud, 0) is equal to (pmd_t *)pud_page_vaddr(*pud).
>
> Is it OK to merge your patch into my patch?
>
Yes, sure.

Thanks,
Jianguo Wu

> Thanks
> Wen Congyang
>
>> + phys_pmd_remove(pmd, addr, next);
>> + free_pmd_table(pmd, pud);
>> unmap_low_page(pmd);
>> __flush_tlb_all();
>> }
>> @@ -805,15 +874,14 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>> pgd_t *pgd = pgd_offset_k(start);
>> pud_t *pud;
>>
>> - next = (start + PGDIR_SIZE) & PGDIR_MASK;
>> - if (next > end)
>> - next = end;
>> + next = pgd_addr_end(start, end);
>>
>> if (!pgd_present(*pgd))
>> continue;
>>
>> pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> - phys_pud_remove(pud, __pa(start), __pa(end));
>> + phys_pud_remove(pud, __pa(start), __pa(next));
>> + free_pud_table(pud, pgd);
>> unmap_low_page(pud);
>> }
>>
>> -- 1.7.6.1 .
>>
>>
>> On 2012-10-5 10:36, Yasuaki Ishimatsu wrote:
>>> From: Wen Congyang <[email protected]>
>>>
>>> For hot removing memory, we sholud remove page table about the memory.
>>> So the patch searches a page table about the removed memory, and clear
>>> page table.
>>>
>>> CC: David Rientjes <[email protected]>
>>> CC: Jiang Liu <[email protected]>
>>> CC: Len Brown <[email protected]>
>>> CC: Christoph Lameter <[email protected]>
>>> Cc: Minchan Kim <[email protected]>
>>> CC: Andrew Morton <[email protected]>
>>> CC: KOSAKI Motohiro <[email protected]>
>>> CC: Yasuaki Ishimatsu <[email protected]>
>>> Signed-off-by: Wen Congyang <[email protected]>
>>> ---
>>> arch/x86/include/asm/pgtable_types.h | 1
>>> arch/x86/mm/init_64.c | 147 +++++++++++++++++++++++++++++++++++
>>> arch/x86/mm/pageattr.c | 47 +++++------
>>> 3 files changed, 173 insertions(+), 22 deletions(-)
>>>
>>> Index: linux-3.6/arch/x86/mm/init_64.c
>>> ===================================================================
>>> --- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
>>> +++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:27.317704652 +0900
>>> @@ -675,6 +675,151 @@ int arch_add_memory(int nid, u64 start,
>>> }
>>> EXPORT_SYMBOL_GPL(arch_add_memory);
>>>
>>> +static void __meminit
>>> +phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
>>> +{
>>> + unsigned pages = 0;
>>> + int i = pte_index(addr);
>>> +
>>> + pte_t *pte = pte_page + pte_index(addr);
>>> +
>>> + for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
>>> +
>>> + if (addr >= end)
>>> + break;
>>> +
>>> + if (!pte_present(*pte))
>>> + continue;
>>> +
>>> + pages++;
>>> + set_pte(pte, __pte(0));
>>> + }
>>> +
>>> + update_page_count(PG_LEVEL_4K, -pages);
>>> +}
>>> +
>>> +static void __meminit
>>> +phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>> +{
>>> + unsigned long pages = 0, next;
>>> + int i = pmd_index(addr);
>>> +
>>> + for (; i < PTRS_PER_PMD; i++, addr = next) {
>>> + unsigned long pte_phys;
>>> + pmd_t *pmd = pmd_page + pmd_index(addr);
>>> + pte_t *pte;
>>> +
>>> + if (addr >= end)
>>> + break;
>>> +
>>> + next = (addr & PMD_MASK) + PMD_SIZE;
>>> +
>>> + if (!pmd_present(*pmd))
>>> + continue;
>>> +
>>> + if (pmd_large(*pmd)) {
>>> + if ((addr & ~PMD_MASK) == 0 && next <= end) {
>>> + set_pmd(pmd, __pmd(0));
>>> + pages++;
>>> + continue;
>>> + }
>>> +
>>> + /*
>>> + * We use 2M page, but we need to remove part of them,
>>> + * so split 2M page to 4K page.
>>> + */
>>> + pte = alloc_low_page(&pte_phys);
>>> + __split_large_page((pte_t *)pmd, addr, pte);
>>> +
>>> + spin_lock(&init_mm.page_table_lock);
>>> + pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>>> + spin_unlock(&init_mm.page_table_lock);
>>> + }
>>> +
>>> + spin_lock(&init_mm.page_table_lock);
>>> + pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>>> + phys_pte_remove(pte, addr, end);
>>> + unmap_low_page(pte);
>>> + spin_unlock(&init_mm.page_table_lock);
>>> + }
>>> + update_page_count(PG_LEVEL_2M, -pages);
>>> +}
>>> +
>>> +static void __meminit
>>> +phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>>> +{
>>> + unsigned long pages = 0, next;
>>> + int i = pud_index(addr);
>>> +
>>> + for (; i < PTRS_PER_PUD; i++, addr = next) {
>>> + unsigned long pmd_phys;
>>> + pud_t *pud = pud_page + pud_index(addr);
>>> + pmd_t *pmd;
>>> +
>>> + if (addr >= end)
>>> + break;
>>> +
>>> + next = (addr & PUD_MASK) + PUD_SIZE;
>>> +
>>> + if (!pud_present(*pud))
>>> + continue;
>>> +
>>> + if (pud_large(*pud)) {
>>> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
>>> + set_pud(pud, __pud(0));
>>> + pages++;
>>> + continue;
>>> + }
>>> +
>>> + /*
>>> + * We use 1G page, but we need to remove part of them,
>>> + * so split 1G page to 2M page.
>>> + */
>>> + pmd = alloc_low_page(&pmd_phys);
>>> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>>> +
>>> + spin_lock(&init_mm.page_table_lock);
>>> + pud_populate(&init_mm, pud, __va(pmd_phys));
>>> + spin_unlock(&init_mm.page_table_lock);
>>> + }
>>> +
>>> + pmd = map_low_page(pmd_offset(pud, 0));
>>> + phys_pmd_remove(pmd, addr, end);
>>> + unmap_low_page(pmd);
>>> + __flush_tlb_all();
>>> + }
>>> + __flush_tlb_all();
>>> +
>>> + update_page_count(PG_LEVEL_1G, -pages);
>>> +}
>>> +
>>> +void __meminit
>>> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>>> +{
>>> + unsigned long next;
>>> +
>>> + start = (unsigned long)__va(start);
>>> + end = (unsigned long)__va(end);
>>> +
>>> + for (; start < end; start = next) {
>>> + pgd_t *pgd = pgd_offset_k(start);
>>> + pud_t *pud;
>>> +
>>> + next = (start + PGDIR_SIZE) & PGDIR_MASK;
>>> + if (next > end)
>>> + next = end;
>>> +
>>> + if (!pgd_present(*pgd))
>>> + continue;
>>> +
>>> + pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>>> + phys_pud_remove(pud, __pa(start), __pa(end));
>>> + unmap_low_page(pud);
>>> + }
>>> +
>>> + __flush_tlb_all();
>>> +}
>>> +
>>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>> int __ref arch_remove_memory(u64 start, u64 size)
>>> {
>>> @@ -687,6 +832,8 @@ int __ref arch_remove_memory(u64 start,
>>> ret = __remove_pages(zone, start_pfn, nr_pages);
>>> WARN_ON_ONCE(ret);
>>>
>>> + kernel_physical_mapping_remove(start, start + size);
>>> +
>>> return ret;
>>> }
>>> #endif
>>> Index: linux-3.6/arch/x86/include/asm/pgtable_types.h
>>> ===================================================================
>>> --- linux-3.6.orig/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:26:51.925486954 +0900
>>> +++ linux-3.6/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:30:27.322704656 +0900
>>> @@ -334,6 +334,7 @@ static inline void update_page_count(int
>>> * as a pte too.
>>> */
>>> extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>>
>>> #endif /* !__ASSEMBLY__ */
>>>
>>> Index: linux-3.6/arch/x86/mm/pageattr.c
>>> ===================================================================
>>> --- linux-3.6.orig/arch/x86/mm/pageattr.c 2012-10-04 18:26:51.923486952 +0900
>>> +++ linux-3.6/arch/x86/mm/pageattr.c 2012-10-04 18:30:27.328704662 +0900
>>> @@ -501,21 +501,13 @@ out_unlock:
>>> return do_split;
>>> }
>>>
>>> -static int split_large_page(pte_t *kpte, unsigned long address)
>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>> {
>>> unsigned long pfn, pfninc = 1;
>>> unsigned int i, level;
>>> - pte_t *pbase, *tmp;
>>> + pte_t *tmp;
>>> pgprot_t ref_prot;
>>> - struct page *base;
>>> -
>>> - if (!debug_pagealloc)
>>> - spin_unlock(&cpa_lock);
>>> - base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>> - if (!debug_pagealloc)
>>> - spin_lock(&cpa_lock);
>>> - if (!base)
>>> - return -ENOMEM;
>>> + struct page *base = virt_to_page(pbase);
>>>
>>> spin_lock(&pgd_lock);
>>> /*
>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte,
>>> * up for us already:
>>> */
>>> tmp = lookup_address(address, &level);
>>> - if (tmp != kpte)
>>> - goto out_unlock;
>>> + if (tmp != kpte) {
>>> + spin_unlock(&pgd_lock);
>>> + return 1;
>>> + }
>>>
>>> - pbase = (pte_t *)page_address(base);
>>> paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>> ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>> /*
>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte,
>>> * going on.
>>> */
>>> __flush_tlb_all();
>>> + spin_unlock(&pgd_lock);
>>>
>>> - base = NULL;
>>> + return 0;
>>> +}
>>>
>>> -out_unlock:
>>> - /*
>>> - * If we dropped out via the lookup_address check under
>>> - * pgd_lock then stick the page back into the pool:
>>> - */
>>> - if (base)
>>> +static int split_large_page(pte_t *kpte, unsigned long address)
>>> +{
>>> + pte_t *pbase;
>>> + struct page *base;
>>> +
>>> + if (!debug_pagealloc)
>>> + spin_unlock(&cpa_lock);
>>> + base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>> + if (!debug_pagealloc)
>>> + spin_lock(&cpa_lock);
>>> + if (!base)
>>> + return -ENOMEM;
>>> +
>>> + pbase = (pte_t *)page_address(base);
>>> + if (__split_large_page(kpte, address, pbase))
>>> __free_page(base);
>>> - spin_unlock(&pgd_lock);
>>>
>>> return 0;
>>> }
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>>
>>
>
>

2012-10-23 07:36:13

by Wen Congyang

[permalink] [raw]
Subject: Re: [PATCH 8/10] memory-hotplug : remove page table of x86_64 architecture

At 10/23/2012 03:09 PM, wujianguo Wrote:
> On 2012-10-22 15:11, Wen Congyang wrote:
>> Hi, Wu
>>
>> Sorry for late reply.
>>
>> At 10/09/2012 04:26 PM, wujianguo Wrote:
>>> Hi Congyang,
>>> I think we should also free pages which are used by page tables after removing
>>> page tables of the memory.
>>
>> It is OK to do it.
>>
>>>
>>> From: Jianguo Wu <[email protected]>
>>>
>>> Signed-off-by: Jianguo Wu <[email protected]>
>>> Signed-off-by: Jiang Liu <[email protected]>
>>> ---
>>> arch/x86/mm/init_64.c | 110 +++++++++++++++++++++++++++++++++++++++---------
>>> 1 files changed, 89 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index 5596dfa..81f9c3b 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -675,6 +675,74 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>> }
>>> EXPORT_SYMBOL_GPL(arch_add_memory);
>>>
>>> +static inline void free_pagetable(struct page *page)
>>> +{
>>> + struct zone *zone;
>>> +
>>> + __ClearPageReserved(page);
>>> + __free_page(page);
>>> +
>>> + zone = page_zone(page);
>>> + zone_span_writelock(zone);
>>> + zone->present_pages++;
>>> + zone_span_writeunlock(zone);
>>> + totalram_pages++;
>>
>> Why do you update zone and totalram_pages here?
> Sorry, I made a mistake here. Only if the page was allocated at booting, we should update
> zone and totalram_pages(zone->present_pages and totalram_pages mean pages which are
> managed by buddy system).
>
> How about:
> static inline void free_pagetable(struct page *page)
> {
> struct zone *zone;
> bool bootmem = false;
>
> /* bootmem page has reserved flag */
> if (PageReserved(page)) {
> __ClearPageReserved(page);
> bootmem = true;
> }
>
> __free_page(page);
>
> if (bootmem) {
> zone = page_zone(page);
> zone_span_writelock(zone);
> zone->present_pages++;
> zone_span_writeunlock(zone);
> totalram_pages++;
> }
> }

This verson looks fine to me.

>
>>
>>> +}
>>> +
>>> +static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
>>> +{
>>> + pte_t *pte;
>>> + int i;
>>> +
>>> + for (i = 0; i < PTRS_PER_PTE; i++) {
>>> + pte = pte_start + i;
>>> + if (pte_val(*pte))
>>> + break;
>>> + }
>>> +
>>> + /* free a pte talbe */
>>> + if (i == PTRS_PER_PTE) {
>>> + free_pagetable(pmd_page(*pmd));
>>
>> The memory may be allocated at booting. So it is very dangerous to
>> free it without any check.
> The page is only used by page table, so is safe to free it when all the page table
> entries have been cleared, right?

Yes, but I guess we need to do more. For example, all boot memory are in memblock.reserved,
and you don't update memblock here.

Thanks
Wen Congyang

>
>>
>>> + pmd_clear(pmd);
>>> + }
>>> +}
>>> +
>>> +static void free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>>> +{
>>> + pmd_t *pmd;
>>> + int i;
>>> +
>>> + for (i = 0; i < PTRS_PER_PMD; i++) {
>>> + pmd = pmd_start + i;
>>> + if (pmd_val(*pmd))
>>> + break;
>>> + }
>>> +
>>> + /* free a pmd talbe */
>>> + if (i == PTRS_PER_PMD) {
>>> + free_pagetable(pud_page(*pud));
>>> + pud_clear(pud);
>>> + }
>>> +}
>>> +
>>> +static void free_pud_table(pud_t *pud_start, pgd_t *pgd)
>>> +{
>>> + pud_t *pud;
>>> + int i;
>>> +
>>> + for (i = 0; i < PTRS_PER_PUD; i++) {
>>> + pud = pud_start + i;
>>> + if (pud_val(*pud))
>>> + break;
>>> + }
>>> +
>>> + /* free a pud table */
>>> + if (i == PTRS_PER_PUD) {
>>> + free_pagetable(pgd_page(*pgd));
>>> + pgd_clear(pgd);
>>> + }
>>> +}
>>> +
>>> static void __meminit
>>> phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
>>> {
>>> @@ -704,21 +772,19 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>> unsigned long pages = 0, next;
>>> int i = pmd_index(addr);
>>>
>>> - for (; i < PTRS_PER_PMD; i++, addr = next) {
>>> + for (; i < PTRS_PER_PMD && addr < end; i++, addr = next) {
>>> unsigned long pte_phys;
>>> pmd_t *pmd = pmd_page + pmd_index(addr);
>>> pte_t *pte;
>>>
>>> - if (addr >= end)
>>> - break;
>>> -
>>> - next = (addr & PMD_MASK) + PMD_SIZE;
>>> + next = pmd_addr_end(addr, end);
>>>
>>> if (!pmd_present(*pmd))
>>> continue;
>>>
>>> if (pmd_large(*pmd)) {
>>> - if ((addr & ~PMD_MASK) == 0 && next <= end) {
>>> + if (IS_ALIGNED(addr, PMD_SIZE) &&
>>> + IS_ALIGNED(next, PMD_SIZE)) {
>>> set_pmd(pmd, __pmd(0));
>>> pages++;
>>> continue;
>>> @@ -729,7 +795,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>> * so split 2M page to 4K page.
>>> */
>>> pte = alloc_low_page(&pte_phys);
>>> - __split_large_page((pte_t *)pmd, addr, pte);
>>> + __split_large_page((pte_t *)pmd,
>>> + (unsigned long)__va(addr), pte);
>>>
>>> spin_lock(&init_mm.page_table_lock);
>>> pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>>> @@ -738,7 +805,8 @@ phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>>
>>> spin_lock(&init_mm.page_table_lock);
>>> pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>>> - phys_pte_remove(pte, addr, end);
>>> + phys_pte_remove(pte, addr, next);
>>> + free_pte_table(pte, pmd);
>>> unmap_low_page(pte);
>>> spin_unlock(&init_mm.page_table_lock);
>>> }
>>> @@ -751,21 +819,19 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>>> unsigned long pages = 0, next;
>>> int i = pud_index(addr);
>>>
>>> - for (; i < PTRS_PER_PUD; i++, addr = next) {
>>> + for (; i < PTRS_PER_PUD && addr < end; i++, addr = next) {
>>> unsigned long pmd_phys;
>>> pud_t *pud = pud_page + pud_index(addr);
>>> pmd_t *pmd;
>>>
>>> - if (addr >= end)
>>> - break;
>>> -
>>> - next = (addr & PUD_MASK) + PUD_SIZE;
>>> + next = pud_addr_end(addr, end);
>>>
>>> if (!pud_present(*pud))
>>> continue;
>>>
>>> if (pud_large(*pud)) {
>>> - if ((addr & ~PUD_MASK) == 0 && next <= end) {
>>> + if (IS_ALIGNED(addr, PUD_SIZE) &&
>>> + IS_ALIGNED(next, PUD_SIZE)) {
>>> set_pud(pud, __pud(0));
>>> pages++;
>>> continue;
>>> @@ -776,15 +842,18 @@ phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>>> * so split 1G page to 2M page.
>>> */
>>> pmd = alloc_low_page(&pmd_phys);
>>> - __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>>> + __split_large_page((pte_t *)pud,
>>> + (unsigned long)__va(addr),
>>> + (pte_t *)pmd);
>>>
>>> spin_lock(&init_mm.page_table_lock);
>>> pud_populate(&init_mm, pud, __va(pmd_phys));
>>> spin_unlock(&init_mm.page_table_lock);
>>> }
>>>
>>> - pmd = map_low_page(pmd_offset(pud, 0));
>>> - phys_pmd_remove(pmd, addr, end);
>>> + pmd = map_low_page((pmd_t *)pud_page_vaddr(*pud));
>>
>> Hmm, pmd_offset(pud, 0) is equal to (pmd_t *)pud_page_vaddr(*pud).
>>
>> Is it OK to merge your patch into my patch?
>>
> Yes, sure.
>
> Thanks,
> Jianguo Wu
>
>> Thanks
>> Wen Congyang
>>
>>> + phys_pmd_remove(pmd, addr, next);
>>> + free_pmd_table(pmd, pud);
>>> unmap_low_page(pmd);
>>> __flush_tlb_all();
>>> }
>>> @@ -805,15 +874,14 @@ kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>>> pgd_t *pgd = pgd_offset_k(start);
>>> pud_t *pud;
>>>
>>> - next = (start + PGDIR_SIZE) & PGDIR_MASK;
>>> - if (next > end)
>>> - next = end;
>>> + next = pgd_addr_end(start, end);
>>>
>>> if (!pgd_present(*pgd))
>>> continue;
>>>
>>> pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>>> - phys_pud_remove(pud, __pa(start), __pa(end));
>>> + phys_pud_remove(pud, __pa(start), __pa(next));
>>> + free_pud_table(pud, pgd);
>>> unmap_low_page(pud);
>>> }
>>>
>>> -- 1.7.6.1 .
>>>
>>>
>>> On 2012-10-5 10:36, Yasuaki Ishimatsu wrote:
>>>> From: Wen Congyang <[email protected]>
>>>>
>>>> For hot removing memory, we sholud remove page table about the memory.
>>>> So the patch searches a page table about the removed memory, and clear
>>>> page table.
>>>>
>>>> CC: David Rientjes <[email protected]>
>>>> CC: Jiang Liu <[email protected]>
>>>> CC: Len Brown <[email protected]>
>>>> CC: Christoph Lameter <[email protected]>
>>>> Cc: Minchan Kim <[email protected]>
>>>> CC: Andrew Morton <[email protected]>
>>>> CC: KOSAKI Motohiro <[email protected]>
>>>> CC: Yasuaki Ishimatsu <[email protected]>
>>>> Signed-off-by: Wen Congyang <[email protected]>
>>>> ---
>>>> arch/x86/include/asm/pgtable_types.h | 1
>>>> arch/x86/mm/init_64.c | 147 +++++++++++++++++++++++++++++++++++
>>>> arch/x86/mm/pageattr.c | 47 +++++------
>>>> 3 files changed, 173 insertions(+), 22 deletions(-)
>>>>
>>>> Index: linux-3.6/arch/x86/mm/init_64.c
>>>> ===================================================================
>>>> --- linux-3.6.orig/arch/x86/mm/init_64.c 2012-10-04 18:30:21.171698416 +0900
>>>> +++ linux-3.6/arch/x86/mm/init_64.c 2012-10-04 18:30:27.317704652 +0900
>>>> @@ -675,6 +675,151 @@ int arch_add_memory(int nid, u64 start,
>>>> }
>>>> EXPORT_SYMBOL_GPL(arch_add_memory);
>>>>
>>>> +static void __meminit
>>>> +phys_pte_remove(pte_t *pte_page, unsigned long addr, unsigned long end)
>>>> +{
>>>> + unsigned pages = 0;
>>>> + int i = pte_index(addr);
>>>> +
>>>> + pte_t *pte = pte_page + pte_index(addr);
>>>> +
>>>> + for (; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
>>>> +
>>>> + if (addr >= end)
>>>> + break;
>>>> +
>>>> + if (!pte_present(*pte))
>>>> + continue;
>>>> +
>>>> + pages++;
>>>> + set_pte(pte, __pte(0));
>>>> + }
>>>> +
>>>> + update_page_count(PG_LEVEL_4K, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +phys_pmd_remove(pmd_t *pmd_page, unsigned long addr, unsigned long end)
>>>> +{
>>>> + unsigned long pages = 0, next;
>>>> + int i = pmd_index(addr);
>>>> +
>>>> + for (; i < PTRS_PER_PMD; i++, addr = next) {
>>>> + unsigned long pte_phys;
>>>> + pmd_t *pmd = pmd_page + pmd_index(addr);
>>>> + pte_t *pte;
>>>> +
>>>> + if (addr >= end)
>>>> + break;
>>>> +
>>>> + next = (addr & PMD_MASK) + PMD_SIZE;
>>>> +
>>>> + if (!pmd_present(*pmd))
>>>> + continue;
>>>> +
>>>> + if (pmd_large(*pmd)) {
>>>> + if ((addr & ~PMD_MASK) == 0 && next <= end) {
>>>> + set_pmd(pmd, __pmd(0));
>>>> + pages++;
>>>> + continue;
>>>> + }
>>>> +
>>>> + /*
>>>> + * We use 2M page, but we need to remove part of them,
>>>> + * so split 2M page to 4K page.
>>>> + */
>>>> + pte = alloc_low_page(&pte_phys);
>>>> + __split_large_page((pte_t *)pmd, addr, pte);
>>>> +
>>>> + spin_lock(&init_mm.page_table_lock);
>>>> + pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>>>> + spin_unlock(&init_mm.page_table_lock);
>>>> + }
>>>> +
>>>> + spin_lock(&init_mm.page_table_lock);
>>>> + pte = map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>>>> + phys_pte_remove(pte, addr, end);
>>>> + unmap_low_page(pte);
>>>> + spin_unlock(&init_mm.page_table_lock);
>>>> + }
>>>> + update_page_count(PG_LEVEL_2M, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +phys_pud_remove(pud_t *pud_page, unsigned long addr, unsigned long end)
>>>> +{
>>>> + unsigned long pages = 0, next;
>>>> + int i = pud_index(addr);
>>>> +
>>>> + for (; i < PTRS_PER_PUD; i++, addr = next) {
>>>> + unsigned long pmd_phys;
>>>> + pud_t *pud = pud_page + pud_index(addr);
>>>> + pmd_t *pmd;
>>>> +
>>>> + if (addr >= end)
>>>> + break;
>>>> +
>>>> + next = (addr & PUD_MASK) + PUD_SIZE;
>>>> +
>>>> + if (!pud_present(*pud))
>>>> + continue;
>>>> +
>>>> + if (pud_large(*pud)) {
>>>> + if ((addr & ~PUD_MASK) == 0 && next <= end) {
>>>> + set_pud(pud, __pud(0));
>>>> + pages++;
>>>> + continue;
>>>> + }
>>>> +
>>>> + /*
>>>> + * We use 1G page, but we need to remove part of them,
>>>> + * so split 1G page to 2M page.
>>>> + */
>>>> + pmd = alloc_low_page(&pmd_phys);
>>>> + __split_large_page((pte_t *)pud, addr, (pte_t *)pmd);
>>>> +
>>>> + spin_lock(&init_mm.page_table_lock);
>>>> + pud_populate(&init_mm, pud, __va(pmd_phys));
>>>> + spin_unlock(&init_mm.page_table_lock);
>>>> + }
>>>> +
>>>> + pmd = map_low_page(pmd_offset(pud, 0));
>>>> + phys_pmd_remove(pmd, addr, end);
>>>> + unmap_low_page(pmd);
>>>> + __flush_tlb_all();
>>>> + }
>>>> + __flush_tlb_all();
>>>> +
>>>> + update_page_count(PG_LEVEL_1G, -pages);
>>>> +}
>>>> +
>>>> +void __meminit
>>>> +kernel_physical_mapping_remove(unsigned long start, unsigned long end)
>>>> +{
>>>> + unsigned long next;
>>>> +
>>>> + start = (unsigned long)__va(start);
>>>> + end = (unsigned long)__va(end);
>>>> +
>>>> + for (; start < end; start = next) {
>>>> + pgd_t *pgd = pgd_offset_k(start);
>>>> + pud_t *pud;
>>>> +
>>>> + next = (start + PGDIR_SIZE) & PGDIR_MASK;
>>>> + if (next > end)
>>>> + next = end;
>>>> +
>>>> + if (!pgd_present(*pgd))
>>>> + continue;
>>>> +
>>>> + pud = map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>>>> + phys_pud_remove(pud, __pa(start), __pa(end));
>>>> + unmap_low_page(pud);
>>>> + }
>>>> +
>>>> + __flush_tlb_all();
>>>> +}
>>>> +
>>>> #ifdef CONFIG_MEMORY_HOTREMOVE
>>>> int __ref arch_remove_memory(u64 start, u64 size)
>>>> {
>>>> @@ -687,6 +832,8 @@ int __ref arch_remove_memory(u64 start,
>>>> ret = __remove_pages(zone, start_pfn, nr_pages);
>>>> WARN_ON_ONCE(ret);
>>>>
>>>> + kernel_physical_mapping_remove(start, start + size);
>>>> +
>>>> return ret;
>>>> }
>>>> #endif
>>>> Index: linux-3.6/arch/x86/include/asm/pgtable_types.h
>>>> ===================================================================
>>>> --- linux-3.6.orig/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:26:51.925486954 +0900
>>>> +++ linux-3.6/arch/x86/include/asm/pgtable_types.h 2012-10-04 18:30:27.322704656 +0900
>>>> @@ -334,6 +334,7 @@ static inline void update_page_count(int
>>>> * as a pte too.
>>>> */
>>>> extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>>>
>>>> #endif /* !__ASSEMBLY__ */
>>>>
>>>> Index: linux-3.6/arch/x86/mm/pageattr.c
>>>> ===================================================================
>>>> --- linux-3.6.orig/arch/x86/mm/pageattr.c 2012-10-04 18:26:51.923486952 +0900
>>>> +++ linux-3.6/arch/x86/mm/pageattr.c 2012-10-04 18:30:27.328704662 +0900
>>>> @@ -501,21 +501,13 @@ out_unlock:
>>>> return do_split;
>>>> }
>>>>
>>>> -static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>>> {
>>>> unsigned long pfn, pfninc = 1;
>>>> unsigned int i, level;
>>>> - pte_t *pbase, *tmp;
>>>> + pte_t *tmp;
>>>> pgprot_t ref_prot;
>>>> - struct page *base;
>>>> -
>>>> - if (!debug_pagealloc)
>>>> - spin_unlock(&cpa_lock);
>>>> - base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> - if (!debug_pagealloc)
>>>> - spin_lock(&cpa_lock);
>>>> - if (!base)
>>>> - return -ENOMEM;
>>>> + struct page *base = virt_to_page(pbase);
>>>>
>>>> spin_lock(&pgd_lock);
>>>> /*
>>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte,
>>>> * up for us already:
>>>> */
>>>> tmp = lookup_address(address, &level);
>>>> - if (tmp != kpte)
>>>> - goto out_unlock;
>>>> + if (tmp != kpte) {
>>>> + spin_unlock(&pgd_lock);
>>>> + return 1;
>>>> + }
>>>>
>>>> - pbase = (pte_t *)page_address(base);
>>>> paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>>> ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>>> /*
>>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte,
>>>> * going on.
>>>> */
>>>> __flush_tlb_all();
>>>> + spin_unlock(&pgd_lock);
>>>>
>>>> - base = NULL;
>>>> + return 0;
>>>> +}
>>>>
>>>> -out_unlock:
>>>> - /*
>>>> - * If we dropped out via the lookup_address check under
>>>> - * pgd_lock then stick the page back into the pool:
>>>> - */
>>>> - if (base)
>>>> +static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +{
>>>> + pte_t *pbase;
>>>> + struct page *base;
>>>> +
>>>> + if (!debug_pagealloc)
>>>> + spin_unlock(&cpa_lock);
>>>> + base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> + if (!debug_pagealloc)
>>>> + spin_lock(&cpa_lock);
>>>> + if (!base)
>>>> + return -ENOMEM;
>>>> +
>>>> + pbase = (pte_t *)page_address(base);
>>>> + if (__split_large_page(kpte, address, pbase))
>>>> __free_page(base);
>>>> - spin_unlock(&pgd_lock);
>>>>
>>>> return 0;
>>>> }
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to [email protected]. For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>>
>>>
>>>
>>
>>
>
>