2012-11-06 19:53:30

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

This is an alternative design for Memory Power Management, developed based on
some of the suggestions[1] received during the review of the earlier patchset
("Hierarchy" design) on Memory Power Management[2]. This alters the buddy-lists
to keep them region-sorted, and is hence identified as the "Sorted-buddy" design.

One of the key aspects of this design is that it avoids the zone-fragmentation
problem that was present in the earlier design[3].


Quick overview of Memory Power Management and Memory Regions:
------------------------------------------------------------

Today memory subsystems are offer a wide range of capabilities for managing
memory power consumption. As a quick example, if a block of memory is not
referenced for a threshold amount of time, the memory controller can decide to
put that chunk into a low-power content-preserving state. And the next
reference to that memory chunk would bring it back to full power for read/write.
With this capability in place, it becomes important for the OS to understand
the boundaries of such power-manageable chunks of memory and to ensure that
references are consolidated to a minimum number of such memory power management
domains.

ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
the firmware can expose information regarding the boundaries of such memory
power management domains to the OS in a standard way.

How can Linux VM help memory power savings?

o Consolidate memory allocations and/or references such that they are
not spread across the entire memory address space. Basically area of memory
that is not being referenced, can reside in low power state.

o Support targeted memory reclaim, where certain areas of memory that can be
easily freed can be offlined, allowing those areas of memory to be put into
lower power states.

Memory Regions:
---------------

"Memory Regions" is a way of capturing the boundaries of power-managable
chunks of memory, within the MM subsystem.


Short description of the "Sorted-buddy" design:
-----------------------------------------------

In this design, the memory region boundaries are captured in a parallel
data-structure instead of fitting regions between nodes and zones in the
hierarchy. Further, the buddy allocator is altered, such that we maintain the
zones' freelists in region-sorted-order and thus do page allocation in the
order of increasing memory regions. (The freelists need not be fully
address-sorted, they just need to be region-sorted. Patch 6 explains this
in more detail).

The idea is to do page allocation in increasing order of memory regions
(within a zone) and perform page reclaim in the reverse order, as illustrated
below.

---------------------------- Increasing region number---------------------->

Direction of allocation---> <---Direction of reclaim


The sorting logic (to maintain freelist pageblocks in region-sorted-order)
lies in the page-free path and not the page-allocation path and hence the
critical page allocation paths remain fast. Moreover, the heart of the page
allocation algorithm itself remains largely unchanged, and the region-related
data-structures are optimized to avoid unnecessary updates during the
page-allocator's runtime.

Advantages of this design:
--------------------------
1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
hence we avoid its associated problems (like too many zones, extra page
reclaim threads, question of choosing watermarks etc).
[This is an advantage over the "Hierarchy" design]

2. Performance overhead is expected to be low: Since we retain the simplicity
of the algorithm in the page allocation path, page allocation can
potentially remain as fast as it would be without memory regions. The
overhead is pushed to the page-freeing paths which are not that critical.


Results:
=======

Test setup:
-----------
This patchset applies cleanly on top of 3.7-rc3.

x86 dual-socket quad core HT-enabled machine booted with mem=8G
Memory region size = 512 MB

Functional testing:
-------------------

Ran pagetest, a simple C program that allocates and touches a required number
of pages.

Below is the statistics from the regions within ZONE_NORMAL, at various sizes
of allocations from pagetest.

Present pages | Free pages at various allocations |
| start | 512 MB | 1024 MB | 2048 MB |
Region 0 16 | 0 | 0 | 0 | 0 |
Region 1 131072 | 87219 | 8066 | 7892 | 7387 |
Region 2 131072 | 131072 | 79036 | 0 | 0 |
Region 3 131072 | 131072 | 131072 | 79061 | 0 |
Region 4 131072 | 131072 | 131072 | 131072 | 0 |
Region 5 131072 | 131072 | 131072 | 131072 | 79051 |
Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
Region 8 131056 | 105475 | 105472 | 105472 | 105472 |

This shows that page allocation occurs in the order of increasing region
numbers, as intended in this design.

Performance impact:
-------------------

Kernbench results didn't show much of a difference between the performance
of vanilla 3.7-rc3 and this patchset.


Todos:
=====

1. Memory-region aware page-reclamation:
----------------------------------------

We would like to do page reclaim in the reverse order of page allocation
within a zone, ie., in the order of decreasing region numbers.
To achieve that, while scanning lru pages to reclaim, we could potentially
look for pages belonging to higher regions (considering region boundaries)
or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
reclaim candidates.

2. Compile-time exclusion of Memory Power Management, and extending the
support to also work with other features such as Mem cgroups, kexec etc.

References:
----------

[1]. Review comments suggesting modifying the buddy allocator to be aware of
memory regions:
http://article.gmane.org/gmane.linux.power-management.general/24862
http://article.gmane.org/gmane.linux.power-management.general/25061
http://article.gmane.org/gmane.linux.kernel.mm/64689

[2]. Patch series that implemented the node-region-zone hierarchy design:
http://lwn.net/Articles/445045/
http://thread.gmane.org/gmane.linux.kernel.mm/63840

Summary of the discussion on that patchset:
http://article.gmane.org/gmane.linux.power-management.general/25061

Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
http://thread.gmane.org/gmane.linux.kernel.mm/89202

[3]. Disadvantages of having memory regions in the hierarchy between nodes and
zones:
http://article.gmane.org/gmane.linux.kernel.mm/63849

[4]. Estimate of potential power savings on Samsung exynos board
http://article.gmane.org/gmane.linux.kernel.mm/65935

[5]. ACPI 5.0 and MPST support
http://www.acpi.info/spec.htm
Section 5.2.21 Memory Power State Table (MPST)

Srivatsa S. Bhat (8):
mm: Introduce memory regions data-structure to capture region boundaries within node
mm: Initialize node memory regions during boot
mm: Introduce and initialize zone memory regions
mm: Add helpers to retrieve node region and zone region for a given page
mm: Add data-structures to describe memory regions within the zones' freelists
mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
mm: Add an optimized version of del_from_freelist to keep page allocation fast
mm: Print memory region statistics to understand the buddy allocator behavior


include/linux/mm.h | 38 +++++++
include/linux/mmzone.h | 52 +++++++++
mm/compaction.c | 8 +
mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
mm/vmstat.c | 59 ++++++++++-
5 files changed, 390 insertions(+), 30 deletions(-)


Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center


2012-11-06 19:53:45

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 1/8] mm: Introduce memory regions data-structure to capture region boundaries within node

Within a node, we can have regions of memory that can be power-managed.
That is, chunks of memory can be transitioned (manually or automatically)
to low-power states based on the frequency of references to that region.
For example, if a memory chunk is not referenced for a given threshold
amount of time, the hardware can decide to put that piece of memory into
a content-preserving low-power state. And of course, on the next reference
to that chunk of memory, it will be transitioned to full-power for
read/write operations.

We propose to incorporate this knowledge of power-manageable chunks of
memory into a new data-structure called "Memory Regions". This way of
acknowledging the existence of different classes of memory with different
characteristics is the first step to in order to manage memory
power-efficiently, such as performing power-aware memory allocation etc.

[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].

We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since there is no guarantee that the region
boundaries will naturally fit inside zone boundaries or vice-versa.

But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).

Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 13 +++++++++++++
1 file changed, 13 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..bb7c3ef 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -80,6 +80,8 @@ static inline int get_pageblock_migratetype(struct page *page)
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}

+#define MAX_NR_REGIONS 256
+
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
@@ -328,6 +330,15 @@ enum zone_type {
#error ZONES_SHIFT -- too many zones configured adjust calculation
#endif

+struct node_mem_region {
+ unsigned long start_pfn;
+ unsigned long present_pages;
+ unsigned long spanned_pages;
+ int idx;
+ int node;
+ struct pglist_data *pgdat;
+};
+
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -687,6 +698,8 @@ typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
+ struct node_mem_region node_regions[MAX_NR_REGIONS];
+ int nr_node_regions;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_MEMCG

2012-11-06 19:54:04

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 2/8] mm: Initialize node memory regions during boot

Initialize the node's memory regions structures with the information about
the region-boundaries, at boot time.

Based-on-patch-by: Ankita Garg <[email protected]>
Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 4 ++++
mm/page_alloc.c | 35 +++++++++++++++++++++++++++++++++++
2 files changed, 39 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..19c4fb0 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -657,6 +657,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

+/* Hard-code memory regions size to be 512 MB for now. */
+#define MEM_REGION_SHIFT (29 - PAGE_SHIFT)
+#define MEM_REGION_SIZE (1UL << MEM_REGION_SHIFT)
+
static inline enum zone_type page_zonenum(const struct page *page)
{
return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..709e3c1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4560,6 +4560,40 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}

+void init_node_memory_regions(struct pglist_data *pgdat)
+{
+ int nid = pgdat->node_id;
+ unsigned long start_pfn = pgdat->node_start_pfn;
+ unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
+ unsigned long i, absent;
+ int idx;
+ struct node_mem_region *region;
+
+ for (i = start_pfn, idx = 0; i < end_pfn;
+ i += region->spanned_pages, idx++) {
+
+ region = &pgdat->node_regions[idx];
+
+ if (i + MEM_REGION_SIZE <= end_pfn) {
+ region->start_pfn = i;
+ region->spanned_pages = MEM_REGION_SIZE;
+ } else {
+ region->start_pfn = i;
+ region->spanned_pages = end_pfn - i;
+ }
+
+ absent = __absent_pages_in_range(nid, region->start_pfn,
+ region->start_pfn +
+ region->spanned_pages);
+
+ region->present_pages = region->spanned_pages - absent;
+ region->idx = idx;
+ region->node = nid;
+ region->pgdat = pgdat;
+ pgdat->nr_node_regions++;
+ }
+}
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -4581,6 +4615,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#endif

free_area_init_core(pgdat, zones_size, zholes_size);
+ init_node_memory_regions(pgdat);
}

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

2012-11-06 19:54:17

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 3/8] mm: Introduce and initialize zone memory regions

Memory region boundaries don't necessarily fit on zone boundaries. So we need
to maintain a zone-level mapping of the absolute memory region boundaries.

"Node Memory Regions" will be used to capture the absolute region boundaries.
Add "Zone Memory Regions" to track the subsets of the absolute memory regions
that fall within the zone boundaries.

Eg:

|<---------------------Node---------------------->|
_________________________________________________
| Node mem reg 0 | Node mem reg 1 |
|_______________________|_________________________|

_________________________________________________
| ZONE_DMA | ZONE_NORMAL |
|_______________|_________________________________|


In the above figure,

ZONE_DMA has only 1 zone memory region (say, Zone mem reg 0) which is a subset
of Node mem reg 0.

ZONE_NORMAL has 2 zone memory regions (say, Zone mem reg 0 and Zone mem reg 1)
which are subsets of Node mem reg 0 and Node mem reg 1 respectively.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 9 +++++++++
mm/page_alloc.c | 42 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bb7c3ef..9f923aa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,6 +339,12 @@ struct node_mem_region {
struct pglist_data *pgdat;
};

+struct zone_mem_region {
+ unsigned long start_pfn;
+ unsigned long spanned_pages;
+ unsigned long present_pages;
+};
+
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -403,6 +409,9 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];

+ struct zone_mem_region zone_mem_region[MAX_NR_REGIONS];
+ int nr_zone_regions;
+
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 709e3c1..c00f72d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4594,6 +4594,46 @@ void init_node_memory_regions(struct pglist_data *pgdat)
}
}

+void init_zone_memory_regions(struct pglist_data *pgdat)
+{
+ unsigned long start_pfn, end_pfn, absent;
+ int i, j, idx, nid = pgdat->node_id;
+ struct node_mem_region *region;
+ struct zone *z;
+
+ for (i = 0; i < pgdat->nr_zones; i++) {
+ z = &pgdat->node_zones[i];
+ idx = 0;
+
+ for (j = 0; j < pgdat->nr_node_regions; j++) {
+ region = &pgdat->node_regions[j];
+ start_pfn = max(z->zone_start_pfn, region->start_pfn);
+ end_pfn = min(z->zone_start_pfn + z->spanned_pages,
+ region->start_pfn + region->spanned_pages);
+
+ if (start_pfn >= end_pfn)
+ continue;
+
+ z->zone_mem_region[idx].start_pfn = start_pfn;
+ z->zone_mem_region[idx].spanned_pages = end_pfn - start_pfn;
+
+ absent = __absent_pages_in_range(nid, start_pfn,
+ end_pfn);
+ z->zone_mem_region[idx].present_pages =
+ end_pfn - start_pfn - absent;
+ idx++;
+ }
+
+ z->nr_zone_regions = idx;
+ }
+}
+
+void init_memory_regions(struct pglist_data *pgdat)
+{
+ init_node_memory_regions(pgdat);
+ init_zone_memory_regions(pgdat);
+}
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -4615,7 +4655,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#endif

free_area_init_core(pgdat, zones_size, zholes_size);
- init_node_memory_regions(pgdat);
+ init_memory_regions(pgdat);
}

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

2012-11-06 19:54:38

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 4/8] mm: Add helpers to retrieve node region and zone region for a given page

Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.

Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region index can be obtained by simply right-shifting
the page's pfn by 'mem_region_shift'.

But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.

To illustrate, consider the following example:

|<---------------------Node---------------------->|
_________________________________________________
| Node mem reg 0 | Node mem reg 1 |
|_______________________|_________________________|

_________________________________________________
| ZONE_DMA | ZONE_NORMAL |
|_______________|_________________________________|


In the above figure,

Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
node_regions[0].zone_region_idx[ZONE_DMA] == 0
node_regions[0].zone_region_idx[ZONE_NORMAL] == 0


Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
node_regions[1].zone_region_idx[ZONE_NORMAL] == 1


Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 23 +++++++++++++++++++++++
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 2 ++
3 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 19c4fb0..a817b16 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -702,6 +702,29 @@ static inline struct zone *page_zone(const struct page *page)
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
}

+static inline int page_node_region_id(const struct page *page)
+{
+ return page_to_pfn(page) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the region to which the page belongs, within its zone.
+ *
+ * Given a page, find the absolute (node) region as well as the zone to which
+ * it belongs. Then find the region within the zone that corresponds to that
+ * absolute (node) region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+ pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+ enum zone_type z_num = page_zonenum(page);
+ unsigned long node_region_idx;
+
+ node_region_idx = page_node_region_id(page);
+
+ return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f923aa..3982354 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -336,6 +336,13 @@ struct node_mem_region {
unsigned long spanned_pages;
int idx;
int node;
+
+ /*
+ * A physical (node) region could be split across multiple zones.
+ * Store the indices of the corresponding regions of each such
+ * zone for this physical (node) region.
+ */
+ int zone_region_idx[MAX_NR_ZONES];
struct pglist_data *pgdat;
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c00f72d..7fd89cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4621,6 +4621,8 @@ void init_zone_memory_regions(struct pglist_data *pgdat)
end_pfn);
z->zone_mem_region[idx].present_pages =
end_pfn - start_pfn - absent;
+
+ region->zone_region_idx[zone_idx(z)] = idx;
idx++;
}

2012-11-06 19:54:47

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 5/8] mm: Add data-structures to describe memory regions within the zones' freelists

In order to influence page allocation decisions (i.e., to make page-allocation
region-aware), we need to be able to distinguish pageblocks belonging to
different zone memory regions within the zones' freelists.

So, within every freelist in a zone, provide pointers to describe the
boundaries of zone memory regions and counters to track the number of free
pageblocks within each region.

Also, fixup the references to the freelist's list_head inside struct free_area.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 17 ++++++++++++++++-
mm/compaction.c | 8 ++++----
mm/page_alloc.c | 21 +++++++++++----------
mm/vmstat.c | 2 +-
4 files changed, 32 insertions(+), 16 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3982354..aba4d68 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -82,8 +82,23 @@ static inline int get_pageblock_migratetype(struct page *page)

#define MAX_NR_REGIONS 256

+struct mem_region_list {
+ struct list_head *page_block;
+ unsigned long nr_free;
+};
+
+struct free_list {
+ struct list_head list;
+
+ /*
+ * Demarcates pageblocks belonging to different regions within
+ * this freelist.
+ */
+ struct mem_region_list mr_list[MAX_NR_REGIONS];
+};
+
struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct free_list free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..95f5c92 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -247,14 +247,14 @@ static void compact_capture_page(struct compact_control *cc)
struct page *page;
struct free_area *area;
area = &(cc->zone->free_area[order]);
- if (list_empty(&area->free_list[mtype]))
+ if (list_empty(&area->free_list[mtype].list))
continue;

/* Take the lock and attempt capture of the page */
if (!compact_trylock_irqsave(&cc->zone->lock, &flags, cc))
return;
- if (!list_empty(&area->free_list[mtype])) {
- page = list_entry(area->free_list[mtype].next,
+ if (!list_empty(&area->free_list[mtype].list)) {
+ page = list_entry(area->free_list[mtype].list.next,
struct page, lru);
if (capture_free_page(page, cc->order, mtype)) {
spin_unlock_irqrestore(&cc->zone->lock,
@@ -866,7 +866,7 @@ static int compact_finished(struct zone *zone,
for (order = cc->order; order < MAX_ORDER; order++) {
struct free_area *area = &zone->free_area[cc->order];
/* Job done if page is free of the right migratetype */
- if (!list_empty(&area->free_list[cc->migratetype]))
+ if (!list_empty(&area->free_list[cc->migratetype].list))
return COMPACT_PARTIAL;

/* Job done if allocation would set block type */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7fd89cd..62d0a9a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -588,12 +588,13 @@ static inline void __free_one_page(struct page *page,
higher_buddy = higher_page + (buddy_idx - combined_idx);
if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list[migratetype].list);
goto out;
}
}

- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype].list);
out:
zone->free_area[order].nr_free++;
}
@@ -811,7 +812,7 @@ static inline void expand(struct zone *zone, struct page *page,
continue;
}
#endif
- list_add(&page[size].lru, &area->free_list[migratetype]);
+ list_add(&page[size].lru, &area->free_list[migratetype].list);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -873,10 +874,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ if (list_empty(&area->free_list[migratetype].list))
continue;

- page = list_entry(area->free_list[migratetype].next,
+ page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
@@ -946,7 +947,7 @@ int move_freepages(struct zone *zone,

order = page_order(page);
list_move(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list[migratetype].list);
set_freepage_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
@@ -1007,10 +1008,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
break;

area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ if (list_empty(&area->free_list[migratetype].list))
continue;

- page = list_entry(area->free_list[migratetype].next,
+ page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
area->nr_free--;

@@ -1274,7 +1275,7 @@ void mark_free_pages(struct zone *zone)
}

for_each_migratetype_order(order, t) {
- list_for_each(curr, &zone->free_area[order].free_list[t]) {
+ list_for_each(curr, &zone->free_area[order].free_list[t].list) {
unsigned long i;

pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -3859,7 +3860,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
{
int order, t;
for_each_migratetype_order(order, t) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list[t].list);
zone->free_area[order].nr_free = 0;
}
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..8183331 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -847,7 +847,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,

area = &(zone->free_area[order]);

- list_for_each(curr, &area->free_list[mtype])
+ list_for_each(curr, &area->free_list[mtype].list)
freecount++;
seq_printf(m, "%6lu ", freecount);
}

2012-11-06 19:55:12

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).

Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.

For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.

Eg:

|<---Region0--->| |<---Region1--->| |<-------Region2--------->|
____ ____ ____ ____ ____ ____ ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->

^ ^ ^
| | |
Reg0 Reg1 Reg2


Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.


Page allocation algorithm:
-------------------------

The heart of the page allocation algorithm remains it is - pick the first
item on the appropriate freelist and return it.


Pageblock order in the zone freelists:
-------------------------------------

This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.

This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.

Page reclaim [Todo]:
--------------------

Page allocation happens in the order of increasing region number. We would
like to do page reclaim in the reverse order, to keep allocated pages within
a minimal number of regions (approximately).

---------------------------- Increasing region number---------------------->

Direction of allocation---> <---Direction of reclaim

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/page_alloc.c | 128 +++++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 113 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 62d0a9a..52ff914 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -502,6 +502,79 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}

+static void add_to_freelist(struct page *page, struct list_head *lru,
+ struct free_list *free_list)
+{
+ struct mem_region_list *region;
+ struct list_head *prev_region_list;
+ int region_id, i;
+
+ region_id = page_zone_region_id(page);
+
+ region = &free_list->mr_list[region_id];
+ region->nr_free++;
+
+ if (region->page_block) {
+ list_add_tail(lru, region->page_block);
+ return;
+ }
+
+ if (!list_empty(&free_list->list)) {
+ for (i = region_id - 1; i >= 0; i--) {
+ if (free_list->mr_list[i].page_block) {
+ prev_region_list =
+ free_list->mr_list[i].page_block;
+ goto out;
+ }
+ }
+ }
+
+ /* This is the first region, so add to the head of the list */
+ prev_region_list = &free_list->list;
+
+out:
+ list_add(lru, prev_region_list);
+
+ /* Save pointer to page block of this region */
+ region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct list_head *lru,
+ struct free_list *free_list)
+{
+ struct mem_region_list *region;
+ struct list_head *prev_page_lru;
+ int region_id;
+
+ region_id = page_zone_region_id(page);
+ region = &free_list->mr_list[region_id];
+ region->nr_free--;
+
+ if (lru != region->page_block) {
+ list_del(lru);
+ return;
+ }
+
+ prev_page_lru = lru->prev;
+ list_del(lru);
+
+ if (region->nr_free == 0)
+ region->page_block = NULL;
+ else
+ region->page_block = prev_page_lru;
+}
+
+/**
+ * Move pages of a given order from freelist of one migrate-type to another.
+ */
+static void move_pages_freelist(struct page *page, struct list_head *lru,
+ struct free_list *old_list,
+ struct free_list *new_list)
+{
+ del_from_freelist(page, lru, old_list);
+ add_to_freelist(page, lru, new_list);
+}
+
/*
* Freeing function for a buddy system allocator.
*
@@ -534,6 +607,7 @@ static inline void __free_one_page(struct page *page,
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
+ struct free_area *area;

if (unlikely(PageCompound(page)))
if (unlikely(destroy_compound_page(page, order)))
@@ -561,8 +635,10 @@ static inline void __free_one_page(struct page *page,
__mod_zone_freepage_state(zone, 1 << order,
migratetype);
} else {
- list_del(&buddy->lru);
- zone->free_area[order].nr_free--;
+ area = &zone->free_area[order];
+ del_from_freelist(buddy, &buddy->lru,
+ &area->free_list[migratetype]);
+ area->nr_free--;
rmv_page_order(buddy);
}
combined_idx = buddy_idx & page_idx;
@@ -587,14 +663,23 @@ static inline void __free_one_page(struct page *page,
buddy_idx = __find_buddy_index(combined_idx, order + 1);
higher_buddy = higher_page + (buddy_idx - combined_idx);
if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+
+ /*
+ * Implementing an add_to_freelist_tail() won't be
+ * very useful because both of them (almost) add to
+ * the tail within the region. So we could potentially
+ * switch off this entire "is next-higher buddy free?"
+ * logic when memory regions are used.
+ */
+ area = &zone->free_area[order];
+ add_to_freelist(page, &page->lru,
+ &area->free_list[migratetype]);
goto out;
}
}

- list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+ add_to_freelist(page, &page->lru,
+ &zone->free_area[order].free_list[migratetype]);
out:
zone->free_area[order].nr_free++;
}
@@ -812,7 +897,8 @@ static inline void expand(struct zone *zone, struct page *page,
continue;
}
#endif
- list_add(&page[size].lru, &area->free_list[migratetype].list);
+ add_to_freelist(&page[size], &page[size].lru,
+ &area->free_list[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -879,7 +965,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
- list_del(&page->lru);
+ del_from_freelist(page, &page->lru,
+ &area->free_list[migratetype]);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);
@@ -918,7 +1005,8 @@ int move_freepages(struct zone *zone,
{
struct page *page;
unsigned long order;
- int pages_moved = 0;
+ struct free_area *area;
+ int pages_moved = 0, old_mt;

#ifndef CONFIG_HOLES_IN_ZONE
/*
@@ -946,8 +1034,11 @@ int move_freepages(struct zone *zone,
}

order = page_order(page);
- list_move(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+ old_mt = get_freepage_migratetype(page);
+ area = &zone->free_area[order];
+ move_pages_freelist(page, &page->lru,
+ &area->free_list[old_mt],
+ &area->free_list[migratetype]);
set_freepage_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
@@ -1045,7 +1136,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
}

/* Remove the page from the freelists */
- list_del(&page->lru);
+ del_from_freelist(page, &page->lru,
+ &area->free_list[migratetype]);
rmv_page_order(page);

/* Take ownership for orders >= pageblock_order */
@@ -1399,12 +1491,14 @@ int capture_free_page(struct page *page, int alloc_order, int migratetype)
if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;

+ mt = get_pageblock_migratetype(page);
+
/* Remove page from free list */
- list_del(&page->lru);
+ del_from_freelist(page, &page->lru,
+ &zone->free_area[order].free_list[mt]);
zone->free_area[order].nr_free--;
rmv_page_order(page);

- mt = get_pageblock_migratetype(page);
if (unlikely(mt != MIGRATE_ISOLATE))
__mod_zone_freepage_state(zone, -(1UL << order), mt);

@@ -6040,6 +6134,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
int order, i;
unsigned long pfn;
unsigned long flags;
+ int mt;
+
/* find the first valid pfn */
for (pfn = start_pfn; pfn < end_pfn; pfn++)
if (pfn_valid(pfn))
@@ -6062,7 +6158,9 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
printk(KERN_INFO "remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- list_del(&page->lru);
+ mt = get_freepage_migratetype(page);
+ del_from_freelist(page, &page->lru,
+ &zone->free_area[order].free_list[mt]);
rmv_page_order(page);
zone->free_area[order].nr_free--;
__mod_zone_page_state(zone, NR_FREE_PAGES,

2012-11-06 19:55:32

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 7/8] mm: Add an optimized version of del_from_freelist to keep page allocation fast

One of the main advantages of this design of memory regions is that page
allocations can potentially be extremely fast - almost with no extra
overhead from memory regions.

To exploit that, introduce an optimized version of del_from_freelist(), which
utilizes the fact that we always delete items from the head of the list
during page allocation.

Basically, we want to keep a note of the region from which we are allocating
in a given freelist, to avoid having to compute the page-to-zone-region for
every page allocation. So introduce a 'next_region' pointer in every freelist
to achieve that, and use it to keep the fastpath of page allocation almost as
fast as it would be without memory regions.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 11 ++++++++++
include/linux/mmzone.h | 6 ++++++
mm/page_alloc.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a817b16..cab8709 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -725,6 +725,17 @@ static inline int page_zone_region_id(const struct page *page)
return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
}

+static inline void set_next_region_in_freelist(struct free_list *free_list)
+{
+ if (list_empty(&free_list->list))
+ free_list->next_region = NULL;
+ else {
+ do {
+ free_list->next_region++;
+ } while (free_list->next_region->nr_free == 0);
+ }
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aba4d68..1d20aa1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -91,6 +91,12 @@ struct free_list {
struct list_head list;

/*
+ * Pointer to the region from which the next allocation will be
+ * satisfied. (Same as the freelist's first pageblock's region.)
+ */
+ struct mem_region_list *next_region; /* for fast page allocation */
+
+ /*
* Demarcates pageblocks belonging to different regions within
* this freelist.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52ff914..05c1fcf 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -532,6 +532,11 @@ static void add_to_freelist(struct page *page, struct list_head *lru,
/* This is the first region, so add to the head of the list */
prev_region_list = &free_list->list;

+ /*
+ * Set 'next_region' to this region, since this is the first region now
+ */
+ free_list->next_region = region;
+
out:
list_add(lru, prev_region_list);

@@ -539,6 +544,38 @@ out:
region->page_block = lru;
}

+/**
+ * __rmqueue_smallest() *always* deletes elements from the head of the
+ * list. Use this knowledge to keep page allocation fast, despite being
+ * region-aware.
+ *
+ * Do *NOT* call this function if you are deleting from somewhere deep
+ * inside the freelist.
+ */
+static void rmqueue_del_from_freelist(struct list_head *lru,
+ struct free_list *free_list)
+{
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN_ON(free_list->list.next != lru);
+#endif
+
+ list_del(lru);
+
+ /* Fastpath */
+ if (--(free_list->next_region->nr_free))
+ return;
+
+ /*
+ * Slowpath, when this is the last pageblock of this region
+ * in this freelist.
+ */
+ free_list->next_region->page_block = NULL;
+
+ /* Set 'next_region' to the new first region in the freelist. */
+ set_next_region_in_freelist(free_list);
+}
+
+/* Generic delete function for region-aware buddy allocator. */
static void del_from_freelist(struct page *page, struct list_head *lru,
struct free_list *free_list)
{
@@ -546,6 +583,11 @@ static void del_from_freelist(struct page *page, struct list_head *lru,
struct list_head *prev_page_lru;
int region_id;

+
+ /* Try to fastpath, if deleting from the head of the list */
+ if (lru == free_list->list.next)
+ return rmqueue_del_from_freelist(lru, free_list);
+
region_id = page_zone_region_id(page);
region = &free_list->mr_list[region_id];
region->nr_free--;
@@ -558,6 +600,11 @@ static void del_from_freelist(struct page *page, struct list_head *lru,
prev_page_lru = lru->prev;
list_del(lru);

+ /*
+ * Since we are not deleting from the head of the list, the
+ * 'next_region' pointer doesn't have to change.
+ */
+
if (region->nr_free == 0)
region->page_block = NULL;
else
@@ -965,8 +1012,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
- del_from_freelist(page, &page->lru,
- &area->free_list[migratetype]);
+ rmqueue_del_from_freelist(&page->lru,
+ &area->free_list[migratetype]);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);

2012-11-06 19:55:47

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH 8/8] mm: Print memory region statistics to understand the buddy allocator behavior

In order to observe the behavior of the region-aware buddy allocator, modify
vmstat.c to also print memory region related statistics. In particular, enable
memory region-related info in /proc/zoneinfo and /proc/buddyinfo, since they
would help us to atleast (roughly) see how the new buddy allocator is
performing.

For now, the region statistics correspond to the zone memory regions and not
the (absolute) node memory regions, and some of the statistics (especially the
no. of present pages) might not be very accurate. But since we account for
and print the free page statistics for every zone memory region accurately, we
should be able to observe the new page allocator behavior to a reasonable
degree of accuracy.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/vmstat.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8183331..cbcd373 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -812,11 +812,31 @@ const char * const vmstat_text[] = {
static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
- int order;
+ int i, order, t;
+ struct free_area *area;

- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
+
+ for (i = 0; i < zone->nr_zone_regions; i++) {
+
+ seq_printf(m, "\t\t Region %d ", i);
+
+ for (order = 0; order < MAX_ORDER; ++order) {
+ unsigned long nr_free = 0;
+
+ area = &zone->free_area[order];
+
+ for (t = 0; t < MIGRATE_TYPES; t++) {
+ if (t == MIGRATE_ISOLATE ||
+ t == MIGRATE_RESERVE)
+ continue;
+ nr_free +=
+ area->free_list[t].mr_list[i].nr_free;
+ }
+ seq_printf(m, "%6lu ", nr_free);
+ }
+ seq_putc(m, '\n');
+ }
seq_putc(m, '\n');
}

@@ -984,6 +1004,8 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
int i;
+ unsigned long zone_nr_free = 0;
+
seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
seq_printf(m,
"\n pages free %lu"
@@ -1001,6 +1023,33 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
zone->spanned_pages,
zone->present_pages);

+ for (i = 0; i < zone->nr_zone_regions; i++) {
+ int order, t;
+ unsigned long nr_free = 0;
+ struct free_area *area = zone->free_area;
+
+ for_each_migratetype_order(order, t) {
+ if (t == MIGRATE_ISOLATE || t == MIGRATE_RESERVE)
+ continue;
+ nr_free +=
+ area[order].free_list[t].mr_list[i].nr_free
+ * (1UL << order);
+ }
+ seq_printf(m, "\n\nZone mem region %d", i);
+ seq_printf(m,
+ "\n pages spanned %lu"
+ "\n present %lu"
+ "\n free %lu",
+ zone->zone_mem_region[i].spanned_pages,
+ zone->zone_mem_region[i].present_pages,
+ nr_free);
+ }
+
+ for (i = 0; i < MAX_ORDER; i++)
+ zone_nr_free += zone->free_area[i].nr_free * (1UL << i);
+
+ seq_printf(m, "\nZone pages nr_free %lu\n", zone_nr_free);
+
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
zone_page_state(zone, i));

2012-11-06 21:49:41

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

On 11/06/2012 11:53 AM, Srivatsa S. Bhat wrote:
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those belonging
> to region-1 and so on. But the pageblocks within a given region need *not* be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.

It's probably _better_ to do it at free time than alloc, but it's still
pretty bad to be doing a linear walk over a potentially 256-entry array
holding the zone lock. The overhead is going to show up somewhere. How
does this do with a kernel compile? Looks like exit() when a process
has a bunch of memory might get painful.

2012-11-06 23:04:29

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: Introduce memory regions data-structure to capture region boundaries within node

On 11/06/2012 11:52 AM, Srivatsa S. Bhat wrote:
> But of course, memory regions are sub-divisions *within* a node, so it makes
> sense to keep the data-structures in the node's struct pglist_data. (Thus
> this placement makes memory regions parallel to zones in that node).

I think it's pretty silly to create *ANOTHER* subdivision of memory
separate from sparsemem. One that doesn't handle large amounts of
memory or scale with memory hotplug. As it stands, you can only support
256*512MB=128GB of address space, which seems pretty puny.

This node_regions[]:

> @@ -687,6 +698,8 @@ typedef struct pglist_data {
> struct zone node_zones[MAX_NR_ZONES];
> struct zonelist node_zonelists[MAX_ZONELISTS];
> int nr_zones;
> + struct node_mem_region node_regions[MAX_NR_REGIONS];
> + int nr_node_regions;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> struct page *node_mem_map;
> #ifdef CONFIG_MEMCG

looks like it's indexed the same way regardless of which node it is in.
In other words, if there are two nodes, at least half of it is wasted,
and 3/4 if there are four nodes. That seems a bit suboptimal.

Could you remind us of the logic for leaving sparsemem out of the
equation here?

2012-11-07 20:14:10

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 1/8] mm: Introduce memory regions data-structure to capture region boundaries within node

On 11/07/2012 04:33 AM, Dave Hansen wrote:
> On 11/06/2012 11:52 AM, Srivatsa S. Bhat wrote:
>> But of course, memory regions are sub-divisions *within* a node, so it makes
>> sense to keep the data-structures in the node's struct pglist_data. (Thus
>> this placement makes memory regions parallel to zones in that node).
>
> I think it's pretty silly to create *ANOTHER* subdivision of memory
> separate from sparsemem. One that doesn't handle large amounts of
> memory or scale with memory hotplug. As it stands, you can only support
> 256*512MB=128GB of address space, which seems pretty puny.
>
> This node_regions[]:
>
>> @@ -687,6 +698,8 @@ typedef struct pglist_data {
>> struct zone node_zones[MAX_NR_ZONES];
>> struct zonelist node_zonelists[MAX_ZONELISTS];
>> int nr_zones;
>> + struct node_mem_region node_regions[MAX_NR_REGIONS];
>> + int nr_node_regions;
>> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
>> struct page *node_mem_map;
>> #ifdef CONFIG_MEMCG
>
> looks like it's indexed the same way regardless of which node it is in.
> In other words, if there are two nodes, at least half of it is wasted,
> and 3/4 if there are four nodes. That seems a bit suboptimal.
>

You're right, I have not addressed that problem in this initial RFC. Thanks
for pointing it out! Going forward, we can surely optimize the way we deal
with memory regions on NUMA systems, using some of the sparsemem techniques.

> Could you remind us of the logic for leaving sparsemem out of the
> equation here?
>

Nothing, its just that in this first RFC I was more focussed towards getting
the overall design right, in terms of having an acceptable way of tracking
pages belonging to different regions within the page allocator (freelists)
and using it to influence page allocation decisions. And also to compare
the merits of this approach over the previous "Hierarchy" design, in a broad
("big picture") sense.

I'll add the above point you raised in my todo-list and address it in
subsequent versions of the patchset.

Thank you very much for the quick feedback!

Regards,
Srivatsa S. Bhat

2012-11-07 20:16:48

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

On 11/07/2012 03:19 AM, Dave Hansen wrote:
> On 11/06/2012 11:53 AM, Srivatsa S. Bhat wrote:
>> This is the main change - we keep the pageblocks in region-sorted order,
>> where pageblocks belonging to region-0 come first, followed by those belonging
>> to region-1 and so on. But the pageblocks within a given region need *not* be
>> sorted, since we need them to be only region-sorted and not fully
>> address-sorted.
>>
>> This sorting is performed when adding pages back to the freelists, thus
>> avoiding any region-related overhead in the critical page allocation
>> paths.
>
> It's probably _better_ to do it at free time than alloc, but it's still
> pretty bad to be doing a linear walk over a potentially 256-entry array
> holding the zone lock. The overhead is going to show up somewhere. How
> does this do with a kernel compile? Looks like exit() when a process
> has a bunch of memory might get painful.
>

As I mentioned in the cover-letter, kernbench numbers haven't shown any
observable performance degradation. On the contrary, (as unbelievable as it
may sound), they actually indicate a slight performance *improvement* with my
patchset! I'm trying to figure out what could be the reason behind that.

Going forward, we could try to optimize the sorting logic in the free()
part, but in any case, IMHO that's the right place to push the overhead to,
since the performance of free() is not expected to be _that_ critical (unlike
alloc()) for overall system performance.

Regards,
Srivatsa S. Bhat

2012-11-08 18:03:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> ------------------------------------------------------------
>
> Today memory subsystems are offer a wide range of capabilities for managing
> memory power consumption. As a quick example, if a block of memory is not
> referenced for a threshold amount of time, the memory controller can decide to
> put that chunk into a low-power content-preserving state. And the next
> reference to that memory chunk would bring it back to full power for read/write.
> With this capability in place, it becomes important for the OS to understand
> the boundaries of such power-manageable chunks of memory and to ensure that
> references are consolidated to a minimum number of such memory power management
> domains.
>

How much power is saved?

> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> the firmware can expose information regarding the boundaries of such memory
> power management domains to the OS in a standard way.
>

I'm not familiar with the ACPI spec but is there support for parsing of
MPST and interpreting the associated ACPI events? For example, if ACPI
fires an event indicating that a memory power node is to enter a low
state then presumably the OS should actively migrate pages away -- even
if it's going into a state where the contents are still refreshed
as exiting that state could take a long time.

I did not look closely at the patchset at all because it looked like the
actual support to use it and measure the benefit is missing.

> How can Linux VM help memory power savings?
>
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space. Basically area of memory
> that is not being referenced, can reside in low power state.
>

Which the series does not appear to do.

> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
>

Which the series does not appear to do judging from this;

include/linux/mm.h | 38 +++++++
include/linux/mmzone.h | 52 +++++++++
mm/compaction.c | 8 +
mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
mm/vmstat.c | 59 ++++++++++-

This does not appear to be doing anything with reclaim and not enough with
compaction to indicate that the series actively manages memory placement
in response to ACPI events.

Further in section 5.2.21.4 the spec says that power node regions can
overlap (but are not hierarchal for some reason) but have no gaps yet the
structure you use to represent is assumes there can be gaps and there are
no overlaps. Again, this is just glancing at the spec and a quick skim of
the patches so maybe I missed something that explains why this structure
is suitable.

It seems to me that superficially the VM implementation for the support
would have

a) Involved a tree that managed the overlapping regions (even if it's
not hierarchal it feels more sensible) and picked the highest-power-state
common denominator in the tree. This would only be allocated if support
for MPST is available.
b) Leave memory allocations and reclaim as they are in the active state.
c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
power but still usable with a latency penalty. This might be a single
migrate type but could also be a parallel set of free_area called
free_area_lowpower that is only used when free_area is depleted and in
the very slow path of the allocator.
d) Use memory hot-remove for power states where the refresh rates were
not constant

and only did anything expensive in response to an ACPI event -- none of
the fast paths should be touched.

When transitioning to the low power state, memory should be migrated in
a vaguely similar fashion to what CMA does. For low-power, migration
failure is acceptable. If contents are not preserved, ACPI needs to know
if the migration failed because it cannot enter that power state.

For any of this to be worthwhile, low power states would need to be achieved
for long periods of time because that migration is not free.

> Memory Regions:
> ---------------
>
> "Memory Regions" is a way of capturing the boundaries of power-managable
> chunks of memory, within the MM subsystem.
>
> Short description of the "Sorted-buddy" design:
> -----------------------------------------------
>
> In this design, the memory region boundaries are captured in a parallel
> data-structure instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions.

Implying that this sorting has to happen in the either the alloc or free
fast path.

> (The freelists need not be fully
> address-sorted, they just need to be region-sorted. Patch 6 explains this
> in more detail).
>
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform page reclaim in the reverse order, as illustrated
> below.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation---> <---Direction of reclaim
>

Compaction will work against this because it uses a PFN walker to isolate
free pages and will ignore memory regions. If pageblocks were used, it
could take that into account at least.

> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and not the page-allocation path and hence the
> critical page allocation paths remain fast.

Page free can be a critical path for application performance as well.
Think network buffer heavy alloc and freeing of buffers.

However, migratetype information is already looked up for THP so ideally
power awareness would piggyback on it.

> Moreover, the heart of the page
> allocation algorithm itself remains largely unchanged, and the region-related
> data-structures are optimized to avoid unnecessary updates during the
> page-allocator's runtime.
>
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> hence we avoid its associated problems (like too many zones, extra page
> reclaim threads, question of choosing watermarks etc).
> [This is an advantage over the "Hierarchy" design]
>
> 2. Performance overhead is expected to be low: Since we retain the simplicity
> of the algorithm in the page allocation path, page allocation can
> potentially remain as fast as it would be without memory regions. The
> overhead is pushed to the page-freeing paths which are not that critical.
>
>
> Results:
> =======
>
> Test setup:
> -----------
> This patchset applies cleanly on top of 3.7-rc3.
>
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
>
> Functional testing:
> -------------------
>
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
>
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
>
> Present pages | Free pages at various allocations |
> | start | 512 MB | 1024 MB | 2048 MB |
> Region 0 16 | 0 | 0 | 0 | 0 |
> Region 1 131072 | 87219 | 8066 | 7892 | 7387 |
> Region 2 131072 | 131072 | 79036 | 0 | 0 |
> Region 3 131072 | 131072 | 131072 | 79061 | 0 |
> Region 4 131072 | 131072 | 131072 | 131072 | 0 |
> Region 5 131072 | 131072 | 131072 | 131072 | 79051 |
> Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 8 131056 | 105475 | 105472 | 105472 | 105472 |
>
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
>
> Performance impact:
> -------------------
>
> Kernbench results didn't show much of a difference between the performance
> of vanilla 3.7-rc3 and this patchset.
>
>
> Todos:
> =====
>
> 1. Memory-region aware page-reclamation:
> ----------------------------------------
>
> We would like to do page reclaim in the reverse order of page allocation
> within a zone, ie., in the order of decreasing region numbers.
> To achieve that, while scanning lru pages to reclaim, we could potentially
> look for pages belonging to higher regions (considering region boundaries)
> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> reclaim candidates.
>

This would disrupting LRU ordering and if those pages were recently
allocated and you force a situation where swap has to be used then any
saving in low memory will be lost by having to access the disk instead.

> 2. Compile-time exclusion of Memory Power Management, and extending the
> support to also work with other features such as Mem cgroups, kexec etc.
>

Compile-time exclusion is pointless because it'll be always activated by
distribution configs. Support for MPST should be detected at runtime and

3. ACPI support to actually use this thing and validate the design is
compatible with the spec and actually works in hardware

--
Mel Gorman
SUSE Labs

2012-11-08 19:39:38

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/08/2012 11:32 PM, Mel Gorman wrote:
> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>> ------------------------------------------------------------
>>
>> Today memory subsystems are offer a wide range of capabilities for managing
>> memory power consumption. As a quick example, if a block of memory is not
>> referenced for a threshold amount of time, the memory controller can decide to
>> put that chunk into a low-power content-preserving state. And the next
>> reference to that memory chunk would bring it back to full power for read/write.
>> With this capability in place, it becomes important for the OS to understand
>> the boundaries of such power-manageable chunks of memory and to ensure that
>> references are consolidated to a minimum number of such memory power management
>> domains.
>>
>
> How much power is saved?

Last year, Amit had evaluated the "Hierarchy" patchset on a Samsung Exynos (ARM)
board and reported that it could save up to 6.3% relative to total system power.
(This was when he allowed only 1 GB out of the total 2 GB RAM to enter low
power states).

Below is the link to his post, as mentioned in the references section in the
cover letter.
http://article.gmane.org/gmane.linux.kernel.mm/65935

Of course, the power savings depends on the characteristics of the particular
hardware memory subsystem used, and the amount of memory present in the system.

>
>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>> the firmware can expose information regarding the boundaries of such memory
>> power management domains to the OS in a standard way.
>>
>
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events?

Sorry I should have been clearer when I mentioned ACPI 5.0. I mentioned ACPI 5.0
just to make a point that support for getting the memory power management
boundaries from the firmware is not far away. I didn't mean to say that that's
the only target for memory power management. Like I mentioned above, last year
the power-savings benefit was measured on ARM boards. The aim of this patchset
is to propose and evaluate some of the core VM algorithms that we will need
to efficiently exploit the power management features offered by the memory
subsystems.

IOW, info regarding memory power domain boundaries made available by ACPI 5.0
or even just with some help from the bootloader on some platforms is only the
input to the VM subsystem to understand at what granularity it should manage
things. *How* it manages is the choice of the algorithm/design at the VM level,
which is what this patchset is trying to propose, by exploring several different
designs of doing it and its costs/benefits.

That's the reason I just hard-coded mem region size to 512 MB in this patchset
and focussed on the VM algorithm to explore what we can do, once we have that
size/boundary info.

> For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
>

We are not really looking at ACPI event notifications here. All we expect from
the firmware (at a first level) is info regarding the boundaries, so that the
VM can be intelligent about how it consolidates references. Many of the memory
subsystems can do power-management automatically - like for example, if a
particular chunk of memory is not referenced for a given threshold time, it can
put it into low-power (content preserving) state without the OS telling it to
do it.

> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.
>

Right, we are focussing on the core VM algorithms for now. The input (ACPI or
other methods) can come later and then we can measure the numbers.

>> How can Linux VM help memory power savings?
>>
>> o Consolidate memory allocations and/or references such that they are
>> not spread across the entire memory address space. Basically area of memory
>> that is not being referenced, can reside in low power state.
>>
>
> Which the series does not appear to do.
>

Well, it influences page-allocation to be memory-region aware. So it does an
attempt to consolidate allocations (and thereby references). As I mentioned,
hardware transition to low-power state can be automatic. The VM must be
intelligent enough to help with that (or atleast smart enough not to disrupt
that!), by avoiding spreading across allocations everywhere.

>> o Support targeted memory reclaim, where certain areas of memory that can be
>> easily freed can be offlined, allowing those areas of memory to be put into
>> lower power states.
>>
>
> Which the series does not appear to do judging from this;
>

Yes, that is one of the items in the TODO list.

> include/linux/mm.h | 38 +++++++
> include/linux/mmzone.h | 52 +++++++++
> mm/compaction.c | 8 +
> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
> mm/vmstat.c | 59 ++++++++++-
>
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.
>
> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.
>

Right, we might need a better way to handle the various possibilities of
the layout of memory regions in the hardware. But this initial RFC tried to
focus on what do we do with that info, inside the VM to aid with power
management.

> It seems to me that superficially the VM implementation for the support
> would have
>
> a) Involved a tree that managed the overlapping regions (even if it's
> not hierarchal it feels more sensible) and picked the highest-power-state
> common denominator in the tree. This would only be allocated if support
> for MPST is available.
> b) Leave memory allocations and reclaim as they are in the active state.
> c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
> power but still usable with a latency penalty. This might be a single
> migrate type but could also be a parallel set of free_area called
> free_area_lowpower that is only used when free_area is depleted and in
> the very slow path of the allocator.
> d) Use memory hot-remove for power states where the refresh rates were
> not constant
>
> and only did anything expensive in response to an ACPI event -- none of
> the fast paths should be touched.
>
> When transitioning to the low power state, memory should be migrated in
> a vaguely similar fashion to what CMA does. For low-power, migration
> failure is acceptable. If contents are not preserved, ACPI needs to know
> if the migration failed because it cannot enter that power state.
>

As I mentioned, we are not really talking about reacting to ACPI events here.
The idea behind this patchset is to have efficient VM algorithms that can
shape memory references depending on power-management boundaries exposed by
the firmware. With that as the goal, I feel we should not even consider
migration as a first step - we should rather consider how to shape allocations
such that we can remain power-efficient right from the beginning and
throughout the runtime, without needing to migrate if possible. And this
patchset implements one of the designs that achieves that.

> For any of this to be worthwhile, low power states would need to be achieved
> for long periods of time because that migration is not free.
>

Best to avoid migration as far as possible in the first place :-)

>> Memory Regions:
>> ---------------
>>
>> "Memory Regions" is a way of capturing the boundaries of power-managable
>> chunks of memory, within the MM subsystem.
>>
>> Short description of the "Sorted-buddy" design:
>> -----------------------------------------------
>>
>> In this design, the memory region boundaries are captured in a parallel
>> data-structure instead of fitting regions between nodes and zones in the
>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>> zones' freelists in region-sorted-order and thus do page allocation in the
>> order of increasing memory regions.
>
> Implying that this sorting has to happen in the either the alloc or free
> fast path.
>

Yep, I have moved it to the free path. The alloc path remains fast.

>> (The freelists need not be fully
>> address-sorted, they just need to be region-sorted. Patch 6 explains this
>> in more detail).
>>
>> The idea is to do page allocation in increasing order of memory regions
>> (within a zone) and perform page reclaim in the reverse order, as illustrated
>> below.
>>
>> ---------------------------- Increasing region number---------------------->
>>
>> Direction of allocation---> <---Direction of reclaim
>>
>
> Compaction will work against this because it uses a PFN walker to isolate
> free pages and will ignore memory regions. If pageblocks were used, it
> could take that into account at least.
>
>> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
>> lies in the page-free path and not the page-allocation path and hence the
>> critical page allocation paths remain fast.
>
> Page free can be a critical path for application performance as well.
> Think network buffer heavy alloc and freeing of buffers.
>
> However, migratetype information is already looked up for THP so ideally
> power awareness would piggyback on it.
>
>> Moreover, the heart of the page
>> allocation algorithm itself remains largely unchanged, and the region-related
>> data-structures are optimized to avoid unnecessary updates during the
>> page-allocator's runtime.
>>
>> Advantages of this design:
>> --------------------------
>> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
>> hence we avoid its associated problems (like too many zones, extra page
>> reclaim threads, question of choosing watermarks etc).
>> [This is an advantage over the "Hierarchy" design]
>>
>> 2. Performance overhead is expected to be low: Since we retain the simplicity
>> of the algorithm in the page allocation path, page allocation can
>> potentially remain as fast as it would be without memory regions. The
>> overhead is pushed to the page-freeing paths which are not that critical.
>>
>>
>> Results:
>> =======
>>
>> Test setup:
>> -----------
>> This patchset applies cleanly on top of 3.7-rc3.
>>
>> x86 dual-socket quad core HT-enabled machine booted with mem=8G
>> Memory region size = 512 MB
>>
>> Functional testing:
>> -------------------
>>
>> Ran pagetest, a simple C program that allocates and touches a required number
>> of pages.
>>
>> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
>> of allocations from pagetest.
>>
>> Present pages | Free pages at various allocations |
>> | start | 512 MB | 1024 MB | 2048 MB |
>> Region 0 16 | 0 | 0 | 0 | 0 |
>> Region 1 131072 | 87219 | 8066 | 7892 | 7387 |
>> Region 2 131072 | 131072 | 79036 | 0 | 0 |
>> Region 3 131072 | 131072 | 131072 | 79061 | 0 |
>> Region 4 131072 | 131072 | 131072 | 131072 | 0 |
>> Region 5 131072 | 131072 | 131072 | 131072 | 79051 |
>> Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
>> Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
>> Region 8 131056 | 105475 | 105472 | 105472 | 105472 |
>>
>> This shows that page allocation occurs in the order of increasing region
>> numbers, as intended in this design.
>>
>> Performance impact:
>> -------------------
>>
>> Kernbench results didn't show much of a difference between the performance
>> of vanilla 3.7-rc3 and this patchset.
>>
>>
>> Todos:
>> =====
>>
>> 1. Memory-region aware page-reclamation:
>> ----------------------------------------
>>
>> We would like to do page reclaim in the reverse order of page allocation
>> within a zone, ie., in the order of decreasing region numbers.
>> To achieve that, while scanning lru pages to reclaim, we could potentially
>> look for pages belonging to higher regions (considering region boundaries)
>> or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
>> reclaim candidates.
>>
>
> This would disrupting LRU ordering and if those pages were recently
> allocated and you force a situation where swap has to be used then any
> saving in low memory will be lost by having to access the disk instead.
>

Right, we need to do it in a way that doesn't hurt performance or power-savings.
I definitely need to think more on this.. Any suggestions?

>> 2. Compile-time exclusion of Memory Power Management, and extending the
>> support to also work with other features such as Mem cgroups, kexec etc.
>>
>
> Compile-time exclusion is pointless because it'll be always activated by
> distribution configs. Support for MPST should be detected at runtime and
>
> 3. ACPI support to actually use this thing and validate the design is
> compatible with the spec and actually works in hardware
>

ACPI is not the only way to exploit this; other platforms (like ARM for example)
can expose info today with some help with the bootloader, and as mentioned
Amit already did a quick evaluation last year. So its not like we are totally
blocked on ACPI support in order to design the VM algorithms to manage memory
power-efficiently.

Thanks a lot for taking a look and for your invaluable feedback!

Regards,
Srivatsa S. Bhat

2012-11-09 05:15:31

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

* Mel Gorman <[email protected]> [2012-11-08 18:02:57]:

> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > ------------------------------------------------------------

Hi Mel,

Thanks for detailed review and comments. The goal of this patch
series is to brainstorm on ideas that enable Linux VM to record and
exploit memory region boundaries.

The first approach that we had last year (hierarchy) has more runtime
overhead. This approach of sorted-buddy was one of the alternative
discussed earlier and we are trying to find out if simple requirements
of biasing memory allocations can be achieved with this approach.

Smart reclaim based on this approach is a key piece we still need to
design. Ideas from compaction will certainly help.

> > Today memory subsystems are offer a wide range of capabilities for managing
> > memory power consumption. As a quick example, if a block of memory is not
> > referenced for a threshold amount of time, the memory controller can decide to
> > put that chunk into a low-power content-preserving state. And the next
> > reference to that memory chunk would bring it back to full power for read/write.
> > With this capability in place, it becomes important for the OS to understand
> > the boundaries of such power-manageable chunks of memory and to ensure that
> > references are consolidated to a minimum number of such memory power management
> > domains.
> >
>
> How much power is saved?

On embedded platform the savings could be around 5% as discussed in
the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935

On larger servers with large amounts of memory the savings could be
more. We do not yet have all the pieces together to evaluate.

> > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > the firmware can expose information regarding the boundaries of such memory
> > power management domains to the OS in a standard way.
> >
>
> I'm not familiar with the ACPI spec but is there support for parsing of
> MPST and interpreting the associated ACPI events? For example, if ACPI
> fires an event indicating that a memory power node is to enter a low
> state then presumably the OS should actively migrate pages away -- even
> if it's going into a state where the contents are still refreshed
> as exiting that state could take a long time.
>
> I did not look closely at the patchset at all because it looked like the
> actual support to use it and measure the benefit is missing.

Correct. The platform interface part is not included in this patch
set mainly because there is not much design required there. Each
platform can have code to collect the memory region boundaries from
BIOS/firmware and load it into the Linux VM. The goal of this patch
is to brainstorm on the idea of hos core VM should used the region
information.

> > How can Linux VM help memory power savings?
> >
> > o Consolidate memory allocations and/or references such that they are
> > not spread across the entire memory address space. Basically area of memory
> > that is not being referenced, can reside in low power state.
> >
>
> Which the series does not appear to do.

Correct. We need to design the correct reclaim strategy for this to
work. However having buddy list sorted by region address could get us
one step closer to shaping the allocations.

> > o Support targeted memory reclaim, where certain areas of memory that can be
> > easily freed can be offlined, allowing those areas of memory to be put into
> > lower power states.
> >
>
> Which the series does not appear to do judging from this;
>
> include/linux/mm.h | 38 +++++++
> include/linux/mmzone.h | 52 +++++++++
> mm/compaction.c | 8 +
> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
> mm/vmstat.c | 59 ++++++++++-
>
> This does not appear to be doing anything with reclaim and not enough with
> compaction to indicate that the series actively manages memory placement
> in response to ACPI events.

Correct. Evaluating different ideas for reclaim will be next step
before getting into the platform interface parts.

> Further in section 5.2.21.4 the spec says that power node regions can
> overlap (but are not hierarchal for some reason) but have no gaps yet the
> structure you use to represent is assumes there can be gaps and there are
> no overlaps. Again, this is just glancing at the spec and a quick skim of
> the patches so maybe I missed something that explains why this structure
> is suitable.

This patch is roughly based on the idea that ACPI MPST will give us
memory region boundaries. It is not designed to implement all options
defined in the spec. We have taken a general case of regions do not
overlap while memory addresses itself can be discontinuous.

> It seems to me that superficially the VM implementation for the support
> would have
>
> a) Involved a tree that managed the overlapping regions (even if it's
> not hierarchal it feels more sensible) and picked the highest-power-state
> common denominator in the tree. This would only be allocated if support
> for MPST is available.
> b) Leave memory allocations and reclaim as they are in the active state.
> c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
> power but still usable with a latency penalty. This might be a single
> migrate type but could also be a parallel set of free_area called
> free_area_lowpower that is only used when free_area is depleted and in
> the very slow path of the allocator.
> d) Use memory hot-remove for power states where the refresh rates were
> not constant
>
> and only did anything expensive in response to an ACPI event -- none of
> the fast paths should be touched.
>
> When transitioning to the low power state, memory should be migrated in
> a vaguely similar fashion to what CMA does. For low-power, migration
> failure is acceptable. If contents are not preserved, ACPI needs to know
> if the migration failed because it cannot enter that power state.
>
> For any of this to be worthwhile, low power states would need to be achieved
> for long periods of time because that migration is not free.

In this patch series we are assuming the simple case of hardware
managing the actual power states and OS facilitates them by keeping
the allocations in less number of memory regions. As we keep
allocations and references low to a regions, it becomes case (c)
above. We are addressing only a small subset of the above list.

> > Memory Regions:
> > ---------------
> >
> > "Memory Regions" is a way of capturing the boundaries of power-managable
> > chunks of memory, within the MM subsystem.
> >
> > Short description of the "Sorted-buddy" design:
> > -----------------------------------------------
> >
> > In this design, the memory region boundaries are captured in a parallel
> > data-structure instead of fitting regions between nodes and zones in the
> > hierarchy. Further, the buddy allocator is altered, such that we maintain the
> > zones' freelists in region-sorted-order and thus do page allocation in the
> > order of increasing memory regions.
>
> Implying that this sorting has to happen in the either the alloc or free
> fast path.

Yes, in the free path. This optimization can be actually be delayed in
the free fast path and completely avoided if our memory is full and we
are doing direct reclaim during allocations.

> > (The freelists need not be fully
> > address-sorted, they just need to be region-sorted. Patch 6 explains this
> > in more detail).
> >
> > The idea is to do page allocation in increasing order of memory regions
> > (within a zone) and perform page reclaim in the reverse order, as illustrated
> > below.
> >
> > ---------------------------- Increasing region number---------------------->
> >
> > Direction of allocation---> <---Direction of reclaim
> >
>
> Compaction will work against this because it uses a PFN walker to isolate
> free pages and will ignore memory regions. If pageblocks were used, it
> could take that into account at least.
>
> > The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> > lies in the page-free path and not the page-allocation path and hence the
> > critical page allocation paths remain fast.
>
> Page free can be a critical path for application performance as well.
> Think network buffer heavy alloc and freeing of buffers.
>
> However, migratetype information is already looked up for THP so ideally
> power awareness would piggyback on it.
>
> > Moreover, the heart of the page
> > allocation algorithm itself remains largely unchanged, and the region-related
> > data-structures are optimized to avoid unnecessary updates during the
> > page-allocator's runtime.
> >
> > Advantages of this design:
> > --------------------------
> > 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> > hence we avoid its associated problems (like too many zones, extra page
> > reclaim threads, question of choosing watermarks etc).
> > [This is an advantage over the "Hierarchy" design]
> >
> > 2. Performance overhead is expected to be low: Since we retain the simplicity
> > of the algorithm in the page allocation path, page allocation can
> > potentially remain as fast as it would be without memory regions. The
> > overhead is pushed to the page-freeing paths which are not that critical.
> >
> >
> > Results:
> > =======
> >
> > Test setup:
> > -----------
> > This patchset applies cleanly on top of 3.7-rc3.
> >
> > x86 dual-socket quad core HT-enabled machine booted with mem=8G
> > Memory region size = 512 MB
> >
> > Functional testing:
> > -------------------
> >
> > Ran pagetest, a simple C program that allocates and touches a required number
> > of pages.
> >
> > Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> > of allocations from pagetest.
> >
> > Present pages | Free pages at various allocations |
> > | start | 512 MB | 1024 MB | 2048 MB |
> > Region 0 16 | 0 | 0 | 0 | 0 |
> > Region 1 131072 | 87219 | 8066 | 7892 | 7387 |
> > Region 2 131072 | 131072 | 79036 | 0 | 0 |
> > Region 3 131072 | 131072 | 131072 | 79061 | 0 |
> > Region 4 131072 | 131072 | 131072 | 131072 | 0 |
> > Region 5 131072 | 131072 | 131072 | 131072 | 79051 |
> > Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
> > Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
> > Region 8 131056 | 105475 | 105472 | 105472 | 105472 |
> >
> > This shows that page allocation occurs in the order of increasing region
> > numbers, as intended in this design.
> >
> > Performance impact:
> > -------------------
> >
> > Kernbench results didn't show much of a difference between the performance
> > of vanilla 3.7-rc3 and this patchset.
> >
> >
> > Todos:
> > =====
> >
> > 1. Memory-region aware page-reclamation:
> > ----------------------------------------
> >
> > We would like to do page reclaim in the reverse order of page allocation
> > within a zone, ie., in the order of decreasing region numbers.
> > To achieve that, while scanning lru pages to reclaim, we could potentially
> > look for pages belonging to higher regions (considering region boundaries)
> > or perhaps simply prefer pages of higher pfns (and skip lower pfns) as
> > reclaim candidates.
> >
>
> This would disrupting LRU ordering and if those pages were recently
> allocated and you force a situation where swap has to be used then any
> saving in low memory will be lost by having to access the disk instead.
>
> > 2. Compile-time exclusion of Memory Power Management, and extending the
> > support to also work with other features such as Mem cgroups, kexec etc.
> >
>
> Compile-time exclusion is pointless because it'll be always activated by
> distribution configs. Support for MPST should be detected at runtime and
>
> 3. ACPI support to actually use this thing and validate the design is
> compatible with the spec and actually works in hardware

This is required to actually evaluate power saving benefit once we
have candidate implementations in the VM.

At this point we want to look at overheads of having region
infrastructure in VM and how does that trade off in terms of
requirements that we can meet.

The first goal is to have memory allocations fill as few regions as
possible when system's memory usage is significantly lower. Next we
would like VM to actively move pages around to cooperate with platform
memory power saving features like notifications or policy changes.

--Vaidy

2012-11-09 09:01:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
> * Mel Gorman <[email protected]> [2012-11-08 18:02:57]:
>
> > On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
> > > ------------------------------------------------------------
>
> Hi Mel,
>
> Thanks for detailed review and comments. The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
>

I see.

> The first approach that we had last year (hierarchy) has more runtime
> overhead. This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
>
> Smart reclaim based on this approach is a key piece we still need to
> design. Ideas from compaction will certainly help.
>
> > > Today memory subsystems are offer a wide range of capabilities for managing
> > > memory power consumption. As a quick example, if a block of memory is not
> > > referenced for a threshold amount of time, the memory controller can decide to
> > > put that chunk into a low-power content-preserving state. And the next
> > > reference to that memory chunk would bring it back to full power for read/write.
> > > With this capability in place, it becomes important for the OS to understand
> > > the boundaries of such power-manageable chunks of memory and to ensure that
> > > references are consolidated to a minimum number of such memory power management
> > > domains.
> > >
> >
> > How much power is saved?
>
> On embedded platform the savings could be around 5% as discussed in
> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> On larger servers with large amounts of memory the savings could be
> more. We do not yet have all the pieces together to evaluate.
>

Ok, it's something to keep an eye on because if memory power savings
require large amounts of CPU (for smart placement or migration) or more
disk accesses (due to reclaim) then the savings will be offset by
increased power usage elsehwere.

> > > ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
> > > the firmware can expose information regarding the boundaries of such memory
> > > power management domains to the OS in a standard way.
> > >
> >
> > I'm not familiar with the ACPI spec but is there support for parsing of
> > MPST and interpreting the associated ACPI events? For example, if ACPI
> > fires an event indicating that a memory power node is to enter a low
> > state then presumably the OS should actively migrate pages away -- even
> > if it's going into a state where the contents are still refreshed
> > as exiting that state could take a long time.
> >
> > I did not look closely at the patchset at all because it looked like the
> > actual support to use it and measure the benefit is missing.
>
> Correct. The platform interface part is not included in this patch
> set mainly because there is not much design required there. Each
> platform can have code to collect the memory region boundaries from
> BIOS/firmware and load it into the Linux VM. The goal of this patch
> is to brainstorm on the idea of hos core VM should used the region
> information.
>

Ok. It does mean that the patches should not be merged until there is
some platform support that can take advantage of them.

> > > How can Linux VM help memory power savings?
> > >
> > > o Consolidate memory allocations and/or references such that they are
> > > not spread across the entire memory address space. Basically area of memory
> > > that is not being referenced, can reside in low power state.
> > >
> >
> > Which the series does not appear to do.
>
> Correct. We need to design the correct reclaim strategy for this to
> work. However having buddy list sorted by region address could get us
> one step closer to shaping the allocations.
>

If you reclaim, it means that the information is going to disk and will
have to be refaulted in sooner rather than later. If you concentrate on
reclaiming low memory regions and memory is almost full, it will lead to
a situation where you almost always reclaim newer pages and increase
faulting. You will save a few milliwatts on memory and lose way more
than that on increase disk traffic and CPU usage.

> > > o Support targeted memory reclaim, where certain areas of memory that can be
> > > easily freed can be offlined, allowing those areas of memory to be put into
> > > lower power states.
> > >
> >
> > Which the series does not appear to do judging from this;
> >
> > include/linux/mm.h | 38 +++++++
> > include/linux/mmzone.h | 52 +++++++++
> > mm/compaction.c | 8 +
> > mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
> > mm/vmstat.c | 59 ++++++++++-
> >
> > This does not appear to be doing anything with reclaim and not enough with
> > compaction to indicate that the series actively manages memory placement
> > in response to ACPI events.
>
> Correct. Evaluating different ideas for reclaim will be next step
> before getting into the platform interface parts.
>
> > Further in section 5.2.21.4 the spec says that power node regions can
> > overlap (but are not hierarchal for some reason) but have no gaps yet the
> > structure you use to represent is assumes there can be gaps and there are
> > no overlaps. Again, this is just glancing at the spec and a quick skim of
> > the patches so maybe I missed something that explains why this structure
> > is suitable.
>
> This patch is roughly based on the idea that ACPI MPST will give us
> memory region boundaries. It is not designed to implement all options
> defined in the spec.

Ok, but as it is the only potential consumer of this interface that you
mentioned then it should at least be able to handle it. The spec talks about
overlapping memory regions where the regions potentially have differnet
power states. This is pretty damn remarkable and hard to see how it could
be interpreted in a sensible way but it forces your implementation to take
it into account.

> We have taken a general case of regions do not
> overlap while memory addresses itself can be discontinuous.
>

Why is the general case? You referred to the ACPI spec where it is not
the case and no other examples.

> > It seems to me that superficially the VM implementation for the support
> > would have
> >
> > a) Involved a tree that managed the overlapping regions (even if it's
> > not hierarchal it feels more sensible) and picked the highest-power-state
> > common denominator in the tree. This would only be allocated if support
> > for MPST is available.
> > b) Leave memory allocations and reclaim as they are in the active state.
> > c) Use a "sticky" migrate list MIGRATE_LOWPOWER for regions that are in lower
> > power but still usable with a latency penalty. This might be a single
> > migrate type but could also be a parallel set of free_area called
> > free_area_lowpower that is only used when free_area is depleted and in
> > the very slow path of the allocator.
> > d) Use memory hot-remove for power states where the refresh rates were
> > not constant
> >
> > and only did anything expensive in response to an ACPI event -- none of
> > the fast paths should be touched.
> >
> > When transitioning to the low power state, memory should be migrated in
> > a vaguely similar fashion to what CMA does. For low-power, migration
> > failure is acceptable. If contents are not preserved, ACPI needs to know
> > if the migration failed because it cannot enter that power state.
> >
> > For any of this to be worthwhile, low power states would need to be achieved
> > for long periods of time because that migration is not free.
>
> In this patch series we are assuming the simple case of hardware
> managing the actual power states and OS facilitates them by keeping
> the allocations in less number of memory regions. As we keep
> allocations and references low to a regions, it becomes case (c)
> above. We are addressing only a small subset of the above list.
>
> > > Memory Regions:
> > > ---------------
> > >
> > > "Memory Regions" is a way of capturing the boundaries of power-managable
> > > chunks of memory, within the MM subsystem.
> > >
> > > Short description of the "Sorted-buddy" design:
> > > -----------------------------------------------
> > >
> > > In this design, the memory region boundaries are captured in a parallel
> > > data-structure instead of fitting regions between nodes and zones in the
> > > hierarchy. Further, the buddy allocator is altered, such that we maintain the
> > > zones' freelists in region-sorted-order and thus do page allocation in the
> > > order of increasing memory regions.
> >
> > Implying that this sorting has to happen in the either the alloc or free
> > fast path.
>
> Yes, in the free path. This optimization can be actually be delayed in
> the free fast path and completely avoided if our memory is full and we
> are doing direct reclaim during allocations.
>

Hurting the free fast path is a bad idea as there are workloads that depend
on it (buffer allocation and free) even though many workloads do *not*
notice it because the bulk of the cost is incurred at exit time. As
memory low power usage has many caveats (may be impossible if a page
table is allocated in the region for example) but CPU usage has less
restrictions it is more important that the CPU usage be kept low.

That means, little or no modification to the fastpath. Sorting or linear
searches should be minimised or avoided.

> > > <SNIPPED where I pointed out that compaction will bust sorting>
> >
> > Compile-time exclusion is pointless because it'll be always activated by
> > distribution configs. Support for MPST should be detected at runtime and
> >
> > 3. ACPI support to actually use this thing and validate the design is
> > compatible with the spec and actually works in hardware
>
> This is required to actually evaluate power saving benefit once we
> have candidate implementations in the VM.
>
> At this point we want to look at overheads of having region
> infrastructure in VM and how does that trade off in terms of
> requirements that we can meet.
>
> The first goal is to have memory allocations fill as few regions as
> possible when system's memory usage is significantly lower.

While it's a reasonable starting objective, the fast path overhead is very
unfortunate and such a strategy can be easily defeated by running sometime
metadata intensive (like find over the entire system) while a large memory
user starts at the same time to spread kernel and user space allocations
throughout the address space. This will spread the allocations throughout
the address space and persist even after the two processes exit due to
the page cache usage from the metadata intensive workload.

Basically, it'll only work as long as the system is idle or never uses
much memory during the lifetime of the system.

--
Mel Gorman
SUSE Labs

2012-11-09 09:05:00

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 6/8] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

Hi Ankita,

On 11/09/2012 11:31 AM, Ankita Garg wrote:
> Hi Srivatsa,
>
> I understand that you are maintaining the page blocks in region sorted
> order. So that way, when the memory requests come in, you can hand out
> memory from the regions in that order.

Yes, that's right.

> However, do you take this
> scenario into account - in some bucket of the buddy allocator, there
> might not be any pages belonging to, lets say, region 0, while the next
> higher bucket has them. So, instead of handing out memory from whichever
> region thats present there, to probably go to the next bucket and split
> that region 0 pageblock there and allocate from it ? (Here, region 0 is
> just an example). Been a while since I looked at kernel code, so I might
> be missing something!
>

This patchset doesn't attempt to do that because that can hurt the fast
path performance of page allocation (ie., because we could end up trying
to split pageblocks even when we already have pageblocks of the required
order ready at hand... and not to mention the searching involved in finding
out whether any higher order free lists really contain pageblocks belonging
to this region 0). In this patchset, I have consciously tried to keep the
overhead from memory regions as low as possible, and have moved most of
the overhead to the page free path.

But the scenario that you brought out is very relevant, because that would
help achieve more aggressive power-savings. I will try to implement
something to that end with least overhead in the next version and measure
whether its cost vs benefit really works out or not. Thank you very much
for pointing it out!

Regards,
Srivatsa S. Bhat

>
>
> On Tue, Nov 6, 2012 at 1:53 PM, Srivatsa S. Bhat
> <[email protected]
> <mailto:[email protected]>> wrote:
>
> The zones' freelists need to be made region-aware, in order to influence
> page allocation and freeing algorithms. So in every free list in the
> zone, we
> would like to demarcate the pageblocks belonging to different memory
> regions
> (we can do this using a set of pointers, and thus avoid splitting up the
> freelists).
>
> Also, we would like to keep the pageblocks in the freelists sorted in
> region-order. That is, pageblocks belonging to region-0 would come
> first,
> followed by pageblocks belonging to region-1 and so on, within a given
> freelist. Of course, a set of pageblocks belonging to the same
> region need
> not be sorted; it is sufficient if we maintain the pageblocks in
> region-sorted-order, rather than a full address-sorted-order.
>
> For each freelist within the zone, we maintain a set of pointers to
> pageblocks belonging to the various memory regions in that zone.
>
> Eg:
>
> |<---Region0--->| |<---Region1--->| |<-------Region2--------->|
> ____ ____ ____ ____ ____ ____ ____
> --> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->
> |____|-->
>
> ^ ^ ^
> | | |
> Reg0 Reg1 Reg2
>
>
> Page allocation will proceed as usual - pick the first item on the
> free list.
> But we don't want to keep updating these region pointers every time
> we allocate
> a pageblock from the freelist. So, instead of pointing to the
> *first* pageblock
> of that region, we maintain the region pointers such that they point
> to the
> *last* pageblock in that region, as shown in the figure above. That
> way, as
> long as there are > 1 pageblocks in that region in that freelist,
> that region
> pointer doesn't need to be updated.
>
>
> Page allocation algorithm:
> -------------------------
>
> The heart of the page allocation algorithm remains it is - pick the
> first
> item on the appropriate freelist and return it.
>
>
> Pageblock order in the zone freelists:
> -------------------------------------
>
> This is the main change - we keep the pageblocks in region-sorted order,
> where pageblocks belonging to region-0 come first, followed by those
> belonging
> to region-1 and so on. But the pageblocks within a given region need
> *not* be
> sorted, since we need them to be only region-sorted and not fully
> address-sorted.
>
> This sorting is performed when adding pages back to the freelists, thus
> avoiding any region-related overhead in the critical page allocation
> paths.
>
> Page reclaim [Todo]:
> --------------------
>
> Page allocation happens in the order of increasing region number. We
> would
> like to do page reclaim in the reverse order, to keep allocated
> pages within
> a minimal number of regions (approximately).
>
> ---------------------------- Increasing region
> number---------------------->
>
> Direction of allocation---> <---Direction of
> reclaim
>
> Signed-off-by: Srivatsa S. Bhat <[email protected]

2012-11-09 14:52:33

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 02:30 PM, Mel Gorman wrote:
> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>> * Mel Gorman <[email protected]> [2012-11-08 18:02:57]:
>>
[...]
>>> How much power is saved?
>>
>> On embedded platform the savings could be around 5% as discussed in
>> the earlier thread: http://article.gmane.org/gmane.linux.kernel.mm/65935
>>
>> On larger servers with large amounts of memory the savings could be
>> more. We do not yet have all the pieces together to evaluate.
>>
>
> Ok, it's something to keep an eye on because if memory power savings
> require large amounts of CPU (for smart placement or migration) or more
> disk accesses (due to reclaim) then the savings will be offset by
> increased power usage elsehwere.
>

True.

>>>> ACPI 5.0 has introduced MPST tables (Memory Power State Tables) [5] so that
>>>> the firmware can expose information regarding the boundaries of such memory
>>>> power management domains to the OS in a standard way.
>>>>
>>>
>>> I'm not familiar with the ACPI spec but is there support for parsing of
>>> MPST and interpreting the associated ACPI events? For example, if ACPI
>>> fires an event indicating that a memory power node is to enter a low
>>> state then presumably the OS should actively migrate pages away -- even
>>> if it's going into a state where the contents are still refreshed
>>> as exiting that state could take a long time.
>>>
>>> I did not look closely at the patchset at all because it looked like the
>>> actual support to use it and measure the benefit is missing.
>>
>> Correct. The platform interface part is not included in this patch
>> set mainly because there is not much design required there. Each
>> platform can have code to collect the memory region boundaries from
>> BIOS/firmware and load it into the Linux VM. The goal of this patch
>> is to brainstorm on the idea of hos core VM should used the region
>> information.
>>
>
> Ok. It does mean that the patches should not be merged until there is
> some platform support that can take advantage of them.
>

That's right, but the development of the VM algorithms and the platform
support for different platforms can go on in parallel. And once we have all
the pieces designed, we can fit them together and merge them.

>>>> How can Linux VM help memory power savings?
>>>>
>>>> o Consolidate memory allocations and/or references such that they are
>>>> not spread across the entire memory address space. Basically area of memory
>>>> that is not being referenced, can reside in low power state.
>>>>
>>>
>>> Which the series does not appear to do.
>>
>> Correct. We need to design the correct reclaim strategy for this to
>> work. However having buddy list sorted by region address could get us
>> one step closer to shaping the allocations.
>>
>
> If you reclaim, it means that the information is going to disk and will
> have to be refaulted in sooner rather than later. If you concentrate on
> reclaiming low memory regions and memory is almost full, it will lead to
> a situation where you almost always reclaim newer pages and increase
> faulting. You will save a few milliwatts on memory and lose way more
> than that on increase disk traffic and CPU usage.
>

Yes, we should ensure that our reclaim strategy won't back-fire like that.
We definitely need to depend on LRU ordering for reclaim for the most part,
but try to opportunistically reclaim from within the required region boundaries
while doing that. We definitely need to think more about this...

But the point of making the free lists sorted region-wise in this patchset
was to exploit the shaping of page allocations the way we want (ie.,
constrained to lesser number of regions).

>>>> o Support targeted memory reclaim, where certain areas of memory that can be
>>>> easily freed can be offlined, allowing those areas of memory to be put into
>>>> lower power states.
>>>>
>>>
>>> Which the series does not appear to do judging from this;
>>>
>>> include/linux/mm.h | 38 +++++++
>>> include/linux/mmzone.h | 52 +++++++++
>>> mm/compaction.c | 8 +
>>> mm/page_alloc.c | 263 ++++++++++++++++++++++++++++++++++++++++++++----
>>> mm/vmstat.c | 59 ++++++++++-
>>>
>>> This does not appear to be doing anything with reclaim and not enough with
>>> compaction to indicate that the series actively manages memory placement
>>> in response to ACPI events.
>>
>> Correct. Evaluating different ideas for reclaim will be next step
>> before getting into the platform interface parts.
>>
[...]
>>
>> This patch is roughly based on the idea that ACPI MPST will give us
>> memory region boundaries. It is not designed to implement all options
>> defined in the spec.
>
> Ok, but as it is the only potential consumer of this interface that you
> mentioned then it should at least be able to handle it. The spec talks about
> overlapping memory regions where the regions potentially have differnet
> power states. This is pretty damn remarkable and hard to see how it could
> be interpreted in a sensible way but it forces your implementation to take
> it into account.
>

Well, sorry for not mentioning in the cover-letter, but the VM algorithms for
memory power management could benefit other platforms too, like ARM, not just
ACPI-based systems. Last year, Amit had evaluated them on Samsung boards with
a simplistic layout for memory regions, based on the Samsung exynos board's
configuration.

http://article.gmane.org/gmane.linux.kernel.mm/65935

>> We have taken a general case of regions do not
>> overlap while memory addresses itself can be discontinuous.
>>
>
> Why is the general case? You referred to the ACPI spec where it is not
> the case and no other examples.
>

ARM is another example, where we could describe the memory regions in a simple
manner with respect to the Samsung exynos board.

So the idea behind this patchset was to start by assuming a simplistic layout
for memory regions and focussing on the design of the VM algorithms, and
evaluating how this "sorted-buddy" design would perform in comparison to the
previous "hierarchy" design that was explored last year.

But of course, you are absolutely right in pointing out that, to make all this
consumable, we need to revisit this with a focus on the layout of memory
regions themselves, so that all interested platforms can make use of it
effectively.

[...]

>>>> Short description of the "Sorted-buddy" design:
>>>> -----------------------------------------------
>>>>
>>>> In this design, the memory region boundaries are captured in a parallel
>>>> data-structure instead of fitting regions between nodes and zones in the
>>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>>>> zones' freelists in region-sorted-order and thus do page allocation in the
>>>> order of increasing memory regions.
>>>
>>> Implying that this sorting has to happen in the either the alloc or free
>>> fast path.
>>
>> Yes, in the free path. This optimization can be actually be delayed in
>> the free fast path and completely avoided if our memory is full and we
>> are doing direct reclaim during allocations.
>>
>
> Hurting the free fast path is a bad idea as there are workloads that depend
> on it (buffer allocation and free) even though many workloads do *not*
> notice it because the bulk of the cost is incurred at exit time. As
> memory low power usage has many caveats (may be impossible if a page
> table is allocated in the region for example) but CPU usage has less
> restrictions it is more important that the CPU usage be kept low.
>
> That means, little or no modification to the fastpath. Sorting or linear
> searches should be minimised or avoided.
>

Right. For example, in the previous "hierarchy" design[1], there was no overhead
in any of the fast paths. Because it split up the zones themselves, so that
they fit on memory region boundaries. But that design had other problems, like
zone fragmentation (too many zones).. which kind of out-weighed the benefit
obtained from zero overhead in the fast-paths. So one of the suggested
alternatives during that review[2], was to explore modifying the buddy allocator
to be aware of memory region boundaries, which this "sorted-buddy" design
implements.

[1]. http://lwn.net/Articles/445045/
http://thread.gmane.org/gmane.linux.kernel.mm/63840
http://thread.gmane.org/gmane.linux.kernel.mm/89202

[2]. http://article.gmane.org/gmane.linux.power-management.general/24862
http://article.gmane.org/gmane.linux.power-management.general/25061
http://article.gmane.org/gmane.linux.kernel.mm/64689

In this patchset, I have tried to minimize the overhead on the fastpaths.
For example, I have used a special 'next_region' data-structure to keep the
alloc path fast. Also, in the free path, we don't need to keep the free
lists fully address sorted; having them region-sorted is sufficient. Of course
we could explore more ways of avoiding overhead in the fast paths, or even a
different design that promises to be much better overall. I'm all ears for
any suggestions :-)

>> At this point we want to look at overheads of having region
>> infrastructure in VM and how does that trade off in terms of
>> requirements that we can meet.
>>
>> The first goal is to have memory allocations fill as few regions as
>> possible when system's memory usage is significantly lower.
>
> While it's a reasonable starting objective, the fast path overhead is very
> unfortunate and such a strategy can be easily defeated by running sometime
> metadata intensive (like find over the entire system) while a large memory
> user starts at the same time to spread kernel and user space allocations
> throughout the address space. This will spread the allocations throughout
> the address space and persist even after the two processes exit due to
> the page cache usage from the metadata intensive workload.
>
> Basically, it'll only work as long as the system is idle or never uses
> much memory during the lifetime of the system.
>

Well, page cache usage could definitely come in the way of memory power
management. Probably having a separate driver shrink the page cache
(depending on how aggressive we want to get with respect to power-management)
is the way to go?

Regards,
Srivatsa S. Bhat

2012-11-09 15:25:18

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 08:21 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 02:30 PM, Mel Gorman wrote:
>> On Fri, Nov 09, 2012 at 10:44:16AM +0530, Vaidyanathan Srinivasan wrote:
>>> * Mel Gorman <[email protected]> [2012-11-08 18:02:57]:
[...]
>>>>> Short description of the "Sorted-buddy" design:
>>>>> -----------------------------------------------
>>>>>
>>>>> In this design, the memory region boundaries are captured in a parallel
>>>>> data-structure instead of fitting regions between nodes and zones in the
>>>>> hierarchy. Further, the buddy allocator is altered, such that we maintain the
>>>>> zones' freelists in region-sorted-order and thus do page allocation in the
>>>>> order of increasing memory regions.
>>>>
>>>> Implying that this sorting has to happen in the either the alloc or free
>>>> fast path.
>>>
>>> Yes, in the free path. This optimization can be actually be delayed in
>>> the free fast path and completely avoided if our memory is full and we
>>> are doing direct reclaim during allocations.
>>>
>>
>> Hurting the free fast path is a bad idea as there are workloads that depend
>> on it (buffer allocation and free) even though many workloads do *not*
>> notice it because the bulk of the cost is incurred at exit time. As
>> memory low power usage has many caveats (may be impossible if a page
>> table is allocated in the region for example) but CPU usage has less
>> restrictions it is more important that the CPU usage be kept low.
>>
>> That means, little or no modification to the fastpath. Sorting or linear
>> searches should be minimised or avoided.
>>
>
> Right. For example, in the previous "hierarchy" design[1], there was no overhead
> in any of the fast paths. Because it split up the zones themselves, so that
> they fit on memory region boundaries. But that design had other problems, like
> zone fragmentation (too many zones).. which kind of out-weighed the benefit
> obtained from zero overhead in the fast-paths. So one of the suggested
> alternatives during that review[2], was to explore modifying the buddy allocator
> to be aware of memory region boundaries, which this "sorted-buddy" design
> implements.
>
> [1]. http://lwn.net/Articles/445045/
> http://thread.gmane.org/gmane.linux.kernel.mm/63840
> http://thread.gmane.org/gmane.linux.kernel.mm/89202
>
> [2]. http://article.gmane.org/gmane.linux.power-management.general/24862
> http://article.gmane.org/gmane.linux.power-management.general/25061
> http://article.gmane.org/gmane.linux.kernel.mm/64689
>
> In this patchset, I have tried to minimize the overhead on the fastpaths.
> For example, I have used a special 'next_region' data-structure to keep the
> alloc path fast. Also, in the free path, we don't need to keep the free
> lists fully address sorted; having them region-sorted is sufficient. Of course
> we could explore more ways of avoiding overhead in the fast paths, or even a
> different design that promises to be much better overall. I'm all ears for
> any suggestions :-)
>

FWIW, kernbench is actually (and surprisingly) showing a slight performance
*improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
my other email to Dave.

https://lkml.org/lkml/2012/11/7/428

I don't think I can dismiss it as an experimental error, because I am seeing
those results consistently.. I'm trying to find out what's behind that.

Regards,
Srivatsa S. Bhat

2012-11-09 15:34:07

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/8/2012 9:14 PM, Vaidyanathan Srinivasan wrote:
> * Mel Gorman <[email protected]> [2012-11-08 18:02:57]:
>
>> On Wed, Nov 07, 2012 at 01:22:13AM +0530, Srivatsa S. Bhat wrote:
>>> ------------------------------------------------------------
>
> Hi Mel,
>
> Thanks for detailed review and comments. The goal of this patch
> series is to brainstorm on ideas that enable Linux VM to record and
> exploit memory region boundaries.
>
> The first approach that we had last year (hierarchy) has more runtime
> overhead. This approach of sorted-buddy was one of the alternative
> discussed earlier and we are trying to find out if simple requirements
> of biasing memory allocations can be achieved with this approach.
>
> Smart reclaim based on this approach is a key piece we still need to
> design. Ideas from compaction will certainly help.

reclaim may be needed for the embedded use case
but at least we are also looking at memory power savings that come for content-preserving power states.
For that, Linux should *statistically* not be actively using (e.g. read or write from it) a percentage of memory...
and statistically clustering is quite sufficient for that.

(for example, if you don't use a DIMM for a certain amount of time,
the link and other pieces can go to a lower power state,
even on todays server systems.
In a many-dimm system.. if each app is, on a per app basis,
preferring one dimm for its allocations, the process scheduler will
help us naturally keeping the other dimms "dark")

If you have to actually free the memory, it is a much much harder problem,
increasingly so if the region you MUST free is quite large.

if one solution can solve both cases, great, but lets not make both not happen
because one of the cases is hard...
(and please lets not use moving or freeing of pages as a solution for at least the
content preserving case)

2012-11-09 16:15:14

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
> FWIW, kernbench is actually (and surprisingly) showing a slight performance
> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
> my other email to Dave.
>
> https://lkml.org/lkml/2012/11/7/428
>
> I don't think I can dismiss it as an experimental error, because I am seeing
> those results consistently.. I'm trying to find out what's behind that.

The only numbers in that link are in the date. :) Let's see the
numbers, please.

If you really have performance improvement to the memory allocator (or
something else) here, then surely it can be pared out of your patches
and merged quickly by itself. Those kinds of optimizations are hard to
come by!

2012-11-09 16:35:36

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 09:43 PM, Dave Hansen wrote:
> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>> my other email to Dave.
>>
>> https://lkml.org/lkml/2012/11/7/428
>>
>> I don't think I can dismiss it as an experimental error, because I am seeing
>> those results consistently.. I'm trying to find out what's behind that.
>
> The only numbers in that link are in the date. :) Let's see the
> numbers, please.
>

Sure :) The reason I didn't post the numbers very eagerly was that I didn't
want it to look ridiculous if it later turned out to be really an error in the
experiment ;) But since I have seen it happening consistently I think I can
post the numbers here with some non-zero confidence.

> If you really have performance improvement to the memory allocator (or
> something else) here, then surely it can be pared out of your patches
> and merged quickly by itself. Those kinds of optimizations are hard to
> come by!
>

:-)

Anyway, here it goes:

Test setup:
----------
x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
patchset might not handle NUMA properly). Mem region size = 512 MB.

Kernbench log for Vanilla 3.7-rc3
=================================

Kernel: 3.7.0-rc3-vanilla-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 650.742 (2.49774)
User Time 8213.08 (17.6347)
System Time 1273.91 (6.00643)
Percent CPU 1457.4 (3.64692)
Context Switches 2250203 (3846.61)
Sleeps 1.8781e+06 (5310.33)

Kernbench log for this sorted-buddy patchset
============================================

Kernel: 3.7.0-rc3-sorted-buddy-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 591.696 (0.660969)
User Time 7511.97 (1.08313)
System Time 1062.99 (1.1109)
Percent CPU 1448.6 (1.94936)
Context Switches 2.1496e+06 (3507.12)
Sleeps 1.84305e+06 (3092.67)

Regards,
Srivatsa S. Bhat

2012-11-09 16:45:14

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>> my other email to Dave.
>>>
>>> https://lkml.org/lkml/2012/11/7/428
>>>
>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>> those results consistently.. I'm trying to find out what's behind that.
>>
>> The only numbers in that link are in the date. :) Let's see the
>> numbers, please.
>>
>
> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
> want it to look ridiculous if it later turned out to be really an error in the
> experiment ;) But since I have seen it happening consistently I think I can
> post the numbers here with some non-zero confidence.
>
>> If you really have performance improvement to the memory allocator (or
>> something else) here, then surely it can be pared out of your patches
>> and merged quickly by itself. Those kinds of optimizations are hard to
>> come by!
>>
>
> :-)
>
> Anyway, here it goes:
>
> Test setup:
> ----------
> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
> patchset might not handle NUMA properly). Mem region size = 512 MB.
>

For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
was much lesser, but nevertheless, this patchset performed better. I wouldn't
vouch that my patchset handles NUMA correctly, but here are the numbers from
that run anyway (at least to show that I really found the results to be
repeatable):

Kernbench log for Vanilla 3.7-rc3
=================================
Kernel: 3.7.0-rc3-vanilla-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 589.058 (0.596171)
User Time 7461.26 (1.69702)
System Time 1072.03 (1.54704)
Percent CPU 1448.2 (1.30384)
Context Switches 2.14322e+06 (4042.97)
Sleeps 1847230 (2614.96)

Kernbench log for Vanilla 3.7-rc3
=================================
Kernel: 3.7.0-rc3-sorted-buddy-numa-default
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 577.182 (0.713772)
User Time 7315.43 (3.87226)
System Time 1043 (1.12855)
Percent CPU 1447.6 (2.19089)
Context Switches 2117022 (3810.15)
Sleeps 1.82966e+06 (4149.82)


Regards,
Srivatsa S. Bhat

> Kernbench log for Vanilla 3.7-rc3
> =================================
>
> Kernel: 3.7.0-rc3-vanilla-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 650.742 (2.49774)
> User Time 8213.08 (17.6347)
> System Time 1273.91 (6.00643)
> Percent CPU 1457.4 (3.64692)
> Context Switches 2250203 (3846.61)
> Sleeps 1.8781e+06 (5310.33)
>
> Kernbench log for this sorted-buddy patchset
> ============================================
>
> Kernel: 3.7.0-rc3-sorted-buddy-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 591.696 (0.660969)
> User Time 7511.97 (1.08313)
> System Time 1062.99 (1.1109)
> Percent CPU 1448.6 (1.94936)
> Context Switches 2.1496e+06 (3507.12)
> Sleeps 1.84305e+06 (3092.67)
>

2012-11-09 16:54:03

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
>> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>>> my other email to Dave.
>>>>
>>>> https://lkml.org/lkml/2012/11/7/428
>>>>
>>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>>> those results consistently.. I'm trying to find out what's behind that.
>>>
>>> The only numbers in that link are in the date. :) Let's see the
>>> numbers, please.
>>>
>>
>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
>> want it to look ridiculous if it later turned out to be really an error in the
>> experiment ;) But since I have seen it happening consistently I think I can
>> post the numbers here with some non-zero confidence.
>>
>>> If you really have performance improvement to the memory allocator (or
>>> something else) here, then surely it can be pared out of your patches
>>> and merged quickly by itself. Those kinds of optimizations are hard to
>>> come by!
>>>
>>
>> :-)
>>
>> Anyway, here it goes:
>>
>> Test setup:
>> ----------
>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
>> patchset might not handle NUMA properly). Mem region size = 512 MB.
>>
>
> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
> was much lesser, but nevertheless, this patchset performed better. I wouldn't
> vouch that my patchset handles NUMA correctly, but here are the numbers from
> that run anyway (at least to show that I really found the results to be
> repeatable):
>
> Kernbench log for Vanilla 3.7-rc3
> =================================
> Kernel: 3.7.0-rc3-vanilla-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 589.058 (0.596171)
> User Time 7461.26 (1.69702)
> System Time 1072.03 (1.54704)
> Percent CPU 1448.2 (1.30384)
> Context Switches 2.14322e+06 (4042.97)
> Sleeps 1847230 (2614.96)
>
> Kernbench log for Vanilla 3.7-rc3
> =================================

Oops, that title must have been "for sorted-buddy patchset" of course..

> Kernel: 3.7.0-rc3-sorted-buddy-numa-default
> Average Optimal load -j 32 Run (std deviation):
> Elapsed Time 577.182 (0.713772)
> User Time 7315.43 (3.87226)
> System Time 1043 (1.12855)
> Percent CPU 1447.6 (2.19089)
> Context Switches 2117022 (3810.15)
> Sleeps 1.82966e+06 (4149.82)
>
>

Regards,
Srivatsa S. Bhat

2012-11-12 16:15:54

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

Hi Srinivas,

It looks like your email did not get delivered to the mailing
lists (and the people in the CC list) properly. So quoting your
entire mail as-it-is here. And thanks a lot for taking a look
at this patchset!

Regards,
Srivatsa S. Bhat

On 11/09/2012 10:18 PM, SrinivasPandruvada wrote:
> I did like this implementation and think it is valuable.
> I am experimenting with one of our HW. This type of partition does help in
> saving power. We believe we can save up-to 1W power per DIM with the help
> of some HW/BIOS changes. We are only talking about content preserving memory,
> so we don't have to be 100% correct.
> In my experiments, I tried two methods:
> - Similar to approach suggested by Mel Gorman. I have a special sticky
> migrate type like CMA.
> - Buddy buckets: Buddies are organized into memory region aware buckets.
> During allocation it prefers higher order buckets. I made sure that there is
> no affect of my change if there are no power saving memory DIMs. The advantage
> of this bucket is that I can keep the memory in close proximity for a related
> task groups by direct hashing to a bucket. The free list if organized as two
> dimensional array with bucket and migrate type for each order.
>
> In both methods, currently reclaim is targetted to be done by a sysfs interface
> similar to memory compaction for a node allowing user space to initiate reclaim.
>
> Thanks,
> Srinivas Pandruvada
> Open Source Technology Center,
> Intel Corp.
>

2012-11-16 18:34:01

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH 0/8][Sorted-buddy] mm: Linux VM Infrastructure to support Memory Power Management

On 11/09/2012 10:22 PM, Srivatsa S. Bhat wrote:
> On 11/09/2012 10:13 PM, Srivatsa S. Bhat wrote:
>> On 11/09/2012 10:04 PM, Srivatsa S. Bhat wrote:
>>> On 11/09/2012 09:43 PM, Dave Hansen wrote:
>>>> On 11/09/2012 07:23 AM, Srivatsa S. Bhat wrote:
>>>>> FWIW, kernbench is actually (and surprisingly) showing a slight performance
>>>>> *improvement* with this patchset, over vanilla 3.7-rc3, as I mentioned in
>>>>> my other email to Dave.
>>>>>
>>>>> https://lkml.org/lkml/2012/11/7/428
>>>>>
>>>>> I don't think I can dismiss it as an experimental error, because I am seeing
>>>>> those results consistently.. I'm trying to find out what's behind that.
>>>>
>>>> The only numbers in that link are in the date. :) Let's see the
>>>> numbers, please.
>>>>
>>>
>>> Sure :) The reason I didn't post the numbers very eagerly was that I didn't
>>> want it to look ridiculous if it later turned out to be really an error in the
>>> experiment ;) But since I have seen it happening consistently I think I can
>>> post the numbers here with some non-zero confidence.
>>>
>>>> If you really have performance improvement to the memory allocator (or
>>>> something else) here, then surely it can be pared out of your patches
>>>> and merged quickly by itself. Those kinds of optimizations are hard to
>>>> come by!
>>>>
>>>
>>> :-)
>>>
>>> Anyway, here it goes:
>>>
>>> Test setup:
>>> ----------
>>> x86 2-socket quad-core machine. (CONFIG_NUMA=n because I figured that my
>>> patchset might not handle NUMA properly). Mem region size = 512 MB.
>>>
>>
>> For CONFIG_NUMA=y on the same machine, the difference between the 2 kernels
>> was much lesser, but nevertheless, this patchset performed better. I wouldn't
>> vouch that my patchset handles NUMA correctly, but here are the numbers from
>> that run anyway (at least to show that I really found the results to be
>> repeatable):
>>

I fixed up the NUMA case (I'll post the updated patch for that soon) and
ran a fresh set of kernbench runs. The difference between mainline and this
patchset is quite tiny; so we can't really say that this patchset shows a
performance improvement over mainline. However, I can safely conclude that
this patchset doesn't show any performance _degradation_ w.r.t mainline
in kernbench.

Results from one of the recent kernbench runs:
---------------------------------------------

Kernbench log for Vanilla 3.7-rc3
=================================
Kernel: 3.7.0-rc3
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 330.39 (0.746257)
User Time 4283.63 (3.39617)
System Time 604.783 (2.72629)
Percent CPU 1479 (3.60555)
Context Switches 845634 (6031.22)
Sleeps 833655 (6652.17)


Kernbench log for Sorted-buddy
==============================
Kernel: 3.7.0-rc3-sorted-buddy
Average Optimal load -j 32 Run (std deviation):
Elapsed Time 329.967 (2.76789)
User Time 4230.02 (2.15324)
System Time 599.793 (1.09988)
Percent CPU 1463.33 (11.3725)
Context Switches 840530 (1646.75)
Sleeps 833732 (2227.68)

Regards,
Srivatsa S. Bhat

2012-11-16 18:40:29

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH UPDATED 4/8] mm: Add helpers to retrieve node region and zone region for a given page

This version of the patch includes a bug-fix for page_node_region_id()
which used to break the NUMA case.

--------------------------------------------------------------------->

From: Srivatsa S. Bhat <[email protected]>
Subject: mm: Add helpers to retrieve node region and zone region for a given page

Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.

Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region index can be obtained by simply right-shifting
the page's pfn by 'mem_region_shift'.

But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.

To illustrate, consider the following example:

|<---------------------Node---------------------->|
_________________________________________________
| Node mem reg 0 | Node mem reg 1 |
|_______________________|_________________________|

_________________________________________________
| ZONE_DMA | ZONE_NORMAL |
|_______________|_________________________________|


In the above figure,

Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
node_regions[0].zone_region_idx[ZONE_DMA] == 0
node_regions[0].zone_region_idx[ZONE_NORMAL] == 0


Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
node_regions[1].zone_region_idx[ZONE_NORMAL] == 1


Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 24 ++++++++++++++++++++++++
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 2 ++
3 files changed, 33 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 19c4fb0..32457c7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -702,6 +702,30 @@ static inline struct zone *page_zone(const struct page *page)
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
}

+static inline int page_node_region_id(const struct page *page,
+ const pg_data_t *pgdat)
+{
+ return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the region to which the page belongs, within its zone.
+ *
+ * Given a page, find the absolute (node) region as well as the zone to which
+ * it belongs. Then find the region within the zone that corresponds to that
+ * absolute (node) region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+ pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+ enum zone_type z_num = page_zonenum(page);
+ unsigned long node_region_idx;
+
+ node_region_idx = page_node_region_id(page, pgdat);
+
+ return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9f923aa..3982354 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -336,6 +336,13 @@ struct node_mem_region {
unsigned long spanned_pages;
int idx;
int node;
+
+ /*
+ * A physical (node) region could be split across multiple zones.
+ * Store the indices of the corresponding regions of each such
+ * zone for this physical (node) region.
+ */
+ int zone_region_idx[MAX_NR_ZONES];
struct pglist_data *pgdat;
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c00f72d..7fd89cd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4621,6 +4621,8 @@ void init_zone_memory_regions(struct pglist_data *pgdat)
end_pfn);
z->zone_mem_region[idx].present_pages =
end_pfn - start_pfn - absent;
+
+ region->zone_region_idx[zone_idx(z)] = idx;
idx++;
}