2013-04-09 21:48:47

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management


[I know, this cover letter is a little too long, but I wanted to clearly
explain the overall goals and the high-level design of this patchset in
detail. I hope this helps more than it annoys, and makes it easier for
reviewers to relate to the background and the goals of this patchset.]


Overview of Memory Power Management and its implications to the Linux MM
========================================================================

Today, we are increasingly seeing computer systems sporting larger and larger
amounts of RAM, in order to meet workload demands. However, memory consumes a
significant amount of power, potentially upto more than a third of total system
power on server systems. So naturally, memory becomes the next big target for
power management - on embedded systems and smartphones, and all the way upto
large server systems.

Power-management capabilities in modern memory hardware:
-------------------------------------------------------

Modern memory hardware such as DDR3 support a number of power management
capabilities - for instance, the memory controller can automatically put
memory DIMMs/banks into content-preserving low-power states, if it detects
that that *entire* memory DIMM/bank has not been referenced for a threshold
amount of time, thus reducing the energy consumption of the memory hardware.
We term these power-manageable chunks of memory as "Memory Regions".

Exporting memory region info of the platform to the OS:
------------------------------------------------------

The OS needs to know about the granularity at which the hardware can perform
automatic power-management of the memory banks (i.e., the address boundaries
of the memory regions). On ARM platforms, the bootloader can be modified to
pass on this info to the kernel via the device-tree. On x86 platforms, the
new ACPI 5.0 spec has added support for exporting the power-management
capabilities of the memory hardware to the OS in a standard way[5].

Estimate of power-savings from power-aware Linux MM:
---------------------------------------------------

Once the firmware/bootloader exports the required info to the OS, it is upto
the kernel's MM subsystem to make the best use of these capabilities and manage
memory power-efficiently. It had been demonstrated on a Samsung Exynos board
(with 2 GB RAM) that upto 6 percent of total system power can be saved by
making the Linux kernel MM subsystem power-aware[4]. (More savings can be
expected on systems with larger amounts of memory, and perhaps improved further
using better MM designs).


Role of the Linux MM in enhancing memory power savings:
------------------------------------------------------

Often, this simply translates to having the Linux MM understand the granularity
at which RAM modules can be power-managed, and keeping the memory allocations
and references consolidated to a minimum no. of these power-manageable
"memory regions". It is of particular interest to note that most of these memory
hardware have the intelligence to automatically save power, by putting memory
banks into (content-preserving) low-power states when not referenced for a
threshold amount of time. All that the kernel has to do, is avoid wrecking
the power-savings logic by scattering its allocations and references all over
the system memory. (The kernel/MM doesn't have to perform the actual power-state
transitions; its mostly done in the hardware automatically, and this is OK
because these are *content-preserving* low-power states).

So we can summarize the goals for the Linux MM as:

o Consolidate memory allocations and/or references such that they are not
spread across the entire memory address space. Basically the area of memory
that is not being referenced can reside in low power state.

o Support light-weight targetted memory compaction/reclaim, to evacuate
lightly-filled memory regions. This helps avoid memory references to
those regions, thereby allowing them to reside in low power states.


Assumptions and goals of this patchset:
--------------------------------------

In this patchset, we don't handle the part of getting the region boundary info
from the firmware/bootloader and populating it in the kernel data-structures.
The aim of this patchset is to propose and brainstorm on a power-aware design
of the Linux MM which can *use* the region boundary info to influence the MM
at various places such as page allocation, reclamation/compaction etc, thereby
contributing to memory power savings. (This patchset is very much an RFC at
the moment and is not intended for mainline-inclusion yet).

So, in this patchset, we assume a simple model in which each 512MB chunk of
memory can be independently power-managed, and hard-code this into the patchset.
As mentioned, the focus of this patchset is not so much on how we get this info
from the firmware or how exactly we handle a variety of configurations, but
rather on discussing the power-savings/performance impact of the MM algorithms
that *act* upon this info in order to save memory power.

That said, its not very far-fetched to try this out with actual region
boundary info to get the actual power savings numbers. For example, on ARM
platforms, we can make the bootloader export this info to the OS via device-tree
and then run this patchset. (This was the method used to get the power-numbers
in [4]). But even without doing that, we can very well evaluate the
effectiveness of this patchset in contributing to power-savings, by analyzing
the free page statistics per-memory-region; and we can observe the performance
impact by running benchmarks - this is the approach currently used to evaluate
this patchset.


Brief overview of the design/approach used in this patchset:
-----------------------------------------------------------

This patchset implements the 'Sorted-buddy design' for Memory Power Management,
in which the buddy (page) allocator is altered to keep the buddy freelists
region-sorted, which helps influence the page allocation paths to keep the
allocations consolidated to a minimum no. of memory regions. This patchset also
includes a light-weight targetted compaction/reclaim algorithm that works
hand-in-hand with the page-allocator, to evacuate lightly-filled memory regions
when memory gets fragmented, in order to further enhance memory power savings.

This Sorted-buddy design was developed based on some of the suggestions
received[1] during the review of the earlier patchset on Memory Power
Management written by Ankita Garg ('Hierarchy design')[2].
One of the key aspects of this Sorted-buddy design is that it avoids the
zone-fragmentation problem that was present in the earlier design[3].



Design of sorted buddy allocator and light-weight targetted region compaction:
=============================================================================

Sorted buddy allocator:
----------------------

In this design, the memory region boundaries are captured in a data structure
parallel to zones, instead of fitting regions between nodes and zones in the
hierarchy. Further, the buddy allocator is altered, such that we maintain the
zones' freelists in region-sorted-order and thus do page allocation in the
order of increasing memory regions. (The freelists need not be fully
address-sorted, they just need to be region-sorted).

The idea is to do page allocation in increasing order of memory regions
(within a zone) and perform region-compaction in the reverse order, as
illustrated below.

---------------------------- Increasing region number---------------------->

Direction of allocation---> <---Direction of region-compaction


The sorting logic (to maintain freelist pageblocks in region-sorted-order)
lies in the page-free path and hence the critical page-allocation paths remain
fast. Also, the sorting logic is optimized to be O(log n).

Advantages of this design:
--------------------------
1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
hence we avoid its associated problems (like too many zones, extra kswapd
activity, question of choosing watermarks etc).
[This is an advantage over the 'Hierarchy' design]

2. Performance overhead is expected to be low: Since we retain the simplicity
of the algorithm in the page allocation path, page allocation can
potentially remain as fast as it would be without memory regions. The
overhead is pushed to the page-freeing paths which are not that critical.


Light-weight targetted region compaction:
----------------------------------------

Over time, due to multiple alloc()s and free()s in random order, memory gets
fragmented, which means the memory allocations will no longer be consolidated
to a minimum no. of memory regions. In such cases we need a light-weight
mechanism to opportunistically compact memory to evacuate lightly-filled
memory regions, thereby enhancing the power-savings.

Noting that CMA (Contiguous Memory Allocator) does targetted compaction to
achieve its goals, this patchset generalizes the targetted compaction code
and reuses it to evacuate memory regions. The region evacuation is triggered
by the page allocator : when it notices the first page allocation in a new
region, it sets up a worker function to perform compaction and evacuate that
region in the future, if possible. There are handshakes between the alloc
and the free paths in the page allocator to help do this smartly, which are
explained in detail in the patches.


This patchset has been hosted in the below git tree. It applies cleanly on
v3.9-rc5.

git://github.com/srivatsabhat/linux.git mem-power-mgmt-v2


Changes in this v2:
==================

* Fixed a bug in the NUMA case.
* Added a new optimized O(log n) sorting algorithm to speed up region-sorting
of the buddy freelists (patch 9). The efficiency of this new algorithm and
its design allows us to support large amounts of RAM quite easily.
* Added light-weight targetted compaction/reclaim support for memory power
management (patches 10-14).
* Revamped the cover-letter to better explain the idea behind memory power
management and this patchset.


Experimental Results:
====================

Test setup:
----------

x86 dual-socket quad core HT-enabled machine booted with mem=8G
Memory region size = 512 MB

Functional testing:
------------------

Ran pagetest, a simple C program that allocates and touches a required number
of pages.

Below is the statistics from the regions within ZONE_NORMAL, at various sizes
of allocations from pagetest.


Present pages | Free pages at various allocation sizes |
| start | 512 MB | 1024 MB | 2048 MB |
Region 0 1 | 0 | 0 | 0 | 0 |
Region 1 131072 | 41537 | 13858 | 13790 | 13334 |
Region 2 131072 | 131072 | 26839 | 82 | 122 |
Region 3 131072 | 131072 | 131072 | 26624 | 0 |
Region 4 131072 | 131072 | 131072 | 131072 | 0 |
Region 5 131072 | 131072 | 131072 | 131072 | 26624 |
Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
Region 8 131071 | 72704 | 72704 | 72704 | 72704 |

This shows that page allocation occurs in the order of increasing region
numbers, as intended in this design.

Performance impact:
-------------------

Kernbench results didn't show any noticeable performance degradation with
this patchset as compared to vanilla 3.9-rc5.


Todos and ideas for enhancing the design further:
================================================

1. Add support for making this work with sparsemem, memcg etc.

2. Mel Gorman pointed out that regular compaction algorithm would work
against the sorted-buddy allocation strategy, since it creates free space
at lower pfns. For now, I have not handled this because regular compaction
triggers only when the memory pressure is very high, and hence memory
power management is pointless in those situations. Besides, it is
immaterial whether memory allocations are consolidated towards lower or
higher pfns, because it saves power either way, and hence the regular
compaction algorithm doesn't actually work against memory power management.

3. Add more optimizations to the targetted region compaction algorithm in order
to enhance its benefits and reduce the overhead, such as:
a. Migrate only active pages during region evacuation, because, strictly
speaking we only want to avoid _references_ to the region. So inactive
pages can be kept around, thus reducing the page-migration overhead.
b. Reduce the search-space for region evacuation, by having the
page-allocator note down the highest allocated pfn within that region.

4. Have stronger influence over how freepages from different migratetypes
are exchanged, so that unmovable and non-reclaimable allocations are
contained within least no. of memory regions.

5. Influence the refill of per-cpu pagesets and perhaps even heavily used
slab caches, such that they all get their memory from least no. of memory
regions. This is to avoid frequent fragmentation of memory regions.

6. Don't perform region evacuation at situations of high memory utilization.
Also, never use freepages from MIGRATE_RESERVE for the purpose of
region-evacuation.

7. Add more tracing/debug info to enable better evaluation of the
effectiveness and benefits of this patchset over vanilla kernel.

8. Add a higher level policy to control the aggressiveness of memory power
management.


References:
----------

[1]. Review comments suggesting modifying the buddy allocator to be aware of
memory regions:
http://article.gmane.org/gmane.linux.power-management.general/24862
http://article.gmane.org/gmane.linux.power-management.general/25061
http://article.gmane.org/gmane.linux.kernel.mm/64689

[2]. Patch series that implemented the node-region-zone hierarchy design:
http://lwn.net/Articles/445045/
http://thread.gmane.org/gmane.linux.kernel.mm/63840

Summary of the discussion on that patchset:
http://article.gmane.org/gmane.linux.power-management.general/25061

Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
http://thread.gmane.org/gmane.linux.kernel.mm/89202

[3]. Disadvantages of having memory regions in the hierarchy between nodes and
zones:
http://article.gmane.org/gmane.linux.kernel.mm/63849

[4]. Estimate of potential power savings on Samsung exynos board
http://article.gmane.org/gmane.linux.kernel.mm/65935

[5]. ACPI 5.0 and MPST support
http://www.acpi.info/spec.htm
Section 5.2.21 Memory Power State Table (MPST)

[6]. v1 of Sorted-buddy memory power management patchset:
http://thread.gmane.org/gmane.linux.power-management.general/28498


Srivatsa S. Bhat (15):
mm: Introduce memory regions data-structure to capture region boundaries within nodes
mm: Initialize node memory regions during boot
mm: Introduce and initialize zone memory regions
mm: Add helpers to retrieve node region and zone region for a given page
mm: Add data-structures to describe memory regions within the zones' freelists
mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
mm: Add an optimized version of del_from_freelist to keep page allocation fast
bitops: Document the difference in indexing between fls() and __fls()
mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
mm: Add support to accurately track per-memory-region allocation
mm: Restructure the compaction part of CMA for wider use
mm: Add infrastructure to evacuate memory regions using compaction
mm: Implement the worker function for memory region compaction
mm: Add alloc-free handshake to trigger memory region compaction
mm: Print memory region statistics to understand the buddy allocator behavior


arch/x86/include/asm/bitops.h | 4
include/asm-generic/bitops/__fls.h | 5
include/linux/compaction.h | 7
include/linux/gfp.h | 2
include/linux/migrate.h | 3
include/linux/mm.h | 62 ++++
include/linux/mmzone.h | 78 ++++-
include/trace/events/migrate.h | 3
mm/compaction.c | 149 +++++++++
mm/internal.h | 40 ++
mm/page_alloc.c | 617 ++++++++++++++++++++++++++++++++----
mm/vmstat.c | 36 ++
12 files changed, 935 insertions(+), 71 deletions(-)


Regards,
Srivatsa S. Bhat
IBM Linux Technology Center


2013-04-09 21:49:04

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 02/15] mm: Initialize node memory regions during boot

Initialize the node's memory-regions structures with the information about
the region-boundaries, at boot time.

Based-on-patch-by: Ankita Garg <[email protected]>
Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 4 ++++
mm/page_alloc.c | 28 ++++++++++++++++++++++++++++
2 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e19ff30..b7b368a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -621,6 +621,10 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

+/* Hard-code memory region size to be 512 MB for now. */
+#define MEM_REGION_SHIFT (29 - PAGE_SHIFT)
+#define MEM_REGION_SIZE (1UL << MEM_REGION_SHIFT)
+
static inline enum zone_type page_zonenum(const struct page *page)
{
return (page->flags >> ZONES_PGSHIFT) & ZONES_MASK;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..9760e89 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4681,6 +4681,33 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}

+static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
+{
+ int nid = pgdat->node_id;
+ unsigned long start_pfn = pgdat->node_start_pfn;
+ unsigned long end_pfn = start_pfn + pgdat->node_spanned_pages;
+ struct node_mem_region *region;
+ unsigned long i, absent;
+ int idx;
+
+ for (i = start_pfn, idx = 0; i < end_pfn;
+ i += region->spanned_pages, idx++) {
+
+ region = &pgdat->node_regions[idx];
+ region->pgdat = pgdat;
+ region->start_pfn = i;
+ region->spanned_pages = min(MEM_REGION_SIZE, end_pfn - i);
+ region->end_pfn = region->start_pfn + region->spanned_pages;
+
+ absent = __absent_pages_in_range(nid, region->start_pfn,
+ region->end_pfn);
+
+ region->present_pages = region->spanned_pages - absent;
+ }
+
+ pgdat->nr_node_regions = idx;
+}
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -4702,6 +4729,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#endif

free_area_init_core(pgdat, zones_size, zholes_size);
+ init_node_memory_regions(pgdat);
}

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

2013-04-09 21:49:15

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 03/15] mm: Introduce and initialize zone memory regions

Memory region boundaries don't necessarily fit on zone boundaries. So we need
to maintain a zone-level mapping of the absolute memory region boundaries.

"Node Memory Regions" will be used to capture the absolute region boundaries.
Add "Zone Memory Regions" to track the subsets of the absolute memory regions
that fall within the zone boundaries.

Eg:

|<----------------------Node---------------------->|
__________________________________________________
| Node mem reg 0 | Node mem reg 1 | (Absolute region
|________________________|_________________________| boundaries)

__________________________________________________
| ZONE_DMA | ZONE_NORMAL |
| | |
|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
|_______________|________|_________________________|


In the above figure,

ZONE_DMA will have only 1 zone memory region (say, Zone mem reg 0) which is a
subset of Node mem reg 0 (ie., the portion of Node mem reg 0 that intersects
with ZONE_DMA).

ZONE_NORMAL will have 2 zone memory regions (say, Zone mem reg 0 and
Zone mem reg 1) which are subsets of Node mem reg 0 and Node mem reg 1
respectively, that intersect with ZONE_NORMAL's range.

Most of the MM algorithms (like page allocation etc) work within a zone,
hence such a zone-level mapping of the absolute region boundaries will be
very useful in influencing the MM decisions at those places.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 11 +++++++++
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e6df08f..46a6b63 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -36,6 +36,7 @@
#define PAGE_ALLOC_COSTLY_ORDER 3

#define MAX_NR_NODE_REGIONS 256
+#define MAX_NR_ZONE_REGIONS MAX_NR_NODE_REGIONS

enum {
MIGRATE_UNMOVABLE,
@@ -312,6 +313,13 @@ enum zone_type {

#ifndef __GENERATING_BOUNDS_H

+struct zone_mem_region {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ unsigned long present_pages;
+ unsigned long spanned_pages;
+};
+
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -369,6 +377,9 @@ struct zone {
#endif
struct free_area free_area[MAX_ORDER];

+ struct zone_mem_region zone_regions[MAX_NR_ZONE_REGIONS];
+ int nr_zone_regions;
+
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9760e89..d4abba6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4708,6 +4708,66 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
pgdat->nr_node_regions = idx;
}

+static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
+{
+ unsigned long start_pfn, end_pfn, absent;
+ unsigned long z_start_pfn, z_end_pfn;
+ int i, j, idx, nid = pgdat->node_id;
+ struct node_mem_region *node_region;
+ struct zone_mem_region *zone_region;
+ struct zone *z;
+
+ for (i = 0, j = 0; i < pgdat->nr_zones; i++) {
+ z = &pgdat->node_zones[i];
+ z_start_pfn = z->zone_start_pfn;
+ z_end_pfn = z->zone_start_pfn + z->spanned_pages;
+ idx = 0;
+
+ for ( ; j < pgdat->nr_node_regions; j++) {
+ node_region = &pgdat->node_regions[j];
+
+ /*
+ * Skip node memory regions that don't intersect with
+ * this zone.
+ */
+ if (node_region->end_pfn <= z_start_pfn)
+ continue; /* Move to next higher node region */
+
+ if (node_region->start_pfn >= z_end_pfn)
+ break; /* Move to next higher zone */
+
+ start_pfn = max(z_start_pfn, node_region->start_pfn);
+ end_pfn = min(z_end_pfn, node_region->end_pfn);
+
+ zone_region = &z->zone_regions[idx];
+ zone_region->start_pfn = start_pfn;
+ zone_region->end_pfn = end_pfn;
+ zone_region->spanned_pages = end_pfn - start_pfn;
+
+ absent = __absent_pages_in_range(nid, start_pfn,
+ end_pfn);
+ zone_region->present_pages =
+ zone_region->spanned_pages - absent;
+
+ idx++;
+ }
+
+ z->nr_zone_regions = idx;
+
+ /*
+ * Revisit the last visited node memory region, in case it
+ * spans multiple zones.
+ */
+ j--;
+ }
+}
+
+static void __meminit init_memory_regions(struct pglist_data *pgdat)
+{
+ init_node_memory_regions(pgdat);
+ init_zone_memory_regions(pgdat);
+}
+
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -4729,7 +4789,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
#endif

free_area_init_core(pgdat, zones_size, zholes_size);
- init_node_memory_regions(pgdat);
+ init_memory_regions(pgdat);
}

#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP

2013-04-09 21:49:33

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 01/15] mm: Introduce memory regions data-structure to capture region boundaries within nodes

The memory within a node can be divided into regions of memory that can be
independently power-managed. That is, chunks of memory can be transitioned
(manually or automatically) to low-power states based on the frequency of
references to that region. For example, if a memory chunk is not referenced
for a given threshold amount of time, the hardware (memory controller) can
decide to put that piece of memory into a content-preserving low-power state.
And of course, on the next reference to that chunk of memory, it will be
transitioned back to full-power for read/write operations.

So, the Linux MM can take advantage of this feature by managing the available
memory with an eye towards power-savings - ie., by keeping the memory
allocations/references consolidated to a minimum no. of such power-manageable
memory regions. In order to do so, the first step is to teach the MM about
the boundaries of these regions - and to capture that info, we introduce a new
data-structure called "Memory Regions".

[Also, the concept of memory regions could potentially be extended to work
with different classes of memory like PCM (Phase Change Memory) etc and
hence, it is not limited to just power management alone].

We already sub-divide a node's memory into zones, based on some well-known
constraints. So the question is, where do we fit in memory regions in this
hierarchy. Instead of artificially trying to fit it into the hierarchy one
way or the other, we choose to simply capture the region boundaries in a
parallel data-structure, since most likely the region boundaries won't
naturally fit inside the zone boundaries or vice-versa.

But of course, memory regions are sub-divisions *within* a node, so it makes
sense to keep the data-structures in the node's struct pglist_data. (Thus
this placement makes memory regions parallel to zones in that node).

Once we capture the region boundaries in the memory regions data-structure,
we can influence MM decisions at various places, such as page allocation,
reclamation etc, in order to perform power-aware memory management.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c74092e..e6df08f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -35,6 +35,8 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3

+#define MAX_NR_NODE_REGIONS 256
+
enum {
MIGRATE_UNMOVABLE,
MIGRATE_RECLAIMABLE,
@@ -685,6 +687,14 @@ struct node_active_region {
extern struct page *mem_map;
#endif

+struct node_mem_region {
+ unsigned long start_pfn;
+ unsigned long end_pfn;
+ unsigned long present_pages;
+ unsigned long spanned_pages;
+ struct pglist_data *pgdat;
+};
+
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
* (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -701,6 +711,8 @@ typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
+ struct node_mem_region node_regions[MAX_NR_NODE_REGIONS];
+ int nr_node_regions;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_MEMCG

2013-04-09 21:49:48

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 05/15] mm: Add data-structures to describe memory regions within the zones' freelists

In order to influence page allocation decisions (i.e., to make page-allocation
region-aware), we need to be able to distinguish pageblocks belonging to
different zone memory regions within the zones' (buddy) freelists.

So, within every freelist in a zone, provide pointers to describe the
boundaries of zone memory regions and counters to track the number of free
pageblocks within each region.

Also, fixup the references to the freelist's list_head inside struct free_area.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 17 ++++++++++++++++-
mm/compaction.c | 2 +-
mm/page_alloc.c | 23 ++++++++++++-----------
mm/vmstat.c | 2 +-
4 files changed, 30 insertions(+), 14 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f772e05..76667bf 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -83,8 +83,23 @@ static inline int get_pageblock_migratetype(struct page *page)
return get_pageblock_flags_group(page, PB_migrate, PB_migrate_end);
}

+struct mem_region_list {
+ struct list_head *page_block;
+ unsigned long nr_free;
+};
+
+struct free_list {
+ struct list_head list;
+
+ /*
+ * Demarcates pageblocks belonging to different regions within
+ * this freelist.
+ */
+ struct mem_region_list mr_list[MAX_NR_ZONE_REGIONS];
+};
+
struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct free_list free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

diff --git a/mm/compaction.c b/mm/compaction.c
index 05ccb4c..13912f5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -858,7 +858,7 @@ static int compact_finished(struct zone *zone,
struct free_area *area = &zone->free_area[order];

/* Job done if page is free of the right migratetype */
- if (!list_empty(&area->free_list[cc->migratetype]))
+ if (!list_empty(&area->free_list[cc->migratetype].list))
return COMPACT_PARTIAL;

/* Job done if allocation would set block type */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af87471..963de6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -593,12 +593,13 @@ static inline void __free_one_page(struct page *page,
higher_buddy = higher_page + (buddy_idx - combined_idx);
if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list[migratetype].list);
goto out;
}
}

- list_add(&page->lru, &zone->free_area[order].free_list[migratetype]);
+ list_add(&page->lru,
+ &zone->free_area[order].free_list[migratetype].list);
out:
zone->free_area[order].nr_free++;
}
@@ -831,7 +832,7 @@ static inline void expand(struct zone *zone, struct page *page,
continue;
}
#endif
- list_add(&page[size].lru, &area->free_list[migratetype]);
+ list_add(&page[size].lru, &area->free_list[migratetype].list);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -893,10 +894,10 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
/* Find a page of the appropriate size in the preferred list */
for (current_order = order; current_order < MAX_ORDER; ++current_order) {
area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ if (list_empty(&area->free_list[migratetype].list))
continue;

- page = list_entry(area->free_list[migratetype].next,
+ page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
list_del(&page->lru);
rmv_page_order(page);
@@ -968,7 +969,7 @@ int move_freepages(struct zone *zone,

order = page_order(page);
list_move(&page->lru,
- &zone->free_area[order].free_list[migratetype]);
+ &zone->free_area[order].free_list[migratetype].list);
set_freepage_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
@@ -1029,10 +1030,10 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
break;

area = &(zone->free_area[current_order]);
- if (list_empty(&area->free_list[migratetype]))
+ if (list_empty(&area->free_list[migratetype].list))
continue;

- page = list_entry(area->free_list[migratetype].next,
+ page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
area->nr_free--;

@@ -1296,7 +1297,7 @@ void mark_free_pages(struct zone *zone)
}

for_each_migratetype_order(order, t) {
- list_for_each(curr, &zone->free_area[order].free_list[t]) {
+ list_for_each(curr, &zone->free_area[order].free_list[t].list) {
unsigned long i;

pfn = page_to_pfn(list_entry(curr, struct page, lru));
@@ -3092,7 +3093,7 @@ void show_free_areas(unsigned int filter)

types[order] = 0;
for (type = 0; type < MIGRATE_TYPES; type++) {
- if (!list_empty(&area->free_list[type]))
+ if (!list_empty(&area->free_list[type].list))
types[order] |= 1 << type;
}
}
@@ -3944,7 +3945,7 @@ static void __meminit zone_init_free_lists(struct zone *zone)
{
int order, t;
for_each_migratetype_order(order, t) {
- INIT_LIST_HEAD(&zone->free_area[order].free_list[t]);
+ INIT_LIST_HEAD(&zone->free_area[order].free_list[t].list);
zone->free_area[order].nr_free = 0;
}
}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index e1d8ed1..63e12f0 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -860,7 +860,7 @@ static void pagetypeinfo_showfree_print(struct seq_file *m,

area = &(zone->free_area[order]);

- list_for_each(curr, &area->free_list[mtype])
+ list_for_each(curr, &area->free_list[mtype].list)
freecount++;
seq_printf(m, "%6lu ", freecount);
}

2013-04-09 21:49:55

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 06/15] mm: Demarcate and maintain pageblocks in region-order in the zones' freelists

The zones' freelists need to be made region-aware, in order to influence
page allocation and freeing algorithms. So in every free list in the zone, we
would like to demarcate the pageblocks belonging to different memory regions
(we can do this using a set of pointers, and thus avoid splitting up the
freelists).

Also, we would like to keep the pageblocks in the freelists sorted in
region-order. That is, pageblocks belonging to region-0 would come first,
followed by pageblocks belonging to region-1 and so on, within a given
freelist. Of course, a set of pageblocks belonging to the same region need
not be sorted; it is sufficient if we maintain the pageblocks in
region-sorted-order, rather than a full address-sorted-order.

For each freelist within the zone, we maintain a set of pointers to
pageblocks belonging to the various memory regions in that zone.

Eg:

|<---Region0--->| |<---Region1--->| |<-------Region2--------->|
____ ____ ____ ____ ____ ____ ____
--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|--> |____|-->

^ ^ ^
| | |
Reg0 Reg1 Reg2


Page allocation will proceed as usual - pick the first item on the free list.
But we don't want to keep updating these region pointers every time we allocate
a pageblock from the freelist. So, instead of pointing to the *first* pageblock
of that region, we maintain the region pointers such that they point to the
*last* pageblock in that region, as shown in the figure above. That way, as
long as there are > 1 pageblocks in that region in that freelist, that region
pointer doesn't need to be updated.


Page allocation algorithm:
-------------------------

The heart of the page allocation algorithm remains as it is - pick the first
item on the appropriate freelist and return it.


Arrangement of pageblocks in the zone freelists:
-----------------------------------------------

This is the main change - we keep the pageblocks in region-sorted order,
where pageblocks belonging to region-0 come first, followed by those belonging
to region-1 and so on. But the pageblocks within a given region need *not* be
sorted, since we need them to be only region-sorted and not fully
address-sorted.

This sorting is performed when adding pages back to the freelists, thus
avoiding any region-related overhead in the critical page allocation
paths.

Strategy to consolidate allocations to a minimum no. of regions:
---------------------------------------------------------------

Page allocation happens in the order of increasing region number. We would
like to do light-weight page reclaim or compaction (for the purpose of memory
power management) in the reverse order, to keep the allocated pages within
a minimum number of regions (approximately). The latter part is implemented
in subsequent patches.

---------------------------- Increasing region number---------------------->

Direction of allocation---> <---Direction of reclaim/compaction

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/page_alloc.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 136 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 963de6c..7fb4254 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -505,6 +505,111 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}

+static void add_to_freelist(struct page *page, struct free_list *free_list)
+{
+ struct list_head *prev_region_list, *lru;
+ struct mem_region_list *region;
+ int region_id, i;
+
+ lru = &page->lru;
+ region_id = page_zone_region_id(page);
+
+ region = &free_list->mr_list[region_id];
+ region->nr_free++;
+
+ if (region->page_block) {
+ list_add_tail(lru, region->page_block);
+ return;
+ }
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(region->nr_free != 1, "%s: nr_free is not unity\n", __func__);
+#endif
+
+ if (!list_empty(&free_list->list)) {
+ for (i = region_id - 1; i >= 0; i--) {
+ if (free_list->mr_list[i].page_block) {
+ prev_region_list =
+ free_list->mr_list[i].page_block;
+ goto out;
+ }
+ }
+ }
+
+ /* This is the first region, so add to the head of the list */
+ prev_region_list = &free_list->list;
+
+out:
+ list_add(lru, prev_region_list);
+
+ /* Save pointer to page block of this region */
+ region->page_block = lru;
+}
+
+static void del_from_freelist(struct page *page, struct free_list *free_list)
+{
+ struct list_head *prev_page_lru, *lru, *p;
+ struct mem_region_list *region;
+ int region_id;
+
+ lru = &page->lru;
+ region_id = page_zone_region_id(page);
+ region = &free_list->mr_list[region_id];
+ region->nr_free--;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
+
+ /* Verify whether this page indeed belongs to this free list! */
+
+ list_for_each(p, &free_list->list) {
+ if (p == lru)
+ goto page_found;
+ }
+
+ WARN(1, "%s: page doesn't belong to the given freelist!\n", __func__);
+
+page_found:
+#endif
+
+ /*
+ * If we are not deleting the last pageblock in this region (i.e.,
+ * farthest from list head, but not necessarily the last numerically),
+ * then we need not update the region->page_block pointer.
+ */
+ if (lru != region->page_block) {
+ list_del(lru);
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(region->nr_free == 0, "%s: nr_free messed up\n", __func__);
+#endif
+ return;
+ }
+
+ prev_page_lru = lru->prev;
+ list_del(lru);
+
+ if (region->nr_free == 0) {
+ region->page_block = NULL;
+ } else {
+ region->page_block = prev_page_lru;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(prev_page_lru == &free_list->list,
+ "%s: region->page_block points to list head\n",
+ __func__);
+#endif
+ }
+}
+
+/**
+ * Move a given page from one freelist to another.
+ */
+static void move_page_freelist(struct page *page, struct free_list *old_list,
+ struct free_list *new_list)
+{
+ del_from_freelist(page, old_list);
+ add_to_freelist(page, new_list);
+}
+
/*
* Freeing function for a buddy system allocator.
*
@@ -537,6 +642,7 @@ static inline void __free_one_page(struct page *page,
unsigned long combined_idx;
unsigned long uninitialized_var(buddy_idx);
struct page *buddy;
+ struct free_area *area;

VM_BUG_ON(!zone_is_initialized(zone));

@@ -566,8 +672,9 @@ static inline void __free_one_page(struct page *page,
__mod_zone_freepage_state(zone, 1 << order,
migratetype);
} else {
- list_del(&buddy->lru);
- zone->free_area[order].nr_free--;
+ area = &zone->free_area[order];
+ del_from_freelist(buddy, &area->free_list[migratetype]);
+ area->nr_free--;
rmv_page_order(buddy);
}
combined_idx = buddy_idx & page_idx;
@@ -576,6 +683,7 @@ static inline void __free_one_page(struct page *page,
order++;
}
set_page_order(page, order);
+ area = &zone->free_area[order];

/*
* If this is not the largest possible page, check if the buddy
@@ -592,16 +700,22 @@ static inline void __free_one_page(struct page *page,
buddy_idx = __find_buddy_index(combined_idx, order + 1);
higher_buddy = higher_page + (buddy_idx - combined_idx);
if (page_is_buddy(higher_page, higher_buddy, order + 1)) {
- list_add_tail(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+
+ /*
+ * Implementing an add_to_freelist_tail() won't be
+ * very useful because both of them (almost) add to
+ * the tail within the region. So we could potentially
+ * switch off this entire "is next-higher buddy free?"
+ * logic when memory regions are used.
+ */
+ add_to_freelist(page, &area->free_list[migratetype]);
goto out;
}
}

- list_add(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+ add_to_freelist(page, &area->free_list[migratetype]);
out:
- zone->free_area[order].nr_free++;
+ area->nr_free++;
}

static inline int free_pages_check(struct page *page)
@@ -832,7 +946,7 @@ static inline void expand(struct zone *zone, struct page *page,
continue;
}
#endif
- list_add(&page[size].lru, &area->free_list[migratetype].list);
+ add_to_freelist(&page[size], &area->free_list[migratetype]);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -899,7 +1013,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
- list_del(&page->lru);
+ del_from_freelist(page, &area->free_list[migratetype]);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);
@@ -940,7 +1054,8 @@ int move_freepages(struct zone *zone,
{
struct page *page;
unsigned long order;
- int pages_moved = 0;
+ struct free_area *area;
+ int pages_moved = 0, old_mt;

#ifndef CONFIG_HOLES_IN_ZONE
/*
@@ -968,8 +1083,10 @@ int move_freepages(struct zone *zone,
}

order = page_order(page);
- list_move(&page->lru,
- &zone->free_area[order].free_list[migratetype].list);
+ old_mt = get_freepage_migratetype(page);
+ area = &zone->free_area[order];
+ move_page_freelist(page, &area->free_list[old_mt],
+ &area->free_list[migratetype]);
set_freepage_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
@@ -1067,7 +1184,7 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
}

/* Remove the page from the freelists */
- list_del(&page->lru);
+ del_from_freelist(page, &area->free_list[migratetype]);
rmv_page_order(page);

/* Take ownership for orders >= pageblock_order */
@@ -1420,7 +1537,8 @@ static int __isolate_free_page(struct page *page, unsigned int order)
}

/* Remove page from free list */
- list_del(&page->lru);
+ mt = get_freepage_migratetype(page);
+ del_from_freelist(page, &zone->free_area[order].free_list[mt]);
zone->free_area[order].nr_free--;
rmv_page_order(page);

@@ -6131,6 +6249,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
int order, i;
unsigned long pfn;
unsigned long flags;
+ int mt;
+
/* find the first valid pfn */
for (pfn = start_pfn; pfn < end_pfn; pfn++)
if (pfn_valid(pfn))
@@ -6163,7 +6283,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
printk(KERN_INFO "remove from free list %lx %d %lx\n",
pfn, 1 << order, end_pfn);
#endif
- list_del(&page->lru);
+ mt = get_freepage_migratetype(page);
+ del_from_freelist(page, &zone->free_area[order].free_list[mt]);
rmv_page_order(page);
zone->free_area[order].nr_free--;
for (i = 0; i < (1 << order); i++)

2013-04-09 21:50:13

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 04/15] mm: Add helpers to retrieve node region and zone region for a given page

Given a page, we would like to have an efficient mechanism to find out
the node memory region and the zone memory region to which it belongs.

Since the node is assumed to be divided into equal-sized node memory
regions, the node memory region can be obtained by simply right-shifting
the page's pfn by 'MEM_REGION_SHIFT'.

But finding the corresponding zone memory region's index in the zone is
not that straight-forward. To have a O(1) algorithm to find it out, define a
zone_region_idx[] array to store the zone memory region indices for every
node memory region.

To illustrate, consider the following example:

|<----------------------Node---------------------->|
__________________________________________________
| Node mem reg 0 | Node mem reg 1 | (Absolute region
|________________________|_________________________| boundaries)

__________________________________________________
| ZONE_DMA | ZONE_NORMAL |
| | |
|<--- ZMR 0 --->|<-ZMR0->|<-------- ZMR 1 -------->|
|_______________|________|_________________________|


In the above figure,

Node mem region 0:
------------------
This region corresponds to the first zone mem region in ZONE_DMA and also
the first zone mem region in ZONE_NORMAL. Hence its index array would look
like this:
node_regions[0].zone_region_idx[ZONE_DMA] == 0
node_regions[0].zone_region_idx[ZONE_NORMAL] == 0


Node mem region 1:
------------------
This region corresponds to the second zone mem region in ZONE_NORMAL. Hence
its index array would look like this:
node_regions[1].zone_region_idx[ZONE_NORMAL] == 1


Using this index array, we can quickly obtain the zone memory region to
which a given page belongs.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 24 ++++++++++++++++++++++++
include/linux/mmzone.h | 7 +++++++
mm/page_alloc.c | 1 +
3 files changed, 32 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b7b368a..dff478b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -717,6 +717,30 @@ static inline struct zone *page_zone(const struct page *page)
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
}

+static inline int page_node_region_id(const struct page *page,
+ const pg_data_t *pgdat)
+{
+ return (page_to_pfn(page) - pgdat->node_start_pfn) >> MEM_REGION_SHIFT;
+}
+
+/**
+ * Return the index of the zone memory region to which the page belongs.
+ *
+ * Given a page, find the absolute (node) memory region as well as the zone to
+ * which it belongs. Then find the region within the zone that corresponds to
+ * that node memory region, and return its index.
+ */
+static inline int page_zone_region_id(const struct page *page)
+{
+ pg_data_t *pgdat = NODE_DATA(page_to_nid(page));
+ enum zone_type z_num = page_zonenum(page);
+ unsigned long node_region_idx;
+
+ node_region_idx = page_node_region_id(page, pgdat);
+
+ return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
+}
+
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 46a6b63..f772e05 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -703,6 +703,13 @@ struct node_mem_region {
unsigned long end_pfn;
unsigned long present_pages;
unsigned long spanned_pages;
+
+ /*
+ * A physical (node) region could be split across multiple zones.
+ * Store the indices of the corresponding regions of each such
+ * zone for this physical (node) region.
+ */
+ int zone_region_idx[MAX_NR_ZONES];
struct pglist_data *pgdat;
};

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d4abba6..af87471 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4749,6 +4749,7 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
zone_region->present_pages =
zone_region->spanned_pages - absent;

+ node_region->zone_region_idx[zone_idx(z)] = idx;
idx++;
}

2013-04-09 21:50:24

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 08/15] bitops: Document the difference in indexing between fls() and __fls()

fls() indexes the bits starting with 1, ie., from 1 to BITS_PER_LONG
whereas __fls() uses a zero-based indexing scheme (0 to BITS_PER_LONG - 1).
Add comments to document this important difference.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

arch/x86/include/asm/bitops.h | 4 ++++
include/asm-generic/bitops/__fls.h | 5 +++++
2 files changed, 9 insertions(+)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 6dfd019..25e6fdc 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -380,6 +380,10 @@ static inline unsigned long ffz(unsigned long word)
* @word: The word to search
*
* Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
*/
static inline unsigned long __fls(unsigned long word)
{
diff --git a/include/asm-generic/bitops/__fls.h b/include/asm-generic/bitops/__fls.h
index a60a7cc..ae908a5 100644
--- a/include/asm-generic/bitops/__fls.h
+++ b/include/asm-generic/bitops/__fls.h
@@ -8,6 +8,11 @@
* @word: the word to search
*
* Undefined if no set bit exists, so code should check against 0 first.
+ *
+ * Note: __fls(x) is equivalent to fls(x) - 1. That is, __fls() uses
+ * a zero-based indexing scheme (0 to BITS_PER_LONG - 1), where
+ * __fls(1) = 0, __fls(2) = 1, and so on.
+ *
*/
static __always_inline unsigned long __fls(unsigned long word)
{

2013-04-09 21:50:38

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 09/15] mm: A new optimized O(log n) sorting algo to speed up buddy-sorting

The sorted-buddy design for memory power management depends on
keeping the buddy freelists region-sorted. And this sorting operation
has been pushed to the free() logic, keeping the alloc() path fast.

However, we would like to also keep the free() path as fast as possible,
since it holds the zone->lock, which will indirectly affect alloc() also.

So replace the existing O(n) sorting logic used in the free-path, with
a new special-case sorting algorithm of time complexity O(log n), in order
to optimize the free() path further. This algorithm uses a bitmap-based
radix tree to help speed up the sorting.

One of the other main advantages of this O(log n) design is that it can
support large amounts of RAM (upto 2 TB and beyond) quite effortlessly.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 2 +
mm/page_alloc.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 139 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d8d67fc..0258c68 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -102,6 +102,8 @@ struct free_list {
* this freelist.
*/
struct mem_region_list mr_list[MAX_NR_ZONE_REGIONS];
+ DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
};

struct free_area {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a68174c..52d8a59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -505,11 +505,131 @@ static inline int page_is_buddy(struct page *page, struct page *buddy,
return 0;
}

+/**
+ *
+ * An example should help illustrate the bitmap representation of memory
+ * regions easily. So consider the following scenario:
+ *
+ * MAX_NR_ZONE_REGIONS = 256
+ * DECLARE_BITMAP(region_leaf_mask, MAX_NR_ZONE_REGIONS);
+ * DECLARE_BITMAP(region_root_mask, BITS_PER_LONG);
+ *
+ * Here region_leaf_mask is an array of unsigned longs. And region_root_mask
+ * is a single unsigned long. The tree notion is constructed like this:
+ * Each bit in the region_root_mask will correspond to an array element of
+ * region_leaf_mask, as shown below. (The elements of the region_leaf_mask
+ * array are shown as being discontiguous, only to help illustrate the
+ * concept easily).
+ *
+ * Region Root Mask
+ * ___________________
+ * |____|____|____|____|
+ * / | \ \
+ * / | \ \
+ * ________ | ________ \
+ * |________| | |________| \
+ * | \
+ * ________ ________
+ * |________| |________| <--- Region Leaf Mask
+ * array elements
+ *
+ * If an array element in the leaf mask is non-zero, the corresponding bit
+ * for that array element will be set in the root mask. Every bit in the
+ * region_leaf_mask will correspond to a memory region; it is set if that
+ * region is present in that free list, cleared otherwise.
+ *
+ * This arrangement helps us find the previous set bit in region_leaf_mask
+ * using at most 2 bitmask-searches (each bitmask of size BITS_PER_LONG),
+ * one at the root-level, and one at the leaf level. Thus, this design of
+ * an optimized access structure reduces the search-complexity when dealing
+ * with large amounts of memory. The worst-case time-complexity of buddy
+ * sorting comes to O(log n) using this algorithm, where 'n' is the no. of
+ * memory regions in the zone.
+ *
+ * For example, with MEM_REGION_SIZE = 512 MB, on 64-bit machines, we can
+ * deal with upto 2TB of RAM (MAX_NR_ZONE_REGIONS = 4096) efficiently (just
+ * 12 ops in the worst case, as opposed to 4096 in an O(n) algo) with such
+ * an arrangement, without even needing to extend this 2-level hierarchy
+ * any further.
+ */
+
+static void set_region_bit(int region_id, struct free_list *free_list)
+{
+ set_bit(region_id, free_list->region_leaf_mask);
+ set_bit(BIT_WORD(region_id), free_list->region_root_mask);
+}
+
+static void clear_region_bit(int region_id, struct free_list *free_list)
+{
+ clear_bit(region_id, free_list->region_leaf_mask);
+
+ if (!(free_list->region_leaf_mask[BIT_WORD(region_id)]))
+ clear_bit(BIT_WORD(region_id), free_list->region_root_mask);
+
+}
+
+/* Note that Region 0 corresponds to bit position 1 (0x1) and so on */
+static int find_prev_region(int region_id, struct free_list *free_list)
+{
+ int leaf_word, prev_region_id;
+ unsigned long *region_root_mask, *region_leaf_mask;
+ unsigned long tmp_root_mask, tmp_leaf_mask;
+
+ if (!region_id)
+ return -1; /* No previous region */
+
+ leaf_word = BIT_WORD(region_id);
+
+ region_root_mask = free_list->region_root_mask;
+ region_leaf_mask = free_list->region_leaf_mask;
+
+
+ /*
+ * Try to get the prev region id without going to the root mask.
+ * Note that region_id itself might not be set yet.
+ */
+ if (region_leaf_mask[leaf_word]) {
+ tmp_leaf_mask = region_leaf_mask[leaf_word] &
+ (BIT_MASK(region_id) - 1);
+
+ if (tmp_leaf_mask) {
+ /* Prev region is in this leaf mask itself. Find it. */
+ prev_region_id = leaf_word * BITS_PER_LONG +
+ __fls(tmp_leaf_mask);
+ goto out;
+ }
+ }
+
+ /* Search the root mask for the leaf mask having prev region */
+ tmp_root_mask = *region_root_mask & (BIT(leaf_word) - 1);
+ if (tmp_root_mask) {
+ leaf_word = __fls(tmp_root_mask);
+
+ /* Get the prev region id from the leaf mask */
+ prev_region_id = leaf_word * BITS_PER_LONG +
+ __fls(region_leaf_mask[leaf_word]);
+ } else {
+ /*
+ * This itself is the first populated region in this
+ * freelist, so previous region doesn't exist.
+ */
+ prev_region_id = -1;
+ }
+
+out:
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(prev_region_id >= region_id, "%s: bitmap logic messed up\n",
+ __func__);
+#endif
+ return prev_region_id;
+}
+
static void add_to_freelist(struct page *page, struct free_list *free_list)
{
struct list_head *prev_region_list, *lru;
struct mem_region_list *region;
- int region_id, i;
+ int region_id, prev_region_id;

lru = &page->lru;
region_id = page_zone_region_id(page);
@@ -527,12 +647,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
#endif

if (!list_empty(&free_list->list)) {
- for (i = region_id - 1; i >= 0; i--) {
- if (free_list->mr_list[i].page_block) {
- prev_region_list =
- free_list->mr_list[i].page_block;
- goto out;
- }
+ prev_region_id = find_prev_region(region_id, free_list);
+ if (prev_region_id >= 0) {
+ prev_region_list =
+ free_list->mr_list[prev_region_id].page_block;
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(prev_region_list == NULL,
+ "%s: prev_region_list is NULL\n"
+ "region_id=%d, prev_region_id=%d\n", __func__,
+ region_id, prev_region_id);
+#endif
+ goto out;
}
}

@@ -553,6 +678,7 @@ out:

/* Save pointer to page block of this region */
region->page_block = lru;
+ set_region_bit(region_id, free_list);
}

/**
@@ -567,6 +693,7 @@ static void rmqueue_del_from_freelist(struct page *page,
struct free_list *free_list)
{
struct list_head *lru = &page->lru;
+ int region_id;

#ifdef CONFIG_DEBUG_PAGEALLOC
WARN((free_list->list.next != lru),
@@ -590,6 +717,8 @@ static void rmqueue_del_from_freelist(struct page *page,
* in this freelist.
*/
free_list->next_region->page_block = NULL;
+ region_id = free_list->next_region - free_list->mr_list;
+ clear_region_bit(region_id, free_list);

/* Set 'next_region' to the new first region in the freelist. */
set_next_region_in_freelist(free_list);
@@ -650,6 +779,7 @@ page_found:

if (region->nr_free == 0) {
region->page_block = NULL;
+ clear_region_bit(region_id, free_list);
} else {
region->page_block = prev_page_lru;
#ifdef CONFIG_DEBUG_PAGEALLOC

2013-04-09 21:50:50

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 10/15] mm: Add support to accurately track per-memory-region allocation

The page allocator needs to be able to detect events such as the first page
allocation in a new region etc, in order to make smart decisions to aid
memory power management. So add the necessary support to accurately track
allocations on a per-region basis.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mmzone.h | 2 +
mm/page_alloc.c | 66 ++++++++++++++++++++++++++++++++++++------------
2 files changed, 51 insertions(+), 17 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0258c68..6e209e9 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -86,6 +86,7 @@ static inline int get_pageblock_migratetype(struct page *page)
struct mem_region_list {
struct list_head *page_block;
unsigned long nr_free;
+ struct zone_mem_region *zone_region;
};

struct free_list {
@@ -341,6 +342,7 @@ struct zone_mem_region {
unsigned long end_pfn;
unsigned long present_pages;
unsigned long spanned_pages;
+ unsigned long nr_free;
};

struct zone {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 52d8a59..541e4ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -625,7 +625,8 @@ out:
return prev_region_id;
}

-static void add_to_freelist(struct page *page, struct free_list *free_list)
+static void add_to_freelist(struct page *page, struct free_list *free_list,
+ int order)
{
struct list_head *prev_region_list, *lru;
struct mem_region_list *region;
@@ -636,6 +637,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)

region = &free_list->mr_list[region_id];
region->nr_free++;
+ region->zone_region->nr_free += 1 << order;

if (region->page_block) {
list_add_tail(lru, region->page_block);
@@ -690,9 +692,10 @@ out:
* inside the freelist.
*/
static void rmqueue_del_from_freelist(struct page *page,
- struct free_list *free_list)
+ struct free_list *free_list, int order)
{
struct list_head *lru = &page->lru;
+ struct mem_region_list *mr_list;
int region_id;

#ifdef CONFIG_DEBUG_PAGEALLOC
@@ -703,7 +706,10 @@ static void rmqueue_del_from_freelist(struct page *page,
list_del(lru);

/* Fastpath */
- if (--(free_list->next_region->nr_free)) {
+ mr_list = free_list->next_region;
+ mr_list->zone_region->nr_free -= 1 << order;
+
+ if (--(mr_list->nr_free)) {

#ifdef CONFIG_DEBUG_PAGEALLOC
WARN(free_list->next_region->nr_free < 0,
@@ -725,7 +731,8 @@ static void rmqueue_del_from_freelist(struct page *page,
}

/* Generic delete function for region-aware buddy allocator. */
-static void del_from_freelist(struct page *page, struct free_list *free_list)
+static void del_from_freelist(struct page *page, struct free_list *free_list,
+ int order)
{
struct list_head *prev_page_lru, *lru, *p;
struct mem_region_list *region;
@@ -735,11 +742,12 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)

/* Try to fastpath, if deleting from the head of the list */
if (lru == free_list->list.next)
- return rmqueue_del_from_freelist(page, free_list);
+ return rmqueue_del_from_freelist(page, free_list, order);

region_id = page_zone_region_id(page);
region = &free_list->mr_list[region_id];
region->nr_free--;
+ region->zone_region->nr_free -= 1 << order;

#ifdef CONFIG_DEBUG_PAGEALLOC
WARN(region->nr_free < 0, "%s: nr_free is negative\n", __func__);
@@ -794,10 +802,10 @@ page_found:
* Move a given page from one freelist to another.
*/
static void move_page_freelist(struct page *page, struct free_list *old_list,
- struct free_list *new_list)
+ struct free_list *new_list, int order)
{
- del_from_freelist(page, old_list);
- add_to_freelist(page, new_list);
+ del_from_freelist(page, old_list, order);
+ add_to_freelist(page, new_list, order);
}

/*
@@ -863,7 +871,8 @@ static inline void __free_one_page(struct page *page,
migratetype);
} else {
area = &zone->free_area[order];
- del_from_freelist(buddy, &area->free_list[migratetype]);
+ del_from_freelist(buddy, &area->free_list[migratetype],
+ order);
area->nr_free--;
rmv_page_order(buddy);
}
@@ -898,12 +907,13 @@ static inline void __free_one_page(struct page *page,
* switch off this entire "is next-higher buddy free?"
* logic when memory regions are used.
*/
- add_to_freelist(page, &area->free_list[migratetype]);
+ add_to_freelist(page, &area->free_list[migratetype],
+ order);
goto out;
}
}

- add_to_freelist(page, &area->free_list[migratetype]);
+ add_to_freelist(page, &area->free_list[migratetype], order);
out:
area->nr_free++;
}
@@ -1136,7 +1146,8 @@ static inline void expand(struct zone *zone, struct page *page,
continue;
}
#endif
- add_to_freelist(&page[size], &area->free_list[migratetype]);
+ add_to_freelist(&page[size], &area->free_list[migratetype],
+ high);
area->nr_free++;
set_page_order(&page[size], high);
}
@@ -1203,7 +1214,8 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
- rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
+ rmqueue_del_from_freelist(page, &area->free_list[migratetype],
+ current_order);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);
@@ -1276,7 +1288,7 @@ int move_freepages(struct zone *zone,
old_mt = get_freepage_migratetype(page);
area = &zone->free_area[order];
move_page_freelist(page, &area->free_list[old_mt],
- &area->free_list[migratetype]);
+ &area->free_list[migratetype], order);
set_freepage_migratetype(page, migratetype);
page += 1 << order;
pages_moved += 1 << order;
@@ -1374,7 +1386,8 @@ __rmqueue_fallback(struct zone *zone, int order, int start_migratetype)
}

/* Remove the page from the freelists */
- del_from_freelist(page, &area->free_list[migratetype]);
+ del_from_freelist(page, &area->free_list[migratetype],
+ current_order);
rmv_page_order(page);

/* Take ownership for orders >= pageblock_order */
@@ -1728,7 +1741,7 @@ static int __isolate_free_page(struct page *page, unsigned int order)

/* Remove page from free list */
mt = get_freepage_migratetype(page);
- del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+ del_from_freelist(page, &zone->free_area[order].free_list[mt], order);
zone->free_area[order].nr_free--;
rmv_page_order(page);

@@ -5017,6 +5030,22 @@ static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
pgdat->nr_node_regions = idx;
}

+static void __meminit zone_init_free_lists_late(struct zone *zone)
+{
+ struct mem_region_list *mr_list;
+ int order, t, i;
+
+ for_each_migratetype_order(order, t) {
+ for (i = 0; i < zone->nr_zone_regions; i++) {
+ mr_list =
+ &zone->free_area[order].free_list[t].mr_list[i];
+
+ mr_list->nr_free = 0;
+ mr_list->zone_region = &zone->zone_regions[i];
+ }
+ }
+}
+
static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)
{
unsigned long start_pfn, end_pfn, absent;
@@ -5064,6 +5093,8 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)

z->nr_zone_regions = idx;

+ zone_init_free_lists_late(z);
+
/*
* Revisit the last visited node memory region, in case it
* spans multiple zones.
@@ -6474,7 +6505,8 @@ __offline_isolated_pages(unsigned long start_pfn, unsigned long end_pfn)
pfn, 1 << order, end_pfn);
#endif
mt = get_freepage_migratetype(page);
- del_from_freelist(page, &zone->free_area[order].free_list[mt]);
+ del_from_freelist(page, &zone->free_area[order].free_list[mt],
+ order);
rmv_page_order(page);
zone->free_area[order].nr_free--;
for (i = 0; i < (1 << order); i++)

2013-04-09 21:51:17

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 12/15] mm: Add infrastructure to evacuate memory regions using compaction

To enhance memory power-savings, we need to be able to completely evacuate
lightly allocated regions, and move those used pages to lower regions,
which would help consolidate all the allocations to a minimum no. of regions.
This can be done using some of the memory compaction and reclaim algorithms.
Develop such an infrastructure to evacuate memory regions completely.

The traditional compaction algorithm uses a pfn walker to get free pages
for compaction. But this would be way too costly for us. We do a pfn walk
only to isolate the used pages, but to get free pages, we just depend on the
fast buddy allocator itself. But we are careful to abort when the buddy
allocator starts giving free pages in this region itself or higher regions
(because in that case, if we proceed, it would be defeating the purpose of
the entire effort).

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/compaction.h | 7 ++++
include/linux/gfp.h | 2 +
include/linux/migrate.h | 3 +-
include/trace/events/migrate.h | 3 +-
mm/compaction.c | 72 ++++++++++++++++++++++++++++++++++++++++
mm/page_alloc.c | 5 +--
6 files changed, 87 insertions(+), 5 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 091d72e..6be2b08 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -26,6 +26,7 @@ extern unsigned long try_to_compact_pages(struct zonelist *zonelist,
extern void compact_pgdat(pg_data_t *pgdat, int order);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern unsigned long compaction_suitable(struct zone *zone, int order);
+extern int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr);

/* Do not skip compaction more than 64 times */
#define COMPACT_MAX_DEFER_SHIFT 6
@@ -102,6 +103,12 @@ static inline bool compaction_deferred(struct zone *zone, int order)
return true;
}

+static inline int evacuate_mem_region(struct zone *z,
+ struct zone_mem_region *zmr)
+{
+ return 0;
+}
+
#endif /* CONFIG_COMPACTION */

#if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0f615eb..dd5430f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -351,6 +351,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);

+int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count,
+ struct list_head *list, int migratetype, int cold);
void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
void free_pages_exact(void *virt, size_t size);
/* This is different from alloc_pages_exact_node !!! */
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e006be9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,7 +30,8 @@ enum migrate_reason {
MR_SYSCALL, /* also applies to cpusets */
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
- MR_CMA
+ MR_CMA,
+ MR_PWR_MGMT
};

#ifdef CONFIG_MIGRATION
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index ec2a6cc..e6892c0 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -15,7 +15,8 @@
{MR_MEMORY_HOTPLUG, "memory_hotplug"}, \
{MR_SYSCALL, "syscall_or_cpuset"}, \
{MR_MEMPOLICY_MBIND, "mempolicy_mbind"}, \
- {MR_CMA, "cma"}
+ {MR_CMA, "cma"}, \
+ {MR_PWR_MGMT, "power_management"}

TRACE_EVENT(mm_migrate_pages,

diff --git a/mm/compaction.c b/mm/compaction.c
index ff9cf23..a76ad90 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1162,6 +1162,78 @@ unsigned long try_to_compact_pages(struct zonelist *zonelist,
return rc;
}

+static struct page *power_mgmt_alloc(struct page *migratepage,
+ unsigned long data, int **result)
+{
+ struct compact_control *cc = (struct compact_control *)data;
+ struct page *freepage;
+
+ /*
+ * Try to allocate pages from lower memory regions. If it fails,
+ * abort.
+ */
+ if (list_empty(&cc->freepages)) {
+ struct zone *z = page_zone(migratepage);
+
+ rmqueue_bulk(z, 0, cc->nr_migratepages, &cc->freepages,
+ MIGRATE_MOVABLE, 1);
+
+ if (list_empty(&cc->freepages))
+ return NULL;
+ }
+
+ freepage = list_entry(cc->freepages.next, struct page, lru);
+
+ if (page_zone_region_id(freepage) >= page_zone_region_id(migratepage))
+ return NULL; /* Freepage is not from lower region, so abort */
+
+ list_del(&freepage->lru);
+ cc->nr_freepages--;
+
+ return freepage;
+}
+
+static unsigned long power_mgmt_release_freepages(unsigned long info)
+{
+ struct compact_control *cc = (struct compact_control *)info;
+
+ return release_freepages(&cc->freepages);
+}
+
+int evacuate_mem_region(struct zone *z, struct zone_mem_region *zmr)
+{
+ unsigned long start_pfn = zmr->start_pfn;
+ unsigned long end_pfn = zmr->end_pfn;
+
+ struct compact_control cc = {
+ .nr_migratepages = 0,
+ .order = -1,
+ .zone = page_zone(pfn_to_page(start_pfn)),
+ .sync = false, /* Async migration */
+ .ignore_skip_hint = true,
+ };
+
+ struct aggression_control ac = {
+ .isolate_unevictable = false,
+ .prep_all = false,
+ .reclaim_clean = true,
+ .max_tries = 1,
+ .reason = MR_PWR_MGMT,
+ };
+
+ struct free_page_control fc = {
+ .free_page_alloc = power_mgmt_alloc,
+ .alloc_data = (unsigned long)&cc,
+ .release_freepages = power_mgmt_release_freepages,
+ .free_data = (unsigned long)&cc,
+ };
+
+ INIT_LIST_HEAD(&cc.migratepages);
+ INIT_LIST_HEAD(&cc.freepages);
+
+ return compact_range(&cc, &ac, &fc, start_pfn, end_pfn);
+}
+

/* Compact all zones within a node */
static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f31ca94..40a3aa6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1445,9 +1445,8 @@ retry_reserve:
* a single hold of the lock, for efficiency. Add them to the supplied list.
* Returns the number of new pages which were placed at *list.
*/
-static int rmqueue_bulk(struct zone *zone, unsigned int order,
- unsigned long count, struct list_head *list,
- int migratetype, int cold)
+int rmqueue_bulk(struct zone *zone, unsigned int order, unsigned long count,
+ struct list_head *list, int migratetype, int cold)
{
int mt = migratetype, i;

2013-04-09 21:51:46

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 14/15] mm: Add alloc-free handshake to trigger memory region compaction

We need a way to decide when to trigger the worker threads to perform
region evacuation/compaction. So the strategy used is as follows:

Alloc path of page allocator:
----------------------------

This accurately tracks the allocations and detects the first allocation
in a new region and notes down that region number. Performing compaction
rightaway is not going to be helpful because we need free pages in the
lower regions to be able to do that. And the page allocator allocated in
this region precisely because there was no memory available in lower regions.
So the alloc path just notes down the freshly used region's id.

Free path of page allocator:
---------------------------

When we enter this path, we know that some memory is being freed. Here we
check if the alloc path had noted down any region for compaction. If so,
we trigger the worker function that tries to compact that memory.

Also, we avoid any locking/synchronization overhead over this worker
function in the alloc/free path, by attaching appropriate semantics to the
available status flags etc, such that we won't need any special locking
around them.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/page_alloc.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 57 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index db7b892..675a435 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -631,6 +631,7 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
struct list_head *prev_region_list, *lru;
struct mem_region_list *region;
int region_id, prev_region_id;
+ struct mem_power_ctrl *mpc;

lru = &page->lru;
region_id = page_zone_region_id(page);
@@ -639,6 +640,17 @@ static void add_to_freelist(struct page *page, struct free_list *free_list,
region->nr_free++;
region->zone_region->nr_free += 1 << order;

+ /*
+ * If the alloc path detected the usage of a new region, now is
+ * the time to complete the handshake and queue a worker
+ * to try compaction on that region.
+ */
+ mpc = &page_zone(page)->mem_power_ctrl;
+ if (!is_mem_pwr_work_in_progress(mpc) && mpc->region) {
+ set_mem_pwr_work_in_progress(mpc);
+ queue_work(system_unbound_wq, &mpc->work);
+ }
+
if (region->page_block) {
list_add_tail(lru, region->page_block);
return;
@@ -696,7 +708,9 @@ static void rmqueue_del_from_freelist(struct page *page,
{
struct list_head *lru = &page->lru;
struct mem_region_list *mr_list;
- int region_id;
+ struct zone_mem_region *zmr;
+ struct mem_power_ctrl *mpc;
+ int region_id, mt;

#ifdef CONFIG_DEBUG_PAGEALLOC
WARN((free_list->list.next != lru),
@@ -706,8 +720,27 @@ static void rmqueue_del_from_freelist(struct page *page,
list_del(lru);

/* Fastpath */
+ region_id = free_list->next_region - free_list->mr_list;
mr_list = free_list->next_region;
- mr_list->zone_region->nr_free -= 1 << order;
+ zmr = mr_list->zone_region;
+ if (region_id != 0 && (zmr->nr_free == zmr->present_pages)) {
+ /*
+ * This is the first alloc in this memory region. So try
+ * compacting this region in the near future. Don't bother
+ * if this is an unmovable/non-reclaimable allocation.
+ * Also don't try compacting region 0 because its pointless.
+ */
+ mt = get_freepage_migratetype(page);
+ if (is_migrate_cma(mt) || mt == MIGRATE_MOVABLE ||
+ mt == MIGRATE_RECLAIMABLE) {
+
+ mpc = &page_zone(page)->mem_power_ctrl;
+ if (!is_mem_pwr_work_in_progress(mpc))
+ mpc->region = zmr;
+ }
+ }
+
+ zmr->nr_free -= 1 << order;

if (--(mr_list->nr_free)) {

@@ -723,7 +756,6 @@ static void rmqueue_del_from_freelist(struct page *page,
* in this freelist.
*/
free_list->next_region->page_block = NULL;
- region_id = free_list->next_region - free_list->mr_list;
clear_region_bit(region_id, free_list);

/* Set 'next_region' to the new first region in the freelist. */
@@ -736,7 +768,9 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,
{
struct list_head *prev_page_lru, *lru, *p;
struct mem_region_list *region;
- int region_id;
+ struct zone_mem_region *zmr;
+ struct mem_power_ctrl *mpc;
+ int region_id, mt;

lru = &page->lru;

@@ -746,6 +780,25 @@ static void del_from_freelist(struct page *page, struct free_list *free_list,

region_id = page_zone_region_id(page);
region = &free_list->mr_list[region_id];
+
+ zmr = region->zone_region;
+ if (region_id != 0 && (zmr->nr_free == zmr->present_pages)) {
+ /*
+ * This is the first alloc in this memory region. So try
+ * compacting this region in the near future. Don't bother
+ * if this is an unmovable/non-reclaimable allocation.
+ * Also don't try compacting region 0 because its pointless.
+ */
+ mt = get_freepage_migratetype(page);
+ if (is_migrate_cma(mt) || mt == MIGRATE_MOVABLE ||
+ mt == MIGRATE_RECLAIMABLE) {
+
+ mpc = &page_zone(page)->mem_power_ctrl;
+ if (!is_mem_pwr_work_in_progress(mpc))
+ mpc->region = zmr;
+ }
+ }
+
region->nr_free--;
region->zone_region->nr_free -= 1 << order;

2013-04-09 21:51:57

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 15/15] mm: Print memory region statistics to understand the buddy allocator behavior

In order to observe the behavior of the region-aware buddy allocator, modify
vmstat.c to also print memory region related statistics. In particular, enable
memory region-related info in /proc/zoneinfo and /proc/buddyinfo, since they
would help us to atleast (roughly) observe how the new buddy allocator is
behaving.

For now, the region statistics correspond to the zone memory regions and not
the (absolute) node memory regions, and some of the statistics (especially the
no. of present pages) might not be very accurate. But since we account for
and print the free page statistics for every zone memory region accurately, we
should be able to observe the new page allocator behavior to a reasonable
degree of accuracy.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/vmstat.c | 34 ++++++++++++++++++++++++++++++----
1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 63e12f0..1a90475 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -825,11 +825,28 @@ const char * const vmstat_text[] = {
static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
struct zone *zone)
{
- int order;
+ int i, order, t;
+ struct free_area *area;

- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
- for (order = 0; order < MAX_ORDER; ++order)
- seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+ seq_printf(m, "Node %d, zone %8s \n", pgdat->node_id, zone->name);
+
+ for (i = 0; i < zone->nr_zone_regions; i++) {
+
+ seq_printf(m, "\t\t Region %d ", i);
+
+ for (order = 0; order < MAX_ORDER; ++order) {
+ unsigned long nr_free = 0;
+
+ area = &zone->free_area[order];
+
+ for (t = 0; t < MIGRATE_TYPES; t++) {
+ nr_free +=
+ area->free_list[t].mr_list[i].nr_free;
+ }
+ seq_printf(m, "%6lu ", nr_free);
+ }
+ seq_putc(m, '\n');
+ }
seq_putc(m, '\n');
}

@@ -1016,6 +1033,15 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
zone->present_pages,
zone->managed_pages);

+ seq_printf(m, "\n\nPer-region page stats\t present\t free\n\n");
+ for (i = 0; i < zone->nr_zone_regions; i++) {
+ struct zone_mem_region *region;
+
+ region = &zone->zone_regions[i];
+ seq_printf(m, "\tRegion %d \t %6lu \t %6lu\n", i,
+ region->present_pages, region->nr_free);
+ }
+
for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++)
seq_printf(m, "\n %-12s %lu", vmstat_text[i],
zone_page_state(zone, i));

2013-04-09 21:52:04

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 13/15] mm: Implement the worker function for memory region compaction

We are going to invoke the memory compaction algorithms for region-evacuation
from worker threads, instead of dedicating a separate kthread to it. So
add the worker infrastructure to perform this.

In the worker, we calculate the cost of migration/compaction for a given
region - if we need to migrate less than 32 pages, then we go ahead, else we
deem the effort to be too costly and abort the compaction.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 20 ++++++++++++++++++++
include/linux/mmzone.h | 21 +++++++++++++++++++++
mm/page_alloc.c | 33 +++++++++++++++++++++++++++++++++
3 files changed, 74 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb0d898..e380eeb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -755,6 +755,26 @@ static inline void set_next_region_in_freelist(struct free_list *free_list)
}
}

+static inline int is_mem_pwr_work_in_progress(struct mem_power_ctrl *mpc)
+{
+ if (mpc->work_status == MEM_PWR_WORK_IN_PROGRESS)
+ return 1;
+ return 0;
+}
+
+static inline void set_mem_pwr_work_in_progress(struct mem_power_ctrl *mpc)
+{
+ mpc->work_status = MEM_PWR_WORK_IN_PROGRESS;
+ smp_mb();
+}
+
+static inline void set_mem_pwr_work_complete(struct mem_power_ctrl *mpc)
+{
+ mpc->work_status = MEM_PWR_WORK_COMPLETE;
+ mpc->region = NULL;
+ smp_mb();
+}
+
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6e209e9..fdadd2a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
#include <linux/pageblock-flags.h>
#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
+#include <linux/workqueue.h>
#include <asm/page.h>

/* Free memory management - zoned buddy allocator. */
@@ -337,6 +338,24 @@ enum zone_type {

#ifndef __GENERATING_BOUNDS_H

+/*
+ * In order to evacuate a memory region, if the no. of pages to be migrated
+ * via compaction is more than this number, the effort is considered too
+ * costly and should be aborted.
+ */
+#define MAX_NR_MEM_PWR_MIGRATE_PAGES 32
+
+enum {
+ MEM_PWR_WORK_COMPLETE = 0,
+ MEM_PWR_WORK_IN_PROGRESS
+};
+
+struct mem_power_ctrl {
+ struct work_struct work;
+ struct zone_mem_region *region;
+ int work_status;
+};
+
struct zone_mem_region {
unsigned long start_pfn;
unsigned long end_pfn;
@@ -405,6 +424,8 @@ struct zone {
struct zone_mem_region zone_regions[MAX_NR_ZONE_REGIONS];
int nr_zone_regions;

+ struct mem_power_ctrl mem_power_ctrl;
+
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 40a3aa6..db7b892 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5002,6 +5002,35 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}

+static void mem_power_mgmt_fn(struct work_struct *work)
+{
+ struct mem_power_ctrl *mpc;
+ struct zone_mem_region *region;
+ unsigned long pages_in_use;
+ struct zone *zone;
+
+ mpc = container_of(work, struct mem_power_ctrl, work);
+
+ if (!mpc->region)
+ return; /* No work to do */
+
+ zone = container_of(mpc, struct zone, mem_power_ctrl);
+ region = mpc->region;
+
+ if (region == zone->zone_regions)
+ return; /* No point compacting region 0. */
+
+ pages_in_use = region->present_pages - region->nr_free;
+
+ if (pages_in_use > 0 &&
+ (pages_in_use <= MAX_NR_MEM_PWR_MIGRATE_PAGES)) {
+
+ evacuate_mem_region(zone, region);
+ }
+
+ set_mem_pwr_work_complete(mpc);
+}
+
static void __meminit init_node_memory_regions(struct pglist_data *pgdat)
{
int nid = pgdat->node_id;
@@ -5094,6 +5123,10 @@ static void __meminit init_zone_memory_regions(struct pglist_data *pgdat)

zone_init_free_lists_late(z);

+ INIT_WORK(&z->mem_power_ctrl.work, mem_power_mgmt_fn);
+ z->mem_power_ctrl.region = NULL;
+ set_mem_pwr_work_complete(&z->mem_power_ctrl);
+
/*
* Revisit the last visited node memory region, in case it
* spans multiple zones.

2013-04-09 21:52:31

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 11/15] mm: Restructure the compaction part of CMA for wider use

CMA uses bits and pieces of the memory compaction algorithms to perform
large contiguous allocations. Those algorithms would be useful for
memory power management too, to evacuate entire regions of memory.
So rewrite the code in a way that helps us to easily reuse the code for
both use-cases.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

mm/compaction.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 40 +++++++++++++++++++++++++++++
mm/page_alloc.c | 51 ++++++++++---------------------------
3 files changed, 128 insertions(+), 38 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 13912f5..ff9cf23 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -816,6 +816,81 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
return ISOLATE_SUCCESS;
}

+/*
+ * Make free pages available within the given range, using compaction to
+ * migrate used pages elsewhere.
+ *
+ * [start, end) must belong to a single zone.
+ *
+ * This function is roughly based on the logic inside compact_zone().
+ */
+int compact_range(struct compact_control *cc, struct aggression_control *ac,
+ struct free_page_control *fc, unsigned long start,
+ unsigned long end)
+{
+ unsigned long pfn = start;
+ int ret = 0, tries, migrate_mode;
+
+ if (ac->prep_all)
+ migrate_prep();
+ else
+ migrate_prep_local();
+
+ while (pfn < end || !list_empty(&cc->migratepages)) {
+ if (list_empty(&cc->migratepages)) {
+ cc->nr_migratepages = 0;
+ pfn = isolate_migratepages_range(cc->zone, cc,
+ pfn, end, ac->isolate_unevictable);
+
+ if (!pfn) {
+ ret = -EINTR;
+ break;
+ }
+ }
+
+ for (tries = 0; tries < ac->max_tries; tries++) {
+ if (fatal_signal_pending(current)){
+ ret = -EINTR;
+ goto out;
+ }
+
+ if (ac->reclaim_clean) {
+ int nr_reclaimed;
+
+ nr_reclaimed =
+ reclaim_clean_pages_from_list(cc->zone,
+ &cc->migratepages);
+
+ cc->nr_migratepages -= nr_reclaimed;
+ }
+
+ migrate_mode = cc->sync ? MIGRATE_SYNC : MIGRATE_ASYNC;
+ ret = migrate_pages(&cc->migratepages,
+ fc->free_page_alloc, fc->alloc_data,
+ migrate_mode, ac->reason);
+
+ update_nr_listpages(cc);
+ }
+
+ if (tries == ac->max_tries) {
+ ret = ret < 0 ? ret : -EBUSY;
+ break;
+ }
+ }
+
+out:
+ if (ret < 0)
+ putback_movable_pages(&cc->migratepages);
+
+ /* Release free pages and check accounting */
+ if (fc->release_freepages)
+ cc->nr_freepages -= fc->release_freepages(fc->free_data);
+
+ VM_BUG_ON(cc->nr_freepages != 0);
+
+ return ret;
+}
+
static int compact_finished(struct zone *zone,
struct compact_control *cc)
{
diff --git a/mm/internal.h b/mm/internal.h
index 8562de0..398fe73 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -110,6 +110,42 @@ extern bool is_free_buddy_page(struct page *page);
/*
* in mm/compaction.c
*/
+
+struct free_page_control {
+
+ /* Function used to allocate free pages as target of migration. */
+ struct page * (*free_page_alloc)(struct page *migratepage,
+ unsigned long data,
+ int **result);
+
+ unsigned long alloc_data; /* Private data for free_page_alloc() */
+
+ /*
+ * Function to release the accumulated free pages after the compaction
+ * run.
+ */
+ unsigned long (*release_freepages)(unsigned long info);
+ unsigned long free_data; /* Private data for release_freepages() */
+};
+
+/*
+ * aggression_control gives us fine-grained control to specify how aggressively
+ * we want to compact memory.
+ */
+struct aggression_control {
+ bool isolate_unevictable; /* Isolate unevictable pages too */
+ bool prep_all; /* Use migrate_prep() instead of
+ * migrate_prep_local().
+ */
+ bool reclaim_clean; /* Reclaim clean page cache pages */
+ int max_tries; /* No. of tries to migrate the
+ * isolated pages before giving up.
+ */
+ int reason; /* Reason for compaction, passed on
+ * as reason for migrate_pages().
+ */
+};
+
/*
* compact_control is used to track pages being migrated and the free pages
* they are being migrated to during memory compaction. The free_pfn starts
@@ -144,6 +180,10 @@ unsigned long
isolate_migratepages_range(struct zone *zone, struct compact_control *cc,
unsigned long low_pfn, unsigned long end_pfn, bool unevictable);

+int compact_range(struct compact_control *cc, struct aggression_control *ac,
+ struct free_page_control *fc, unsigned long start,
+ unsigned long end);
+
#endif

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 541e4ab..f31ca94 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6226,46 +6226,21 @@ static unsigned long pfn_max_align_up(unsigned long pfn)
static int __alloc_contig_migrate_range(struct compact_control *cc,
unsigned long start, unsigned long end)
{
- /* This function is based on compact_zone() from compaction.c. */
- unsigned long nr_reclaimed;
- unsigned long pfn = start;
- unsigned int tries = 0;
- int ret = 0;
-
- migrate_prep();
-
- while (pfn < end || !list_empty(&cc->migratepages)) {
- if (fatal_signal_pending(current)) {
- ret = -EINTR;
- break;
- }
-
- if (list_empty(&cc->migratepages)) {
- cc->nr_migratepages = 0;
- pfn = isolate_migratepages_range(cc->zone, cc,
- pfn, end, true);
- if (!pfn) {
- ret = -EINTR;
- break;
- }
- tries = 0;
- } else if (++tries == 5) {
- ret = ret < 0 ? ret : -EBUSY;
- break;
- }
+ struct aggression_control ac = {
+ .isolate_unevictable = true,
+ .prep_all = true,
+ .reclaim_clean = true,
+ .max_tries = 5,
+ .reason = MR_CMA,
+ };

- nr_reclaimed = reclaim_clean_pages_from_list(cc->zone,
- &cc->migratepages);
- cc->nr_migratepages -= nr_reclaimed;
+ struct free_page_control fc = {
+ .free_page_alloc = alloc_migrate_target,
+ .alloc_data = 0,
+ .release_freepages = NULL,
+ };

- ret = migrate_pages(&cc->migratepages, alloc_migrate_target,
- 0, MIGRATE_SYNC, MR_CMA);
- }
- if (ret < 0) {
- putback_movable_pages(&cc->migratepages);
- return ret;
- }
- return 0;
+ return compact_range(cc, &ac, &fc, start, end);
}

/**

2013-04-09 21:50:11

by Srivatsa S. Bhat

[permalink] [raw]
Subject: [RFC PATCH v2 07/15] mm: Add an optimized version of del_from_freelist to keep page allocation fast

One of the main advantages of this design of memory regions is that page
allocations can potentially be extremely fast - almost with no extra
overhead from memory regions.

To exploit that, introduce an optimized version of del_from_freelist(), which
utilizes the fact that we always delete items from the head of the list
during page allocation.

Basically, we want to keep a note of the region from which we are allocating
in a given freelist, to avoid having to compute the page-to-zone-region for
every page allocation. So introduce a 'next_region' pointer in every freelist
to achieve that, and use it to keep the fastpath of page allocation almost as
fast as it would have been without memory regions.

Signed-off-by: Srivatsa S. Bhat <[email protected]>
---

include/linux/mm.h | 14 +++++++++++
include/linux/mmzone.h | 6 +++++
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dff478b..cb0d898 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -741,6 +741,20 @@ static inline int page_zone_region_id(const struct page *page)
return pgdat->node_regions[node_region_idx].zone_region_idx[z_num];
}

+static inline void set_next_region_in_freelist(struct free_list *free_list)
+{
+ struct page *page;
+ int region_id;
+
+ if (unlikely(list_empty(&free_list->list))) {
+ free_list->next_region = NULL;
+ } else {
+ page = list_entry(free_list->list.next, struct page, lru);
+ region_id = page_zone_region_id(page);
+ free_list->next_region = &free_list->mr_list[region_id];
+ }
+}
+
#ifdef SECTION_IN_PAGE_FLAGS
static inline void set_page_section(struct page *page, unsigned long section)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 76667bf..d8d67fc 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -92,6 +92,12 @@ struct free_list {
struct list_head list;

/*
+ * Pointer to the region from which the next allocation will be
+ * satisfied. (Same as the freelist's first pageblock's region.)
+ */
+ struct mem_region_list *next_region; /* for fast page allocation */
+
+ /*
* Demarcates pageblocks belonging to different regions within
* this freelist.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7fb4254..a68174c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -539,6 +539,15 @@ static void add_to_freelist(struct page *page, struct free_list *free_list)
/* This is the first region, so add to the head of the list */
prev_region_list = &free_list->list;

+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN((list_empty(&free_list->list) && free_list->next_region != NULL),
+ "%s: next_region not NULL\n", __func__);
+#endif
+ /*
+ * Set 'next_region' to this region, since this is the first region now
+ */
+ free_list->next_region = region;
+
out:
list_add(lru, prev_region_list);

@@ -546,6 +555,47 @@ out:
region->page_block = lru;
}

+/**
+ * __rmqueue_smallest() *always* deletes elements from the head of the
+ * list. Use this knowledge to keep page allocation fast, despite being
+ * region-aware.
+ *
+ * Do *NOT* call this function if you are deleting from somewhere deep
+ * inside the freelist.
+ */
+static void rmqueue_del_from_freelist(struct page *page,
+ struct free_list *free_list)
+{
+ struct list_head *lru = &page->lru;
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN((free_list->list.next != lru),
+ "%s: page not at head of list", __func__);
+#endif
+
+ list_del(lru);
+
+ /* Fastpath */
+ if (--(free_list->next_region->nr_free)) {
+
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ WARN(free_list->next_region->nr_free < 0,
+ "%s: nr_free is negative\n", __func__);
+#endif
+ return;
+ }
+
+ /*
+ * Slowpath, when this is the last pageblock of this region
+ * in this freelist.
+ */
+ free_list->next_region->page_block = NULL;
+
+ /* Set 'next_region' to the new first region in the freelist. */
+ set_next_region_in_freelist(free_list);
+}
+
+/* Generic delete function for region-aware buddy allocator. */
static void del_from_freelist(struct page *page, struct free_list *free_list)
{
struct list_head *prev_page_lru, *lru, *p;
@@ -553,6 +603,11 @@ static void del_from_freelist(struct page *page, struct free_list *free_list)
int region_id;

lru = &page->lru;
+
+ /* Try to fastpath, if deleting from the head of the list */
+ if (lru == free_list->list.next)
+ return rmqueue_del_from_freelist(page, free_list);
+
region_id = page_zone_region_id(page);
region = &free_list->mr_list[region_id];
region->nr_free--;
@@ -588,6 +643,11 @@ page_found:
prev_page_lru = lru->prev;
list_del(lru);

+ /*
+ * Since we are not deleting from the head of the freelist, the
+ * 'next_region' pointer doesn't have to change.
+ */
+
if (region->nr_free == 0) {
region->page_block = NULL;
} else {
@@ -1013,7 +1073,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

page = list_entry(area->free_list[migratetype].list.next,
struct page, lru);
- del_from_freelist(page, &area->free_list[migratetype]);
+ rmqueue_del_from_freelist(page, &area->free_list[migratetype]);
rmv_page_order(page);
area->nr_free--;
expand(zone, page, order, current_order, area, migratetype);

2013-04-10 23:26:14

by Cody P Schafer

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/15] mm: Add alloc-free handshake to trigger memory region compaction

On 04/09/2013 02:48 PM, Srivatsa S. Bhat wrote:
> We need a way to decide when to trigger the worker threads to perform
> region evacuation/compaction. So the strategy used is as follows:
>
> Alloc path of page allocator:
> ----------------------------
>
> This accurately tracks the allocations and detects the first allocation
> in a new region and notes down that region number. Performing compaction
> rightaway is not going to be helpful because we need free pages in the
> lower regions to be able to do that. And the page allocator allocated in
> this region precisely because there was no memory available in lower regions.
> So the alloc path just notes down the freshly used region's id.
>
> Free path of page allocator:
> ---------------------------
>
> When we enter this path, we know that some memory is being freed. Here we
> check if the alloc path had noted down any region for compaction. If so,
> we trigger the worker function that tries to compact that memory.
>
> Also, we avoid any locking/synchronization overhead over this worker
> function in the alloc/free path, by attaching appropriate semantics to the
> available status flags etc, such that we won't need any special locking
> around them.
>

Can you explain why avoiding locking works in this case?

It appears the lack of locking is only on the worker side, and the
mem_power_ctrl is implicitly protected by zone->lock on the alloc & free
side.

In the previous patch I see smp_mb(), but no explanation is provided for
why they are needed. Are they related to/necessary for this lack of locking?

What happens when a region is passed over for compaction because the
worker is already compacting another region? Can this occur? Will the
compaction re-trigger appropriately?

I recommend combining this patch and the previous patch to make the
interface more clear, or make functions that explicitly handle the
interface for accessing mem_power_ctrl.

2013-04-16 13:52:10

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 14/15] mm: Add alloc-free handshake to trigger memory region compaction


Hi Cody,

Thank you for your review comments and sorry for the delay in replying!

On 04/11/2013 04:56 AM, Cody P Schafer wrote:
> On 04/09/2013 02:48 PM, Srivatsa S. Bhat wrote:
>> We need a way to decide when to trigger the worker threads to perform
>> region evacuation/compaction. So the strategy used is as follows:
>>
>> Alloc path of page allocator:
>> ----------------------------
>>
>> This accurately tracks the allocations and detects the first allocation
>> in a new region and notes down that region number. Performing compaction
>> rightaway is not going to be helpful because we need free pages in the
>> lower regions to be able to do that. And the page allocator allocated in
>> this region precisely because there was no memory available in lower
>> regions.
>> So the alloc path just notes down the freshly used region's id.
>>
>> Free path of page allocator:
>> ---------------------------
>>
>> When we enter this path, we know that some memory is being freed. Here we
>> check if the alloc path had noted down any region for compaction. If so,
>> we trigger the worker function that tries to compact that memory.
>>
>> Also, we avoid any locking/synchronization overhead over this worker
>> function in the alloc/free path, by attaching appropriate semantics to
>> the
>> available status flags etc, such that we won't need any special locking
>> around them.
>>
>
> Can you explain why avoiding locking works in this case?
>

Sure, see below. BTW, the whole idea behind doing this is to avoid additional
overhead as much as possible, since these are quite hot paths in the kernel.

> It appears the lack of locking is only on the worker side, and the
> mem_power_ctrl is implicitly protected by zone->lock on the alloc & free
> side.
>

That's right. What I meant to say is that I don't introduce any *extra*
locking overhead in the alloc/free path, just to synchronize the updates to
mem_power_ctrl. On the alloc/free side, as you rightly noted, I piggyback
on the zone->lock to get the synchronization right.

On the worker side, I don't need any locking, due to the following reasons:

a. Only 1 worker (at max) is active at any given time.

The free path of the page allocator (which queues the worker) never
queues more than 1 worker at a time. If a worker is still busy doing a
previously queued work, the free path just ignores new hints from the
alloc path about region evacuation. So essentially no 2 workers run at
the same time. So the worker need not worry about being re-entrant.

b. The ->work_status field in the mem_power_ctrl structure is never written
to by 2 different tasks at the same time.

The free path always avoids updates to the ->work_status field in presence
of an active worker. That is, until the ->work_status is set to
MEM_PWR_WORK_COMPLETE by the worker, the free path won't write to it.

So the ->work_status field can be written to by the worker and read by
the free path at the same time - which is fine, because in that case,
if the free path read the old value, it will just assume that the worker
is still active and ignores the alloc path's hint, which is harmless.
Similar is the case about why the alloc path can read the ->work_status
without locking out the worker : if it reads the old value, it doesn't
set any new hints in ->region, which is again fine.

c. The ->region field in the mem_power_ctrl structure is also never written
to by 2 different tasks at the same time. This goes by extending the logic
in 'b'.

Yes, this scheme could mean that sometimes we might lose a few opportunities to
perform region evacuation, but that is OK, because that's the price we pay
in order to avoid hurting performance too much. Besides, there's a more
important reason why its actually critical that we aren't too aggressive
and jump at every opportunity to do compaction; see below.

> In the previous patch I see smp_mb(), but no explanation is provided for
> why they are needed. Are they related to/necessary for this lack of
> locking?
>

Hmm, looking at that again, I don't think it is needed. I'll remove it in
the next version.

> What happens when a region is passed over for compaction because the
> worker is already compacting another region? Can this occur?

Yes it can occur. But since we try to allocate pages in increasing order of
regions, if this situation does occur, there is a very good chance that we
won't benefit from compacting both regions, see below.

> Will the
> compaction re-trigger appropriately?
>

No, there is no re-trigger and that's by design. A particular region being
suitable for compaction is only a transient/temporary condition; it might
not persist forever. So it might not be worth trying over and over.
So if the worker was busy compacting some other region when the alloc path
hinted a new region for compaction, we simply ignore it because, there is
no guarantee that that situation (the new region being suitable for compaction)
would continue to hold good when the worker finishes its current job.

Part of it is actually true even for *any* work that the worker performs:
by the time it gets into action, the region might not be suitable for
compaction any more, perhaps because more pages have been allocated from that
region in the meantime, making evacuation costlier. So, that part is handled
by re-evaluating the situation by looking at the region statistics in the
worker, before actually performing the compaction.

The harder problem to solve is: how to avoid having workers clash or otherwise
undo the efforts of each other. That is, say we tried to compact 2 regions
say A and B, one soon after the other. Then it is a little hard to guarantee
that we didn't do the stupid mistake of first moving pages of A to B via
compaction and then again compacting B and moving the pages elsewhere.
I still need to think of ways to explicitly avoid this from happening. But on
a first approximation, as mentioned above, if the alloc path saw fresh
allocations on 2 different regions within a short period of time, its probably
best to avoid taking *both* hints into consideration and instead act on only
one of them. That's why this patch doesn't bother re-triggering compaction
at a later time, if the worker was already busy working on another region.

> I recommend combining this patch and the previous patch to make the
> interface more clear, or make functions that explicitly handle the
> interface for accessing mem_power_ctrl.
>

Sure, I'll think more on how to make it clearer.

Thanks a lot!

Regards,
Srivatsa S. Bhat

2013-04-17 16:48:21

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
> [I know, this cover letter is a little too long, but I wanted to clearly
> explain the overall goals and the high-level design of this patchset in
> detail. I hope this helps more than it annoys, and makes it easier for
> reviewers to relate to the background and the goals of this patchset.]
>
>
> Overview of Memory Power Management and its implications to the Linux MM
> ========================================================================
>
> Today, we are increasingly seeing computer systems sporting larger and larger
> amounts of RAM, in order to meet workload demands. However, memory consumes a
> significant amount of power, potentially upto more than a third of total system
> power on server systems. So naturally, memory becomes the next big target for
> power management - on embedded systems and smartphones, and all the way upto
> large server systems.
>
> Power-management capabilities in modern memory hardware:
> -------------------------------------------------------
>
> Modern memory hardware such as DDR3 support a number of power management
> capabilities - for instance, the memory controller can automatically put
> memory DIMMs/banks into content-preserving low-power states, if it detects
> that that *entire* memory DIMM/bank has not been referenced for a threshold
> amount of time, thus reducing the energy consumption of the memory hardware.
> We term these power-manageable chunks of memory as "Memory Regions".
>
> Exporting memory region info of the platform to the OS:
> ------------------------------------------------------
>
> The OS needs to know about the granularity at which the hardware can perform
> automatic power-management of the memory banks (i.e., the address boundaries
> of the memory regions). On ARM platforms, the bootloader can be modified to
> pass on this info to the kernel via the device-tree. On x86 platforms, the
> new ACPI 5.0 spec has added support for exporting the power-management
> capabilities of the memory hardware to the OS in a standard way[5].
>
> Estimate of power-savings from power-aware Linux MM:
> ---------------------------------------------------
>
> Once the firmware/bootloader exports the required info to the OS, it is upto
> the kernel's MM subsystem to make the best use of these capabilities and manage
> memory power-efficiently. It had been demonstrated on a Samsung Exynos board
> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
> making the Linux kernel MM subsystem power-aware[4]. (More savings can be
> expected on systems with larger amounts of memory, and perhaps improved further
> using better MM designs).
>
>
> Role of the Linux MM in enhancing memory power savings:
> ------------------------------------------------------
>
> Often, this simply translates to having the Linux MM understand the granularity
> at which RAM modules can be power-managed, and keeping the memory allocations
> and references consolidated to a minimum no. of these power-manageable
> "memory regions". It is of particular interest to note that most of these memory
> hardware have the intelligence to automatically save power, by putting memory
> banks into (content-preserving) low-power states when not referenced for a
> threshold amount of time. All that the kernel has to do, is avoid wrecking
> the power-savings logic by scattering its allocations and references all over
> the system memory. (The kernel/MM doesn't have to perform the actual power-state
> transitions; its mostly done in the hardware automatically, and this is OK
> because these are *content-preserving* low-power states).
>
> So we can summarize the goals for the Linux MM as:
>
> o Consolidate memory allocations and/or references such that they are not
> spread across the entire memory address space. Basically the area of memory
> that is not being referenced can reside in low power state.
>
> o Support light-weight targetted memory compaction/reclaim, to evacuate
> lightly-filled memory regions. This helps avoid memory references to
> those regions, thereby allowing them to reside in low power states.
>
>
> Assumptions and goals of this patchset:
> --------------------------------------
>
> In this patchset, we don't handle the part of getting the region boundary info
> from the firmware/bootloader and populating it in the kernel data-structures.
> The aim of this patchset is to propose and brainstorm on a power-aware design
> of the Linux MM which can *use* the region boundary info to influence the MM
> at various places such as page allocation, reclamation/compaction etc, thereby
> contributing to memory power savings. (This patchset is very much an RFC at
> the moment and is not intended for mainline-inclusion yet).
>
> So, in this patchset, we assume a simple model in which each 512MB chunk of
> memory can be independently power-managed, and hard-code this into the patchset.
> As mentioned, the focus of this patchset is not so much on how we get this info
> from the firmware or how exactly we handle a variety of configurations, but
> rather on discussing the power-savings/performance impact of the MM algorithms
> that *act* upon this info in order to save memory power.
>
> That said, its not very far-fetched to try this out with actual region
> boundary info to get the actual power savings numbers. For example, on ARM
> platforms, we can make the bootloader export this info to the OS via device-tree
> and then run this patchset. (This was the method used to get the power-numbers
> in [4]). But even without doing that, we can very well evaluate the
> effectiveness of this patchset in contributing to power-savings, by analyzing
> the free page statistics per-memory-region; and we can observe the performance
> impact by running benchmarks - this is the approach currently used to evaluate
> this patchset.
>
>
> Brief overview of the design/approach used in this patchset:
> -----------------------------------------------------------
>
> This patchset implements the 'Sorted-buddy design' for Memory Power Management,
> in which the buddy (page) allocator is altered to keep the buddy freelists
> region-sorted, which helps influence the page allocation paths to keep the
> allocations consolidated to a minimum no. of memory regions. This patchset also
> includes a light-weight targetted compaction/reclaim algorithm that works
> hand-in-hand with the page-allocator, to evacuate lightly-filled memory regions
> when memory gets fragmented, in order to further enhance memory power savings.
>
> This Sorted-buddy design was developed based on some of the suggestions
> received[1] during the review of the earlier patchset on Memory Power
> Management written by Ankita Garg ('Hierarchy design')[2].
> One of the key aspects of this Sorted-buddy design is that it avoids the
> zone-fragmentation problem that was present in the earlier design[3].
>
>
>
> Design of sorted buddy allocator and light-weight targetted region compaction:
> =============================================================================
>
> Sorted buddy allocator:
> ----------------------
>
> In this design, the memory region boundaries are captured in a data structure
> parallel to zones, instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions. (The freelists need not be fully
> address-sorted, they just need to be region-sorted).
>
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform region-compaction in the reverse order, as
> illustrated below.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation---> <---Direction of region-compaction
>
>
> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and hence the critical page-allocation paths remain
> fast. Also, the sorting logic is optimized to be O(log n).
>
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> hence we avoid its associated problems (like too many zones, extra kswapd
> activity, question of choosing watermarks etc).
> [This is an advantage over the 'Hierarchy' design]
>
> 2. Performance overhead is expected to be low: Since we retain the simplicity
> of the algorithm in the page allocation path, page allocation can
> potentially remain as fast as it would be without memory regions. The
> overhead is pushed to the page-freeing paths which are not that critical.
>
>
> Light-weight targetted region compaction:
> ----------------------------------------
>
> Over time, due to multiple alloc()s and free()s in random order, memory gets
> fragmented, which means the memory allocations will no longer be consolidated
> to a minimum no. of memory regions. In such cases we need a light-weight
> mechanism to opportunistically compact memory to evacuate lightly-filled
> memory regions, thereby enhancing the power-savings.
>
> Noting that CMA (Contiguous Memory Allocator) does targetted compaction to
> achieve its goals, this patchset generalizes the targetted compaction code
> and reuses it to evacuate memory regions. The region evacuation is triggered
> by the page allocator : when it notices the first page allocation in a new
> region, it sets up a worker function to perform compaction and evacuate that
> region in the future, if possible. There are handshakes between the alloc
> and the free paths in the page allocator to help do this smartly, which are
> explained in detail in the patches.
>
>
> This patchset has been hosted in the below git tree. It applies cleanly on
> v3.9-rc5.
>
> git://github.com/srivatsabhat/linux.git mem-power-mgmt-v2
>
>
> Changes in this v2:
> ==================
>
> * Fixed a bug in the NUMA case.
> * Added a new optimized O(log n) sorting algorithm to speed up region-sorting
> of the buddy freelists (patch 9). The efficiency of this new algorithm and
> its design allows us to support large amounts of RAM quite easily.
> * Added light-weight targetted compaction/reclaim support for memory power
> management (patches 10-14).
> * Revamped the cover-letter to better explain the idea behind memory power
> management and this patchset.
>
>
> Experimental Results:
> ====================
>
> Test setup:
> ----------
>
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
>
> Functional testing:
> ------------------
>
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
>
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
>
>
> Present pages | Free pages at various allocation sizes |
> | start | 512 MB | 1024 MB | 2048 MB |
> Region 0 1 | 0 | 0 | 0 | 0 |
> Region 1 131072 | 41537 | 13858 | 13790 | 13334 |
> Region 2 131072 | 131072 | 26839 | 82 | 122 |
> Region 3 131072 | 131072 | 131072 | 26624 | 0 |
> Region 4 131072 | 131072 | 131072 | 131072 | 0 |
> Region 5 131072 | 131072 | 131072 | 131072 | 26624 |
> Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 8 131071 | 72704 | 72704 | 72704 | 72704 |
>
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
>
> Performance impact:
> -------------------
>
> Kernbench results didn't show any noticeable performance degradation with
> this patchset as compared to vanilla 3.9-rc5.
>
>
> Todos and ideas for enhancing the design further:
> ================================================
>
> 1. Add support for making this work with sparsemem, memcg etc.
>
> 2. Mel Gorman pointed out that regular compaction algorithm would work
> against the sorted-buddy allocation strategy, since it creates free space
> at lower pfns. For now, I have not handled this because regular compaction
> triggers only when the memory pressure is very high, and hence memory
> power management is pointless in those situations. Besides, it is
> immaterial whether memory allocations are consolidated towards lower or
> higher pfns, because it saves power either way, and hence the regular
> compaction algorithm doesn't actually work against memory power management.
>
> 3. Add more optimizations to the targetted region compaction algorithm in order
> to enhance its benefits and reduce the overhead, such as:
> a. Migrate only active pages during region evacuation, because, strictly
> speaking we only want to avoid _references_ to the region. So inactive
> pages can be kept around, thus reducing the page-migration overhead.
> b. Reduce the search-space for region evacuation, by having the
> page-allocator note down the highest allocated pfn within that region.
>
> 4. Have stronger influence over how freepages from different migratetypes
> are exchanged, so that unmovable and non-reclaimable allocations are
> contained within least no. of memory regions.
>
> 5. Influence the refill of per-cpu pagesets and perhaps even heavily used
> slab caches, such that they all get their memory from least no. of memory
> regions. This is to avoid frequent fragmentation of memory regions.
>
> 6. Don't perform region evacuation at situations of high memory utilization.
> Also, never use freepages from MIGRATE_RESERVE for the purpose of
> region-evacuation.
>
> 7. Add more tracing/debug info to enable better evaluation of the
> effectiveness and benefits of this patchset over vanilla kernel.
>
> 8. Add a higher level policy to control the aggressiveness of memory power
> management.
>
>
> References:
> ----------
>
> [1]. Review comments suggesting modifying the buddy allocator to be aware of
> memory regions:
> http://article.gmane.org/gmane.linux.power-management.general/24862
> http://article.gmane.org/gmane.linux.power-management.general/25061
> http://article.gmane.org/gmane.linux.kernel.mm/64689
>
> [2]. Patch series that implemented the node-region-zone hierarchy design:
> http://lwn.net/Articles/445045/
> http://thread.gmane.org/gmane.linux.kernel.mm/63840
>
> Summary of the discussion on that patchset:
> http://article.gmane.org/gmane.linux.power-management.general/25061
>
> Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
> http://thread.gmane.org/gmane.linux.kernel.mm/89202
>
> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
> zones:
> http://article.gmane.org/gmane.linux.kernel.mm/63849
>
> [4]. Estimate of potential power savings on Samsung exynos board
> http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> [5]. ACPI 5.0 and MPST support
> http://www.acpi.info/spec.htm
> Section 5.2.21 Memory Power State Table (MPST)
>
> [6]. v1 of Sorted-buddy memory power management patchset:
> http://thread.gmane.org/gmane.linux.power-management.general/28498
>
>
> Srivatsa S. Bhat (15):
> mm: Introduce memory regions data-structure to capture region boundaries within nodes
> mm: Initialize node memory regions during boot
> mm: Introduce and initialize zone memory regions
> mm: Add helpers to retrieve node region and zone region for a given page
> mm: Add data-structures to describe memory regions within the zones' freelists
> mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
> mm: Add an optimized version of del_from_freelist to keep page allocation fast
> bitops: Document the difference in indexing between fls() and __fls()
> mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
> mm: Add support to accurately track per-memory-region allocation
> mm: Restructure the compaction part of CMA for wider use
> mm: Add infrastructure to evacuate memory regions using compaction
> mm: Implement the worker function for memory region compaction
> mm: Add alloc-free handshake to trigger memory region compaction
> mm: Print memory region statistics to understand the buddy allocator behavior
>
>
> arch/x86/include/asm/bitops.h | 4
> include/asm-generic/bitops/__fls.h | 5
> include/linux/compaction.h | 7
> include/linux/gfp.h | 2
> include/linux/migrate.h | 3
> include/linux/mm.h | 62 ++++
> include/linux/mmzone.h | 78 ++++-
> include/trace/events/migrate.h | 3
> mm/compaction.c | 149 +++++++++
> mm/internal.h | 40 ++
> mm/page_alloc.c | 617 ++++++++++++++++++++++++++++++++----
> mm/vmstat.c | 36 ++
> 12 files changed, 935 insertions(+), 71 deletions(-)
>
>
> Regards,
> Srivatsa S. Bhat
> IBM Linux Technology Center
>
>
One thing you need to prevent is boot time allocation. You have to make
sure that frequently accessed per node data stored at the end of memory
will keep all ranks of memory active.

Thanks,
Srinivas


2013-04-18 09:57:29

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/17/2013 10:23 PM, Srinivas Pandruvada wrote:
> On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
>> [I know, this cover letter is a little too long, but I wanted to clearly
>> explain the overall goals and the high-level design of this patchset in
>> detail. I hope this helps more than it annoys, and makes it easier for
>> reviewers to relate to the background and the goals of this patchset.]
>>
>>
>> Overview of Memory Power Management and its implications to the Linux MM
>> ========================================================================
>>
[...]
>>
> One thing you need to prevent is boot time allocation. You have to make
> sure that frequently accessed per node data stored at the end of memory
> will keep all ranks of memory active.
>

I think you meant to say "... stored at the end of memory will NOT keep all
ranks of memory active".

Yep, that's a good point! I'll think about how to achieve that. Thanks!

Regards,
Srivatsa S. Bhat

2013-04-18 15:09:00

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/18/2013 02:54 AM, Srivatsa S. Bhat wrote:
> On 04/17/2013 10:23 PM, Srinivas Pandruvada wrote:
>> On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
>>> [I know, this cover letter is a little too long, but I wanted to clearly
>>> explain the overall goals and the high-level design of this patchset in
>>> detail. I hope this helps more than it annoys, and makes it easier for
>>> reviewers to relate to the background and the goals of this patchset.]
>>>
>>>
>>> Overview of Memory Power Management and its implications to the Linux MM
>>> ========================================================================
>>>
> [...]
>> One thing you need to prevent is boot time allocation. You have to make
>> sure that frequently accessed per node data stored at the end of memory
>> will keep all ranks of memory active.
>>
When I was experimenting I did something like this.
/////////////////////////////////


+/*
+ * Experimental MPST implemenentation
+ * Copyright (c) 2012, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
License for
+ * more details.
+ *
+ * You should have received a copy of the GNU General Public License
along with
+ * this program; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
+ *
+ */
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/acpi.h>
+#include <linux/export.h>
+#include <linux/bootmem.h>
+#include <linux/delay.h>
+#include <linux/pfn.h>
+#include <linux/suspend.h>
+#include <linux/acpi.h>
+#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/mmzone.h>
+#include <linux/migrate.h>
+#include <linux/mm_inline.h>
+#include <linux/page-isolation.h>
+#include <linux/vmalloc.h>
+#include <linux/compaction.h>
+#include "internal.h"
+
+#define phys_to_pfn(p) ((p) >> PAGE_SHIFT)
+#define pfn_to_phys(p) ((p) << PAGE_SHIFT)
+#define MAX_MPST_ZONES 16
+/* Atleast 4G of non MPST memory. */
+#define MINIMAL_NON_MPST_MEMORY_PFN (0x100000000 >> PAGE_SHIFT)
+
+struct mpst_mem_zone {
+ phys_addr_t start_addr;
+ phys_addr_t end_addr;
+};
+
+static struct mpst_mem_zone mpst_zones[MAX_MPST_ZONES];
+static int mpst_zone_cnt;
+static unsigned long mpst_start_pfn;
+static unsigned long mpst_end_pfn;
+static bool mpst_enabled;
+
+/* Minimal parsing for just getting node ranges */
+static int __init acpi_parse_mpst_table(struct acpi_table_header *table)
+{
+ struct acpi_table_mpst *mpst;
+ struct acpi_mpst_power_node *node;
+ u16 node_count;
+ int i;
+
+ mpst = (struct acpi_table_mpst *)table;
+ if (!mpst) {
+ pr_warn("Unable to map MPST\n");
+ return -ENODEV;
+ }
+ node_count = mpst->power_node_count;
+ node = (struct acpi_mpst_power_node *)((u8 *)mpst + sizeof(*mpst));
+
+ for (i = mpst_zone_cnt; (i < node_count) && (i < MAX_MPST_ZONES);
+ ++i) {
+ if ((node->flags & ACPI_MPST_ENABLED) &&
+ (node->flags & ACPI_MPST_POWER_MANAGED)) {
+ mpst_zones[mpst_zone_cnt].start_addr =
+ node->range_address;
+ mpst_zones[mpst_zone_cnt].end_addr =
+ node->range_address + node->range_length;
+ ++mpst_zone_cnt;
+ }
+ ++node;
+ }
+
+ return 0;
+}
+
+static unsigned long local_ahex_to_long(const char *name)
+{
+ unsigned long val = 0;
+
+ for (;; name++) {
+ switch (*name) {
+ case '0' ... '9':
+ val = 16*val+(*name-'0');
+ break;
+ case 'A' ... 'F':
+ val = 16*val+(*name-'A'+10);
+ break;
+ case 'a' ... 'f':
+ val = 16*val+(*name-'a'+10);
+ break;
+ default:
+ return val;
+ }
+ }
+
+ return val;
+}
+
+/* Specify MPST range by command line for test till ACPI - MPST is
available */
+static int __init parse_mpst_opt(char *str)
+{
+ char *ptr;
+ phys_addr_t start_at = 0, end_at = 0;
+ u64 mem_size = 0;
+
+ if (!str)
+ return -EINVAL;
+ ptr = str;
+ while (1) {
+ if (*str == '-') {
+ *str = '\0';
+ start_at = local_ahex_to_long(ptr);
+ ++str;
+ ptr = str;
+ }
+ if (start_at && (*str == '\0' || *str == ',' || *str ==
' ')) {
+ *str = '\0';
+ end_at = local_ahex_to_long(ptr);
+ mem_size = end_at-start_at;
+ ++str;
+ ptr = str;
+ pr_info("-mpst[%#018Lx-%#018Lx size: %#018Lx]\n",
+ start_at, end_at, mem_size);
+ if (IS_ALIGNED(phys_to_pfn(start_at),
+ pageblock_nr_pages) &&
+ IS_ALIGNED(phys_to_pfn(end_at),
+ pageblock_nr_pages)) {
+ mpst_zones[mpst_zone_cnt].start_addr =
+ start_at;
+ mpst_zones[mpst_zone_cnt].end_addr =
+ end_at;
+ } else {
+ pr_err("mpst invalid range\n");
+ return -EINVAL;
+ }
+ mpst_zone_cnt++;
+ start_at = mem_size = end_at = 0;
+ }
+ if (*str == '\0')
+ break;
+ else
+ ++str;
+ }
+
+ return 0;
+}
+early_param("mpst_range", parse_mpst_opt);
+
+/* Specify MPST range by command line for test till ACPI - MPST is
available */
+static int __init parse_mpst_enable_opt(char *str)
+{
+ long value;
+ if (kstrtol(str, 10, &value))
+ return -EINVAL;
+ mpst_enabled = value ? true : false;
+
+ return 0;
+}
+early_param("mpst_enable", parse_mpst_enable_opt);
+
+/* Set the minimum and maximum PFN */
+static void mpst_set_min_max_pfn(void)
+{
+ int i;
+
+ if (!mpst_zone_cnt)
+ return;
+
+ mpst_start_pfn = phys_to_pfn(mpst_zones[0].start_addr);
+ mpst_end_pfn = phys_to_pfn(mpst_zones[0].end_addr);
+
+ for (i = 1; i < mpst_zone_cnt; ++i) {
+ if (mpst_start_pfn > phys_to_pfn(mpst_zones[i].start_addr))
+ mpst_start_pfn =
phys_to_pfn(mpst_zones[i].start_addr);
+ if (mpst_end_pfn < phys_to_pfn(mpst_zones[i].end_addr))
+ mpst_end_pfn = phys_to_pfn(mpst_zones[i].end_addr);
+ }
+}
+
+/* Change migrate type for the MPST ranges */
+int mpst_set_migrate_type(void)
+{
+ int i;
+ struct page *page;
+ unsigned long start_pfn, end_pfn;
+
+ if (!mpst_start_pfn || !mpst_end_pfn)
+ return -EINVAL;
+ if (!IS_ALIGNED(mpst_start_pfn, pageblock_nr_pages))
+ return -EINVAL;
+ if (!IS_ALIGNED(mpst_end_pfn, pageblock_nr_pages))
+ return -EINVAL;
+ memblock_free(pfn_to_phys(mpst_start_pfn),
+ pfn_to_phys(mpst_end_pfn) - pfn_to_phys(mpst_start_pfn));
+ for (i = 0; i < mpst_zone_cnt; ++i) {
+ start_pfn = phys_to_pfn(mpst_zones[i].start_addr);
+ end_pfn = phys_to_pfn(mpst_zones[i].end_addr);
+ for (; start_pfn < end_pfn; ++start_pfn) {
+ page = pfn_to_page(start_pfn);
+ if (page)
+ set_pageblock_migratetype(page,
+ MIGRATE_LP_MEMORY);
+ }
+ }
+
+ return 0;
+}
+
+/* Parse ACPI table and find start and end of MPST zone.
+Assuming zones are contiguous */
+int mpst_init(void)
+{
+ if (!mpst_enabled) {
+ pr_info("mpst not enabled in command line\n");
+ return 0;
+ }
+
+ acpi_table_parse(ACPI_SIG_MPST, acpi_parse_mpst_table);
+ mpst_set_min_max_pfn();
+ if (mpst_zone_cnt) {
+
+ if (mpst_start_pfn < MINIMAL_NON_MPST_MEMORY_PFN) {
+ pr_err("Not enough memory: Ignore MPST\n");
+ mpst_start_pfn = mpst_end_pfn = 0;
+ return -EINVAL;
+ }
+ memblock_reserve(pfn_to_phys(mpst_start_pfn),
+ pfn_to_phys(mpst_end_pfn) -
+ pfn_to_phys(mpst_start_pfn));
+ pr_info("mpst_init memblock limit set to pfn %lu
0x%#018lx\n",
+ mpst_start_pfn, pfn_to_phys(mpst_start_pfn));
+ }
+
+ return 0;
+}





/////////////////////////////
> I think you meant to say "... stored at the end of memory will NOT keep all
> ranks of memory active".
>
> Yep, that's a good point! I'll think about how to achieve that. Thanks!
>
> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-04-18 17:10:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
> 2. Performance overhead is expected to be low: Since we retain the simplicity
> of the algorithm in the page allocation path, page allocation can
> potentially remain as fast as it would be without memory regions. The
> overhead is pushed to the page-freeing paths which are not that critical.

Numbers, please. The problem with pushing the overhead to frees is that
they, believe it or not, actually average out to the same as the number
of allocs. Think kernel compile, or a large dd. Both of those churn
through a lot of memory, and both do an awful lot of allocs _and_ frees.
We need to know both the overhead on a system that does *no* memory
power management, and the overhead on a system which is carved and
actually using this code.

> Kernbench results didn't show any noticeable performance degradation with
> this patchset as compared to vanilla 3.9-rc5.

Surely this code isn't magical and there's overhead _somewhere_, and
such overhead can be quantified _somehow_. Have you made an effort to
find those cases, even with microbenchmarks?

I still also want to see some hard numbers on:
> However, memory consumes a significant amount of power, potentially upto
> more than a third of total system power on server systems.
and
> It had been demonstrated on a Samsung Exynos board
> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
> making the Linux kernel MM subsystem power-aware[4].

That was *NOT* with this code, and it's nearing being two years old.
What can *this* *patch* do?

I think there are three scenarios to look at. Let's say you have an 8GB
system with 1GB regions:
1. Normal unpatched kernel, booted with mem=1G...8G (in 1GB increments
perhaps) running some benchmark which sees performance scale with
the amount of memory present in the system.
2. Kernel patched with this set, running the same test, but with single
memory regions.
3. Kernel patched with this set. But, instead of using mem=, you run
it trying to evacuate equivalent amount of memory to the amounts you
removed using mem=.

That will tell us both what the overhead is, and how effective it is.
I'd much rather see actual numbers and a description of the test than
some hand waving that it "didn't show any noticeable performance
degradation".

The amount of code here isn't huge. But, it sucks that it's bloating
the already quite plump page_alloc.c.

2013-04-19 05:35:13

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

Hi Srivatsa,
On 04/10/2013 05:45 AM, Srivatsa S. Bhat wrote:
> [I know, this cover letter is a little too long, but I wanted to clearly
> explain the overall goals and the high-level design of this patchset in
> detail. I hope this helps more than it annoys, and makes it easier for
> reviewers to relate to the background and the goals of this patchset.]
>
>
> Overview of Memory Power Management and its implications to the Linux MM
> ========================================================================
>
> Today, we are increasingly seeing computer systems sporting larger and larger
> amounts of RAM, in order to meet workload demands. However, memory consumes a
> significant amount of power, potentially upto more than a third of total system
> power on server systems. So naturally, memory becomes the next big target for
> power management - on embedded systems and smartphones, and all the way upto
> large server systems.
>
> Power-management capabilities in modern memory hardware:
> -------------------------------------------------------
>
> Modern memory hardware such as DDR3 support a number of power management
> capabilities - for instance, the memory controller can automatically put

memory controller is integrated in cpu in NUMA system and mount on PCI-E
in UMA, correct? How can memory controller know which memory DIMMs/banks
it will control?

> memory DIMMs/banks into content-preserving low-power states, if it detects
> that that *entire* memory DIMM/bank has not been referenced for a threshold
> amount of time, thus reducing the energy consumption of the memory hardware.
> We term these power-manageable chunks of memory as "Memory Regions".
>
> Exporting memory region info of the platform to the OS:
> ------------------------------------------------------
>
> The OS needs to know about the granularity at which the hardware can perform
> automatic power-management of the memory banks (i.e., the address boundaries
> of the memory regions). On ARM platforms, the bootloader can be modified to
> pass on this info to the kernel via the device-tree. On x86 platforms, the
> new ACPI 5.0 spec has added support for exporting the power-management
> capabilities of the memory hardware to the OS in a standard way[5].
>
> Estimate of power-savings from power-aware Linux MM:
> ---------------------------------------------------
>
> Once the firmware/bootloader exports the required info to the OS, it is upto
> the kernel's MM subsystem to make the best use of these capabilities and manage
> memory power-efficiently. It had been demonstrated on a Samsung Exynos board
> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
> making the Linux kernel MM subsystem power-aware[4]. (More savings can be
> expected on systems with larger amounts of memory, and perhaps improved further
> using better MM designs).

How to know there are 6 percent of total system power can be saved by
making the Linux kernel MM subsystem power-aware?

>
>
> Role of the Linux MM in enhancing memory power savings:
> ------------------------------------------------------
>
> Often, this simply translates to having the Linux MM understand the granularity
> at which RAM modules can be power-managed, and keeping the memory allocations
> and references consolidated to a minimum no. of these power-manageable
> "memory regions". It is of particular interest to note that most of these memory
> hardware have the intelligence to automatically save power, by putting memory
> banks into (content-preserving) low-power states when not referenced for a

How to know DIMM/bank is not referenced?

> threshold amount of time. All that the kernel has to do, is avoid wrecking
> the power-savings logic by scattering its allocations and references all over
> the system memory. (The kernel/MM doesn't have to perform the actual power-state
> transitions; its mostly done in the hardware automatically, and this is OK
> because these are *content-preserving* low-power states).
>
> So we can summarize the goals for the Linux MM as:
>
> o Consolidate memory allocations and/or references such that they are not
> spread across the entire memory address space. Basically the area of memory
> that is not being referenced can reside in low power state.
>
> o Support light-weight targetted memory compaction/reclaim, to evacuate
> lightly-filled memory regions. This helps avoid memory references to
> those regions, thereby allowing them to reside in low power states.
>
>
> Assumptions and goals of this patchset:
> --------------------------------------
>
> In this patchset, we don't handle the part of getting the region boundary info
> from the firmware/bootloader and populating it in the kernel data-structures.
> The aim of this patchset is to propose and brainstorm on a power-aware design
> of the Linux MM which can *use* the region boundary info to influence the MM
> at various places such as page allocation, reclamation/compaction etc, thereby
> contributing to memory power savings. (This patchset is very much an RFC at
> the moment and is not intended for mainline-inclusion yet).
>
> So, in this patchset, we assume a simple model in which each 512MB chunk of
> memory can be independently power-managed, and hard-code this into the patchset.
> As mentioned, the focus of this patchset is not so much on how we get this info
> from the firmware or how exactly we handle a variety of configurations, but
> rather on discussing the power-savings/performance impact of the MM algorithms
> that *act* upon this info in order to save memory power.
>
> That said, its not very far-fetched to try this out with actual region
> boundary info to get the actual power savings numbers. For example, on ARM
> platforms, we can make the bootloader export this info to the OS via device-tree
> and then run this patchset. (This was the method used to get the power-numbers
> in [4]). But even without doing that, we can very well evaluate the
> effectiveness of this patchset in contributing to power-savings, by analyzing
> the free page statistics per-memory-region; and we can observe the performance
> impact by running benchmarks - this is the approach currently used to evaluate
> this patchset.
>
>
> Brief overview of the design/approach used in this patchset:
> -----------------------------------------------------------
>
> This patchset implements the 'Sorted-buddy design' for Memory Power Management,
> in which the buddy (page) allocator is altered to keep the buddy freelists
> region-sorted, which helps influence the page allocation paths to keep the

If this will impact normal zone based buddy freelists?

> allocations consolidated to a minimum no. of memory regions. This patchset also
> includes a light-weight targetted compaction/reclaim algorithm that works
> hand-in-hand with the page-allocator, to evacuate lightly-filled memory regions
> when memory gets fragmented, in order to further enhance memory power savings.
>
> This Sorted-buddy design was developed based on some of the suggestions
> received[1] during the review of the earlier patchset on Memory Power
> Management written by Ankita Garg ('Hierarchy design')[2].
> One of the key aspects of this Sorted-buddy design is that it avoids the
> zone-fragmentation problem that was present in the earlier design[3].
>
>
>
> Design of sorted buddy allocator and light-weight targetted region compaction:
> =============================================================================
>
> Sorted buddy allocator:
> ----------------------
>
> In this design, the memory region boundaries are captured in a data structure
> parallel to zones, instead of fitting regions between nodes and zones in the
> hierarchy. Further, the buddy allocator is altered, such that we maintain the
> zones' freelists in region-sorted-order and thus do page allocation in the
> order of increasing memory regions. (The freelists need not be fully
> address-sorted, they just need to be region-sorted).
>
> The idea is to do page allocation in increasing order of memory regions
> (within a zone) and perform region-compaction in the reverse order, as
> illustrated below.
>
> ---------------------------- Increasing region number---------------------->
>
> Direction of allocation---> <---Direction of region-compaction
>
>
> The sorting logic (to maintain freelist pageblocks in region-sorted-order)
> lies in the page-free path and hence the critical page-allocation paths remain
> fast. Also, the sorting logic is optimized to be O(log n).
>
> Advantages of this design:
> --------------------------
> 1. No zone-fragmentation (IOW, we don't create more zones than necessary) and
> hence we avoid its associated problems (like too many zones, extra kswapd
> activity, question of choosing watermarks etc).
> [This is an advantage over the 'Hierarchy' design]
>
> 2. Performance overhead is expected to be low: Since we retain the simplicity
> of the algorithm in the page allocation path, page allocation can
> potentially remain as fast as it would be without memory regions. The
> overhead is pushed to the page-freeing paths which are not that critical.
>
>
> Light-weight targetted region compaction:
> ----------------------------------------
>
> Over time, due to multiple alloc()s and free()s in random order, memory gets
> fragmented, which means the memory allocations will no longer be consolidated
> to a minimum no. of memory regions. In such cases we need a light-weight
> mechanism to opportunistically compact memory to evacuate lightly-filled
> memory regions, thereby enhancing the power-savings.
>
> Noting that CMA (Contiguous Memory Allocator) does targetted compaction to
> achieve its goals, this patchset generalizes the targetted compaction code
> and reuses it to evacuate memory regions. The region evacuation is triggered
> by the page allocator : when it notices the first page allocation in a new
> region, it sets up a worker function to perform compaction and evacuate that
> region in the future, if possible. There are handshakes between the alloc
> and the free paths in the page allocator to help do this smartly, which are
> explained in detail in the patches.
>
>
> This patchset has been hosted in the below git tree. It applies cleanly on
> v3.9-rc5.
>
> git://github.com/srivatsabhat/linux.git mem-power-mgmt-v2
>
>
> Changes in this v2:
> ==================
>
> * Fixed a bug in the NUMA case.
> * Added a new optimized O(log n) sorting algorithm to speed up region-sorting
> of the buddy freelists (patch 9). The efficiency of this new algorithm and
> its design allows us to support large amounts of RAM quite easily.
> * Added light-weight targetted compaction/reclaim support for memory power
> management (patches 10-14).
> * Revamped the cover-letter to better explain the idea behind memory power
> management and this patchset.
>
>
> Experimental Results:
> ====================
>
> Test setup:
> ----------
>
> x86 dual-socket quad core HT-enabled machine booted with mem=8G
> Memory region size = 512 MB
>
> Functional testing:
> ------------------
>
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages.
>
> Below is the statistics from the regions within ZONE_NORMAL, at various sizes
> of allocations from pagetest.
>
>
> Present pages | Free pages at various allocation sizes |
> | start | 512 MB | 1024 MB | 2048 MB |
> Region 0 1 | 0 | 0 | 0 | 0 |
> Region 1 131072 | 41537 | 13858 | 13790 | 13334 |
> Region 2 131072 | 131072 | 26839 | 82 | 122 |
> Region 3 131072 | 131072 | 131072 | 26624 | 0 |
> Region 4 131072 | 131072 | 131072 | 131072 | 0 |
> Region 5 131072 | 131072 | 131072 | 131072 | 26624 |
> Region 6 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 7 131072 | 131072 | 131072 | 131072 | 131072 |
> Region 8 131071 | 72704 | 72704 | 72704 | 72704 |
>
> This shows that page allocation occurs in the order of increasing region
> numbers, as intended in this design.
>
> Performance impact:
> -------------------
>
> Kernbench results didn't show any noticeable performance degradation with
> this patchset as compared to vanilla 3.9-rc5.
>
>
> Todos and ideas for enhancing the design further:
> ================================================
>
> 1. Add support for making this work with sparsemem, memcg etc.
>
> 2. Mel Gorman pointed out that regular compaction algorithm would work
> against the sorted-buddy allocation strategy, since it creates free space
> at lower pfns. For now, I have not handled this because regular compaction
> triggers only when the memory pressure is very high, and hence memory
> power management is pointless in those situations. Besides, it is
> immaterial whether memory allocations are consolidated towards lower or
> higher pfns, because it saves power either way, and hence the regular
> compaction algorithm doesn't actually work against memory power management.
>
> 3. Add more optimizations to the targetted region compaction algorithm in order
> to enhance its benefits and reduce the overhead, such as:
> a. Migrate only active pages during region evacuation, because, strictly
> speaking we only want to avoid _references_ to the region. So inactive
> pages can be kept around, thus reducing the page-migration overhead.
> b. Reduce the search-space for region evacuation, by having the
> page-allocator note down the highest allocated pfn within that region.
>
> 4. Have stronger influence over how freepages from different migratetypes
> are exchanged, so that unmovable and non-reclaimable allocations are
> contained within least no. of memory regions.
>
> 5. Influence the refill of per-cpu pagesets and perhaps even heavily used
> slab caches, such that they all get their memory from least no. of memory
> regions. This is to avoid frequent fragmentation of memory regions.
>
> 6. Don't perform region evacuation at situations of high memory utilization.
> Also, never use freepages from MIGRATE_RESERVE for the purpose of
> region-evacuation.
>
> 7. Add more tracing/debug info to enable better evaluation of the
> effectiveness and benefits of this patchset over vanilla kernel.
>
> 8. Add a higher level policy to control the aggressiveness of memory power
> management.
>
>
> References:
> ----------
>
> [1]. Review comments suggesting modifying the buddy allocator to be aware of
> memory regions:
> http://article.gmane.org/gmane.linux.power-management.general/24862
> http://article.gmane.org/gmane.linux.power-management.general/25061
> http://article.gmane.org/gmane.linux.kernel.mm/64689
>
> [2]. Patch series that implemented the node-region-zone hierarchy design:
> http://lwn.net/Articles/445045/
> http://thread.gmane.org/gmane.linux.kernel.mm/63840
>
> Summary of the discussion on that patchset:
> http://article.gmane.org/gmane.linux.power-management.general/25061
>
> Forward-port of that patchset to 3.7-rc3 (minimal x86 config)
> http://thread.gmane.org/gmane.linux.kernel.mm/89202
>
> [3]. Disadvantages of having memory regions in the hierarchy between nodes and
> zones:
> http://article.gmane.org/gmane.linux.kernel.mm/63849
>
> [4]. Estimate of potential power savings on Samsung exynos board
> http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> [5]. ACPI 5.0 and MPST support
> http://www.acpi.info/spec.htm
> Section 5.2.21 Memory Power State Table (MPST)
>
> [6]. v1 of Sorted-buddy memory power management patchset:
> http://thread.gmane.org/gmane.linux.power-management.general/28498
>
>
> Srivatsa S. Bhat (15):
> mm: Introduce memory regions data-structure to capture region boundaries within nodes
> mm: Initialize node memory regions during boot
> mm: Introduce and initialize zone memory regions
> mm: Add helpers to retrieve node region and zone region for a given page
> mm: Add data-structures to describe memory regions within the zones' freelists
> mm: Demarcate and maintain pageblocks in region-order in the zones' freelists
> mm: Add an optimized version of del_from_freelist to keep page allocation fast
> bitops: Document the difference in indexing between fls() and __fls()
> mm: A new optimized O(log n) sorting algo to speed up buddy-sorting
> mm: Add support to accurately track per-memory-region allocation
> mm: Restructure the compaction part of CMA for wider use
> mm: Add infrastructure to evacuate memory regions using compaction
> mm: Implement the worker function for memory region compaction
> mm: Add alloc-free handshake to trigger memory region compaction
> mm: Print memory region statistics to understand the buddy allocator behavior
>
>
> arch/x86/include/asm/bitops.h | 4
> include/asm-generic/bitops/__fls.h | 5
> include/linux/compaction.h | 7
> include/linux/gfp.h | 2
> include/linux/migrate.h | 3
> include/linux/mm.h | 62 ++++
> include/linux/mmzone.h | 78 ++++-
> include/trace/events/migrate.h | 3
> mm/compaction.c | 149 +++++++++
> mm/internal.h | 40 ++
> mm/page_alloc.c | 617 ++++++++++++++++++++++++++++++++----
> mm/vmstat.c | 36 ++
> 12 files changed, 935 insertions(+), 71 deletions(-)
>
>
> Regards,
> Srivatsa S. Bhat
> IBM Linux Technology Center
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-19 06:53:56

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/18/2013 10:40 PM, Dave Hansen wrote:
> On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
>> 2. Performance overhead is expected to be low: Since we retain the simplicity
>> of the algorithm in the page allocation path, page allocation can
>> potentially remain as fast as it would be without memory regions. The
>> overhead is pushed to the page-freeing paths which are not that critical.
>
> Numbers, please. The problem with pushing the overhead to frees is that
> they, believe it or not, actually average out to the same as the number
> of allocs. Think kernel compile, or a large dd. Both of those churn
> through a lot of memory, and both do an awful lot of allocs _and_ frees.
> We need to know both the overhead on a system that does *no* memory
> power management, and the overhead on a system which is carved and
> actually using this code.
>
>> Kernbench results didn't show any noticeable performance degradation with
>> this patchset as compared to vanilla 3.9-rc5.
>
> Surely this code isn't magical and there's overhead _somewhere_, and
> such overhead can be quantified _somehow_. Have you made an effort to
> find those cases, even with microbenchmarks?
>

Sorry for not posting the numbers explicitly. It really shows no difference in
kernbench, see below.

[For the following run, I reverted patch 14, since it seems to be intermittently
causing kernel-instability at high loads. So the numbers below show the effect
of only the sorted-buddy part of the patchset, and not the compaction part.]

Kernbench was run on a 2 socket 8 core machine (HT disabled) with 128 GB RAM,
with allyesconfig on 3.9-rc5 kernel source.

Vanilla 3.9-rc5:
---------------
Fri Apr 19 08:30:12 IST 2013
3.9.0-rc5
Average Optimal load -j 16 Run (std deviation):
Elapsed Time 574.66 (2.31846)
User Time 3919.12 (3.71256)
System Time 339.296 (0.73694)
Percent CPU 740.4 (2.50998)
Context Switches 1.2183e+06 (4019.47)
Sleeps 1.61239e+06 (2657.33)

This patchset (minus patch 14): [Region size = 512 MB]
------------------------------
Fri Apr 19 09:42:38 IST 2013
3.9.0-rc5-mpmv2-nowq
Average Optimal load -j 16 Run (std deviation):
Elapsed Time 575.668 (2.01583)
User Time 3916.77 (3.48345)
System Time 337.406 (0.701591)
Percent CPU 738.4 (3.36155)
Context Switches 1.21683e+06 (6980.13)
Sleeps 1.61474e+06 (4906.23)


So, that shows almost no degradation due to the sorted-buddy logic (considering
the elapsed time).

> I still also want to see some hard numbers on:
>> However, memory consumes a significant amount of power, potentially upto
>> more than a third of total system power on server systems.
> and
>> It had been demonstrated on a Samsung Exynos board
>> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
>> making the Linux kernel MM subsystem power-aware[4].
>
> That was *NOT* with this code, and it's nearing being two years old.
> What can *this* *patch* do?
>

Please let me clarify that. My intention behind quoting that 6% power-savings
number was _not_ to trick reviewers into believing that _this_ patchset provides
that much power-savings. It was only to show that this whole effort of doing memory
power management is not worthless, and we do have some valuable/tangible
benefits to gain from it. IOW, it was only meant to show an _estimate_ of how
much we can potentially save and thus justify the effort behind managing memory
power-efficiently.

As I had mentioned in the cover-letter, I don't have the the exact power-savings
number for this particular patchset yet. I'll definitely work towards getting
those numbers soon.

> I think there are three scenarios to look at. Let's say you have an 8GB
> system with 1GB regions:
> 1. Normal unpatched kernel, booted with mem=1G...8G (in 1GB increments
> perhaps) running some benchmark which sees performance scale with
> the amount of memory present in the system.
> 2. Kernel patched with this set, running the same test, but with single
> memory regions.
> 3. Kernel patched with this set. But, instead of using mem=, you run
> it trying to evacuate equivalent amount of memory to the amounts you
> removed using mem=.
>
> That will tell us both what the overhead is, and how effective it is.
> I'd much rather see actual numbers and a description of the test than
> some hand waving that it "didn't show any noticeable performance
> degradation".
>

Sure, I'll perform more extensive tests to evaluate the performance overhead
more thoroughly. I'll first fix the compaction logic that seems to be buggy
and run benchmarks again.

Thanks a lot for your all invaluable inputs, Dave!

Regards,
Srivatsa S. Bhat

2013-04-19 07:15:20

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/19/2013 11:04 AM, Simon Jeons wrote:
> Hi Srivatsa,
> On 04/10/2013 05:45 AM, Srivatsa S. Bhat wrote:
>> [I know, this cover letter is a little too long, but I wanted to clearly
>> explain the overall goals and the high-level design of this patchset in
>> detail. I hope this helps more than it annoys, and makes it easier for
>> reviewers to relate to the background and the goals of this patchset.]
>>
>>
>> Overview of Memory Power Management and its implications to the Linux MM
>> ========================================================================
>>
>> Today, we are increasingly seeing computer systems sporting larger and
>> larger
>> amounts of RAM, in order to meet workload demands. However, memory
>> consumes a
>> significant amount of power, potentially upto more than a third of
>> total system
>> power on server systems. So naturally, memory becomes the next big
>> target for
>> power management - on embedded systems and smartphones, and all the
>> way upto
>> large server systems.
>>
>> Power-management capabilities in modern memory hardware:
>> -------------------------------------------------------
>>
>> Modern memory hardware such as DDR3 support a number of power management
>> capabilities - for instance, the memory controller can automatically put
>
> memory controller is integrated in cpu in NUMA system and mount on PCI-E
> in UMA, correct? How can memory controller know which memory DIMMs/banks
> it will control?
>

Um? That sounds like a strange question to me. If the memory controller
itself doesn't know what it is controlling, then who will??

>> memory DIMMs/banks into content-preserving low-power states, if it
>> detects
>> that that *entire* memory DIMM/bank has not been referenced for a
>> threshold
>> amount of time, thus reducing the energy consumption of the memory
>> hardware.
>> We term these power-manageable chunks of memory as "Memory Regions".
>>
>> Exporting memory region info of the platform to the OS:
>> ------------------------------------------------------
>>
>> The OS needs to know about the granularity at which the hardware can
>> perform
>> automatic power-management of the memory banks (i.e., the address
>> boundaries
>> of the memory regions). On ARM platforms, the bootloader can be
>> modified to
>> pass on this info to the kernel via the device-tree. On x86 platforms,
>> the
>> new ACPI 5.0 spec has added support for exporting the power-management
>> capabilities of the memory hardware to the OS in a standard way[5].
>>
>> Estimate of power-savings from power-aware Linux MM:
>> ---------------------------------------------------
>>
>> Once the firmware/bootloader exports the required info to the OS, it
>> is upto
>> the kernel's MM subsystem to make the best use of these capabilities
>> and manage
>> memory power-efficiently. It had been demonstrated on a Samsung Exynos
>> board
>> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
>> making the Linux kernel MM subsystem power-aware[4]. (More savings can be
>> expected on systems with larger amounts of memory, and perhaps
>> improved further
>> using better MM designs).
>
> How to know there are 6 percent of total system power can be saved by
> making the Linux kernel MM subsystem power-aware?
>

By looking at the link I gave, I suppose? :-) Let me put it here again:

[4]. Estimate of potential power savings on Samsung exynos board
http://article.gmane.org/gmane.linux.kernel.mm/65935

That was measured by running the earlier patchset which implemented the
"Hierarchy" design[2], with aggressive memory savings policies. But in any
case, it gives an idea of the amount of power savings we can get by doing
memory power management.

>>
>>
>> Role of the Linux MM in enhancing memory power savings:
>> ------------------------------------------------------
>>
>> Often, this simply translates to having the Linux MM understand the
>> granularity
>> at which RAM modules can be power-managed, and keeping the memory
>> allocations
>> and references consolidated to a minimum no. of these power-manageable
>> "memory regions". It is of particular interest to note that most of
>> these memory
>> hardware have the intelligence to automatically save power, by putting
>> memory
>> banks into (content-preserving) low-power states when not referenced
>> for a
>
> How to know DIMM/bank is not referenced?
>

That's upto the hardware to figure out. It would be engraved in the
hardware logic. The kernel need not worry about it. The kernel has to
simply understand the PFN ranges corresponding to independently
power-manageable chunks of memory and try to keep the memory allocations
consolidated to a minimum no. of such memory regions. That's because we
never reference (access) unallocated memory. So keeping the allocations
consolidated also indirectly keeps the references consolidated.

But going further, as I had mentioned in my TODO list, we can be smarter
than this while doing compaction to evacuate memory regions - we can
choose to migrate only the active pages, and leave the inactive pages
alone. Because, the goal is to actually consolidate the *references* and
not necessarily the *allocations* themselves.

>> threshold amount of time. All that the kernel has to do, is avoid
>> wrecking
>> the power-savings logic by scattering its allocations and references
>> all over
>> the system memory. (The kernel/MM doesn't have to perform the actual
>> power-state
>> transitions; its mostly done in the hardware automatically, and this
>> is OK
>> because these are *content-preserving* low-power states).
>>
>>
>> Brief overview of the design/approach used in this patchset:
>> -----------------------------------------------------------
>>
>> This patchset implements the 'Sorted-buddy design' for Memory Power
>> Management,
>> in which the buddy (page) allocator is altered to keep the buddy
>> freelists
>> region-sorted, which helps influence the page allocation paths to keep
>> the
>
> If this will impact normal zone based buddy freelists?
>

The freelists continue to remain zone-based. No change in that. We are
not fragmenting them further to be per-memory-region. Instead, we simply
maintain pointers within the freelists to differentiate pageblocks belonging
to different memory regions.

>> allocations consolidated to a minimum no. of memory regions. This
>> patchset also
>> includes a light-weight targetted compaction/reclaim algorithm that works
>> hand-in-hand with the page-allocator, to evacuate lightly-filled
>> memory regions
>> when memory gets fragmented, in order to further enhance memory power
>> savings.
>>
>> This Sorted-buddy design was developed based on some of the suggestions
>> received[1] during the review of the earlier patchset on Memory Power
>> Management written by Ankita Garg ('Hierarchy design')[2].
>> One of the key aspects of this Sorted-buddy design is that it avoids the
>> zone-fragmentation problem that was present in the earlier design[3].
>>
>>

Regards,
Srivatsa S. Bhat

2013-04-19 08:15:13

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/18/2013 08:43 PM, Srinivas Pandruvada wrote:
> On 04/18/2013 02:54 AM, Srivatsa S. Bhat wrote:
>> On 04/17/2013 10:23 PM, Srinivas Pandruvada wrote:
>>> On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
>>>> [I know, this cover letter is a little too long, but I wanted to
>>>> clearly
>>>> explain the overall goals and the high-level design of this patchset in
>>>> detail. I hope this helps more than it annoys, and makes it easier for
>>>> reviewers to relate to the background and the goals of this patchset.]
>>>>
>>>>
>>>> Overview of Memory Power Management and its implications to the
>>>> Linux MM
>>>> ========================================================================
>>>>
>>>>
>> [...]
>>> One thing you need to prevent is boot time allocation. You have to make
>>> sure that frequently accessed per node data stored at the end of memory
>>> will keep all ranks of memory active.
>>>
> When I was experimenting I did something like this.

Thanks a lot for sharing this, Srinivas!

Regards,
Srivatsa S. Bhat

> /////////////////////////////////
>
>
> +/*
> + * Experimental MPST implemenentation
> + * Copyright (c) 2012, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public
> License for
> + * more details.
> + *
> + * You should have received a copy of the GNU General Public License
> along with
> + * this program; if not, write to the Free Software Foundation, Inc.,
> + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
> + *
> + */
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/init.h>
> +#include <linux/kthread.h>
> +#include <linux/acpi.h>
> +#include <linux/export.h>
> +#include <linux/bootmem.h>
> +#include <linux/delay.h>
> +#include <linux/pfn.h>
> +#include <linux/suspend.h>
> +#include <linux/acpi.h>
> +#include <linux/memblock.h>
> +#include <linux/mm.h>
> +#include <linux/mmzone.h>
> +#include <linux/migrate.h>
> +#include <linux/mm_inline.h>
> +#include <linux/page-isolation.h>
> +#include <linux/vmalloc.h>
> +#include <linux/compaction.h>
> +#include "internal.h"
> +
> +#define phys_to_pfn(p) ((p) >> PAGE_SHIFT)
> +#define pfn_to_phys(p) ((p) << PAGE_SHIFT)
> +#define MAX_MPST_ZONES 16
> +/* Atleast 4G of non MPST memory. */
> +#define MINIMAL_NON_MPST_MEMORY_PFN (0x100000000 >> PAGE_SHIFT)
> +
> +struct mpst_mem_zone {
> + phys_addr_t start_addr;
> + phys_addr_t end_addr;
> +};
> +
> +static struct mpst_mem_zone mpst_zones[MAX_MPST_ZONES];
> +static int mpst_zone_cnt;
> +static unsigned long mpst_start_pfn;
> +static unsigned long mpst_end_pfn;
> +static bool mpst_enabled;
> +
> +/* Minimal parsing for just getting node ranges */
> +static int __init acpi_parse_mpst_table(struct acpi_table_header *table)
> +{
> + struct acpi_table_mpst *mpst;
> + struct acpi_mpst_power_node *node;
> + u16 node_count;
> + int i;
> +
> + mpst = (struct acpi_table_mpst *)table;
> + if (!mpst) {
> + pr_warn("Unable to map MPST\n");
> + return -ENODEV;
> + }
> + node_count = mpst->power_node_count;
> + node = (struct acpi_mpst_power_node *)((u8 *)mpst + sizeof(*mpst));
> +
> + for (i = mpst_zone_cnt; (i < node_count) && (i < MAX_MPST_ZONES);
> + ++i) {
> + if ((node->flags & ACPI_MPST_ENABLED) &&
> + (node->flags & ACPI_MPST_POWER_MANAGED)) {
> + mpst_zones[mpst_zone_cnt].start_addr =
> + node->range_address;
> + mpst_zones[mpst_zone_cnt].end_addr =
> + node->range_address + node->range_length;
> + ++mpst_zone_cnt;
> + }
> + ++node;
> + }
> +
> + return 0;
> +}
> +
> +static unsigned long local_ahex_to_long(const char *name)
> +{
> + unsigned long val = 0;
> +
> + for (;; name++) {
> + switch (*name) {
> + case '0' ... '9':
> + val = 16*val+(*name-'0');
> + break;
> + case 'A' ... 'F':
> + val = 16*val+(*name-'A'+10);
> + break;
> + case 'a' ... 'f':
> + val = 16*val+(*name-'a'+10);
> + break;
> + default:
> + return val;
> + }
> + }
> +
> + return val;
> +}
> +
> +/* Specify MPST range by command line for test till ACPI - MPST is
> available */
> +static int __init parse_mpst_opt(char *str)
> +{
> + char *ptr;
> + phys_addr_t start_at = 0, end_at = 0;
> + u64 mem_size = 0;
> +
> + if (!str)
> + return -EINVAL;
> + ptr = str;
> + while (1) {
> + if (*str == '-') {
> + *str = '\0';
> + start_at = local_ahex_to_long(ptr);
> + ++str;
> + ptr = str;
> + }
> + if (start_at && (*str == '\0' || *str == ',' || *str ==
> ' ')) {
> + *str = '\0';
> + end_at = local_ahex_to_long(ptr);
> + mem_size = end_at-start_at;
> + ++str;
> + ptr = str;
> + pr_info("-mpst[%#018Lx-%#018Lx size: %#018Lx]\n",
> + start_at, end_at,
> mem_size);
> + if (IS_ALIGNED(phys_to_pfn(start_at),
> + pageblock_nr_pages) &&
> + IS_ALIGNED(phys_to_pfn(end_at),
> + pageblock_nr_pages)) {
> + mpst_zones[mpst_zone_cnt].start_addr =
> + start_at;
> + mpst_zones[mpst_zone_cnt].end_addr =
> + end_at;
> + } else {
> + pr_err("mpst invalid range\n");
> + return -EINVAL;
> + }
> + mpst_zone_cnt++;
> + start_at = mem_size = end_at = 0;
> + }
> + if (*str == '\0')
> + break;
> + else
> + ++str;
> + }
> +
> + return 0;
> +}
> +early_param("mpst_range", parse_mpst_opt);
> +
> +/* Specify MPST range by command line for test till ACPI - MPST is
> available */
> +static int __init parse_mpst_enable_opt(char *str)
> +{
> + long value;
> + if (kstrtol(str, 10, &value))
> + return -EINVAL;
> + mpst_enabled = value ? true : false;
> +
> + return 0;
> +}
> +early_param("mpst_enable", parse_mpst_enable_opt);
> +
> +/* Set the minimum and maximum PFN */
> +static void mpst_set_min_max_pfn(void)
> +{
> + int i;
> +
> + if (!mpst_zone_cnt)
> + return;
> +
> + mpst_start_pfn = phys_to_pfn(mpst_zones[0].start_addr);
> + mpst_end_pfn = phys_to_pfn(mpst_zones[0].end_addr);
> +
> + for (i = 1; i < mpst_zone_cnt; ++i) {
> + if (mpst_start_pfn > phys_to_pfn(mpst_zones[i].start_addr))
> + mpst_start_pfn =
> phys_to_pfn(mpst_zones[i].start_addr);
> + if (mpst_end_pfn < phys_to_pfn(mpst_zones[i].end_addr))
> + mpst_end_pfn = phys_to_pfn(mpst_zones[i].end_addr);
> + }
> +}
> +
> +/* Change migrate type for the MPST ranges */
> +int mpst_set_migrate_type(void)
> +{
> + int i;
> + struct page *page;
> + unsigned long start_pfn, end_pfn;
> +
> + if (!mpst_start_pfn || !mpst_end_pfn)
> + return -EINVAL;
> + if (!IS_ALIGNED(mpst_start_pfn, pageblock_nr_pages))
> + return -EINVAL;
> + if (!IS_ALIGNED(mpst_end_pfn, pageblock_nr_pages))
> + return -EINVAL;
> + memblock_free(pfn_to_phys(mpst_start_pfn),
> + pfn_to_phys(mpst_end_pfn) - pfn_to_phys(mpst_start_pfn));
> + for (i = 0; i < mpst_zone_cnt; ++i) {
> + start_pfn = phys_to_pfn(mpst_zones[i].start_addr);
> + end_pfn = phys_to_pfn(mpst_zones[i].end_addr);
> + for (; start_pfn < end_pfn; ++start_pfn) {
> + page = pfn_to_page(start_pfn);
> + if (page)
> + set_pageblock_migratetype(page,
> + MIGRATE_LP_MEMORY);
> + }
> + }
> +
> + return 0;
> +}
> +
> +/* Parse ACPI table and find start and end of MPST zone.
> +Assuming zones are contiguous */
> +int mpst_init(void)
> +{
> + if (!mpst_enabled) {
> + pr_info("mpst not enabled in command line\n");
> + return 0;
> + }
> +
> + acpi_table_parse(ACPI_SIG_MPST, acpi_parse_mpst_table);
> + mpst_set_min_max_pfn();
> + if (mpst_zone_cnt) {
> +
> + if (mpst_start_pfn < MINIMAL_NON_MPST_MEMORY_PFN) {
> + pr_err("Not enough memory: Ignore MPST\n");
> + mpst_start_pfn = mpst_end_pfn = 0;
> + return -EINVAL;
> + }
> + memblock_reserve(pfn_to_phys(mpst_start_pfn),
> + pfn_to_phys(mpst_end_pfn) -
> + pfn_to_phys(mpst_start_pfn));
> + pr_info("mpst_init memblock limit set to pfn %lu
> 0x%#018lx\n",
> + mpst_start_pfn, pfn_to_phys(mpst_start_pfn));
> + }
> +
> + return 0;
> +}
>
>
>
>
>
> /////////////////////////////

2013-04-19 15:21:33

by srinivas pandruvada

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/19/2013 12:12 AM, Srivatsa S. Bhat wrote:
> On 04/19/2013 11:04 AM, Simon Jeons wrote:
>> Hi Srivatsa,
>> On 04/10/2013 05:45 AM, Srivatsa S. Bhat wrote:
>>> [I know, this cover letter is a little too long, but I wanted to clearly
>>> explain the overall goals and the high-level design of this patchset in
>>> detail. I hope this helps more than it annoys, and makes it easier for
>>> reviewers to relate to the background and the goals of this patchset.]
>>>
>>>
>>> Overview of Memory Power Management and its implications to the Linux MM
>>> ========================================================================
>>>
>>> Today, we are increasingly seeing computer systems sporting larger and
>>> larger
>>> amounts of RAM, in order to meet workload demands. However, memory
>>> consumes a
>>> significant amount of power, potentially upto more than a third of
>>> total system
>>> power on server systems. So naturally, memory becomes the next big
>>> target for
>>> power management - on embedded systems and smartphones, and all the
>>> way upto
>>> large server systems.
>>>
>>> Power-management capabilities in modern memory hardware:
>>> -------------------------------------------------------
>>>
>>> Modern memory hardware such as DDR3 support a number of power management
>>> capabilities - for instance, the memory controller can automatically put
>> memory controller is integrated in cpu in NUMA system and mount on PCI-E
>> in UMA, correct? How can memory controller know which memory DIMMs/banks
>> it will control?
>>
> Um? That sounds like a strange question to me. If the memory controller
> itself doesn't know what it is controlling, then who will??
<Modern memory controller or smart enough to put into low power content
preserving state, if you don't touch any ranks. So if you don't access a
rank, it will go to low power state.>
>>> memory DIMMs/banks into content-preserving low-power states, if it
>>> detects
>>> that that *entire* memory DIMM/bank has not been referenced for a
>>> threshold
>>> amount of time, thus reducing the energy consumption of the memory
>>> hardware.
>>> We term these power-manageable chunks of memory as "Memory Regions".
>>>
>>> Exporting memory region info of the platform to the OS:
>>> ------------------------------------------------------
>>>
>>> The OS needs to know about the granularity at which the hardware can
>>> perform
>>> automatic power-management of the memory banks (i.e., the address
>>> boundaries
>>> of the memory regions). On ARM platforms, the bootloader can be
>>> modified to
>>> pass on this info to the kernel via the device-tree. On x86 platforms,
>>> the
>>> new ACPI 5.0 spec has added support for exporting the power-management
>>> capabilities of the memory hardware to the OS in a standard way[5].
>>>
>>> Estimate of power-savings from power-aware Linux MM:
>>> ---------------------------------------------------
>>>
>>> Once the firmware/bootloader exports the required info to the OS, it
>>> is upto
>>> the kernel's MM subsystem to make the best use of these capabilities
>>> and manage
>>> memory power-efficiently. It had been demonstrated on a Samsung Exynos
>>> board
>>> (with 2 GB RAM) that upto 6 percent of total system power can be saved by
>>> making the Linux kernel MM subsystem power-aware[4]. (More savings can be
>>> expected on systems with larger amounts of memory, and perhaps
>>> improved further
>>> using better MM designs).
>> How to know there are 6 percent of total system power can be saved by
>> making the Linux kernel MM subsystem power-aware?
>>
> By looking at the link I gave, I suppose? :-) Let me put it here again:
>
> [4]. Estimate of potential power savings on Samsung exynos board
> http://article.gmane.org/gmane.linux.kernel.mm/65935
>
> That was measured by running the earlier patchset which implemented the
> "Hierarchy" design[2], with aggressive memory savings policies. But in any
> case, it gives an idea of the amount of power savings we can get by doing
> memory power management.
>
>>>
>>> Role of the Linux MM in enhancing memory power savings:
>>> ------------------------------------------------------
>>>
>>> Often, this simply translates to having the Linux MM understand the
>>> granularity
>>> at which RAM modules can be power-managed, and keeping the memory
>>> allocations
>>> and references consolidated to a minimum no. of these power-manageable
>>> "memory regions". It is of particular interest to note that most of
>>> these memory
>>> hardware have the intelligence to automatically save power, by putting
>>> memory
>>> banks into (content-preserving) low-power states when not referenced
>>> for a
>> How to know DIMM/bank is not referenced?
>>
> That's upto the hardware to figure out. It would be engraved in the
> hardware logic. The kernel need not worry about it. The kernel has to
> simply understand the PFN ranges corresponding to independently
> power-manageable chunks of memory and try to keep the memory allocations
> consolidated to a minimum no. of such memory regions. That's because we
> never reference (access) unallocated memory. So keeping the allocations
> consolidated also indirectly keeps the references consolidated.
>
> But going further, as I had mentioned in my TODO list, we can be smarter
> than this while doing compaction to evacuate memory regions - we can
> choose to migrate only the active pages, and leave the inactive pages
> alone. Because, the goal is to actually consolidate the *references* and
> not necessarily the *allocations* themselves.
>
>>> threshold amount of time. All that the kernel has to do, is avoid
>>> wrecking
>>> the power-savings logic by scattering its allocations and references
>>> all over
>>> the system memory. (The kernel/MM doesn't have to perform the actual
>>> power-state
>>> transitions; its mostly done in the hardware automatically, and this
>>> is OK
>>> because these are *content-preserving* low-power states).
>>>
>>>
>>> Brief overview of the design/approach used in this patchset:
>>> -----------------------------------------------------------
>>>
>>> This patchset implements the 'Sorted-buddy design' for Memory Power
>>> Management,
>>> in which the buddy (page) allocator is altered to keep the buddy
>>> freelists
>>> region-sorted, which helps influence the page allocation paths to keep
>>> the
>> If this will impact normal zone based buddy freelists?
>>
> The freelists continue to remain zone-based. No change in that. We are
> not fragmenting them further to be per-memory-region. Instead, we simply
> maintain pointers within the freelists to differentiate pageblocks belonging
> to different memory regions.
>
>>> allocations consolidated to a minimum no. of memory regions. This
>>> patchset also
>>> includes a light-weight targetted compaction/reclaim algorithm that works
>>> hand-in-hand with the page-allocator, to evacuate lightly-filled
>>> memory regions
>>> when memory gets fragmented, in order to further enhance memory power
>>> savings.
>>>
>>> This Sorted-buddy design was developed based on some of the suggestions
>>> received[1] during the review of the earlier patchset on Memory Power
>>> Management written by Ankita Garg ('Hierarchy design')[2].
>>> One of the key aspects of this Sorted-buddy design is that it avoids the
>>> zone-fragmentation problem that was present in the earlier design[3].
>>>
>>>
> Regards,
> Srivatsa S. Bhat
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2013-04-25 18:00:22

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 04/18/2013 10:40 PM, Dave Hansen wrote:
> On 04/09/2013 02:45 PM, Srivatsa S. Bhat wrote:
>> 2. Performance overhead is expected to be low: Since we retain the simplicity
>> of the algorithm in the page allocation path, page allocation can
>> potentially remain as fast as it would be without memory regions. The
>> overhead is pushed to the page-freeing paths which are not that critical.
>

[...]

> I still also want to see some hard numbers on:
>> However, memory consumes a significant amount of power, potentially upto
>> more than a third of total system power on server systems.

Please find below, the reference to the publicly available paper I had in
mind, when I made that statement:

C. Lefurgy, K. Rajamani, F. Rawson, W. Felter, M. Kistler, and Tom Keller.
Energy management for commercial servers. In IEEE Computer, pages 39–48,
Dec 2003.

Here is a quick link to the paper:
researcher.ibm.com/files/us-lefurgy/computer2003.pdf

On page 40, the paper shows the power-consumption breakdown for an IBM p670
machine, which shows that as much as 40% of the system energy is consumed by
the memory sub-system in a mid-range server.

I admit that the paper is a little old (I'll see if I can find anything more
recent that is publicly available, or perhaps you can verify the same if you
have data-sheets for other platforms handy), but given the trend of increasing
memory speeds and increasing memory density/capacity in computer systems, the
power-consumption of memory is certainly not going to become insignificant all
of a sudden.

IOW, the above data supports the point I was trying to make - Memory hardware
contributes to a significant portion of the power consumption of a system. And
since the hardware is now exposing ways to reduce the power consumption, it
would be worthwhile to try and exploit it by doing memory power management.

Regards,
Srivatsa S. Bhat

2013-05-28 20:08:43

by Phillip Susi

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 4/19/2013 3:12 AM, Srivatsa S. Bhat wrote:
> But going further, as I had mentioned in my TODO list, we can be
> smarter than this while doing compaction to evacuate memory regions
> - we can choose to migrate only the active pages, and leave the
> inactive pages alone. Because, the goal is to actually consolidate
> the *references* and not necessarily the *allocations* themselves.

That would help with keeping references compact to allow use of the
low power states, but it would also be nice to keep allocations
compact, and completely power off a bank of ram with no allocations.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJRpQ7EAAoJEJrBOlT6nu75wqYH/Aq18xGwk3QCJPEBb3BQX4be
NY0JfHH9sHE7CFH4t13qpwNj/+N73xBg+TumY2qekUxqfwYSRo3hDhPRonlW/eDI
X2AaEGDs8+aQT+QY8bAZnZFHFX8ZayNYOsNCewEV8djZDll2l+fOaFjVfZAwuQLK
KtsMxjhlTzqMleRxVpZFLtVPG4GzLRITifKlBRQ+ffrO1zTTMI7glvM+IygIa5vS
ajOCI0Nis1Rst2cOsrxfWc+DKN+gnI6c/qTsHarPD5zda1AFwe9DzWQ7EGiqnbJq
39vJmGIsspwrEPbaK0VX5dVYp85Bvd03EudeEish4EHVmH+hphpFokxoypRJePg=
=Flf3
-----END PGP SIGNATURE-----

2013-05-29 05:40:12

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/15][Sorted-buddy] mm: Memory Power Management

On 05/29/2013 01:38 AM, Phillip Susi wrote:
>
> On 4/19/2013 3:12 AM, Srivatsa S. Bhat wrote:
>> But going further, as I had mentioned in my TODO list, we can be
>> smarter than this while doing compaction to evacuate memory regions
>> - we can choose to migrate only the active pages, and leave the
>> inactive pages alone. Because, the goal is to actually consolidate
>> the *references* and not necessarily the *allocations* themselves.
>
> That would help with keeping references compact to allow use of the
> low power states, but it would also be nice to keep allocations
> compact, and completely power off a bank of ram with no allocations.
>

That is a very good point, thanks! But one of the differences we have to
keep in mind is that powering off a bank requires intervention from the
OS (ie., OS should initiate the power-off, because we lose the contents
on power-off) whereas going to lower power states can be mostly done
automatically by the hardware (because it is content-preserving).

But powering-off unused banks of RAM (using techniques such as PASR -
Partial Array Self Refresh) can give us more power-savings than just
entering lower power states. So yes, keeping allocations consolidated
has that additional advantage. And the sorted-buddy design of the page
allocator helps us achieve that.

Thanks a lot for your inputs, Phillip!

Regards,
Srivatsa S. Bhat