2011-05-27 12:33:24

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

Modern systems offer higher CPU performance and large amount of memory in
each generation in order to support application demands. Memory subsystem has
began to offer wide range of capabilities for managing power consumption,
which is driving the need to relook at the way memory is managed by the
operating system. Linux VM subsystem has sophisticated algorithms to
optimally manage the scarce resources for best overall system performance.
Apart from the capacity and location of memory areas, the VM subsystem tracks
special addressability restrictions in zones and relative distance from CPU as
NUMA nodes if necessary. Power management capabilities in the memory subsystem
and inclusion of different class of main memory like PCM, or non-volatile RAM,
brings in new boundaries and attributes that needs to be tagged within the
Linux VM subsystem for exploitation by the kernel and applications.

This patchset proposes a generic memory regions infrastructure that can be
used to tag boundaries of memory blocks which belongs to a specific memory
power management domain and further enable exploitation of platform memory
power management capabilities.

How can Linux VM help memory power savings?

o Consolidate memory allocations and/or references such that they are
not spread across the entire memory address space. Basically area of memory
that is not being referenced, can reside in low power state.

o Support targeted memory reclaim, where certain areas of memory that can be
easily freed can be offlined, allowing those areas of memory to be put into
lower power states.

What is a Memory Region ?
-------------------------

Memory regions is a generic memory management framework that enables the
virtual memory manager to consider memory characteristics when making memory
allocation and deallocation decisions. It is a layer of abstraction under the
real NUMA nodes, that encapsulate knowledge of the underlying memory hardware.
This layer is created at boot time, with information from firmware regarding
the granularity at which memory power can be managed on the platform. For
example, on platforms with support for Partial Array Self-Refresh (PASR) [1],
regions could be aligned to memory unit that can be independently put into
self-refresh or turned off (content destructive power off). On the other hand,
platforms with support for multiple memory controllers that control the power
states of memory, one memory region could be created for all the memory under
a single memory controller.

The aim of the alignment is to ensure that memory allocations, deallocations
and reclaim are performed within a defined hardware boundary. By creating
zones under regions, the buddy allocator would operate at the level of
regions. The proposed data structure is as shown in the Figure below:


-----------------------------
|N0 |N1 |N2 |N3 |.. |.. |Nn |
-----------------------------
/ \ \
/ \ \
/ \ \
------------ | ------------
| Mem Rgn0 | | | Mem Rgn3 |
------------ | ------------
| | |
| ------------ | ---------
| | Mem Rgn1 | ->| zones |
| ------------ ---------
| | ---------
| ----->| zones |
| --------- ---------
->| zones |
---------

Memory regions enable the following :

o Sequential allocation of memory in the order of memory regions, thus
ensuring that greater number of memory regions are devoid of allocations to
begin with
o With time however, the memory allocations will tend to be spread across
different regions. But the notion of a region boundary and region level
memory statistics will enable specific regions to be evacuated using
targetted allocation and reclaim.

Lumpy reclaim and other memory compaction work by Mel Gorman, would further
aid in consolidation of memory [4].

Memory regions is just a base infrastructure that would enable the Linux VM to
be aware of the physical memory hardware characterisitics, a pre-requisite to
implementing other sophisticated algorithms and techniques to actually
conserve power.

Advantages
-----------

Memory regions framework works with existing memory management data
structures and only adds one more layer of abstraction that is required to
capture special boundaries and properties. Most VM code paths work similar
to current implementation with additional traversal of zone data structures
in pre-defined order.

Alternative Approach:

There are other ways in which memory belonging to the same power domain could
be grouped together. Fake NUMA nodes under a real NUMA node could encapsulate
information about the memory hardware units that can be independently power
managed. With minimal code changes, the same functionality as memory regions
can be achieved. However, the fake NUMA nodes is a non-intuitive solution,
that breaks the NUMA semantics and is not generic in nature. It would present
an incorrect view of the system to the administrator, by showing that it has a
greater number of NUMA nodes than actually present.

Challenges
----------

o Memory interleaving is typically used on all platforms to increase the
memory bandwidth and hence memory performance. However, in the presence of
interleaving, the amount of idle memory within the hardware domain reduces,
impacting power savings. For a given platform, it is important to select an
interleaving scheme that gives good performance with optimum power savings.

This is a RFC patchset with minimal functionality to demonstrate the
requirement and proposed implementation options. It has been tested on TI
OMAP4 Panda board with 1Gb RAM and the Samsung Exynos 4210 board. The patch
applies on kernel version 2.6.39-rc5, compiled with the default config files
for the two platforms. I have turned off cgroup, memory hotplug and kexec to
begin. Support to these framework can be easily extended. The u-boot
bootloader does not yet export information regarding the physical memory bank
boundaries and hence the regions are not correctly aligned to hardware and
hence hard coded for test/demo purposes. Also, the code assumes that atleast
one region is present in the node. Compile time exclusion of memory regions is
a todo.

Results
-------
Ran pagetest, a simple C program that allocates and touches a required number
of pages, on a Samsung Exynos 4210 board with ~2GB RAM, booted with 4 memory
regions, each with ~512MB. The allocation size used was 512MB. Below is the
free page statistics while running the benchmark:

---------------------------------------
| | start | ~480MB | 512MB |
---------------------------------------
| Region 0 | 124013 | 1129 | 484 |
| Region 1 | 131072 | 131072 | 130824 |
| Region 2 | 131072 | 131072 | 131072 |
| Region 3 | 57332 | 57332 | 57332 |
---------------------------------------

(The total number of pages in Region 3 is 57332, as it contains all the
remaining pages and hence the region size is not 512MB).

Column 1 indicates the number of free pages in each region at the start of the
benchmark, column 2 at about 480MB allocation and column 3 at 512MB
allocation. The memory in regions 1,2 & 3 is free and only region0 is
utilized. So if the regions are aligned to the hardware memory units, free
regions could potentially be put either into low power state or turned off. It
may be possible to allocate from lower address without regions, but once the
page reclaim comes into play, the page allocations will tend to get spread
around.

References
----------

[1] Partial Array Self Refresh
http://focus.ti.com/general/docs/wtbu/wtbudocumentcenter.tsp?templateId=6123&navigationId=12037
[2] TI OMAP$ Panda board
http://pandaboard.org/node/224/#manual
[3] Memory Regions discussion at Ubuntu Development Summit, May 2011
https://wiki.linaro.org/Specs/KernelGeneralPerformanceO?action=AttachFile&do=view&target=k81-memregions.odp
[4] Memory compaction
http://lwn.net/Articles/368869/

Ankita Garg (10):
mm: Introduce the memory regions data structure
mm: Helper routines
mm: Init zones inside memory regions
mm: Refer to zones from memory regions
mm: Create zonelists
mm: Verify zonelists
mm: Modify vmstat
mm: Modify vmscan
mm: Reflect memory region changes in zoneinfo
mm: Create memory regions at boot-up

include/linux/mm.h | 25 +++-
include/linux/mmzone.h | 58 +++++++--
include/linux/vmstat.h | 22 ++-
mm/mm_init.c | 51 ++++---
mm/mmzone.c | 36 ++++-
mm/page_alloc.c | 368 +++++++++++++++++++++++++++++++-----------------
mm/vmscan.c | 284 ++++++++++++++++++++-----------------
mm/vmstat.c | 77 ++++++----
8 files changed, 581 insertions(+), 340 deletions(-)

--
1.7.4


2011-05-27 12:32:05

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 01/10] mm: Introduce the memory regions data structure

Memory region data structure is created under a NUMA node. Each NUMA node can
have multiple memory regions, depending upon the platform configuration for
power management. Each memory region contains zones, which is the entity from
which memory is allocated by the buddy allocator.

-------------
| pg_data_t |
-------------
| |
------ -------
v v
---------------- ----------------
| mem_region_t | | mem_region_t |
---------------- ---------------- -------------
| |...........| zone0 | ....
v -------------
-----------------------------
| zone0 | zone1 | zone3 | ..|
-----------------------------

Each memory region contains a zone array for the zones belonging to that region,
in addition to other fields like node id, index of the region in the node, start
pfn of the pages in that region and the number of pages spanned in the region.
The zone array inside the regions is statically allocated at this point.

ToDo:
However, since the number of regions actually present on the system might be much
smaller than the maximum allowed, dynamic bootmem allocation could be used to save
memory.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/mmzone.h | 25 ++++++++++++++++++++++++-
1 files changed, 24 insertions(+), 1 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e56f835..997a474 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -60,6 +60,7 @@ struct free_area {
};

struct pglist_data;
+struct mem_region_list_data;

/*
* zone->lock and zone->lru_lock are two of the hottest locks in the kernel.
@@ -311,6 +312,7 @@ struct zone {
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif
+ int region;
struct per_cpu_pageset __percpu *pageset;
/*
* free areas of different sizes
@@ -399,6 +401,8 @@ struct zone {
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
+ struct mem_region_list_data *zone_mem_region;
+
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;

@@ -597,6 +601,19 @@ struct node_active_region {
extern struct page *mem_map;
#endif

+typedef struct mem_region_list_data {
+ struct zone zones[MAX_NR_ZONES];
+ int nr_zones;
+
+ int node;
+ int region;
+
+ unsigned long start_pfn;
+ unsigned long spanned_pages;
+} mem_region_t;
+
+#define MAX_NR_REGIONS 16
+
/*
* The pg_data_t structure is used in machines with CONFIG_DISCONTIGMEM
* (mostly NUMA machines?) to denote a higher-level memory zone than the
@@ -610,7 +627,10 @@ extern struct page *mem_map;
*/
struct bootmem_data;
typedef struct pglist_data {
- struct zone node_zones[MAX_NR_ZONES];
+/* The linkage to node_zones is now removed. The new hierarchy introduced
+ * is pg_data_t -> mem_region -> zones
+ * struct zone node_zones[MAX_NR_ZONES];
+ */
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
@@ -632,6 +652,9 @@ typedef struct pglist_data {
*/
spinlock_t node_size_lock;
#endif
+ mem_region_t mem_regions[MAX_NR_REGIONS];
+ int nr_mem_regions;
+
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
--
1.7.4

2011-05-27 12:34:17

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 02/10] mm: Helper routines

With the introduction of regions, helper routines are needed to walk through
all the regions and zones inside a node. This patch adds these helper
routines.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/mm.h | 21 ++++++++++++++++-----
include/linux/mmzone.h | 25 ++++++++++++++++++++++---
mm/mmzone.c | 36 +++++++++++++++++++++++++++++++-----
mm/page_alloc.c | 10 ++++++++++
4 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6507dde..25299a3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -675,11 +675,6 @@ static inline int page_to_nid(struct page *page)
}
#endif

-static inline struct zone *page_zone(struct page *page)
-{
- return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
-}
-
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
static inline unsigned long page_to_section(struct page *page)
{
@@ -687,6 +682,22 @@ static inline unsigned long page_to_section(struct page *page)
}
#endif

+static inline struct zone *page_zone(struct page *page)
+{
+ pg_data_t *pg = NODE_DATA(page_to_nid(page));
+ int i;
+ unsigned long pfn = page_to_pfn(page);
+
+ for (i = 0; i< pg->nr_mem_regions; i++) {
+ mem_region_t *mem_region= &(pg->mem_regions[i]);
+ unsigned long end_pfn = mem_region->start_pfn + mem_region->spanned_pages - 1;
+ if ((pfn >= mem_region->start_pfn) && (pfn <= end_pfn))
+ return &mem_region->zones[page_zonenum(page)];
+ }
+
+ return NULL;
+}
+
static inline void set_page_zone(struct page *page, enum zone_type zone)
{
page->flags &= ~(ZONES_MASK << ZONES_PGSHIFT);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 997a474..3f13dc8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -711,7 +711,7 @@ unsigned long __init node_memmap_size_bytes(int, unsigned long, unsigned long);
/*
* zone_idx() returns 0 for the ZONE_DMA zone, 1 for the ZONE_NORMAL zone, etc.
*/
-#define zone_idx(zone) ((zone) - (zone)->zone_pgdat->node_zones)
+#define zone_idx(zone) ((zone) - (zone)->zone_mem_region->zones)

static inline int populated_zone(struct zone *zone)
{
@@ -818,6 +818,7 @@ extern struct pglist_data contig_page_data;

extern struct pglist_data *first_online_pgdat(void);
extern struct pglist_data *next_online_pgdat(struct pglist_data *pgdat);
+extern struct zone* first_zone(void);
extern struct zone *next_zone(struct zone *zone);

/**
@@ -828,6 +829,24 @@ extern struct zone *next_zone(struct zone *zone);
for (pgdat = first_online_pgdat(); \
pgdat; \
pgdat = next_online_pgdat(pgdat))
+
+extern struct mem_region_list_data *first_mem_region(struct pglist_data *pgdat);
+/**
+ * for_each_mem_region - helper macro to iterate over all the memory regions in a
+ * node
+ * @mem_region - pointer to a mem_region_t variable
+ */
+#define for_each_online_mem_region(mem_region) \
+ for (mem_region = first_node_mem_region(); \
+ mem_region; \
+ mem_region = next_mem_region(mem_region))
+
+inline int first_mem_region_in_nid(int nid);
+int next_mem_region_in_nid(int idx, int nid);
+/* Basic iterator support to walk all the memory regions in a node */
+#define for_each_mem_region_in_nid(i, nid) \
+ for (i = 0; i != -1; i = next_mem_region_in_nid(i, nid))
+
/**
* for_each_zone - helper macro to iterate over all memory zones
* @zone - pointer to struct zone variable
@@ -836,12 +855,12 @@ extern struct zone *next_zone(struct zone *zone);
* fills it in.
*/
#define for_each_zone(zone) \
- for (zone = (first_online_pgdat())->node_zones; \
+ for (zone = (first_zone()); \
zone; \
zone = next_zone(zone))

#define for_each_populated_zone(zone) \
- for (zone = (first_online_pgdat())->node_zones; \
+ for (zone = (first_zone()); \
zone; \
zone = next_zone(zone)) \
if (!populated_zone(zone)) \
diff --git a/mm/mmzone.c b/mm/mmzone.c
index f5b7d17..0e2bec3 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -24,21 +24,47 @@ struct pglist_data *next_online_pgdat(struct pglist_data *pgdat)
return NODE_DATA(nid);
}

+inline struct mem_region_list_data *first_mem_region(struct pglist_data *pgdat)
+{
+ return &pgdat->mem_regions[0];
+}
+
+struct mem_region_list_data *next_mem_region(struct mem_region_list_data *mem_region)
+{
+ int next_region = mem_region->region + 1;
+ pg_data_t *pgdat = NODE_DATA(mem_region->node);
+
+ if (next_region == pgdat->nr_mem_regions)
+ return NULL;
+ return &(pgdat->mem_regions[next_region]);
+}
+
+inline struct zone *first_zone(void)
+{
+ return (first_online_pgdat())->mem_regions[0].zones;
+}
+
/*
* next_zone - helper magic for for_each_zone()
*/
struct zone *next_zone(struct zone *zone)
{
pg_data_t *pgdat = zone->zone_pgdat;
+ mem_region_t *mem_region = zone->zone_mem_region;

- if (zone < pgdat->node_zones + MAX_NR_ZONES - 1)
+ if (zone < mem_region->zones + MAX_NR_ZONES - 1)
zone++;
else {
- pgdat = next_online_pgdat(pgdat);
- if (pgdat)
- zone = pgdat->node_zones;
+ mem_region = next_mem_region(mem_region);
+ if (!mem_region) {
+ pgdat = next_online_pgdat(pgdat);
+ if (pgdat)
+ zone = pgdat->mem_regions[0].zones;
+ else
+ zone = NULL;
+ }
else
- zone = NULL;
+ zone = mem_region->zones;
}
return zone;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f8bce2..af2529d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2492,6 +2492,16 @@ out:

#define K(x) ((x) << (PAGE_SHIFT-10))

+int next_mem_region_in_nid(int idx, int nid)
+{
+ pg_data_t *pgdat = NODE_DATA(nid);
+ int region = idx + 1;
+
+ if (region == pgdat->nr_mem_regions)
+ return -1;
+ return region;
+}
+
/*
* Show free area list (used inside shift_scroll-lock stuff)
* We also calculate the percentage fragmentation. We do this by counting the
--
1.7.4

2011-05-27 12:33:58

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 03/10] mm: Init zones inside memory regions

This patch initializes zones inside memory regions. Each memory region is
scanned for the pfns present in it. The intersection of the range with that of
a zone is setup as the amount of memory present in the zone in that region.
Most of the other setup related steps continue to be unmodified.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/mm.h | 2 +
mm/page_alloc.c | 182 +++++++++++++++++++++++++++++++++-------------------
2 files changed, 118 insertions(+), 66 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25299a3..e4e7869 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1360,6 +1360,8 @@ extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
unsigned long *start_pfn, unsigned long *end_pfn);
+extern void get_pfn_range_for_region(int nid, int region,
+ unsigned long *start_pfn, unsigned long *end_pfn);
extern unsigned long find_min_pfn_with_active_regions(void);
extern void free_bootmem_with_active_regions(int nid,
unsigned long max_low_pfn);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index af2529d..a21e067 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4000,6 +4000,11 @@ static void __meminit adjust_zone_range_for_zone_movable(int nid,
* Return the number of pages a zone spans in a node, including holes
* present_pages = zone_spanned_pages_in_node() - zone_absent_pages_in_node()
*/
+
+/* This routines needs modifications
+ * Presently have made changes only to routines specific to the default config options
+ * of the panda and exynos boards
+ */
static unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long *ignored)
@@ -4111,6 +4116,37 @@ static unsigned long __meminit zone_absent_pages_in_node(int nid,
}

#else
+
+void __meminit get_pfn_range_for_region(int nid, int region,
+ unsigned long *start_pfn, unsigned long *end_pfn)
+{
+ mem_region_t *mem_region;
+
+ mem_region = &NODE_DATA(nid)->mem_regions[region];
+ *start_pfn = mem_region->start_pfn;
+ *end_pfn = *start_pfn + mem_region->spanned_pages - 1;
+}
+
+static inline unsigned long __meminit zone_spanned_pages_in_node_region(int nid,
+ int region,
+ unsigned long zone_start_pfn,
+ unsigned long zone_type,
+ unsigned long *zones_size)
+{
+ unsigned long start_pfn, end_pfn;
+ unsigned long zone_end_pfn = zone_start_pfn + zones_size[zone_type] - 1;
+
+ if (!zones_size[zone_type])
+ return 0;
+
+ get_pfn_range_for_region(nid, region, &start_pfn, &end_pfn);
+
+ zone_end_pfn = min(zone_end_pfn, end_pfn);
+ zone_start_pfn = max(start_pfn, zone_start_pfn);
+
+ return zone_end_pfn - zone_start_pfn + 1;
+}
+
static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
unsigned long zone_type,
unsigned long *zones_size)
@@ -4118,14 +4154,22 @@ static inline unsigned long __meminit zone_spanned_pages_in_node(int nid,
return zones_size[zone_type];
}

+/* Returning 0 at this point. It only affects the zone watermarks as the number
+ * of present pages in the zones will be stored incorrectly.
+ * To Do: Compute the pfn ranges of holes in memory and incorporate that info
+ * when finding holes inside each zone
+ */
static inline unsigned long __meminit zone_absent_pages_in_node(int nid,
unsigned long zone_type,
unsigned long *zholes_size)
{
+#if 0
if (!zholes_size)
return 0;

return zholes_size[zone_type];
+#endif
+ return 0;
}

#endif
@@ -4237,7 +4281,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
enum zone_type j;
int nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
- int ret;
+ int ret, i;

pgdat_resize_init(pgdat);
pgdat->nr_zones = 0;
@@ -4246,78 +4290,84 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
pgdat_page_cgroup_init(pgdat);

for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zone *zone = pgdat->node_zones + j;
- unsigned long size, realsize, memmap_pages;
- enum lru_list l;
-
- size = zone_spanned_pages_in_node(nid, j, zones_size);
- realsize = size - zone_absent_pages_in_node(nid, j,
+ for_each_mem_region_in_nid(i, nid) {
+ mem_region_t *mem_region = &pgdat->mem_regions[i];
+ struct zone *zone = mem_region->zones + j;
+ unsigned long size, realsize, memmap_pages;
+ enum lru_list l;
+
+ size = zone_spanned_pages_in_node_region(nid, i, zone_start_pfn,
+ j, zones_size);
+ realsize = size - zone_absent_pages_in_node(nid, j,
zholes_size);

- /*
- * Adjust realsize so that it accounts for how much memory
- * is used by this zone for memmap. This affects the watermark
- * and per-cpu initialisations
- */
- memmap_pages =
- PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
- if (realsize >= memmap_pages) {
- realsize -= memmap_pages;
- if (memmap_pages)
- printk(KERN_DEBUG
- " %s zone: %lu pages used for memmap\n",
- zone_names[j], memmap_pages);
- } else
- printk(KERN_WARNING
- " %s zone: %lu pages exceeds realsize %lu\n",
- zone_names[j], memmap_pages, realsize);
-
- /* Account for reserved pages */
- if (j == 0 && realsize > dma_reserve) {
- realsize -= dma_reserve;
- printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
- zone_names[0], dma_reserve);
- }
+ /*
+ * Adjust realsize so that it accounts for how much memory
+ * is used by this zone for memmap. This affects the watermark
+ * and per-cpu initialisations
+ */
+ memmap_pages =
+ PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT;
+ if (realsize >= memmap_pages) {
+ realsize -= memmap_pages;
+ if (memmap_pages)
+ printk(KERN_DEBUG
+ " %s zone: %lu pages used for memmap\n",
+ zone_names[j], memmap_pages);
+ } else
+ printk(KERN_WARNING
+ " %s zone: %lu pages exceeds realsize %lu\n",
+ zone_names[j], memmap_pages, realsize);
+
+ /* Account for reserved pages */
+ if (j == 0 && realsize > dma_reserve) {
+ realsize -= dma_reserve;
+ printk(KERN_DEBUG " %s zone: %lu pages reserved\n",
+ zone_names[0], dma_reserve);
+ }

- if (!is_highmem_idx(j))
- nr_kernel_pages += realsize;
- nr_all_pages += realsize;
+ if (!is_highmem_idx(j))
+ nr_kernel_pages += realsize;
+ nr_all_pages += realsize;

- zone->spanned_pages = size;
- zone->present_pages = realsize;
+ zone->spanned_pages = size;
+ zone->present_pages = realsize;
#ifdef CONFIG_NUMA
- zone->node = nid;
- zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
- / 100;
- zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
+ zone->node = nid;
+ zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
+ / 100;
+ zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
- zone->name = zone_names[j];
- spin_lock_init(&zone->lock);
- spin_lock_init(&zone->lru_lock);
- zone_seqlock_init(zone);
- zone->zone_pgdat = pgdat;
-
- zone_pcp_init(zone);
- for_each_lru(l) {
- INIT_LIST_HEAD(&zone->lru[l].list);
- zone->reclaim_stat.nr_saved_scan[l] = 0;
- }
- zone->reclaim_stat.recent_rotated[0] = 0;
- zone->reclaim_stat.recent_rotated[1] = 0;
- zone->reclaim_stat.recent_scanned[0] = 0;
- zone->reclaim_stat.recent_scanned[1] = 0;
- zap_zone_vm_stats(zone);
- zone->flags = 0;
- if (!size)
- continue;
+ zone->region = i;
+ zone->name = zone_names[j];
+ spin_lock_init(&zone->lock);
+ spin_lock_init(&zone->lru_lock);
+ zone_seqlock_init(zone);
+ zone->zone_pgdat = pgdat;
+ zone->zone_mem_region = mem_region;
+
+ zone_pcp_init(zone);
+ for_each_lru(l) {
+ INIT_LIST_HEAD(&zone->lru[l].list);
+ zone->reclaim_stat.nr_saved_scan[l] = 0;
+ }
+ zone->reclaim_stat.recent_rotated[0] = 0;
+ zone->reclaim_stat.recent_rotated[1] = 0;
+ zone->reclaim_stat.recent_scanned[0] = 0;
+ zone->reclaim_stat.recent_scanned[1] = 0;
+ zap_zone_vm_stats(zone);
+ zone->flags = 0;
+ if (!size)
+ continue;

- set_pageblock_order(pageblock_default_order());
- setup_usemap(pgdat, zone, size);
- ret = init_currently_empty_zone(zone, zone_start_pfn,
- size, MEMMAP_EARLY);
- BUG_ON(ret);
- memmap_init(size, nid, j, zone_start_pfn);
- zone_start_pfn += size;
+ set_pageblock_order(pageblock_default_order());
+ setup_usemap(pgdat, zone, size);
+ ret = init_currently_empty_zone(zone, zone_start_pfn,
+ size, MEMMAP_EARLY);
+ BUG_ON(ret);
+ memmap_init(size, nid, j, zone_start_pfn);
+ zone_start_pfn += size;
+ }
}
}

--
1.7.4

2011-05-27 12:33:02

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 04/10] mm: Refer to zones from memory regions

With the introduction of memory regions, the node_zone link inside
the node structure is removed. Hence, this patch modifies the VM
code to refer to zones from within memory regions instead of nodes.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/mm.h | 2 +-
include/linux/mmzone.h | 8 ++--
mm/page_alloc.c | 87 +++++++++++++++++++++++++++--------------------
3 files changed, 55 insertions(+), 42 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index e4e7869..1b8839d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1390,7 +1390,7 @@ extern int __meminit __early_pfn_to_nid(unsigned long pfn);
#endif

extern void set_dma_reserve(unsigned long new_dma_reserve);
-extern void memmap_init_zone(unsigned long, int, unsigned long,
+extern void memmap_init_zone(unsigned long, int, int, unsigned long,
unsigned long, enum memmap_context);
extern void setup_per_zone_wmarks(void);
extern void calculate_zone_inactive_ratio(struct zone *zone);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3f13dc8..bc3e3fd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -753,7 +753,7 @@ static inline int is_normal_idx(enum zone_type idx)
static inline int is_highmem(struct zone *zone)
{
#ifdef CONFIG_HIGHMEM
- int zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones;
+ int zone_off = (char *)zone - (char *)zone->zone_mem_region->zones;
return zone_off == ZONE_HIGHMEM * sizeof(*zone) ||
(zone_off == ZONE_MOVABLE * sizeof(*zone) &&
zone_movable_is_highmem());
@@ -764,13 +764,13 @@ static inline int is_highmem(struct zone *zone)

static inline int is_normal(struct zone *zone)
{
- return zone == zone->zone_pgdat->node_zones + ZONE_NORMAL;
+ return zone == zone->zone_mem_region->zones + ZONE_NORMAL;
}

static inline int is_dma32(struct zone *zone)
{
#ifdef CONFIG_ZONE_DMA32
- return zone == zone->zone_pgdat->node_zones + ZONE_DMA32;
+ return zone == zone->zone_mem_region->zones + ZONE_DMA32;
#else
return 0;
#endif
@@ -779,7 +779,7 @@ static inline int is_dma32(struct zone *zone)
static inline int is_dma(struct zone *zone)
{
#ifdef CONFIG_ZONE_DMA
- return zone == zone->zone_pgdat->node_zones + ZONE_DMA;
+ return zone == zone->zone_mem_region->zones + ZONE_DMA;
#else
return 0;
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a21e067..3c48635 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -892,7 +892,7 @@ static int move_freepages_block(struct zone *zone, struct page *page,
end_pfn = start_pfn + pageblock_nr_pages - 1;

/* Do not cross zone boundaries */
- if (start_pfn < zone->zone_start_pfn)
+ if (start_pfn <= zone->zone_start_pfn)
start_page = page;
if (end_pfn >= zone->zone_start_pfn + zone->spanned_pages)
return 0;
@@ -2462,7 +2462,7 @@ void si_meminfo_node(struct sysinfo *val, int nid)
#ifdef CONFIG_HIGHMEM
val->totalhigh = pgdat->node_zones[ZONE_HIGHMEM].present_pages;
val->freehigh = zone_page_state(&pgdat->node_zones[ZONE_HIGHMEM],
- NR_FREE_PAGES);
+ NR_FREE_PAGES);
#else
val->totalhigh = 0;
val->freehigh = 0;
@@ -3396,8 +3396,8 @@ static void setup_zone_migrate_reserve(struct zone *zone)
* up by free_all_bootmem() once the early boot process is
* done. Non-atomic initialization, single-pass.
*/
-void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
- unsigned long start_pfn, enum memmap_context context)
+void __meminit memmap_init_zone(unsigned long size, int nid, int region,
+ unsigned long zone, unsigned long start_pfn, enum memmap_context context)
{
struct page *page;
unsigned long end_pfn = start_pfn + size;
@@ -3407,7 +3407,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
if (highest_memmap_pfn < end_pfn - 1)
highest_memmap_pfn = end_pfn - 1;

- z = &NODE_DATA(nid)->node_zones[zone];
+ z = &NODE_DATA(nid)->mem_regions[region].zones[zone];
for (pfn = start_pfn; pfn < end_pfn; pfn++) {
/*
* There can be holes in boot-time mem_map[]s
@@ -3464,8 +3464,8 @@ static void __meminit zone_init_free_lists(struct zone *zone)
}

#ifndef __HAVE_ARCH_MEMMAP_INIT
-#define memmap_init(size, nid, zone, start_pfn) \
- memmap_init_zone((size), (nid), (zone), (start_pfn), MEMMAP_EARLY)
+#define memmap_init(size, nid, region, zone, start_pfn) \
+ memmap_init_zone((size), (nid), (region), (zone), (start_pfn), MEMMAP_EARLY)
#endif

static int zone_batchsize(struct zone *zone)
@@ -3584,7 +3584,7 @@ static noinline __init_refok
int zone_wait_table_init(struct zone *zone, unsigned long zone_size_pages)
{
int i;
- struct pglist_data *pgdat = zone->zone_pgdat;
+ struct pglist_data *pgdat = NODE_DATA(zone->node);
size_t alloc_size;

/*
@@ -3670,20 +3670,22 @@ __meminit int init_currently_empty_zone(struct zone *zone,
enum memmap_context context)
{
struct pglist_data *pgdat = zone->zone_pgdat;
+ struct mem_region_list_data *mem_region = zone->zone_mem_region;
int ret;
ret = zone_wait_table_init(zone, size);
if (ret)
return ret;
pgdat->nr_zones = zone_idx(zone) + 1;
+ mem_region->nr_zones = zone_idx(zone) + 1;

zone->zone_start_pfn = zone_start_pfn;
-
+/*
mminit_dprintk(MMINIT_TRACE, "memmap_init",
"Initialising map node %d zone %lu pfns %lu -> %lu\n",
pgdat->node_id,
(unsigned long)zone_idx(zone),
zone_start_pfn, (zone_start_pfn + size));
-
+*/
zone_init_free_lists(zone);

return 0;
@@ -4365,7 +4367,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
ret = init_currently_empty_zone(zone, zone_start_pfn,
size, MEMMAP_EARLY);
BUG_ON(ret);
- memmap_init(size, nid, j, zone_start_pfn);
+ memmap_init(size, nid, i, j, zone_start_pfn);
zone_start_pfn += size;
}
}
@@ -4412,6 +4414,9 @@ static void __init_refok alloc_node_mem_map(struct pglist_data *pgdat)
#endif /* CONFIG_FLAT_NODE_MEM_MAP */
}

+/* TO DO: This routine needs modification, but not required for panda board
+ * to start with
+ */
void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
unsigned long node_start_pfn, unsigned long *zholes_size)
{
@@ -5014,24 +5019,28 @@ static void calculate_totalreserve_pages(void)
struct pglist_data *pgdat;
unsigned long reserve_pages = 0;
enum zone_type i, j;
+ int p;

for_each_online_pgdat(pgdat) {
for (i = 0; i < MAX_NR_ZONES; i++) {
- struct zone *zone = pgdat->node_zones + i;
- unsigned long max = 0;
-
- /* Find valid and maximum lowmem_reserve in the zone */
- for (j = i; j < MAX_NR_ZONES; j++) {
- if (zone->lowmem_reserve[j] > max)
- max = zone->lowmem_reserve[j];
- }
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;
+ unsigned long max = 0;
+
+ /* Find valid and maximum lowmem_reserve in the zone */
+ for (j = i; j < MAX_NR_ZONES; j++) {
+ if (zone->lowmem_reserve[j] > max)
+ max = zone->lowmem_reserve[j];
+ }

- /* we treat the high watermark as reserved pages. */
- max += high_wmark_pages(zone);
+ /* we treat the high watermark as reserved pages. */
+ max += high_wmark_pages(zone);

- if (max > zone->present_pages)
- max = zone->present_pages;
- reserve_pages += max;
+ if (max > zone->present_pages)
+ max = zone->present_pages;
+ reserve_pages += max;
+ }
}
}
totalreserve_pages = reserve_pages;
@@ -5047,27 +5056,31 @@ static void setup_per_zone_lowmem_reserve(void)
{
struct pglist_data *pgdat;
enum zone_type j, idx;
+ int p;

for_each_online_pgdat(pgdat) {
for (j = 0; j < MAX_NR_ZONES; j++) {
- struct zone *zone = pgdat->node_zones + j;
- unsigned long present_pages = zone->present_pages;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + j;
+ unsigned long present_pages = zone->present_pages;

- zone->lowmem_reserve[j] = 0;
+ zone->lowmem_reserve[j] = 0;

- idx = j;
- while (idx) {
- struct zone *lower_zone;
+ idx = j;
+ while (idx) {
+ struct zone *lower_zone;

- idx--;
+ idx--;

- if (sysctl_lowmem_reserve_ratio[idx] < 1)
- sysctl_lowmem_reserve_ratio[idx] = 1;
+ if (sysctl_lowmem_reserve_ratio[idx] < 1)
+ sysctl_lowmem_reserve_ratio[idx] = 1;

- lower_zone = pgdat->node_zones + idx;
- lower_zone->lowmem_reserve[j] = present_pages /
- sysctl_lowmem_reserve_ratio[idx];
- present_pages += lower_zone->present_pages;
+ lower_zone = mem_region->zones + idx;
+ lower_zone->lowmem_reserve[j] = present_pages /
+ sysctl_lowmem_reserve_ratio[idx];
+ present_pages += lower_zone->present_pages;
+ }
}
}
}
--
1.7.4

2011-05-27 12:32:12

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 05/10] mm: Create zonelists

The default zonelist that is node ordered contains all zones from within a
node and then all zones from the next node and so on. By introducing memory
regions, the primary aim is to group memory allocations to a given area of
memory together. The modified zonelists thus contain all zones from one
region, followed by all zones from the next region and so on. This ensures
that all the memory in one region is allocated before going over to the next
region, unless targetted memory allocations are performed.

Signed-off-by: Ankita Garg <[email protected]>
---
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++---------------------
1 files changed, 38 insertions(+), 24 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c48635..da8b045 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2668,20 +2668,26 @@ static int build_zonelists_node(pg_data_t *pgdat, struct zonelist *zonelist,
int nr_zones, enum zone_type zone_type)
{
struct zone *zone;
+ int nid = pgdat->node_id;
+ int i;
+ enum zone_type z_type = zone_type;

BUG_ON(zone_type >= MAX_NR_ZONES);
zone_type++;

- do {
- zone_type--;
- zone = pgdat->node_zones + zone_type;
- if (populated_zone(zone)) {
- zoneref_set_zone(zone,
- &zonelist->_zonerefs[nr_zones++]);
- check_highest_zone(zone_type);
- }
-
- } while (zone_type);
+ for_each_mem_region_in_nid(i, nid) {
+ mem_region_t *mem_region = &pgdat->mem_regions[i];
+ do {
+ zone_type--;
+ zone = mem_region->zones + zone_type;
+ if (populated_zone(zone)) {
+ zoneref_set_zone(zone,
+ &zonelist->_zonerefs[nr_zones++]);
+ check_highest_zone(zone_type);
+ }
+ } while (zone_type);
+ zone_type = z_type + 1;
+ }
return nr_zones;
}

@@ -2898,7 +2904,7 @@ static int node_order[MAX_NUMNODES];

static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)
{
- int pos, j, node;
+ int pos, j, node, p;
int zone_type; /* needs to be signed */
struct zone *z;
struct zonelist *zonelist;
@@ -2922,7 +2928,7 @@ static void build_zonelists_in_zone_order(pg_data_t *pgdat, int nr_nodes)

static int default_zonelist_order(void)
{
- int nid, zone_type;
+ int nid, zone_type, i;
unsigned long low_kmem_size,total_size;
struct zone *z;
int average_size;
@@ -2937,12 +2943,16 @@ static int default_zonelist_order(void)
total_size = 0;
for_each_online_node(nid) {
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
- z = &NODE_DATA(nid)->node_zones[zone_type];
- if (populated_zone(z)) {
- if (zone_type < ZONE_NORMAL)
- low_kmem_size += z->present_pages;
- total_size += z->present_pages;
- } else if (zone_type == ZONE_NORMAL) {
+ for_each_mem_region_in_nid(i, nid) {
+ mem_region_t *mem_region = &(NODE_DATA(nid)->mem_regions[i]);
+ z = &mem_region->zones[zone_type];
+ if (populated_zone(z)) {
+ if (zone_type < ZONE_NORMAL)
+ low_kmem_size +=
+ z->present_pages;
+
+ total_size += z->present_pages;
+ } else if (zone_type == ZONE_NORMAL) {
/*
* If any node has only lowmem, then node order
* is preferred to allow kernel allocations
@@ -2950,7 +2960,8 @@ static int default_zonelist_order(void)
* on other nodes when there is an abundance of
* lowmem available to allocate from.
*/
- return ZONELIST_ORDER_NODE;
+ return ZONELIST_ORDER_NODE;
+ }
}
}
}
@@ -2968,11 +2979,14 @@ static int default_zonelist_order(void)
low_kmem_size = 0;
total_size = 0;
for (zone_type = 0; zone_type < MAX_NR_ZONES; zone_type++) {
- z = &NODE_DATA(nid)->node_zones[zone_type];
- if (populated_zone(z)) {
- if (zone_type < ZONE_NORMAL)
- low_kmem_size += z->present_pages;
- total_size += z->present_pages;
+ for_each_mem_region_in_nid(i, nid) {
+ mem_region_t *mem_region = &(NODE_DATA(nid)->mem_regions[i]);
+ z = &mem_region->zones[zone_type];
+ if (populated_zone(z)) {
+ if (zone_type < ZONE_NORMAL)
+ low_kmem_size += z->present_pages;
+ total_size += z->present_pages;
+ }
}
}
if (low_kmem_size &&
--
1.7.4

2011-05-27 12:32:13

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 06/10] mm: Verify zonelists

Verify that the zonelists were created appropriately. Below is the output in
the dmesg for the verification of creation of zonelists. 4 regions, each of
size 512MB were created on the Samsung Orion/Exynos board (board has 2G RAM).

The regions were created as follows:

created region 0 in nid 0 start pfn 262144 spanned pages 131072
created region 1 in nid 0 start pfn 393216 spanned pages 131072
created region 2 in nid 0 start pfn 524288 spanned pages 131072
created region 3 in nid 0 start pfn 655360 spanned pages 57344

mminit::zonelist general 0:Normal = 0:Normal 0:Normal 0:Normal 0:Normal
mminit::zonelist general 0:Normal = 0:Normal 0:Normal 0:Normal 0:Normal
mminit::zonelist general 0:Normal = 0:Normal 0:Normal 0:Normal 0:Normal
mminit::zonelist general 0:Normal = 0:Normal 0:Normal 0:Normal 0:Normal

Since now 4 zones are present inside a node, the above shows 4 zonelists
being created.

Signed-off-by: Ankita Garg <[email protected]>
---
mm/mm_init.c | 51 +++++++++++++++++++++++++++------------------------
1 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/mm/mm_init.c b/mm/mm_init.c
index 4e0e265..77468f8 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -21,44 +21,47 @@ int mminit_loglevel;
/* The zonelists are simply reported, validation is manual. */
void mminit_verify_zonelist(void)
{
- int nid;
+ int nid, p;

if (mminit_loglevel < MMINIT_VERIFY)
return;

for_each_online_node(nid) {
pg_data_t *pgdat = NODE_DATA(nid);
- struct zone *zone;
- struct zoneref *z;
- struct zonelist *zonelist;
- int i, listid, zoneid;
-
- BUG_ON(MAX_ZONELISTS > 2);
- for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) {
-
- /* Identify the zone and nodelist */
- zoneid = i % MAX_NR_ZONES;
- listid = i / MAX_NR_ZONES;
- zonelist = &pgdat->node_zonelists[listid];
- zone = &pgdat->node_zones[zoneid];
- if (!populated_zone(zone))
- continue;
-
- /* Print information about the zonelist */
- printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
- listid > 0 ? "thisnode" : "general", nid,
- zone->name);
-
- /* Iterate the zonelist */
- for_each_zone_zonelist(zone, z, zonelist, zoneid) {
+ for_each_mem_region_in_nid(p, nid) {
+ mem_region_t *mem_region = &(NODE_DATA(nid)->mem_regions[p]);
+ struct zone *zone;
+ struct zoneref *z;
+ struct zonelist *zonelist;
+ int i, listid, zoneid;
+
+ BUG_ON(MAX_ZONELISTS > 2);
+ for (i = 0; i < MAX_ZONELISTS * MAX_NR_ZONES; i++) {
+
+ /* Identify the zone and nodelist */
+ zoneid = i % MAX_NR_ZONES;
+ listid = i / MAX_NR_ZONES;
+ zonelist = &pgdat->node_zonelists[listid];
+ zone = &mem_region->zones[zoneid];
+ if (!populated_zone(zone))
+ continue;
+
+ /* Print information about the zonelist */
+ printk(KERN_DEBUG "mminit::zonelist %s %d:%s = ",
+ listid > 0 ? "thisnode" : "general", nid,
+ zone->name);
+
+ /* Iterate the zonelist */
+ for_each_zone_zonelist(zone, z, zonelist, zoneid) {
#ifdef CONFIG_NUMA
- printk(KERN_CONT "%d:%s ",
- zone->node, zone->name);
+ printk(KERN_CONT "%d:%s ",
+ zone->node, zone->name);
#else
- printk(KERN_CONT "0:%s ", zone->name);
+ printk(KERN_CONT "0:%s ", zone->name);
#endif /* CONFIG_NUMA */
+ }
+ printk(KERN_CONT "\n");
}
- printk(KERN_CONT "\n");
}
}
}
--
1.7.4

2011-05-27 12:32:10

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 07/10] mm: Modify vmstat

Change the way vmstats are collected. Since the zones are now present inside
regions, scan through all the regions to obtain zone specific statistics.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/vmstat.h | 22 +++++++++++++++-------
mm/vmstat.c | 48 ++++++++++++++++++++++++++++--------------------
2 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 2b3831b..296b9ad 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -211,20 +211,28 @@ extern unsigned long zone_reclaimable_pages(struct zone *zone);
static inline unsigned long node_page_state(int node,
enum zone_stat_item item)
{
- struct zone *zones = NODE_DATA(node)->node_zones;
+ int i;
+ unsigned long page_state = 0;
+
+ for_each_mem_region_in_nid(i, node) {
+ mr_data_t *mrdat = &(NODE_DATA(node)->mem_regions[i]);
+ struct zone *zones = mrdat->zones;
+
+ page_state =

- return
#ifdef CONFIG_ZONE_DMA
- zone_page_state(&zones[ZONE_DMA], item) +
+ zone_page_state(&zones[ZONE_DMA], item) +
#endif
#ifdef CONFIG_ZONE_DMA32
- zone_page_state(&zones[ZONE_DMA32], item) +
+ zone_page_state(&zones[ZONE_DMA32], item) +
#endif
#ifdef CONFIG_HIGHMEM
- zone_page_state(&zones[ZONE_HIGHMEM], item) +
+ zone_page_state(&zones[ZONE_HIGHMEM], item) +
#endif
- zone_page_state(&zones[ZONE_NORMAL], item) +
- zone_page_state(&zones[ZONE_MOVABLE], item);
+ zone_page_state(&zones[ZONE_NORMAL], item) +
+ zone_page_state(&zones[ZONE_MOVABLE], item);
+ }
+ return page_state;
}

extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 897ea9e..542f8b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -191,17 +191,21 @@ void set_pgdat_percpu_threshold(pg_data_t *pgdat,
struct zone *zone;
int cpu;
int threshold;
- int i;
+ int i, p;

for (i = 0; i < pgdat->nr_zones; i++) {
- zone = &pgdat->node_zones[i];
- if (!zone->percpu_drift_mark)
- continue;
-
- threshold = (*calculate_pressure)(zone);
- for_each_possible_cpu(cpu)
- per_cpu_ptr(zone->pageset, cpu)->stat_threshold
- = threshold;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;
+
+ if (!zone->percpu_drift_mark)
+ continue;
+
+ threshold = (*calculate_pressure)(zone);
+ for_each_possible_cpu(cpu)
+ per_cpu_ptr(zone->pageset, cpu)->stat_threshold
+ = threshold;
+ }
}
}

@@ -642,19 +646,23 @@ static void frag_stop(struct seq_file *m, void *arg)

/* Walk all the zones in a node and print using a callback */
static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
- void (*print)(struct seq_file *m, pg_data_t *, struct zone *))
+ void (*print)(struct seq_file *m, pg_data_t *,
+ mem_region_t *, struct zone *))
{
- struct zone *zone;
- struct zone *node_zones = pgdat->node_zones;
unsigned long flags;
-
- for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
- if (!populated_zone(zone))
- continue;
-
- spin_lock_irqsave(&zone->lock, flags);
- print(m, pgdat, zone);
- spin_unlock_irqrestore(&zone->lock, flags);
+ int i, j;
+
+ for (i = 0; i < MAX_NR_ZONES; ++i) {
+ for_each_mem_region_in_nid(j, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[j];
+ struct zone *zone = mem_region->zones + i;
+ if (!populated_zone(zone))
+ continue;
+
+ spin_lock_irqsave(&zone->lock, flags);
+ print(m, pgdat, mem_region, zone);
+ spin_unlock_irqrestore(&zone->lock, flags);
+ }
}
}
#endif
--
1.7.4

2011-05-27 12:32:58

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 08/10] mm: Modify vmscan

Modify vmscan to take into account the changed node-zone hierarchy.

Signed-off-by: Ankita Garg <[email protected]>
---
mm/vmscan.c | 284 ++++++++++++++++++++++++++++++++---------------------------
1 files changed, 153 insertions(+), 131 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8bfd450..2e11974 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2235,10 +2235,16 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
int classzone_idx)
{
unsigned long present_pages = 0;
- int i;
-
- for (i = 0; i <= classzone_idx; i++)
- present_pages += pgdat->node_zones[i].present_pages;
+ int i, p;
+
+ for (i = 0; i <= classzone_idx; i++) {
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;
+
+ present_pages += zone->present_pages;
+ }
+ }

return balanced_pages > (present_pages >> 2);
}
@@ -2247,7 +2253,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, unsigned long balanced_pages,
static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
int classzone_idx)
{
- int i;
+ int i, j;
unsigned long balanced = 0;
bool all_zones_ok = true;

@@ -2257,29 +2263,31 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,

/* Check the watermark levels */
for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
+ for_each_mem_region_in_nid(j, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[j];
+ struct zone *zone = mem_region->zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- /*
- * balance_pgdat() skips over all_unreclaimable after
- * DEF_PRIORITY. Effectively, it considers them balanced so
- * they must be considered balanced here as well if kswapd
- * is to sleep
- */
- if (zone->all_unreclaimable) {
- balanced += zone->present_pages;
- continue;
- }
+ /*
+ * balance_pgdat() skips over all_unreclaimable after
+ * DEF_PRIORITY. Effectively, it considers them balanced so
+ * they must be considered balanced here as well if kswapd
+ * is to sleep
+ */
+ if (zone->all_unreclaimable) {
+ balanced += zone->present_pages;
+ continue;
+ }

- if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
- classzone_idx, 0))
- all_zones_ok = false;
- else
- balanced += zone->present_pages;
+ if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone),
+ classzone_idx, 0))
+ all_zones_ok = false;
+ else
+ balanced += zone->present_pages;
+ }
}
-
/*
* For high-order requests, the balanced zones must contain at least
* 25% of the nodes pages for kswapd to sleep. For order-0, all zones
@@ -2318,7 +2326,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
int all_zones_ok;
unsigned long balanced;
int priority;
- int i;
+ int i, p;
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long total_scanned;
struct reclaim_state *reclaim_state = current->reclaim_state;
@@ -2357,36 +2365,42 @@ loop_again:
* zone which needs scanning
*/
for (i = pgdat->nr_zones - 1; i >= 0; i--) {
- struct zone *zone = pgdat->node_zones + i;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ continue;

- /*
- * Do some background aging of the anon list, to give
- * pages a chance to be referenced before reclaiming.
- */
- if (inactive_anon_is_low(zone, &sc))
- shrink_active_list(SWAP_CLUSTER_MAX, zone,
- &sc, priority, 0);
-
- if (!zone_watermark_ok_safe(zone, order,
- high_wmark_pages(zone), 0, 0)) {
- end_zone = i;
- *classzone_idx = i;
- break;
+ /*
+ * Do some background aging of the anon list, to give
+ * pages a chance to be referenced before reclaiming.
+ */
+ if (inactive_anon_is_low(zone, &sc))
+ shrink_active_list(SWAP_CLUSTER_MAX, zone,
+ &sc, priority, 0);
+
+ if (!zone_watermark_ok_safe(zone, order,
+ high_wmark_pages(zone), 0, 0)) {
+ end_zone = i;
+ *classzone_idx = i;
+ break;
+ }
}
}
if (i < 0)
goto out;

for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;

- lru_pages += zone_reclaimable_pages(zone);
+ lru_pages += zone_reclaimable_pages(zone);
+ }
}

/*
@@ -2399,84 +2413,86 @@ loop_again:
* cause too much scanning of the lower zones.
*/
for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
- int nr_slab;
- unsigned long balance_gap;
-
- if (!populated_zone(zone))
- continue;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;
+ int nr_slab;
+ unsigned long balance_gap;

- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
- continue;
+ if (!populated_zone(zone))
+ continue;

- sc.nr_scanned = 0;
+ if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ continue;

- /*
- * Call soft limit reclaim before calling shrink_zone.
- * For now we ignore the return value
- */
- mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);
+ sc.nr_scanned = 0;

- /*
- * We put equal pressure on every zone, unless
- * one zone has way too many pages free
- * already. The "too many pages" is defined
- * as the high wmark plus a "gap" where the
- * gap is either the low watermark or 1%
- * of the zone, whichever is smaller.
- */
- balance_gap = min(low_wmark_pages(zone),
- (zone->present_pages +
- KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
- KSWAPD_ZONE_BALANCE_GAP_RATIO);
- if (!zone_watermark_ok_safe(zone, order,
- high_wmark_pages(zone) + balance_gap,
- end_zone, 0))
- shrink_zone(priority, zone, &sc);
- reclaim_state->reclaimed_slab = 0;
- nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
- lru_pages);
- sc.nr_reclaimed += reclaim_state->reclaimed_slab;
- total_scanned += sc.nr_scanned;
-
- if (zone->all_unreclaimable)
- continue;
- if (nr_slab == 0 &&
- !zone_reclaimable(zone))
- zone->all_unreclaimable = 1;
- /*
- * If we've done a decent amount of scanning and
- * the reclaim ratio is low, start doing writepage
- * even in laptop mode
- */
- if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
- total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
- sc.may_writepage = 1;
+ /*
+ * Call soft limit reclaim before calling shrink_zone.
+ * For now we ignore the return value
+ */
+ mem_cgroup_soft_limit_reclaim(zone, order, sc.gfp_mask);

- if (!zone_watermark_ok_safe(zone, order,
- high_wmark_pages(zone), end_zone, 0)) {
- all_zones_ok = 0;
/*
- * We are still under min water mark. This
- * means that we have a GFP_ATOMIC allocation
- * failure risk. Hurry up!
+ * We put equal pressure on every zone, unless
+ * one zone has way too many pages free
+ * already. The "too many pages" is defined
+ * as the high wmark plus a "gap" where the
+ * gap is either the low watermark or 1%
+ * of the zone, whichever is smaller.
*/
+ balance_gap = min(low_wmark_pages(zone),
+ (zone->present_pages +
+ KSWAPD_ZONE_BALANCE_GAP_RATIO-1) /
+ KSWAPD_ZONE_BALANCE_GAP_RATIO);
if (!zone_watermark_ok_safe(zone, order,
- min_wmark_pages(zone), end_zone, 0))
- has_under_min_watermark_zone = 1;
- } else {
+ high_wmark_pages(zone) + balance_gap,
+ end_zone, 0))
+ shrink_zone(priority, zone, &sc);
+ reclaim_state->reclaimed_slab = 0;
+ nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
+ lru_pages);
+ sc.nr_reclaimed += reclaim_state->reclaimed_slab;
+ total_scanned += sc.nr_scanned;
+
+ if (zone->all_unreclaimable)
+ continue;
+ if (nr_slab == 0 &&
+ !zone_reclaimable(zone))
+ zone->all_unreclaimable = 1;
/*
- * If a zone reaches its high watermark,
- * consider it to be no longer congested. It's
- * possible there are dirty pages backed by
- * congested BDIs but as pressure is relieved,
- * spectulatively avoid congestion waits
+ * If we've done a decent amount of scanning and
+ * the reclaim ratio is low, start doing writepage
+ * even in laptop mode
*/
- zone_clear_flag(zone, ZONE_CONGESTED);
- if (i <= *classzone_idx)
- balanced += zone->present_pages;
- }
+ if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
+ total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2)
+ sc.may_writepage = 1;

+ if (!zone_watermark_ok_safe(zone, order,
+ high_wmark_pages(zone), end_zone, 0)) {
+ all_zones_ok = 0;
+ /*
+ * We are still under min water mark. This
+ * means that we have a GFP_ATOMIC allocation
+ * failure risk. Hurry up!
+ */
+ if (!zone_watermark_ok_safe(zone, order,
+ min_wmark_pages(zone), end_zone, 0))
+ has_under_min_watermark_zone = 1;
+ } else {
+ /*
+ * If a zone reaches its high watermark,
+ * consider it to be no longer congested. It's
+ * possible there are dirty pages backed by
+ * congested BDIs but as pressure is relieved,
+ * spectulatively avoid congestion waits
+ */
+ zone_clear_flag(zone, ZONE_CONGESTED);
+ if (i <= *classzone_idx)
+ balanced += zone->present_pages;
+ }
+ }
}
if (all_zones_ok || (order && pgdat_balanced(pgdat, balanced, *classzone_idx)))
break; /* kswapd: all done */
@@ -2542,23 +2558,26 @@ out:
*/
if (order) {
for (i = 0; i <= end_zone; i++) {
- struct zone *zone = pgdat->node_zones + i;
+ for_each_mem_region_in_nid(p, pgdat->node_id) {
+ mem_region_t *mem_region = &pgdat->mem_regions[p];
+ struct zone *zone = mem_region->zones + i;

- if (!populated_zone(zone))
- continue;
+ if (!populated_zone(zone))
+ continue;

- if (zone->all_unreclaimable && priority != DEF_PRIORITY)
- continue;
+ if (zone->all_unreclaimable && priority != DEF_PRIORITY)
+ continue;

- /* Confirm the zone is balanced for order-0 */
- if (!zone_watermark_ok(zone, 0,
- high_wmark_pages(zone), 0, 0)) {
- order = sc.order = 0;
- goto loop_again;
- }
+ /* Confirm the zone is balanced for order-0 */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone), 0, 0)) {
+ order = sc.order = 0;
+ goto loop_again;
+ }

- /* If balanced, clear the congested flag */
- zone_clear_flag(zone, ZONE_CONGESTED);
+ /* If balanced, clear the congested flag */
+ zone_clear_flag(zone, ZONE_CONGESTED);
+ }
}
}

@@ -3304,18 +3323,21 @@ static ssize_t write_scan_unevictable_node(struct sys_device *dev,
struct sysdev_attribute *attr,
const char *buf, size_t count)
{
- struct zone *node_zones = NODE_DATA(dev->id)->node_zones;
- struct zone *zone;
unsigned long res;
+ int i,j;
unsigned long req = strict_strtoul(buf, 10, &res);

if (!req)
return 1; /* zero is no-op */

- for (zone = node_zones; zone - node_zones < MAX_NR_ZONES; ++zone) {
- if (!populated_zone(zone))
- continue;
- scan_zone_unevictable_pages(zone);
+ for (j = 0; j < MAX_NR_ZONES; ++j) {
+ for_each_mem_region_in_nid(i, dev->id) {
+ mem_region_t *mem_region = &(NODE_DATA(dev->id)->mem_regions[i]);
+ struct zone *zone = mem_region->zones;
+ if (!populated_zone(zone))
+ continue;
+ scan_zone_unevictable_pages(zone);
+ }
}
return 1;
}
--
1.7.4

2011-05-27 12:32:06

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 09/10] mm: Reflect memory region changes in zoneinfo

This patch modifies the output of /proc/zoneinfo to take the memory regions
into into account. Below is the output on the Samsung board booted with 4
regions, each of size 512MB.

# cat /proc/zoneinfo
Node 0, Region 0, zone Normal
pages free 124570
min 388
low 485
high 582
scanned 0
spanned 131072
present 130048
nr_free_pages 124570
nr_inactive_anon 0
nr_active_anon 92
nr_inactive_file 454
nr_active_file 190
nr_unevictable 0
nr_mlock 0
nr_anon_pages 95
nr_mapped 290
nr_file_pages 647
nr_dirty 1
nr_writeback 0
nr_slab_reclaimable 33
nr_slab_unreclaimable 428
nr_page_table_pages 4
nr_kernel_stack 20
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 12
nr_written 11
nr_anon_transparent_hugepages 0
protection: (0, 0)
pagesets
cpu: 0
count: 48
high: 186
batch: 31
vm stats threshold: 6
all_unreclaimable: 0
start_pfn: 262144
inactive_ratio: 1
Node 0, Region 1, zone Normal
pages free 131072
min 388
low 485
high 582
scanned 0
spanned 131072
present 130048
nr_free_pages 131072
.....
Node 0, Region 2, zone Normal
pages free 131072
min 388
low 485
high 582
scanned 0
spanned 131072
present 130048
nr_free_pages 131072
.....
Node 0, Region 3, zone Normal
pages free 57332
min 170
low 212
high 255
scanned 0
spanned 57344
present 56896
nr_free_pages 57332
.....

Signed-off-by: Ankita Garg <[email protected]>
---
mm/vmstat.c | 29 +++++++++++++++++------------
1 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 542f8b6..153e25b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -179,16 +179,18 @@ static void refresh_zone_stat_thresholds(void)
*/
tolerate_drift = low_wmark_pages(zone) - min_wmark_pages(zone);
max_drift = num_online_cpus() * threshold;
- if (max_drift > tolerate_drift)
+ if (max_drift > tolerate_drift) {
zone->percpu_drift_mark = high_wmark_pages(zone) +
max_drift;
+ printk("zone %s drift mark %lu \n", zone->name,
+ zone->percpu_drift_mark);
+ }
}
}

void set_pgdat_percpu_threshold(pg_data_t *pgdat,
int (*calculate_pressure)(struct zone *))
{
- struct zone *zone;
int cpu;
int threshold;
int i, p;
@@ -669,11 +671,12 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,

#ifdef CONFIG_PROC_FS
static void frag_show_print(struct seq_file *m, pg_data_t *pgdat,
- struct zone *zone)
+ mem_region_t *mem_region, struct zone *zone)
{
int order;

- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ seq_printf(m, "Node %d, REG %d, zone %8s ", pgdat->node_id,
+ mem_region->region, zone->name);
for (order = 0; order < MAX_ORDER; ++order)
seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
seq_putc(m, '\n');
@@ -689,14 +692,15 @@ static int frag_show(struct seq_file *m, void *arg)
return 0;
}

-static void pagetypeinfo_showfree_print(struct seq_file *m,
- pg_data_t *pgdat, struct zone *zone)
+static void pagetypeinfo_showfree_print(struct seq_file *m, pg_data_t *pgdat,
+ mem_region_t *mem_region, struct zone *zone)
{
int order, mtype;

for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
- seq_printf(m, "Node %4d, zone %8s, type %12s ",
+ seq_printf(m, "Node %4d, Region %d, zone %8s, type %12s ",
pgdat->node_id,
+ mem_region->region,
zone->name,
migratetype_names[mtype]);
for (order = 0; order < MAX_ORDER; ++order) {
@@ -731,8 +735,8 @@ static int pagetypeinfo_showfree(struct seq_file *m, void *arg)
return 0;
}

-static void pagetypeinfo_showblockcount_print(struct seq_file *m,
- pg_data_t *pgdat, struct zone *zone)
+static void pagetypeinfo_showblockcount_print(struct seq_file *m, pg_data_t *pgdat,
+ mem_region_t *mem_region, struct zone *zone)
{
int mtype;
unsigned long pfn;
@@ -759,7 +763,7 @@ static void pagetypeinfo_showblockcount_print(struct seq_file *m,
}

/* Print counts */
- seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
+ seq_printf(m, "Node %d, Region %d, zone %8s ", pgdat->node_id, mem_region->region, zone->name);
for (mtype = 0; mtype < MIGRATE_TYPES; mtype++)
seq_printf(m, "%12lu ", count[mtype]);
seq_putc(m, '\n');
@@ -969,10 +973,11 @@ static const char * const vmstat_text[] = {
};

static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
- struct zone *zone)
+ mem_region_t *mem_region, struct zone *zone)
{
int i;
- seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
+ seq_printf(m, "Node %d, Region %d, zone %8s", pgdat->node_id,
+ mem_region->region, zone->name);
seq_printf(m,
"\n pages free %lu"
"\n min %lu"
--
1.7.4

2011-05-27 12:32:08

by Ankita Garg

[permalink] [raw]
Subject: [PATCH 10/10] mm: Create memory regions at boot-up

Memory regions are created at boot up time, from the information obtained
from the firmware. This patchset was developed on ARM platform, on which at
present u-boot bootloader does not export information about memory units that
can be independently power managed. For the purpose of demonstration, 2 hard
coded memory regions are created, of 256MB each on the Panda board with 512MB
RAM.

Signed-off-by: Ankita Garg <[email protected]>
---
include/linux/mmzone.h | 8 +++-----
mm/page_alloc.c | 29 +++++++++++++++++++++++++++++
2 files changed, 32 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index bc3e3fd..5dbe1e1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -627,14 +627,12 @@ typedef struct mem_region_list_data {
*/
struct bootmem_data;
typedef struct pglist_data {
-/* The linkage to node_zones is now removed. The new hierarchy introduced
- * is pg_data_t -> mem_region -> zones
- * struct zone node_zones[MAX_NR_ZONES];
- */
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
- struct page *node_mem_map;
+ strs pg_data_t -> mem_region -> zones
+ * struct zone node_zones[MAX_NR_ZONES];
+ */uct page *node_mem_map;
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct page_cgroup *node_page_cgroup;
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index da8b045..3d994e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4285,6 +4285,34 @@ static inline int pageblock_default_order(unsigned int order)

#endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */

+#define REGIONS_SIZE (512 << 20) >> PAGE_SHIFT
+
+static void init_node_memory_regions(struct pglist_data *pgdat)
+{
+ int cnt = 0;
+ unsigned long i;
+ unsigned long start_pfn = pgdat->node_start_pfn;
+ unsigned long spanned_pages = pgdat->node_spanned_pages;
+ unsigned long total = 0;
+
+ for (i = start_pfn; i < start_pfn + spanned_pages; i+= REGIONS_SIZE) {
+ mem_region_t *mem_region = &pgdat->mem_regions[cnt];
+
+ mem_region->start_pfn = i;
+ if ((spanned_pages - total) <= REGIONS_SIZE) {
+ mem_region->spanned_pages = spanned_pages - total;
+ }
+ else
+ mem_region->spanned_pages = REGIONS_SIZE;
+
+ mem_region->node = pgdat->node_id;
+ mem_region->region = cnt;
+ pgdat->nr_mem_regions++;
+ total += mem_region->spanned_pages;
+ cnt++;
+ }
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -4447,6 +4475,7 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
(unsigned long)pgdat->node_mem_map);
#endif

+ init_node_memory_regions(pgdat);
free_area_init_core(pgdat, zones_size, zholes_size);
}

--
1.7.4

2011-05-27 15:31:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote:
> +typedef struct mem_region_list_data {
> + struct zone zones[MAX_NR_ZONES];
> + int nr_zones;
> +
> + int node;
> + int region;
> +
> + unsigned long start_pfn;
> + unsigned long spanned_pages;
> +} mem_region_t;
> +
> +#define MAX_NR_REGIONS 16

Don't do the foo_t thing. It's out of style and the pg_data_t is a
dinosaur.

I'm a bit surprised how little discussion of this there is in the patch
descriptions. Why did you choose this structure? What are the
downsides of doing it this way? This effectively breaks up the zone's
LRU in to MAX_NR_REGIONS LRUs. What effects does that have?

How big _is_ a 'struct zone' these days? This patch will increase their
effective size by 16x.

Since one distro kernel basically gets run on *EVERYTHING*, what will
MAX_NR_REGIONS be in practice? How many regions are there on the
largest systems that will need this? We're going to be doing many
linear searches and iterations over it, so it's pretty darn important to
know. What does this do to lmbench numbers sensitive to page
allocations?

-- Dave

2011-05-27 18:20:53

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

* Dave Hansen <[email protected]> [2011-05-27 08:30:03]:

> On Fri, 2011-05-27 at 18:01 +0530, Ankita Garg wrote:
> > +typedef struct mem_region_list_data {
> > + struct zone zones[MAX_NR_ZONES];
> > + int nr_zones;
> > +
> > + int node;
> > + int region;
> > +
> > + unsigned long start_pfn;
> > + unsigned long spanned_pages;
> > +} mem_region_t;
> > +
> > +#define MAX_NR_REGIONS 16
>
> Don't do the foo_t thing. It's out of style and the pg_data_t is a
> dinosaur.
>
> I'm a bit surprised how little discussion of this there is in the patch
> descriptions. Why did you choose this structure? What are the
> downsides of doing it this way? This effectively breaks up the zone's
> LRU in to MAX_NR_REGIONS LRUs. What effects does that have?

This data structure is one of the option, but definitely has
overheads. One alternative was to use fake-numa nodes that has more
overhead and user visible quirks.

The overheads is based on the number of regions actually defined in
the platform. It may be 2-4 in smaller systems. This split is what
makes the allocations and reclaims work withing these boundaries using
the zone's active, inactive lists on a per memory regions basis.

An external structure to just capture the boundaries would have less
overheads, but does not provide enough hooks to influence the zone
level allocators and reclaim operations.

> How big _is_ a 'struct zone' these days? This patch will increase their
> effective size by 16x.

Yes, this is not good, we should to a runtime allocation for the exact
number of regions that we need. This can be optimized later once we
design the data structure hierarchy with least overhead for the
purpose.

> Since one distro kernel basically gets run on *EVERYTHING*, what will
> MAX_NR_REGIONS be in practice? How many regions are there on the
> largest systems that will need this? We're going to be doing many
> linear searches and iterations over it, so it's pretty darn important to
> know. What does this do to lmbench numbers sensitive to page
> allocations?

Yep, agreed, we are generally looking at 2-4 regions per-node for most
purposes. Also regions need not be of equal size, they can be large
and small based on platform characteristics so that we need not
fragment the zones below the level required.

The overall idea is to have a VM data structure that can capture
various boundaries of memory, and enable the allocations and reclaim
logic to target certain areas based on the boundaries and properties
required. NUMA node and pgdat is the example of capturing memory
distances. The proposed memory regions should capture other
orthogonal properties and boundaries of memory addresses similar to
zone type.

Thanks for the quick feedback.

--Vaidy

2011-05-27 21:32:24

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

On Fri, 2011-05-27 at 23:50 +0530, Vaidyanathan Srinivasan wrote:
> The overall idea is to have a VM data structure that can capture
> various boundaries of memory, and enable the allocations and reclaim
> logic to target certain areas based on the boundaries and properties
> required.

It's worth noting that we already do targeted reclaim on boundaries
other than zones. The lumpy reclaim and memory compaction logically do
the same thing. So, it's at least possible to do this without having
the global LRU designed around the way you want to reclaim.

Also, if you get _too_ dependent on the global LRU, what are you going
to do if our cgroup buddies manage to get cgroup'd pages off the global
LRU?

-- Dave

2011-05-28 07:55:14

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg <[email protected]> wrote:

> This patchset proposes a generic memory regions infrastructure that can be
> used to tag boundaries of memory blocks which belongs to a specific memory
> power management domain and further enable exploitation of platform memory
> power management capabilities.

A couple of quick thoughts...

I'm seeing no estimate of how much energy we might save when this work
is completed. But saving energy is the entire point of the entire
patchset! So please spend some time thinking about that and update and
maintain the [patch 0/n] description so others can get some idea of the
benefit we might get from all of this. That estimate should include an
estimate of what proportion of machines are likely to have hardware
which can use this feature and in what timeframe.

IOW, if it saves one microwatt on 0.001% of machines, not interested ;)


Also, all this code appears to be enabled on all machines? So machines
which don't have the requisite hardware still carry any additional
overhead which is added here. I can see that ifdeffing a feature like
this would be ghastly but please also have a think about the
implications of this and add that discussion also.

If possible, it would be good to think up some microbenchmarks which
probe the worst-case performance impact and describe those and present
the results. So others can gain an understanding of the runtime costs.

2011-05-28 13:16:18

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi Andrew,

On Sat, May 28, 2011 at 12:56:40AM -0700, Andrew Morton wrote:
> On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg <[email protected]> wrote:
>
> > This patchset proposes a generic memory regions infrastructure that can be
> > used to tag boundaries of memory blocks which belongs to a specific memory
> > power management domain and further enable exploitation of platform memory
> > power management capabilities.
>
> A couple of quick thoughts...
>
> I'm seeing no estimate of how much energy we might save when this work
> is completed. But saving energy is the entire point of the entire
> patchset! So please spend some time thinking about that and update and
> maintain the [patch 0/n] description so others can get some idea of the
> benefit we might get from all of this. That estimate should include an
> estimate of what proportion of machines are likely to have hardware
> which can use this feature and in what timeframe.
>

This patchset is definitely not for inclusion. The intention of this RFC
series is to convey the idea and demonstrate the intricacies of the VM
design. Partial Array Self-Refresh (PASR) is an upcoming technology that
is supported on some platforms today, but will be an important feature
in future platforms to conserve idle power consumed by memory subsystem.
Mobile devices that are predominantly in the standby state can exploit
PASR feature to partially turn off areas of memory that are free.

Unfortunately, at this point we are unable to provide an estimate of the
power savings, as the hardware platforms do not yet export information
about the underlying memory hardware topology. We are working on this
and hope to have some estimations in a month or two. However, will
evaluate the performance impact of the changes and share the same.

> IOW, if it saves one microwatt on 0.001% of machines, not interested ;)
>
>
> Also, all this code appears to be enabled on all machines? So machines
> which don't have the requisite hardware still carry any additional
> overhead which is added here. I can see that ifdeffing a feature like
> this would be ghastly but please also have a think about the
> implications of this and add that discussion also.
>
> If possible, it would be good to think up some microbenchmarks which
> probe the worst-case performance impact and describe those and present
> the results. So others can gain an understanding of the runtime costs.
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

Subject: Re: [PATCH 10/10] mm: Create memory regions at boot-up

On 18:01 Fri 27 May , Ankita Garg wrote:
> Memory regions are created at boot up time, from the information obtained
> from the firmware. This patchset was developed on ARM platform, on which at
> present u-boot bootloader does not export information about memory units that
> can be independently power managed. For the purpose of demonstration, 2 hard
> coded memory regions are created, of 256MB each on the Panda board with 512MB
> RAM.
>
> Signed-off-by: Ankita Garg <[email protected]>
> ---
> include/linux/mmzone.h | 8 +++-----
> mm/page_alloc.c | 29 +++++++++++++++++++++++++++++
> 2 files changed, 32 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index bc3e3fd..5dbe1e1 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -627,14 +627,12 @@ typedef struct mem_region_list_data {
> */
> struct bootmem_data;
> typedef struct pglist_data {
> -/* The linkage to node_zones is now removed. The new hierarchy introduced
> - * is pg_data_t -> mem_region -> zones
> - * struct zone node_zones[MAX_NR_ZONES];
> - */
> struct zonelist node_zonelists[MAX_ZONELISTS];
> int nr_zones;
> #ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
> - struct page *node_mem_map;
> + strs pg_data_t -> mem_region -> zones
> + * struct zone node_zones[MAX_NR_ZONES];
> + */uct page *node_mem_map;
what is time?
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR
> struct page_cgroup *node_page_cgroup;
> #endif
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index da8b045..3d994e8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4285,6 +4285,34 @@ static inline int pageblock_default_order(unsigned int order)
>
> #endif /* CONFIG_HUGETLB_PAGE_SIZE_VARIABLE */
>
> +#define REGIONS_SIZE (512 << 20) >> PAGE_SHIFT
fix a region size why?
> +
> +static void init_node_memory_regions(struct pglist_data *pgdat)
> +{
Best Regards,
J.

2011-05-29 08:16:29

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

Hi Dave,

On Fri, May 27, 2011 at 02:31:52PM -0700, Dave Hansen wrote:
> On Fri, 2011-05-27 at 23:50 +0530, Vaidyanathan Srinivasan wrote:
> > The overall idea is to have a VM data structure that can capture
> > various boundaries of memory, and enable the allocations and reclaim
> > logic to target certain areas based on the boundaries and properties
> > required.
>
> It's worth noting that we already do targeted reclaim on boundaries
> other than zones. The lumpy reclaim and memory compaction logically do
> the same thing. So, it's at least possible to do this without having
> the global LRU designed around the way you want to reclaim.
>

My understanding maybe incorrect, but doesn't both lumpy reclaim and
memory compaction still work under zone boundary ? While trying to free
up higher order pages, lumpy reclaim checks to ensure that pages that
are selected do not cross zone boundary. Further, compaction walks
through the pages in a zone and tries to re-arrange them.

> Also, if you get _too_ dependent on the global LRU, what are you going
> to do if our cgroup buddies manage to get cgroup'd pages off the global
> LRU?
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-05-31 17:34:32

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

On Sun, 2011-05-29 at 13:46 +0530, Ankita Garg wrote:
> > It's worth noting that we already do targeted reclaim on boundaries
> > other than zones. The lumpy reclaim and memory compaction logically do
> > the same thing. So, it's at least possible to do this without having
> > the global LRU designed around the way you want to reclaim.
> >
> My understanding maybe incorrect, but doesn't both lumpy reclaim and
> memory compaction still work under zone boundary ? While trying to free
> up higher order pages, lumpy reclaim checks to ensure that pages that
> are selected do not cross zone boundary. Further, compaction walks
> through the pages in a zone and tries to re-arrange them.

I'm asserting that we don't need memory regions in the

pgdat->regions[]->zones[]

layout to do what you're asking for.

Lumpy reclaim is limited to a zone because it's trying to satisfy and
allocation request that came in for *THAT* *ZONE*. It's useless to go
clear out other zones. In your case, you don't care about zone
boundaries: you want to reclaim things regardless.

There was a "cma: Contiguous Memory Allocator added" patch posted a bit
ago to linux-mm@. You might want to take a look at it for some
inspiration.

I think you also need to clearly establish here why any memory that
you're going to want to power off can't use (or shouldn't use)
ZONE_MOVABLE. It seems a bit silly to have it there, and ignore it for
such a similar use case. Memory hot-remove and power-down are not
horrifically different beasts.

BTW, that's probably something else to add to your list: make sure
mem_map[]s for memory in a region get allocated *in* that region.

-- Dave

2011-06-02 08:54:21

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: Introduce the memory regions data structure

Hi Dave,

On Tue, May 31, 2011 at 10:34:20AM -0700, Dave Hansen wrote:
> On Sun, 2011-05-29 at 13:46 +0530, Ankita Garg wrote:
> > > It's worth noting that we already do targeted reclaim on boundaries
> > > other than zones. The lumpy reclaim and memory compaction logically do
> > > the same thing. So, it's at least possible to do this without having
> > > the global LRU designed around the way you want to reclaim.
> > >
> > My understanding maybe incorrect, but doesn't both lumpy reclaim and
> > memory compaction still work under zone boundary ? While trying to free
> > up higher order pages, lumpy reclaim checks to ensure that pages that
> > are selected do not cross zone boundary. Further, compaction walks
> > through the pages in a zone and tries to re-arrange them.
>
> I'm asserting that we don't need memory regions in the
>
> pgdat->regions[]->zones[]
>
> layout to do what you're asking for.
>
> Lumpy reclaim is limited to a zone because it's trying to satisfy and
> allocation request that came in for *THAT* *ZONE*. It's useless to go
> clear out other zones. In your case, you don't care about zone
> boundaries: you want to reclaim things regardless.
>

Ok true. So I guess lumpy reclaim could be extended to just free up
pages spanning the entire region and not just a particular zone.

> There was a "cma: Contiguous Memory Allocator added" patch posted a bit
> ago to linux-mm@. You might want to take a look at it for some
> inspiration.
>

We did take a look at CMA, but the use case seems to be slightly
different. Inorder to allocate large contiguous pages, CMA creates a new
miratetype called MIGRATE_CMA, which effectively isolates pages from the
buddy allocator.

> I think you also need to clearly establish here why any memory that
> you're going to want to power off can't use (or shouldn't use)
> ZONE_MOVABLE. It seems a bit silly to have it there, and ignore it for
> such a similar use case. Memory hot-remove and power-down are not
> horrifically different beasts.
>

Memory hot add and remove are definite usecases for conserving memory
power. In this first version of the RFC patch, I have not yet added the
support for ZONE_MOVABLE. I am currently testing the patch that creates
movable zones under regions, thus ensuring that it can be easily
evacuated using page migration.

> BTW, that's probably something else to add to your list: make sure
> mem_map[]s for memory in a region get allocated *in* that region.
>

There are a few reasons why we decided that we must have all the kernel
non-movable data structures co-located in a single region as much as
possible:

- Having a region devoid of non-movable memory will enable the complete
memory region to be even hot-removed
- If the memory is evacuated and later turned off (loss of content),
then the mem_map[]s will be lost. So when the memory comes back on,
the mem_map[]s will need to be reinitialized. While the hotplug
approach will work for exploiting PASR, it may not be the most
efficient one
- When the memory is put into a lower power state, having the
mem_maps[]s in a single region would ensure that any references to
just the struct pages will not lead to references to the actual memory

However, it might be worth taking a look at it again.

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-09 18:53:07

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Sat, May 28, 2011 at 12:56:40AM -0700, Andrew Morton wrote:
> On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg <[email protected]> wrote:
>
> > This patchset proposes a generic memory regions infrastructure that can be
> > used to tag boundaries of memory blocks which belongs to a specific memory
> > power management domain and further enable exploitation of platform memory
> > power management capabilities.
>
> A couple of quick thoughts...
>
> I'm seeing no estimate of how much energy we might save when this work
> is completed. But saving energy is the entire point of the entire
> patchset! So please spend some time thinking about that and update and
> maintain the [patch 0/n] description so others can get some idea of the
> benefit we might get from all of this. That estimate should include an
> estimate of what proportion of machines are likely to have hardware
> which can use this feature and in what timeframe.
>
> IOW, if it saves one microwatt on 0.001% of machines, not interested ;)

FWIW, I have seen estimates on the order of a 5% reduction in power
consumption for some common types of embedded devices.

Thanx, Paul

> Also, all this code appears to be enabled on all machines? So machines
> which don't have the requisite hardware still carry any additional
> overhead which is added here. I can see that ifdeffing a feature like
> this would be ghastly but please also have a think about the
> implications of this and add that discussion also.
>
> If possible, it would be good to think up some microbenchmarks which
> probe the worst-case performance impact and describe those and present
> the results. So others can gain an understanding of the runtime costs.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2011-06-10 00:51:56

by Kyungmin Park

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 3:52 AM, Paul E. McKenney
<[email protected]> wrote:
> On Sat, May 28, 2011 at 12:56:40AM -0700, Andrew Morton wrote:
>> On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg <[email protected]> wrote:
>>
>> > This patchset proposes a generic memory regions infrastructure that can be
>> > used to tag boundaries of memory blocks which belongs to a specific memory
>> > power management domain and further enable exploitation of platform memory
>> > power management capabilities.
>>
>> A couple of quick thoughts...
>>
>> I'm seeing no estimate of how much energy we might save when this work
>> is completed. ?But saving energy is the entire point of the entire
>> patchset! ?So please spend some time thinking about that and update and
>> maintain the [patch 0/n] description so others can get some idea of the
>> benefit we might get from all of this. ?That estimate should include an
>> estimate of what proportion of machines are likely to have hardware
>> which can use this feature and in what timeframe.
>>
>> IOW, if it saves one microwatt on 0.001% of machines, not interested ;)
>
> FWIW, I have seen estimates on the order of a 5% reduction in power
> consumption for some common types of embedded devices.

Wow interesting. I can't expect it can reduce 5% power reduction.
If it uses the 1GiBytes LPDDR2 memory. each memory port has 4Gib,
another has 4Gib. so one bank size is 64MiB (512MiB / 8).
So I don't expect it's difficult to contain the free or inactive
memory more than 64MiB during runtime.

Anyway can you describe the exact test environment? esp., memory type?
As you know there are too much embedded devices which use the various
environment.

Thank you,
Kyungmin Park
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
>
>> Also, all this code appears to be enabled on all machines? ?So machines
>> which don't have the requisite hardware still carry any additional
>> overhead which is added here. ?I can see that ifdeffing a feature like
>> this would be ghastly but please also have a think about the
>> implications of this and add that discussion also.
>>
>> If possible, it would be good to think up some microbenchmarks which
>> probe the worst-case performance impact and describe those and present
>> the results. ?So others can gain an understanding of the runtime costs.
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at ?http://www.tux.org/lkml/
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. ?For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2011-06-10 15:11:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 09:51:53AM +0900, Kyungmin Park wrote:
> On Fri, Jun 10, 2011 at 3:52 AM, Paul E. McKenney
> <[email protected]> wrote:
> > On Sat, May 28, 2011 at 12:56:40AM -0700, Andrew Morton wrote:
> >> On Fri, 27 May 2011 18:01:28 +0530 Ankita Garg <[email protected]> wrote:
> >>
> >> > This patchset proposes a generic memory regions infrastructure that can be
> >> > used to tag boundaries of memory blocks which belongs to a specific memory
> >> > power management domain and further enable exploitation of platform memory
> >> > power management capabilities.
> >>
> >> A couple of quick thoughts...
> >>
> >> I'm seeing no estimate of how much energy we might save when this work
> >> is completed. ?But saving energy is the entire point of the entire
> >> patchset! ?So please spend some time thinking about that and update and
> >> maintain the [patch 0/n] description so others can get some idea of the
> >> benefit we might get from all of this. ?That estimate should include an
> >> estimate of what proportion of machines are likely to have hardware
> >> which can use this feature and in what timeframe.
> >>
> >> IOW, if it saves one microwatt on 0.001% of machines, not interested ;)
> >
> > FWIW, I have seen estimates on the order of a 5% reduction in power
> > consumption for some common types of embedded devices.
>
> Wow interesting. I can't expect it can reduce 5% power reduction.
> If it uses the 1GiBytes LPDDR2 memory. each memory port has 4Gib,
> another has 4Gib. so one bank size is 64MiB (512MiB / 8).
> So I don't expect it's difficult to contain the free or inactive
> memory more than 64MiB during runtime.
>
> Anyway can you describe the exact test environment? esp., memory type?
> As you know there are too much embedded devices which use the various
> environment.

Indeed, your mileage may vary. It involved a very low-power CPU,
and the change enabled not just powering off memory, but reducing
the amount of physical memory provided.

Of course, on a server, you could get similar results by having a very
large amount of memory (say 256GB) and a workload that needed all the
memory only occasionally for short periods, but could get by with much
less (say 8GB) the rest of the time. I have no idea whether or not
anyone actually has such a system.

Thanx, Paul

> Thank you,
> Kyungmin Park
> >
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Thanx, Paul
> >
> >> Also, all this code appears to be enabled on all machines? ?So machines
> >> which don't have the requisite hardware still carry any additional
> >> overhead which is added here. ?I can see that ifdeffing a feature like
> >> this would be ghastly but please also have a think about the
> >> implications of this and add that discussion also.
> >>
> >> If possible, it would be good to think up some microbenchmarks which
> >> probe the worst-case performance impact and describe those and present
> >> the results. ?So others can gain an understanding of the runtime costs.
> >>
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> the body of a message to [email protected]
> >> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> >> Please read the FAQ at ?http://www.tux.org/lkml/
> >>
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. ?For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >

2011-06-10 16:01:01

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 08:11:21AM -0700, Paul E. McKenney wrote:

> Of course, on a server, you could get similar results by having a very
> large amount of memory (say 256GB) and a workload that needed all the
> memory only occasionally for short periods, but could get by with much
> less (say 8GB) the rest of the time. I have no idea whether or not
> anyone actually has such a system.

For the server case, the low hanging fruit would seem to be
finer-grained self-refresh. At best we seem to be able to do that on a
per-CPU socket basis right now. The difference between active and
self-refresh would seem to be much larger than the difference between
self-refresh and powered down.

--
Matthew Garrett | [email protected]

2011-06-10 16:55:35

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 04:59:54PM +0100, Matthew Garrett wrote:
> On Fri, Jun 10, 2011 at 08:11:21AM -0700, Paul E. McKenney wrote:
>
> > Of course, on a server, you could get similar results by having a very
> > large amount of memory (say 256GB) and a workload that needed all the
> > memory only occasionally for short periods, but could get by with much
> > less (say 8GB) the rest of the time. I have no idea whether or not
> > anyone actually has such a system.
>
> For the server case, the low hanging fruit would seem to be
> finer-grained self-refresh. At best we seem to be able to do that on a
> per-CPU socket basis right now. The difference between active and
> self-refresh would seem to be much larger than the difference between
> self-refresh and powered down.

By "finer-grained self-refresh" you mean turning off refresh for banks
of memory that are not being used, right? If so, this is supported by
the memory-regions support provided, at least assuming that the regions
can be aligned with the self-refresh boundaries.

Or am I missing your point?

Thanx, Paul

2011-06-10 17:06:14

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 09:55:29AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 10, 2011 at 04:59:54PM +0100, Matthew Garrett wrote:
> > For the server case, the low hanging fruit would seem to be
> > finer-grained self-refresh. At best we seem to be able to do that on a
> > per-CPU socket basis right now. The difference between active and
> > self-refresh would seem to be much larger than the difference between
> > self-refresh and powered down.
>
> By "finer-grained self-refresh" you mean turning off refresh for banks
> of memory that are not being used, right? If so, this is supported by
> the memory-regions support provided, at least assuming that the regions
> can be aligned with the self-refresh boundaries.

I mean at the hardware level. As far as I know, the best we can do at
the moment is to put an entire node into self refresh when the CPU hits
package C6.

--
Matthew Garrett | [email protected]

2011-06-10 17:20:20

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 06:05:35PM +0100, Matthew Garrett wrote:
> On Fri, Jun 10, 2011 at 09:55:29AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 10, 2011 at 04:59:54PM +0100, Matthew Garrett wrote:
> > > For the server case, the low hanging fruit would seem to be
> > > finer-grained self-refresh. At best we seem to be able to do that on a
> > > per-CPU socket basis right now. The difference between active and
> > > self-refresh would seem to be much larger than the difference between
> > > self-refresh and powered down.
> >
> > By "finer-grained self-refresh" you mean turning off refresh for banks
> > of memory that are not being used, right? If so, this is supported by
> > the memory-regions support provided, at least assuming that the regions
> > can be aligned with the self-refresh boundaries.
>
> I mean at the hardware level. As far as I know, the best we can do at
> the moment is to put an entire node into self refresh when the CPU hits
> package C6.

But this depends on the type of system and CPU family, right? If you
can say, which hardware are you thinking of? (I am thinking of ARM.)

Thanx, Paul

2011-06-10 17:24:01

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 10:19:39AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 10, 2011 at 06:05:35PM +0100, Matthew Garrett wrote:
> > I mean at the hardware level. As far as I know, the best we can do at
> > the moment is to put an entire node into self refresh when the CPU hits
> > package C6.
>
> But this depends on the type of system and CPU family, right? If you
> can say, which hardware are you thinking of? (I am thinking of ARM.)

I haven't seen too many ARM servers with 256GB of RAM :) I'm mostly
looking at this from an x86 perspective.

--
Matthew Garrett | [email protected]

2011-06-10 17:33:57

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 10:19:39AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 10, 2011 at 06:05:35PM +0100, Matthew Garrett wrote:
> > On Fri, Jun 10, 2011 at 09:55:29AM -0700, Paul E. McKenney wrote:
> > > On Fri, Jun 10, 2011 at 04:59:54PM +0100, Matthew Garrett wrote:
> > > > For the server case, the low hanging fruit would seem to be
> > > > finer-grained self-refresh. At best we seem to be able to do that on a
> > > > per-CPU socket basis right now. The difference between active and
> > > > self-refresh would seem to be much larger than the difference between
> > > > self-refresh and powered down.
> > >
> > > By "finer-grained self-refresh" you mean turning off refresh for banks
> > > of memory that are not being used, right? If so, this is supported by
> > > the memory-regions support provided, at least assuming that the regions
> > > can be aligned with the self-refresh boundaries.
> >
> > I mean at the hardware level. As far as I know, the best we can do at
> > the moment is to put an entire node into self refresh when the CPU hits
> > package C6.
>
> But this depends on the type of system and CPU family, right? If you
> can say, which hardware are you thinking of? (I am thinking of ARM.)
>

And also whether the memory controller is on-chip or off-chip ? As
package could be in C6, but other packages could be refering memory
connected to this socket right ? And as Paul mentioned, at this point
the ARM SoCs that have support for memory power management, have only a
single node.

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-10 17:52:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 06:23:07PM +0100, Matthew Garrett wrote:
> On Fri, Jun 10, 2011 at 10:19:39AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 10, 2011 at 06:05:35PM +0100, Matthew Garrett wrote:
> > > I mean at the hardware level. As far as I know, the best we can do at
> > > the moment is to put an entire node into self refresh when the CPU hits
> > > package C6.
> >
> > But this depends on the type of system and CPU family, right? If you
> > can say, which hardware are you thinking of? (I am thinking of ARM.)
>
> I haven't seen too many ARM servers with 256GB of RAM :) I'm mostly
> looking at this from an x86 perspective.

But I have seen ARM embedded systems with CPU power consumption in
the milliwatt range, which greatly reduces the amount of RAM required
to get significant power savings from this approach. Three orders
of magnitude less CPU power consumption translates (roughly) to three
orders of magnitude less memory required -- and embedded devices with
more than 256MB of memory are quite common.

Thanx, Paul

2011-06-10 18:08:41

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 10:52:48AM -0700, Paul E. McKenney wrote:
> On Fri, Jun 10, 2011 at 06:23:07PM +0100, Matthew Garrett wrote:
> > I haven't seen too many ARM servers with 256GB of RAM :) I'm mostly
> > looking at this from an x86 perspective.
>
> But I have seen ARM embedded systems with CPU power consumption in
> the milliwatt range, which greatly reduces the amount of RAM required
> to get significant power savings from this approach. Three orders
> of magnitude less CPU power consumption translates (roughly) to three
> orders of magnitude less memory required -- and embedded devices with
> more than 256MB of memory are quite common.

I'm not saying that powering down memory isn't a win, just that in the
server market we're not even getting unused memory into self refresh at
the moment. If we can gain that hardware capability then sub-node zoning
means that we can look at allocating (and migrating?) RAM in such a way
as to get a lot of the win that we'd gain from actually cutting the
power, without the added overhead of actually shrinking our working set.

--
Matthew Garrett | [email protected]

2011-06-10 18:48:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 07:08:07PM +0100, Matthew Garrett wrote:
> On Fri, Jun 10, 2011 at 10:52:48AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 10, 2011 at 06:23:07PM +0100, Matthew Garrett wrote:
> > > I haven't seen too many ARM servers with 256GB of RAM :) I'm mostly
> > > looking at this from an x86 perspective.
> >
> > But I have seen ARM embedded systems with CPU power consumption in
> > the milliwatt range, which greatly reduces the amount of RAM required
> > to get significant power savings from this approach. Three orders
> > of magnitude less CPU power consumption translates (roughly) to three
> > orders of magnitude less memory required -- and embedded devices with
> > more than 256MB of memory are quite common.
>
> I'm not saying that powering down memory isn't a win, just that in the
> server market we're not even getting unused memory into self refresh at
> the moment. If we can gain that hardware capability then sub-node zoning
> means that we can look at allocating (and migrating?) RAM in such a way
> as to get a lot of the win that we'd gain from actually cutting the
> power, without the added overhead of actually shrinking our working set.

Agreed.

And if I understand you correctly, then the patches that Ankita posted
should help your self-refresh case, along with the originally intended
the power-down case and special-purpose use of memory case.

Thanx, Paul

2011-06-10 19:24:06

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:

> And if I understand you correctly, then the patches that Ankita posted
> should help your self-refresh case, along with the originally intended
> the power-down case and special-purpose use of memory case.

Yeah, I'd hope so once we actually have capable hardware.

--
Matthew Garrett | [email protected]

2011-06-10 19:37:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
>
> > And if I understand you correctly, then the patches that Ankita posted
> > should help your self-refresh case, along with the originally intended
> > the power-down case and special-purpose use of memory case.
>
> Yeah, I'd hope so once we actually have capable hardware.

Cool!!!

So Ankita's patchset might be useful to you at some point, then.

Does it look like a reasonable implementation?

Thanx, Paul

2011-06-10 20:13:17

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 12:37:13PM -0700, Paul E. McKenney wrote:
> On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
> >
> > > And if I understand you correctly, then the patches that Ankita posted
> > > should help your self-refresh case, along with the originally intended
> > > the power-down case and special-purpose use of memory case.
> >
> > Yeah, I'd hope so once we actually have capable hardware.
>
> Cool!!!
>
> So Ankita's patchset might be useful to you at some point, then.
>
> Does it look like a reasonable implementation?

I should read it before saying anything, I suspect...

--
Matthew Garrett | [email protected]

2011-06-11 02:58:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, 10 Jun 2011 12:37:13 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
> >
> > > And if I understand you correctly, then the patches that Ankita
> > > posted should help your self-refresh case, along with the
> > > originally intended the power-down case and special-purpose use
> > > of memory case.
> >
> > Yeah, I'd hope so once we actually have capable hardware.
>
> Cool!!!
>
> So Ankita's patchset might be useful to you at some point, then.
>
> Does it look like a reasonable implementation?

as someone who is working on hardware that is PASR capable right now,
I have to admit that our plan was to just hook into the buddy allocator,
and use PASR on the top level of buddy (eg PASR off blocks that are
free there, and PASR them back on once an allocation required the block
to be broken up)..... that looked the very most simple to me.

Maybe something much more elaborate is needed, but I didn't see why so
far.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2011-06-11 17:06:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 08:02:33PM -0700, Arjan van de Ven wrote:
> On Fri, 10 Jun 2011 12:37:13 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
> > >
> > > > And if I understand you correctly, then the patches that Ankita
> > > > posted should help your self-refresh case, along with the
> > > > originally intended the power-down case and special-purpose use
> > > > of memory case.
> > >
> > > Yeah, I'd hope so once we actually have capable hardware.
> >
> > Cool!!!
> >
> > So Ankita's patchset might be useful to you at some point, then.
> >
> > Does it look like a reasonable implementation?
>
> as someone who is working on hardware that is PASR capable right now,
> I have to admit that our plan was to just hook into the buddy allocator,
> and use PASR on the top level of buddy (eg PASR off blocks that are
> free there, and PASR them back on once an allocation required the block
> to be broken up)..... that looked the very most simple to me.
>
> Maybe something much more elaborate is needed, but I didn't see why so
> far.

If I understand correctly, you face the same issue that affects
transparent huge pages, but on a much larger scale. If you have even
one non-moveable allocation in a given top-level buddy block, you won't
be able to PASR that block.

In addition, one of the things that Ankita's patchset is looking to do
is to control allocations in a given region, so that the region can be
easily evacuated. One use for this is to power off regions of memory,
another is to PASR off regions of memory, and a third is to ensure that
large regions of memory are available for when needed by media codecs
(e.g., cameras), but can be used for other purposes when the media codecs
don't need them (e.g., when viewing photos rather than taking them).

>From what I can see, the same mechanism covers all three use cases.

Thanx, Paul

2011-06-11 17:08:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, Jun 10, 2011 at 11:03:45PM +0530, Ankita Garg wrote:
> On Fri, Jun 10, 2011 at 10:19:39AM -0700, Paul E. McKenney wrote:
> > On Fri, Jun 10, 2011 at 06:05:35PM +0100, Matthew Garrett wrote:
> > > On Fri, Jun 10, 2011 at 09:55:29AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Jun 10, 2011 at 04:59:54PM +0100, Matthew Garrett wrote:
> > > > > For the server case, the low hanging fruit would seem to be
> > > > > finer-grained self-refresh. At best we seem to be able to do that on a
> > > > > per-CPU socket basis right now. The difference between active and
> > > > > self-refresh would seem to be much larger than the difference between
> > > > > self-refresh and powered down.
> > > >
> > > > By "finer-grained self-refresh" you mean turning off refresh for banks
> > > > of memory that are not being used, right? If so, this is supported by
> > > > the memory-regions support provided, at least assuming that the regions
> > > > can be aligned with the self-refresh boundaries.
> > >
> > > I mean at the hardware level. As far as I know, the best we can do at
> > > the moment is to put an entire node into self refresh when the CPU hits
> > > package C6.
> >
> > But this depends on the type of system and CPU family, right? If you
> > can say, which hardware are you thinking of? (I am thinking of ARM.)
> >
>
> And also whether the memory controller is on-chip or off-chip ? As
> package could be in C6, but other packages could be refering memory
> connected to this socket right ? And as Paul mentioned, at this point
> the ARM SoCs that have support for memory power management, have only a
> single node.

I suspect that this will be changing shortly, and that finer-grained
control might be available.

However, there are also use cases where contiguous memory is required from
time to time by media codecs, and being able to use that memory for other
purposes when the media codec is not in use reduced the total amount of
memory required, which reduces power consumption all the time, right? ;-)

Thanx, Paul

2011-06-11 17:23:01

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Sat, 11 Jun 2011 10:06:10 -0700
"Paul E. McKenney" <[email protected]> wrote:

> On Fri, Jun 10, 2011 at 08:02:33PM -0700, Arjan van de Ven wrote:
> > On Fri, 10 Jun 2011 12:37:13 -0700
> > "Paul E. McKenney" <[email protected]> wrote:
> >
> > > On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > > > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney
> > > > wrote:
> > > >
> > > > > And if I understand you correctly, then the patches that
> > > > > Ankita posted should help your self-refresh case, along with
> > > > > the originally intended the power-down case and
> > > > > special-purpose use of memory case.
> > > >
> > > > Yeah, I'd hope so once we actually have capable hardware.
> > >
> > > Cool!!!
> > >
> > > So Ankita's patchset might be useful to you at some point, then.
> > >
> > > Does it look like a reasonable implementation?
> >
> > as someone who is working on hardware that is PASR capable right
> > now, I have to admit that our plan was to just hook into the buddy
> > allocator, and use PASR on the top level of buddy (eg PASR off
> > blocks that are free there, and PASR them back on once an
> > allocation required the block to be broken up)..... that looked the
> > very most simple to me.
> >
> > Maybe something much more elaborate is needed, but I didn't see why
> > so far.
>
> If I understand correctly, you face the same issue that affects
> transparent huge pages, but on a much larger scale. If you have even
> one non-moveable allocation in a given top-level buddy block, you
> won't be able to PASR that block.

yep we'd use a non-kernel zone for that; not too big a deal.
(the much larger scale you can debate, if your memory controller is
configured correctly the PASR regions are not all that much bigger than
hugepages)
>
> In addition, one of the things that Ankita's patchset is looking to do
> is to control allocations in a given region, so that the region can be
> easily evacuated. One use for this is to power off regions of memory,
> another is to PASR off regions of memory, and a third is to ensure
> that large regions of memory are available for when needed by media
> codecs (e.g., cameras), but can be used for other purposes when the
> media codecs don't need them (e.g., when viewing photos rather than
> taking them).

the codec issue seems to be solved in time; a previous generation
silicon on our (Intel) side had ARM ecosystem blocks that did not do
scatter gather, however the current generation ARM ecosystem blocks all
seem to have added S/G to them....
(in part this is coming from the strong desire to get camera/etc blocks
to all use "GPU texture" class memory, so that the camera can directly
deposit its information into a gpu texture, and similar for media
encode/decode blocks... this avoids copies as well as duplicate memory).



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2011-06-12 23:07:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Sat, Jun 11, 2011 at 10:26:54AM -0700, Arjan van de Ven wrote:
> On Sat, 11 Jun 2011 10:06:10 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Fri, Jun 10, 2011 at 08:02:33PM -0700, Arjan van de Ven wrote:
> > > On Fri, 10 Jun 2011 12:37:13 -0700
> > > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > > > > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney
> > > > > wrote:
> > > > >
> > > > > > And if I understand you correctly, then the patches that
> > > > > > Ankita posted should help your self-refresh case, along with
> > > > > > the originally intended the power-down case and
> > > > > > special-purpose use of memory case.
> > > > >
> > > > > Yeah, I'd hope so once we actually have capable hardware.
> > > >
> > > > Cool!!!
> > > >
> > > > So Ankita's patchset might be useful to you at some point, then.
> > > >
> > > > Does it look like a reasonable implementation?
> > >
> > > as someone who is working on hardware that is PASR capable right
> > > now, I have to admit that our plan was to just hook into the buddy
> > > allocator, and use PASR on the top level of buddy (eg PASR off
> > > blocks that are free there, and PASR them back on once an
> > > allocation required the block to be broken up)..... that looked the
> > > very most simple to me.
> > >
> > > Maybe something much more elaborate is needed, but I didn't see why
> > > so far.
> >
> > If I understand correctly, you face the same issue that affects
> > transparent huge pages, but on a much larger scale. If you have even
> > one non-moveable allocation in a given top-level buddy block, you
> > won't be able to PASR that block.
>
> yep we'd use a non-kernel zone for that; not too big a deal.
> (the much larger scale you can debate, if your memory controller is
> configured correctly the PASR regions are not all that much bigger than
> hugepages)

Ah, OK, so you have a very large number of top-level buddy allocations,
then. Either that or very large hugepages.

> > In addition, one of the things that Ankita's patchset is looking to do
> > is to control allocations in a given region, so that the region can be
> > easily evacuated. One use for this is to power off regions of memory,
> > another is to PASR off regions of memory, and a third is to ensure
> > that large regions of memory are available for when needed by media
> > codecs (e.g., cameras), but can be used for other purposes when the
> > media codecs don't need them (e.g., when viewing photos rather than
> > taking them).
>
> the codec issue seems to be solved in time; a previous generation
> silicon on our (Intel) side had ARM ecosystem blocks that did not do
> scatter gather, however the current generation ARM ecosystem blocks all
> seem to have added S/G to them....
> (in part this is coming from the strong desire to get camera/etc blocks
> to all use "GPU texture" class memory, so that the camera can directly
> deposit its information into a gpu texture, and similar for media
> encode/decode blocks... this avoids copies as well as duplicate memory).

That is indeed a clever approach!

Of course, if the GPU textures are in main memory, there will still
be memory consumption gains to be had as the image size varies (e.g.,
displaying image on one hand vs. menus and UI on the other). In addition,
I would expect that for quite some time there will continue to be a lot
of systems with display hardware a bit too simple to qualify as "GPU".

Thanx, Paul

2011-06-13 04:54:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, 27 May 2011 18:01:28 +0530
Ankita Garg <[email protected]> wrote:

> Hi,
>

I'm sorry if you've answered already.

Is memory hotplug is too bad and cannot be enhanced for this purpose ?

I wonder
- make section-size smaller (IIUC, IBM's system has 16MB section size)

- add per section statistics

- add a kind of balloon driver which does software memory offline
(which means making a contiguous chunk of free pages of section_size
by page migration) in background with regard to memory usage statistics.
If system says "need more memory!", balloon driver can online pages.

can work for your purpose. It can allow you page isolatation and
controls in 16MB unit. If you need whole rework of memory hotplug, I think
it's better to rewrite memory hotplug, too.

Thanks,
-Kame





> Modern systems offer higher CPU performance and large amount of memory in
> each generation in order to support application demands. Memory subsystem has
> began to offer wide range of capabilities for managing power consumption,
> which is driving the need to relook at the way memory is managed by the
> operating system. Linux VM subsystem has sophisticated algorithms to
> optimally manage the scarce resources for best overall system performance.
> Apart from the capacity and location of memory areas, the VM subsystem tracks
> special addressability restrictions in zones and relative distance from CPU as
> NUMA nodes if necessary. Power management capabilities in the memory subsystem
> and inclusion of different class of main memory like PCM, or non-volatile RAM,
> brings in new boundaries and attributes that needs to be tagged within the
> Linux VM subsystem for exploitation by the kernel and applications.
>
> This patchset proposes a generic memory regions infrastructure that can be
> used to tag boundaries of memory blocks which belongs to a specific memory
> power management domain and further enable exploitation of platform memory
> power management capabilities.
>
> How can Linux VM help memory power savings?
>
> o Consolidate memory allocations and/or references such that they are
> not spread across the entire memory address space. Basically area of memory
> that is not being referenced, can reside in low power state.
>
> o Support targeted memory reclaim, where certain areas of memory that can be
> easily freed can be offlined, allowing those areas of memory to be put into
> lower power states.
>
> What is a Memory Region ?
> -------------------------
>
> Memory regions is a generic memory management framework that enables the
> virtual memory manager to consider memory characteristics when making memory
> allocation and deallocation decisions. It is a layer of abstraction under the
> real NUMA nodes, that encapsulate knowledge of the underlying memory hardware.
> This layer is created at boot time, with information from firmware regarding
> the granularity at which memory power can be managed on the platform. For
> example, on platforms with support for Partial Array Self-Refresh (PASR) [1],
> regions could be aligned to memory unit that can be independently put into
> self-refresh or turned off (content destructive power off). On the other hand,
> platforms with support for multiple memory controllers that control the power
> states of memory, one memory region could be created for all the memory under
> a single memory controller.
>
> The aim of the alignment is to ensure that memory allocations, deallocations
> and reclaim are performed within a defined hardware boundary. By creating
> zones under regions, the buddy allocator would operate at the level of
> regions. The proposed data structure is as shown in the Figure below:
>
>
> -----------------------------
> |N0 |N1 |N2 |N3 |.. |.. |Nn |
> -----------------------------
> / \ \
> / \ \
> / \ \
> ------------ | ------------
> | Mem Rgn0 | | | Mem Rgn3 |
> ------------ | ------------
> | | |
> | ------------ | ---------
> | | Mem Rgn1 | ->| zones |
> | ------------ ---------
> | | ---------
> | ----->| zones |
> | --------- ---------
> ->| zones |
> ---------
>
> Memory regions enable the following :
>
> o Sequential allocation of memory in the order of memory regions, thus
> ensuring that greater number of memory regions are devoid of allocations to
> begin with
> o With time however, the memory allocations will tend to be spread across
> different regions. But the notion of a region boundary and region level
> memory statistics will enable specific regions to be evacuated using
> targetted allocation and reclaim.
>
> Lumpy reclaim and other memory compaction work by Mel Gorman, would further
> aid in consolidation of memory [4].
>
> Memory regions is just a base infrastructure that would enable the Linux VM to
> be aware of the physical memory hardware characterisitics, a pre-requisite to
> implementing other sophisticated algorithms and techniques to actually
> conserve power.
>
> Advantages
> -----------
>
> Memory regions framework works with existing memory management data
> structures and only adds one more layer of abstraction that is required to
> capture special boundaries and properties. Most VM code paths work similar
> to current implementation with additional traversal of zone data structures
> in pre-defined order.
>
> Alternative Approach:
>
> There are other ways in which memory belonging to the same power domain could
> be grouped together. Fake NUMA nodes under a real NUMA node could encapsulate
> information about the memory hardware units that can be independently power
> managed. With minimal code changes, the same functionality as memory regions
> can be achieved. However, the fake NUMA nodes is a non-intuitive solution,
> that breaks the NUMA semantics and is not generic in nature. It would present
> an incorrect view of the system to the administrator, by showing that it has a
> greater number of NUMA nodes than actually present.
>
> Challenges
> ----------
>
> o Memory interleaving is typically used on all platforms to increase the
> memory bandwidth and hence memory performance. However, in the presence of
> interleaving, the amount of idle memory within the hardware domain reduces,
> impacting power savings. For a given platform, it is important to select an
> interleaving scheme that gives good performance with optimum power savings.
>
> This is a RFC patchset with minimal functionality to demonstrate the
> requirement and proposed implementation options. It has been tested on TI
> OMAP4 Panda board with 1Gb RAM and the Samsung Exynos 4210 board. The patch
> applies on kernel version 2.6.39-rc5, compiled with the default config files
> for the two platforms. I have turned off cgroup, memory hotplug and kexec to
> begin. Support to these framework can be easily extended. The u-boot
> bootloader does not yet export information regarding the physical memory bank
> boundaries and hence the regions are not correctly aligned to hardware and
> hence hard coded for test/demo purposes. Also, the code assumes that atleast
> one region is present in the node. Compile time exclusion of memory regions is
> a todo.
>
> Results
> -------
> Ran pagetest, a simple C program that allocates and touches a required number
> of pages, on a Samsung Exynos 4210 board with ~2GB RAM, booted with 4 memory
> regions, each with ~512MB. The allocation size used was 512MB. Below is the
> free page statistics while running the benchmark:
>
> ---------------------------------------
> | | start | ~480MB | 512MB |
> ---------------------------------------
> | Region 0 | 124013 | 1129 | 484 |
> | Region 1 | 131072 | 131072 | 130824 |
> | Region 2 | 131072 | 131072 | 131072 |
> | Region 3 | 57332 | 57332 | 57332 |
> ---------------------------------------
>
> (The total number of pages in Region 3 is 57332, as it contains all the
> remaining pages and hence the region size is not 512MB).
>
> Column 1 indicates the number of free pages in each region at the start of the
> benchmark, column 2 at about 480MB allocation and column 3 at 512MB
> allocation. The memory in regions 1,2 & 3 is free and only region0 is
> utilized. So if the regions are aligned to the hardware memory units, free
> regions could potentially be put either into low power state or turned off. It
> may be possible to allocate from lower address without regions, but once the
> page reclaim comes into play, the page allocations will tend to get spread
> around.
>
> References
> ----------
>
> [1] Partial Array Self Refresh
> http://focus.ti.com/general/docs/wtbu/wtbudocumentcenter.tsp?templateId=6123&navigationId=12037
> [2] TI OMAP$ Panda board
> http://pandaboard.org/node/224/#manual
> [3] Memory Regions discussion at Ubuntu Development Summit, May 2011
> https://wiki.linaro.org/Specs/KernelGeneralPerformanceO?action=AttachFile&do=view&target=k81-memregions.odp
> [4] Memory compaction
> http://lwn.net/Articles/368869/
>
> Ankita Garg (10):
> mm: Introduce the memory regions data structure
> mm: Helper routines
> mm: Init zones inside memory regions
> mm: Refer to zones from memory regions
> mm: Create zonelists
> mm: Verify zonelists
> mm: Modify vmstat
> mm: Modify vmscan
> mm: Reflect memory region changes in zoneinfo
> mm: Create memory regions at boot-up
>
> include/linux/mm.h | 25 +++-
> include/linux/mmzone.h | 58 +++++++--
> include/linux/vmstat.h | 22 ++-
> mm/mm_init.c | 51 ++++---
> mm/mmzone.c | 36 ++++-
> mm/page_alloc.c | 368 +++++++++++++++++++++++++++++++-----------------
> mm/vmscan.c | 284 ++++++++++++++++++++-----------------
> mm/vmstat.c | 77 ++++++----
> 8 files changed, 581 insertions(+), 340 deletions(-)
>
> --
> 1.7.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2011-06-13 14:25:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Sun, 12 Jun 2011 16:07:07 -0700
"Paul E. McKenney" <[email protected]> wrote:
> >
> > the codec issue seems to be solved in time; a previous generation
> > silicon on our (Intel) side had ARM ecosystem blocks that did not do
> > scatter gather, however the current generation ARM ecosystem blocks
> > all seem to have added S/G to them....
> > (in part this is coming from the strong desire to get camera/etc
> > blocks to all use "GPU texture" class memory, so that the camera
> > can directly deposit its information into a gpu texture, and
> > similar for media encode/decode blocks... this avoids copies as
> > well as duplicate memory).
>
> That is indeed a clever approach!
>
> Of course, if the GPU textures are in main memory, there will still
> be memory consumption gains to be had as the image size varies (e.g.,
> displaying image on one hand vs. menus and UI on the other).

graphics drivers and the whole graphics stack is set up to deal with
that... textures aren't per se "screen size", the texture for a button
is only as large as the button (with some rounding up to multiples of
some small power of two)




--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2011-06-13 23:04:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Mon, Jun 13, 2011 at 07:28:50AM -0700, Arjan van de Ven wrote:
> On Sun, 12 Jun 2011 16:07:07 -0700
> "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > the codec issue seems to be solved in time; a previous generation
> > > silicon on our (Intel) side had ARM ecosystem blocks that did not do
> > > scatter gather, however the current generation ARM ecosystem blocks
> > > all seem to have added S/G to them....
> > > (in part this is coming from the strong desire to get camera/etc
> > > blocks to all use "GPU texture" class memory, so that the camera
> > > can directly deposit its information into a gpu texture, and
> > > similar for media encode/decode blocks... this avoids copies as
> > > well as duplicate memory).
> >
> > That is indeed a clever approach!
> >
> > Of course, if the GPU textures are in main memory, there will still
> > be memory consumption gains to be had as the image size varies (e.g.,
> > displaying image on one hand vs. menus and UI on the other).
>
> graphics drivers and the whole graphics stack is set up to deal with
> that... textures aren't per se "screen size", the texture for a button
> is only as large as the button (with some rounding up to multiples of
> some small power of two)

In addition, I would expect that for quite some time there will continue
to be a lot of systems with display hardware a bit too simple to qualify
as "GPU".

Thanx, Paul

2011-06-14 08:52:07

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Fri, Jun 10, 2011 at 08:02:33PM -0700, Arjan van de Ven wrote:
> On Fri, 10 Jun 2011 12:37:13 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
> > >
> > > > And if I understand you correctly, then the patches that Ankita
> > > > posted should help your self-refresh case, along with the
> > > > originally intended the power-down case and special-purpose use
> > > > of memory case.
> > >
> > > Yeah, I'd hope so once we actually have capable hardware.
> >
> > Cool!!!
> >
> > So Ankita's patchset might be useful to you at some point, then.
> >
> > Does it look like a reasonable implementation?
>
> as someone who is working on hardware that is PASR capable right now,
> I have to admit that our plan was to just hook into the buddy allocator,
> and use PASR on the top level of buddy (eg PASR off blocks that are
> free there, and PASR them back on once an allocation required the block
> to be broken up)..... that looked the very most simple to me.
>

We were looking at a generic approach to exploit the different types of
memory power management features present in the hardware, like PASR,
power off and automatic transition into lower power states. To actively
create opportunities to exploit these features, appropriately guiding
memory allocations/deallocations, reclaim and statistics gathering is
essential. A few usecases are -

- allocating all memory from one region before moving over to the next
region would theoretically ensure that the second region could be kept
into lower power state for a longer period or turned off
- regions could be created consisting of only movable memory, thus
making it possible to evacuate an entire region
- when memory utilization is low, selected memory regions could be
removed as target for allocation itself. These regions need not be
only turned off to save power, if the hardware supports, these regions
could be put into content preserving lower power state if they are not
being referenced
- depending upon the memory utilization of the different regions,
targeted reclaim and memory migration could be triggered in a few
regions with the aim of freeing memory

Grouping memory at the level of the buddy allocator for this purpose
might be tougher.

> Maybe something much more elaborate is needed, but I didn't see why so
> far.
>
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-15 16:53:43

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Fri, Jun 10, 2011 at 08:02:33PM -0700, Arjan van de Ven wrote:
> On Fri, 10 Jun 2011 12:37:13 -0700
> "Paul E. McKenney" <[email protected]> wrote:
>
> > On Fri, Jun 10, 2011 at 08:23:29PM +0100, Matthew Garrett wrote:
> > > On Fri, Jun 10, 2011 at 11:47:38AM -0700, Paul E. McKenney wrote:
> > >
> > > > And if I understand you correctly, then the patches that Ankita
> > > > posted should help your self-refresh case, along with the
> > > > originally intended the power-down case and special-purpose use
> > > > of memory case.
> > >
> > > Yeah, I'd hope so once we actually have capable hardware.
> >
> > Cool!!!
> >
> > So Ankita's patchset might be useful to you at some point, then.
> >
> > Does it look like a reasonable implementation?
>
> as someone who is working on hardware that is PASR capable right now,
> I have to admit that our plan was to just hook into the buddy allocator,
> and use PASR on the top level of buddy (eg PASR off blocks that are
> free there, and PASR them back on once an allocation required the block
> to be broken up)..... that looked the very most simple to me.
>

The maximum order in buddy allocator is by default 1k pages. Isn't this
too small a granularity to track blocks that might comprise a PASR unit?

> Maybe something much more elaborate is needed, but I didn't see why so
> far.
>
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-16 04:20:58

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Mon, Jun 13, 2011 at 01:47:01PM +0900, KAMEZAWA Hiroyuki wrote:
> On Fri, 27 May 2011 18:01:28 +0530
> Ankita Garg <[email protected]> wrote:
>
> > Hi,
> >
>
> I'm sorry if you've answered already.
>
> Is memory hotplug is too bad and cannot be enhanced for this purpose ?
>
> I wonder
> - make section-size smaller (IIUC, IBM's system has 16MB section size)
>
> - add per section statistics
>
> - add a kind of balloon driver which does software memory offline
> (which means making a contiguous chunk of free pages of section_size
> by page migration) in background with regard to memory usage statistics.
> If system says "need more memory!", balloon driver can online pages.
>
> can work for your purpose. It can allow you page isolatation and
> controls in 16MB unit. If you need whole rework of memory hotplug, I think
> it's better to rewrite memory hotplug, too.
>

Interesting idea, but a few issues -

- Correctly predicting memory pressure is difficult and thereby being
able to online the required pages at the right time could be a
challenge
- Memory hotplug is a heavy operation, so the overhead involved may be
high
- Powering off memory is just one of the ways in which memory power could
be saved. The platform can also dynamically transition areas of memory
into a content-preserving lower power state if it is not referenced
for a pre-defined threshold of time. In such a case, we would need a
mechanism to soft offline the pages - i.e, no new allocations to be
directed to that memory

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-16 09:19:51

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Thu, 16 Jun 2011 09:50:44 +0530
Ankita Garg <[email protected]> wrote:

> Hi,
>
> On Mon, Jun 13, 2011 at 01:47:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Fri, 27 May 2011 18:01:28 +0530
> > Ankita Garg <[email protected]> wrote:
> >
> > > Hi,
> > >
> >
> > I'm sorry if you've answered already.
> >
> > Is memory hotplug is too bad and cannot be enhanced for this purpose ?
> >
> > I wonder
> > - make section-size smaller (IIUC, IBM's system has 16MB section size)
> >
> > - add per section statistics
> >
> > - add a kind of balloon driver which does software memory offline
> > (which means making a contiguous chunk of free pages of section_size
> > by page migration) in background with regard to memory usage statistics.
> > If system says "need more memory!", balloon driver can online pages.
> >
> > can work for your purpose. It can allow you page isolatation and
> > controls in 16MB unit. If you need whole rework of memory hotplug, I think
> > it's better to rewrite memory hotplug, too.
> >
>
> Interesting idea, but a few issues -
>
> - Correctly predicting memory pressure is difficult and thereby being
> able to online the required pages at the right time could be a
> challenge

But it will be required for your purpose, anyway. Isn't it ?

> - Memory hotplug is a heavy operation, so the overhead involved may be
> high

soft-offline of small amount of pages will not very heavy.
compaction and cma patches use the same kind of logic.


> - Powering off memory is just one of the ways in which memory power could
> be saved. The platform can also dynamically transition areas of memory
> into a content-preserving lower power state if it is not referenced
> for a pre-defined threshold of time. In such a case, we would need a
> mechanism to soft offline the pages - i.e, no new allocations to be
> directed to that memory
>

Hmm, sounds like a similar idea of CleanCache ?

Reusing section is much easier than adding new one.., I think.

Thanks,
-Kame



2011-06-16 16:04:16

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Thu, 2011-06-16 at 09:50 +0530, Ankita Garg wrote:
> - Correctly predicting memory pressure is difficult and thereby being
> able to online the required pages at the right time could be a
> challenge

For the sake of this discussion, let's forget about this. There are
certainly a lot of scenarios where turning memory on/off is necessary
and useful _without_ knowing what kind of load the system is under. "We
just shut down our huge database, and now have 99% of RAM free" is a
fine, dumb, metric. We don't have to have magical pony memory pressure
detection as a prerequisite.

> - Memory hotplug is a heavy operation, so the overhead involved may be
> high

I'm curious. Why do you say this? Could you elaborate a bit on _how_
memory hotplug is different from what you're doing here? On powerpc, at
least, we can do memory hotplug in areas as small as 16MB. That's _way_
smaller than what you're talking about here, and I would assume that
smaller here means less overhead.

> - Powering off memory is just one of the ways in which memory power could
> be saved. The platform can also dynamically transition areas of memory
> into a content-preserving lower power state if it is not referenced
> for a pre-defined threshold of time. In such a case, we would need a
> mechanism to soft offline the pages - i.e, no new allocations to be
> directed to that memory

OK... That's fine, but I think you're avoiding the question a bit. You
need to demonstrate that this 'regions' thing is necessary to do this,
and that we can not get by just modifying what we have now. For
instance:

1. Have something like khugepaged try to find region-sized chunks of
memory to free.
2. Modify the buddy allocator to be "picky" about when it lets you get
access to these regions.
3. Try to bunch up 'like allocations' like ZONE_MOVABLE does.

(2) could easily mean that we take the MAX_ORDER-1 buddy pages and treat
them differently. If the page being freed is going (or trying to go) in
to a low power state, insert freed pages on to the tail, or on a special
list. When satisfying allocations, we'd make some bit of effort to
return pages which are powered on.

-- Dave

2011-06-17 10:03:15

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Thu, Jun 16, 2011 at 09:04:00AM -0700, Dave Hansen wrote:
> On Thu, 2011-06-16 at 09:50 +0530, Ankita Garg wrote:
> > - Correctly predicting memory pressure is difficult and thereby being
> > able to online the required pages at the right time could be a
> > challenge
>
> For the sake of this discussion, let's forget about this. There are
> certainly a lot of scenarios where turning memory on/off is necessary
> and useful _without_ knowing what kind of load the system is under. "We
> just shut down our huge database, and now have 99% of RAM free" is a
> fine, dumb, metric. We don't have to have magical pony memory pressure
> detection as a prerequisite.
>

I agree, but when using memory hotplug for managing memory power, it
would be important to correctly predict pressure so that performance is
not affected too much. Also, especially since memory would be offlined
only when memory is mostly idle. While in the case of CPU, a user space
daemon can automatically online/offline cpus based on load, but in the
case of memory, I guess a kernel thread that maintains global statistics
might have to be used.

> > - Memory hotplug is a heavy operation, so the overhead involved may be
> > high
>
> I'm curious. Why do you say this? Could you elaborate a bit on _how_
> memory hotplug is different from what you're doing here? On powerpc, at
> least, we can do memory hotplug in areas as small as 16MB. That's _way_
> smaller than what you're talking about here, and I would assume that
> smaller here means less overhead.
>

To save any power, the entire memory unit (like a bank for PASR) will
have to be turned off (and hence offlined). The overhead in memory
hotplug is to migrate/free pages belonging to the sections and
creating/deleting the various memory management structures. Instead, if
we could have a framework like you mentiond below that could target
allocations away from certain areas of memory, the migration step will
not be needed. Further, the hardware would just turn off the memory and
the OS would retain all the memory management structures.

We intend to use memory regions to group the memory together into units
that can be independently power managed. We propose to achieve this by
re-ordering zones within the zonelist, such that zones from regions that
are the target for evacuation would be at the tail of the zonelist and
thus will not be prefered for allocations.

> > - Powering off memory is just one of the ways in which memory power could
> > be saved. The platform can also dynamically transition areas of memory
> > into a content-preserving lower power state if it is not referenced
> > for a pre-defined threshold of time. In such a case, we would need a
> > mechanism to soft offline the pages - i.e, no new allocations to be
> > directed to that memory
>
> OK... That's fine, but I think you're avoiding the question a bit. You
> need to demonstrate that this 'regions' thing is necessary to do this,
> and that we can not get by just modifying what we have now. For
> instance:
>
> 1. Have something like khugepaged try to find region-sized chunks of
> memory to free.
> 2. Modify the buddy allocator to be "picky" about when it lets you get
> access to these regions.

Managing pages belonging to multiple regions on the same buddy list
might make the buddy allocator more complex. But thanks for suggesting
the different approaches, will looking into these and get back to you.

> 3. Try to bunch up 'like allocations' like ZONE_MOVABLE does.
>
> (2) could easily mean that we take the MAX_ORDER-1 buddy pages and treat
> them differently. If the page being freed is going (or trying to go) in
> to a low power state, insert freed pages on to the tail, or on a special
> list. When satisfying allocations, we'd make some bit of effort to
> return pages which are powered on.
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-17 15:29:18

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Thu, Jun 16, 2011 at 06:12:51PM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 16 Jun 2011 09:50:44 +0530
> Ankita Garg <[email protected]> wrote:
>
> > Hi,
> >
> > On Mon, Jun 13, 2011 at 01:47:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > > On Fri, 27 May 2011 18:01:28 +0530
> > > Ankita Garg <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > >
> > > I'm sorry if you've answered already.
> > >
> > > Is memory hotplug is too bad and cannot be enhanced for this purpose ?
> > >
> > > I wonder
> > > - make section-size smaller (IIUC, IBM's system has 16MB section size)
> > >
> > > - add per section statistics
> > >
> > > - add a kind of balloon driver which does software memory offline
> > > (which means making a contiguous chunk of free pages of section_size
> > > by page migration) in background with regard to memory usage statistics.
> > > If system says "need more memory!", balloon driver can online pages.
> > >
> > > can work for your purpose. It can allow you page isolatation and
> > > controls in 16MB unit. If you need whole rework of memory hotplug, I think
> > > it's better to rewrite memory hotplug, too.
> > >
> >
> > Interesting idea, but a few issues -
> >
> > - Correctly predicting memory pressure is difficult and thereby being
> > able to online the required pages at the right time could be a
> > challenge
>
> But it will be required for your purpose, anyway. Isn't it ?
>
> > - Memory hotplug is a heavy operation, so the overhead involved may be
> > high
>
> soft-offline of small amount of pages will not very heavy.
> compaction and cma patches use the same kind of logic.
>
>
> > - Powering off memory is just one of the ways in which memory power could
> > be saved. The platform can also dynamically transition areas of memory
> > into a content-preserving lower power state if it is not referenced
> > for a pre-defined threshold of time. In such a case, we would need a
> > mechanism to soft offline the pages - i.e, no new allocations to be
> > directed to that memory
> >
>
> Hmm, sounds like a similar idea of CleanCache ?
>
> Reusing section is much easier than adding new one.., I think.
>

But sections do not define the granualarity at which memory operations
are done right ? i.e, allocations/deallocations or reclaim cannot be
directed to a section or a group of sections ?

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-18 04:04:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Wed, 15 Jun 2011 22:23:21 +0530
Ankita Garg <[email protected]> wrote:

> The maximum order in buddy allocator is by default 1k pages. Isn't
> this too small a granularity to track blocks that might comprise a
> PASR unit?

we had to bump the default up a little, but not all that much

(like 4x)

2011-06-20 00:00:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Fri, 17 Jun 2011 20:58:45 +0530
Ankita Garg <[email protected]> wrote:

> Hi,
>
> On Thu, Jun 16, 2011 at 06:12:51PM +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 16 Jun 2011 09:50:44 +0530
> > Ankita Garg <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > On Mon, Jun 13, 2011 at 01:47:01PM +0900, KAMEZAWA Hiroyuki wrote:
> > > > On Fri, 27 May 2011 18:01:28 +0530
> > > > Ankita Garg <[email protected]> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > >
> > > > I'm sorry if you've answered already.
> > > >
> > > > Is memory hotplug is too bad and cannot be enhanced for this purpose ?
> > > >
> > > > I wonder
> > > > - make section-size smaller (IIUC, IBM's system has 16MB section size)
> > > >
> > > > - add per section statistics
> > > >
> > > > - add a kind of balloon driver which does software memory offline
> > > > (which means making a contiguous chunk of free pages of section_size
> > > > by page migration) in background with regard to memory usage statistics.
> > > > If system says "need more memory!", balloon driver can online pages.
> > > >
> > > > can work for your purpose. It can allow you page isolatation and
> > > > controls in 16MB unit. If you need whole rework of memory hotplug, I think
> > > > it's better to rewrite memory hotplug, too.
> > > >
> > >
> > > Interesting idea, but a few issues -
> > >
> > > - Correctly predicting memory pressure is difficult and thereby being
> > > able to online the required pages at the right time could be a
> > > challenge
> >
> > But it will be required for your purpose, anyway. Isn't it ?
> >
> > > - Memory hotplug is a heavy operation, so the overhead involved may be
> > > high
> >
> > soft-offline of small amount of pages will not very heavy.
> > compaction and cma patches use the same kind of logic.
> >
> >
> > > - Powering off memory is just one of the ways in which memory power could
> > > be saved. The platform can also dynamically transition areas of memory
> > > into a content-preserving lower power state if it is not referenced
> > > for a pre-defined threshold of time. In such a case, we would need a
> > > mechanism to soft offline the pages - i.e, no new allocations to be
> > > directed to that memory
> > >
> >
> > Hmm, sounds like a similar idea of CleanCache ?
> >
> > Reusing section is much easier than adding new one.., I think.
> >
>
> But sections do not define the granualarity at which memory operations
> are done right ? i.e, allocations/deallocations or reclaim cannot be
> directed to a section or a group of sections ?


Just because there are no users. If you'll be a 1st user, you can
add codes, I think.

Thanks,
-Kame



2011-06-29 13:01:06

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Fri, May 27, 2011 at 06:01:28PM +0530, Ankita Garg wrote:
> Hi,
>
> Modern systems offer higher CPU performance and large amount of memory in
> each generation in order to support application demands. Memory subsystem has
> began to offer wide range of capabilities for managing power consumption,
> which is driving the need to relook at the way memory is managed by the
> operating system. Linux VM subsystem has sophisticated algorithms to
> optimally manage the scarce resources for best overall system performance.
> Apart from the capacity and location of memory areas, the VM subsystem tracks
> special addressability restrictions in zones and relative distance from CPU as
> NUMA nodes if necessary. Power management capabilities in the memory subsystem
> and inclusion of different class of main memory like PCM, or non-volatile RAM,
> brings in new boundaries and attributes that needs to be tagged within the
> Linux VM subsystem for exploitation by the kernel and applications.
>

Below is the summary of the discussion we have had on this thread so
far, along with details of hardware capabilities and the VM requirements
to support memory power management.

Details of the hardware capabilities -

1) Dynamic Power Transition: The memory controller can have the ability
to automatically transition regions of memory into lower power states
when they are devoid of references for a pre-defined threshold amount of
time. Memory contents are preserved in the low power states and accessing
memory that is at a low power state takes a latency hit.

2) Dynamic Power Off: If a region is free/unallocated, the software can
indicate to the controller to completely turn off power to a certain
region. Memory contents are lost and hence the software has to be
absolutely sure about the usage statistics of the particular region. This
is a runtime capability, where the required amount of memory can be
powered 'ON' to match the workload demands.

3) Partial Array Self-Refresh (PASR): If a certain regions of memory is
free/unallocated, the software can indicate to the controller to not
refresh that region when the system goes to suspend-to-ram state and
thereby save standby power consumption.

Many embedded devices support one or more of the above capabilities.

Given the above capabilities, different levels of support is needed in
the OS, to exploit the hardware features. In general we need an artificial
threshold that guards against crossing over to the next region, along with
accurate statistics on how much memory is allocated in which region, so
that reclaim can target the regions with less pages and possibly evacuate
them. Below are some details on potential approaches for each feature:

I) Dynamic Power Transition

The goal here is to ensure that as much as possible, on an idle system,
the memory references do not get spread across the entire RAM, a problem
similar to memory fragmentation. The proposed approach is as below:

1) One of the first things is to ensure that the memory allocations do
not spill over to more number of regions. Thus the allocator needs to
be aware of the address boundary of the different regions.

2) At the time of allocation, before spilling over allocations to the
next logical region, the allocator needs to make a best attempt to
reclaim some memory from within the existing region itself first. The
reclaim here needs to be in LRU order within the region. However, if
it is ascertained that the reclaim would take a lot of time, like there
are quite a fe write-backs needed, then we can spill over to the next
memory region (just like our NUMA node allocation policy now).

II) Dynamic Power Off & PASR

The goal here is to ensure that as much as possible, on an idle system,
memory allocations are consolidated and most of the regions are kept
devoid of allocations. The OS can then indicate to the controller to
turn off power to the specific regions. The requirements and proposed
approach is as below:

1) As mentioned above, one of the first things is to ensure that memory
is allocated sequentially across the regions and best effort is made to
allocate memory within a region, before going over to the next one.

2) Design OS callbacks to hardware, to track first page allocation and
last page deallocation, to better communicate to the hardware about
when to power on/off the region respectively. Alternatively, in the
absence of such notifications also, heuristics could be used to decide
on the threshold for the callbacks to decide when to trigger the power
related operation.

3) On some platforms like the Samsung Exynos 4210, while dynamic
power transition takes place at one granularity, dynamic power off is
performed at a different and a higher granularity. So, the OS needs
to be able to associate these regions into groups, to aid in making
allocation/deallocation/reclaim decisions.

Approaches -

Memory Regions

>
> -----------------------------
> |N0 |N1 |N2 |N3 |.. |.. |Nn |
> -----------------------------
> / \ \
> / \ \
> / \ \
> ------------ | ------------
> | Mem Rgn0 | | | Mem Rgn3 |
> ------------ | ------------
> | | |
> | ------------ | ---------
> | | Mem Rgn1 | ->| zones |
> | ------------ ---------
> | | ---------
> | ----->| zones |
> | --------- ---------
> ->| zones |
> ---------
>

(a) A data structure to capture the hardware memory region boundaries
and also enable grouping of certain regions together for the purpose of
higher order power saving operations.

(b) Enable gathering of accurate page allocation statistics on a
per-memory region granularity

(c) Allow memory to be allocated from within a hardware region first,
target easily reclaimable pages within the current region and only then
spill over to the other regions if memory pressure is high. In an empty
region, allocation will happen sequentially anyway, but need a mechanism
to do targeted reclaim in LRU order within a region, to keep allocations
from spreading easily to other regions.

(d) Targeted reclaims of memory from within a memory region when its
utilization (allocation) is very low. Once the utilization of a region
falls below a certain threshold, move the remaining pages to other active
(fairly utilized) regions and evacuate the underutilized ones. This
would basically consolidate all allocated memory into less number of
memory regions.

The proposed memory regions approach has the advantages of catering to
all of the above requirements, but has the disadvantage of fragmenting
zones in the system.

Alternative suggestions that came up:

- Hacking the buddy allocator to free up chunks of pages to exploit PASR

The buddy allocator does not take any address boundary into account.
However, one approach would be to keep the boundary information in a
parallel data structure, and at the time of allocation, look hard for
the pages belonging to a particular region.

- Using lumpy reclaim as a mechanism to free-up region sized and aligned
pages from the buddy allocator, but will not help in shaping
allocations

- The LRU reclaimer presently operates within a zone and does not take
into account the physical addresses. One approach could be to extend it
to reclaim the LRU pages within a given address range

- A balloon driver to offline contiguous chunks of pages. However, we
would still need a mechanism to group sections that belong to the same
region and also bias the allocations

- Modify the buddy allocator to be "picky" about when it lets you get
access to the regions

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-29 17:07:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

I was kinda hoping for something a bit simpler than that. I'd boil down
what you were saying to this:

1. The kernel must be aware of how the pieces of hardware are
mapped in to the system's physical address space
2. The kernel must have a mechanism in place to minimize access to
specific pieces of hardware
3. For destructive power-down operations, the kernel should have a
mechanism in place to ensure that no valuable data is contained
in the memory to be powered down.

Is that complete?

On Wed, 2011-06-29 at 18:30 +0530, Ankita Garg wrote:
> 1) Dynamic Power Transition: The memory controller can have the ability
> to automatically transition regions of memory into lower power states
> when they are devoid of references for a pre-defined threshold amount of
> time. Memory contents are preserved in the low power states and accessing
> memory that is at a low power state takes a latency hit.
>
> 2) Dynamic Power Off: If a region is free/unallocated, the software can
> indicate to the controller to completely turn off power to a certain
> region. Memory contents are lost and hence the software has to be
> absolutely sure about the usage statistics of the particular region. This
> is a runtime capability, where the required amount of memory can be
> powered 'ON' to match the workload demands.
>
> 3) Partial Array Self-Refresh (PASR): If a certain regions of memory is
> free/unallocated, the software can indicate to the controller to not
> refresh that region when the system goes to suspend-to-ram state and
> thereby save standby power consumption.

(3) is simply a subset of (2), but with the additional restriction that
the power off can only occur during a suspend operation.

Let's say we fully implemented support for (2). What would be missing
to support PASR?

-- Dave

2011-06-29 17:43:29

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Wed, Jun 29, 2011 at 10:06:24AM -0700, Dave Hansen wrote:
> I was kinda hoping for something a bit simpler than that. I'd boil down
> what you were saying to this:
>
> 1. The kernel must be aware of how the pieces of hardware are
> mapped in to the system's physical address space
> 2. The kernel must have a mechanism in place to minimize access to
> specific pieces of hardware
> 3. For destructive power-down operations, the kernel should have a
> mechanism in place to ensure that no valuable data is contained
> in the memory to be powered down.
>

4. The kernel must have a mechanism to maintain utilization
statistics pertaining to a piece of hardware, so that it can
trigger the hardware to power it off
5. Being able to group these pieces of hardware for purpose of
higher savings.

> Is that complete?
>
> On Wed, 2011-06-29 at 18:30 +0530, Ankita Garg wrote:
> > 1) Dynamic Power Transition: The memory controller can have the ability
> > to automatically transition regions of memory into lower power states
> > when they are devoid of references for a pre-defined threshold amount of
> > time. Memory contents are preserved in the low power states and accessing
> > memory that is at a low power state takes a latency hit.
> >
> > 2) Dynamic Power Off: If a region is free/unallocated, the software can
> > indicate to the controller to completely turn off power to a certain
> > region. Memory contents are lost and hence the software has to be
> > absolutely sure about the usage statistics of the particular region. This
> > is a runtime capability, where the required amount of memory can be
> > powered 'ON' to match the workload demands.
> >
> > 3) Partial Array Self-Refresh (PASR): If a certain regions of memory is
> > free/unallocated, the software can indicate to the controller to not
> > refresh that region when the system goes to suspend-to-ram state and
> > thereby save standby power consumption.
>
> (3) is simply a subset of (2), but with the additional restriction that
> the power off can only occur during a suspend operation.
>
> Let's say we fully implemented support for (2). What would be missing
> to support PASR?
>

Yes, PASR is a subset of (2) from implementation perspective.

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-29 17:59:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Wed, 2011-06-29 at 23:12 +0530, Ankita Garg wrote:
> 4. The kernel must have a mechanism to maintain utilization
> statistics pertaining to a piece of hardware, so that it can
> trigger the hardware to power it off

Having statistics like this would certainly be nice, but how important
_is_ it? Is it really a show-stopper? There's some stuff today, like
the NPT/EPT support in KVM where we don't even have visibility in to
when a given page is referenced.

It's also going to be a pain to track kernel references. On x86, our
kernel linear mapping uses 1GB pages when it can, and those are greater
than the 512MB granularity that we've been talking about here. It's
even larger on powerpc. I'm also pretty sure we don't even _look_ at
the referenced bits in the kernel page tables. We'll definitely need
some infrastructure to do that.

> 5. Being able to group these pieces of hardware for purpose of
> higher savings.

Do you really mean group, or do you mean "turn as many off as possible"?

-- Dave

2011-06-29 18:07:56

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

* Dave Hansen <[email protected]> [2011-06-29 10:06:24]:

> I was kinda hoping for something a bit simpler than that. I'd boil down
> what you were saying to this:
>
> 1. The kernel must be aware of how the pieces of hardware are
> mapped in to the system's physical address space
> 2. The kernel must have a mechanism in place to minimize access to
> specific pieces of hardware
(mainly by controlling allocations and reclaim)


> 3. For destructive power-down operations, the kernel should have a
> mechanism in place to ensure that no valuable data is contained
> in the memory to be powered down.
>
> Is that complete?

At a high level these are the main requirements, except that different
operations/features can happen at different/higher granularity. The
infrastructure should be able to related groups of regions and act
upon for a specific optimization. Like granularity for (2) may be
512MB, while (3) could be a pair of 512MB blocks. This is relatively
a minor issue to solve.

> On Wed, 2011-06-29 at 18:30 +0530, Ankita Garg wrote:
> > 1) Dynamic Power Transition: The memory controller can have the ability
> > to automatically transition regions of memory into lower power states
> > when they are devoid of references for a pre-defined threshold amount of
> > time. Memory contents are preserved in the low power states and accessing
> > memory that is at a low power state takes a latency hit.
> >
> > 2) Dynamic Power Off: If a region is free/unallocated, the software can
> > indicate to the controller to completely turn off power to a certain
> > region. Memory contents are lost and hence the software has to be
> > absolutely sure about the usage statistics of the particular region. This
> > is a runtime capability, where the required amount of memory can be
> > powered 'ON' to match the workload demands.
> >
> > 3) Partial Array Self-Refresh (PASR): If a certain regions of memory is
> > free/unallocated, the software can indicate to the controller to not
> > refresh that region when the system goes to suspend-to-ram state and
> > thereby save standby power consumption.
>
> (3) is simply a subset of (2), but with the additional restriction that
> the power off can only occur during a suspend operation.
>
> Let's say we fully implemented support for (2). What would be missing
> to support PASR?

The similarity between (2) and (3) here is the need for accurate
statistics to know allocation status. The difference is the
actuation/trigger part... in case of (2) the trigger would happen
during allocation/free while in case of (3) it happens only at suspend
time. Also the granularity could be different, generally PASR is very
fine grain as compared for power-off at controller level.

We can combine them and look at just how to track allocations at
different (or multiple) physical boundaries.

--Vaidy

2011-06-29 18:18:08

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

* Dave Hansen <[email protected]> [2011-06-29 10:59:02]:

> On Wed, 2011-06-29 at 23:12 +0530, Ankita Garg wrote:
> > 4. The kernel must have a mechanism to maintain utilization
> > statistics pertaining to a piece of hardware, so that it can
> > trigger the hardware to power it off
>
> Having statistics like this would certainly be nice, but how important
> _is_ it? Is it really a show-stopper? There's some stuff today, like
> the NPT/EPT support in KVM where we don't even have visibility in to
> when a given page is referenced.
>
> It's also going to be a pain to track kernel references. On x86, our
> kernel linear mapping uses 1GB pages when it can, and those are greater
> than the 512MB granularity that we've been talking about here. It's
> even larger on powerpc. I'm also pretty sure we don't even _look_ at
> the referenced bits in the kernel page tables. We'll definitely need
> some infrastructure to do that.

Utilization is all about allocated vs free and at most 'type of
allocation'. We are not looking at actual reference rates from page
tables. A free or unallocated page is not going to be referenced.


> > 5. Being able to group these pieces of hardware for purpose of
> > higher savings.
>
> Do you really mean group, or do you mean "turn as many off as possible"?

Grouping based on hardware topology could help save more power at
higher granularity. In most cases just turning as many off as
possible will work. But the design should allow grouping based on
certain rules or hierarchies.

--Vaidy

2011-06-29 20:12:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Dave Hansen <[email protected]> writes:
>
> It's also going to be a pain to track kernel references. On x86, our

Even if you tracked them what would you do with them?

It's quite hard to stop using arbitary kernel memory (see all the dancing
memory-failure does)

You need to track the direct accesses to user data which happens
to be accessed through the direct mapping.

Also it will be always unreliable because this all won't track DMA.
For that you would also need to track in the dma_* infrastructure,
which will likely get seriously expensive.

-Andi

--
[email protected] -- Speaking for myself only

2011-06-30 04:38:35

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

On Wed, Jun 29, 2011 at 11:47:55PM +0530, Vaidyanathan Srinivasan wrote:
> * Dave Hansen <[email protected]> [2011-06-29 10:59:02]:
>
> > On Wed, 2011-06-29 at 23:12 +0530, Ankita Garg wrote:
>
> > > 5. Being able to group these pieces of hardware for purpose of
> > > higher savings.
> >
> > Do you really mean group, or do you mean "turn as many off as possible"?
>
> Grouping based on hardware topology could help save more power at
> higher granularity. In most cases just turning as many off as
> possible will work. But the design should allow grouping based on
> certain rules or hierarchies.
>

For instance, on the Samsung Exynos 4210 board, the controller
dynamically transitions 512MB of memory into lower powerdown state
depending on whether it is being actively referenced or not.
Additionally, if two such 512MB devices are free (as hinted by
software), the controller can cut the clock going into that memory
channel to which the two devices are connected, further reducing the
power consumption.

--
Regards,
eAnkita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India

2011-06-30 05:11:41

by Ankita Garg

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: Linux VM Infrastructure to support Memory Power Management

Hi,

On Wed, Jun 29, 2011 at 01:11:00PM -0700, Andi Kleen wrote:
> Dave Hansen <[email protected]> writes:
> >
> > It's also going to be a pain to track kernel references. On x86, our
>

As Vaidy mentioned, we are only looking at memory being either allocated
or free, as a way to evacuate it. Tracking memory references, no doubt,
is a difficult proposition and might involve a lot of overhead.

> Even if you tracked them what would you do with them?
>
> It's quite hard to stop using arbitary kernel memory (see all the dancing
> memory-failure does)
>
> You need to track the direct accesses to user data which happens
> to be accessed through the direct mapping.
>
> Also it will be always unreliable because this all won't track DMA.
> For that you would also need to track in the dma_* infrastructure,
> which will likely get seriously expensive.
>

--
Regards,
Ankita Garg ([email protected])
Linux Technology Center
IBM India Systems & Technology Labs,
Bangalore, India