Changelog since v1
o topology comment updates
When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are NUMA-aware
and it's routine to see major performance due to zone_reclaim_mode being
enabled but relatively few can identify the problem.
Those that require zone_reclaim_mode are likely to be able to detect when
it needs to be enabled and tune appropriately so lets have a sensible
default for the bulk of users.
Documentation/sysctl/vm.txt | 17 +++++++++--------
arch/ia64/include/asm/topology.h | 3 ++-
arch/powerpc/include/asm/topology.h | 8 ++------
include/linux/mmzone.h | 1 -
include/linux/topology.h | 3 ++-
mm/page_alloc.c | 17 +----------------
6 files changed, 16 insertions(+), 33 deletions(-)
--
1.8.4.5
zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
Documentation/sysctl/vm.txt | 17 +++++++++--------
arch/ia64/include/asm/topology.h | 3 ++-
arch/powerpc/include/asm/topology.h | 8 ++------
include/linux/topology.h | 3 ++-
mm/page_alloc.c | 2 --
5 files changed, 15 insertions(+), 18 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index d614a9b..ff5da70 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -751,16 +751,17 @@ This is value ORed together of
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default. For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
data locality.
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction. The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
diff --git a/arch/ia64/include/asm/topology.h b/arch/ia64/include/asm/topology.h
index a2496e4..e1d484e 100644
--- a/arch/ia64/include/asm/topology.h
+++ b/arch/ia64/include/asm/topology.h
@@ -21,7 +21,8 @@
#define PENALTY_FOR_NODE_WITH_CPUS 255
/*
- * Distance above which we begin to use zone reclaim
+ * Nodes within this distance are eligible for reclaim by zone_reclaim() when
+ * zone_reclaim_mode is enabled.
*/
#define RECLAIM_DISTANCE 15
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index d0b5fca..c778ca0 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -9,12 +9,8 @@ struct device_node;
#ifdef CONFIG_NUMA
/*
- * Before going off node we want the VM to try and reclaim from the local
- * node. It does this if the remote distance is larger than RECLAIM_DISTANCE.
- * With the default REMOTE_DISTANCE of 20 and the default RECLAIM_DISTANCE of
- * 20, we never reclaim and go off node straight away.
- *
- * To fix this we choose a smaller value of RECLAIM_DISTANCE.
+ * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
+ * all zones on all nodes will be eligible for zone_reclaim().
*/
#define RECLAIM_DISTANCE 10
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..a1e9186 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -58,7 +58,8 @@ int arch_update_cpu_topology(void);
/*
* If the distance between nodes in a system is larger than RECLAIM_DISTANCE
* (in whatever arch specific measurement units returned by node_distance())
- * then switch on zone reclaim on boot.
+ * and zone_reclaim_mode is enabled then the VM will only call zone_reclaim()
+ * on nodes within this distance.
*/
#define RECLAIM_DISTANCE 30
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..a256f85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
for_each_online_node(i)
if (node_distance(nid, i) <= RECLAIM_DISTANCE)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
- else
- zone_reclaim_mode = 1;
}
#else /* CONFIG_NUMA */
--
1.8.4.5
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/mmzone.h | 1 -
mm/page_alloc.c | 15 +--------------
2 files changed, 1 insertion(+), 15 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b61b9b..564b169 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -757,7 +757,6 @@ typedef struct pglist_data {
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
- nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a256f85..574928e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
- return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
- int i;
-
- for_each_online_node(i)
- if (node_distance(nid, i) <= RECLAIM_DISTANCE)
- node_set(i, NODE_DATA(nid)->reclaim_nodes);
+ return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
}
#else /* CONFIG_NUMA */
@@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
return true;
}
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
#endif /* CONFIG_NUMA */
/*
@@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- init_zone_allows_reclaim(nid);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
#endif
--
1.8.4.5
On Tue, 8 Apr 2014 09:22:58 +0100 Mel Gorman <[email protected]> wrote:
> Changelog since v1
> o topology comment updates
>
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> enabled but relatively few can identify the problem.
>
> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
>
This patchset conflicts with
commit 70ef57e6c22c3323dce179b7d0d433c479266612
Author: Michal Hocko <[email protected]>
AuthorDate: Mon Apr 7 15:37:01 2014 -0700
Commit: Linus Torvalds <[email protected]>
CommitDate: Mon Apr 7 16:35:50 2014 -0700
mm: exclude memoryless nodes from zone_reclaim
It was pretty simple to resolve, but please check that I didn't miss
anything.
>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
Andrew> On Tue, 8 Apr 2014 09:22:58 +0100 Mel Gorman <[email protected]> wrote:
>> Changelog since v1
>> o topology comment updates
>>
>> When it was introduced, zone_reclaim_mode made sense as NUMA distances
>> punished and workloads were generally partitioned to fit into a NUMA
>> node. NUMA machines are now common but few of the workloads are NUMA-aware
>> and it's routine to see major performance due to zone_reclaim_mode being
>> enabled but relatively few can identify the problem.
This is unclear here. "see major performance <what> due" doesn't make
sense to me.
On Fri, Apr 18, 2014 at 04:48:25PM -0400, John Stoffel wrote:
> >>>>> "Andrew" == Andrew Morton <[email protected]> writes:
>
> Andrew> On Tue, 8 Apr 2014 09:22:58 +0100 Mel Gorman <[email protected]> wrote:
> >> Changelog since v1
> >> o topology comment updates
> >>
> >> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> >> punished and workloads were generally partitioned to fit into a NUMA
> >> node. NUMA machines are now common but few of the workloads are NUMA-aware
> >> and it's routine to see major performance due to zone_reclaim_mode being
> >> enabled but relatively few can identify the problem.
>
>
> This is unclear here. "see major performance <what> due" doesn't make
> sense to me.
>
Degradation
--
Mel Gorman
SUSE Labs