When it was introduced, zone_reclaim_mode made sense as NUMA distances
punished and workloads were generally partitioned to fit into a NUMA
node. NUMA machines are now common but few of the workloads are NUMA-aware
and it's routine to see major performance due to zone_reclaim_mode being
disabled but relatively few can identify the problem.
Those that require zone_reclaim_mode are likely to be able to detect when
it needs to be enabled and tune appropriately so lets have a sensible
default for the bulk of users.
Documentation/sysctl/vm.txt | 17 +++++++++--------
include/linux/mmzone.h | 1 -
mm/page_alloc.c | 17 +----------------
3 files changed, 10 insertions(+), 25 deletions(-)
--
1.8.4.5
pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
will be rarely enabled it is unreasonable for all machines to take a penalty.
Fortunately, the zone_reclaim_mode() path is already slow and it is the path
that takes the hit.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 1 -
mm/page_alloc.c | 15 +--------------
2 files changed, 1 insertion(+), 15 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b61b9b..564b169 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -757,7 +757,6 @@ typedef struct pglist_data {
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
- nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a256f85..574928e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
{
- return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
-}
-
-static void __paginginit init_zone_allows_reclaim(int nid)
-{
- int i;
-
- for_each_online_node(i)
- if (node_distance(nid, i) <= RECLAIM_DISTANCE)
- node_set(i, NODE_DATA(nid)->reclaim_nodes);
+ return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
}
#else /* CONFIG_NUMA */
@@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
return true;
}
-static inline void init_zone_allows_reclaim(int nid)
-{
-}
#endif /* CONFIG_NUMA */
/*
@@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
pgdat->node_id = nid;
pgdat->node_start_pfn = node_start_pfn;
- init_zone_allows_reclaim(nid);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
#endif
--
1.8.4.5
zone_reclaim_mode causes processes to prefer reclaiming memory from local
node instead of spilling over to other nodes. This made sense initially when
NUMA machines were almost exclusively HPC and the workload was partitioned
into nodes. The NUMA penalties were sufficiently high to justify reclaiming
the memory. On current machines and workloads it is often the case that
zone_reclaim_mode destroys performance but not all users know how to detect
this. Favour the common case and disable it by default. Users that are
sophisticated enough to know they need zone_reclaim_mode will detect it.
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/vm.txt | 17 +++++++++--------
mm/page_alloc.c | 2 --
2 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index d614a9b..ff5da70 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -751,16 +751,17 @@ This is value ORed together of
2 = Zone reclaim writes dirty pages out
4 = Zone reclaim swaps pages
-zone_reclaim_mode is set during bootup to 1 if it is determined that pages
-from remote zones will cause a measurable performance reduction. The
-page allocator will then reclaim easily reusable pages (those page
-cache pages that are currently not used) before allocating off node pages.
-
-It may be beneficial to switch off zone reclaim if the system is
-used for a file server and all of memory should be used for caching files
-from disk. In that case the caching effect is more important than
+zone_reclaim_mode is disabled by default. For file servers or workloads
+that benefit from having their data cached, zone_reclaim_mode should be
+left disabled as the caching effect is likely to be more important than
data locality.
+zone_reclaim may be enabled if it's known that the workload is partitioned
+such that each partition fits within a NUMA node and that accessing remote
+memory would cause a measurable performance reduction. The page allocator
+will then reclaim easily reusable pages (those page cache pages that are
+currently not used) before allocating off node pages.
+
Allowing zone reclaim to write out pages stops processes that are
writing large amounts of data from dirtying pages on other nodes. Zone
reclaim will write out dirty pages if a zone fills up and so effectively
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3bac76a..a256f85 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
for_each_online_node(i)
if (node_distance(nid, i) <= RECLAIM_DISTANCE)
node_set(i, NODE_DATA(nid)->reclaim_nodes);
- else
- zone_reclaim_mode = 1;
}
#else /* CONFIG_NUMA */
--
1.8.4.5
On Mon, Apr 07, 2014 at 11:34:27PM +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Mon, Apr 07, 2014 at 11:34:28PM +0100, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On 04/08/2014 06:34 AM, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
> ---
> Documentation/sysctl/vm.txt | 17 +++++++++--------
> mm/page_alloc.c | 2 --
> 2 files changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index d614a9b..ff5da70 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -751,16 +751,17 @@ This is value ORed together of
> 2 = Zone reclaim writes dirty pages out
> 4 = Zone reclaim swaps pages
>
> -zone_reclaim_mode is set during bootup to 1 if it is determined that pages
> -from remote zones will cause a measurable performance reduction. The
> -page allocator will then reclaim easily reusable pages (those page
> -cache pages that are currently not used) before allocating off node pages.
> -
> -It may be beneficial to switch off zone reclaim if the system is
> -used for a file server and all of memory should be used for caching files
> -from disk. In that case the caching effect is more important than
> +zone_reclaim_mode is disabled by default. For file servers or workloads
> +that benefit from having their data cached, zone_reclaim_mode should be
> +left disabled as the caching effect is likely to be more important than
> data locality.
>
> +zone_reclaim may be enabled if it's known that the workload is partitioned
> +such that each partition fits within a NUMA node and that accessing remote
> +memory would cause a measurable performance reduction. The page allocator
> +will then reclaim easily reusable pages (those page cache pages that are
> +currently not used) before allocating off node pages.
> +
> Allowing zone reclaim to write out pages stops processes that are
> writing large amounts of data from dirtying pages on other nodes. Zone
> reclaim will write out dirty pages if a zone fills up and so effectively
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3bac76a..a256f85 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1873,8 +1873,6 @@ static void __paginginit init_zone_allows_reclaim(int nid)
> for_each_online_node(i)
> if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> node_set(i, NODE_DATA(nid)->reclaim_nodes);
> - else
> - zone_reclaim_mode = 1;
> }
>
> #else /* CONFIG_NUMA */
>
--
Thanks.
Zhang Yanfei
On 04/08/2014 06:34 AM, Mel Gorman wrote:
> pgdat->reclaim_nodes tracks if a remote node is allowed to be reclaimed by
> zone_reclaim due to its distance. As it is expected that zone_reclaim_mode
> will be rarely enabled it is unreasonable for all machines to take a penalty.
> Fortunately, the zone_reclaim_mode() path is already slow and it is the path
> that takes the hit.
>
> Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
> ---
> include/linux/mmzone.h | 1 -
> mm/page_alloc.c | 15 +--------------
> 2 files changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 9b61b9b..564b169 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -757,7 +757,6 @@ typedef struct pglist_data {
> unsigned long node_spanned_pages; /* total size of physical page
> range, including holes */
> int node_id;
> - nodemask_t reclaim_nodes; /* Nodes allowed to reclaim from */
> wait_queue_head_t kswapd_wait;
> wait_queue_head_t pfmemalloc_wait;
> struct task_struct *kswapd; /* Protected by lock_memory_hotplug() */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a256f85..574928e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1863,16 +1863,7 @@ static bool zone_local(struct zone *local_zone, struct zone *zone)
>
> static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> {
> - return node_isset(local_zone->node, zone->zone_pgdat->reclaim_nodes);
> -}
> -
> -static void __paginginit init_zone_allows_reclaim(int nid)
> -{
> - int i;
> -
> - for_each_online_node(i)
> - if (node_distance(nid, i) <= RECLAIM_DISTANCE)
> - node_set(i, NODE_DATA(nid)->reclaim_nodes);
> + return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) < RECLAIM_DISTANCE;
> }
>
> #else /* CONFIG_NUMA */
> @@ -1906,9 +1897,6 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
> return true;
> }
>
> -static inline void init_zone_allows_reclaim(int nid)
> -{
> -}
> #endif /* CONFIG_NUMA */
>
> /*
> @@ -4917,7 +4905,6 @@ void __paginginit free_area_init_node(int nid, unsigned long *zones_size,
>
> pgdat->node_id = nid;
> pgdat->node_start_pfn = node_start_pfn;
> - init_zone_allows_reclaim(nid);
> #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
> get_pfn_range_for_nid(nid, &start_pfn, &end_pfn);
> #endif
>
--
Thanks.
Zhang Yanfei
Hi,
On 2014-04-07 23:34:27 +0100, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
Unsurprisingly I am in favor of this.
> Documentation/sysctl/vm.txt | 17 +++++++++--------
> mm/page_alloc.c | 2 --
> 2 files changed, 9 insertions(+), 10 deletions(-)
But I think linux/topology.h's comment about RECLAIM_DISTANCE should be
adapted as well.
Thanks,
Andres
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 04/08/2014 12:34 AM, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
^ I think you meant "enabled" here?
Just in case the cover letter goes to the changelog...
Vlastimil
> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
>
> Documentation/sysctl/vm.txt | 17 +++++++++--------
> include/linux/mmzone.h | 1 -
> mm/page_alloc.c | 17 +----------------
> 3 files changed, 10 insertions(+), 25 deletions(-)
>
On Mon, 7 Apr 2014, Mel Gorman wrote:
> zone_reclaim_mode causes processes to prefer reclaiming memory from local
> node instead of spilling over to other nodes. This made sense initially when
> NUMA machines were almost exclusively HPC and the workload was partitioned
> into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> the memory. On current machines and workloads it is often the case that
> zone_reclaim_mode destroys performance but not all users know how to detect
> this. Favour the common case and disable it by default. Users that are
> sophisticated enough to know they need zone_reclaim_mode will detect it.
Ok that is going to require SGI machines to deal with zone_reclaim
configurations on bootup. Dimitri? Any comments?
On Tue, 8 Apr 2014, Vlastimil Babka wrote:
> On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > punished and workloads were generally partitioned to fit into a NUMA
> > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > and it's routine to see major performance due to zone_reclaim_mode being
> > disabled but relatively few can identify the problem.
> ^ I think you meant "enabled" here?
>
> Just in case the cover letter goes to the changelog...
Correct.
Another solution here would be to increase the threshhold so that
4 socket machines do not enable zone reclaim by default. The larger the
NUMA system is the more memory is off node from the perspective of a
processor and the larger the hit from remote memory.
On the other hand: The more expensive we make reclaim the less it
makes sense to allow zone reclaim to occur.
On 2014-04-08 09:17:04 -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Vlastimil Babka wrote:
>
> > On 04/08/2014 12:34 AM, Mel Gorman wrote:
> > > When it was introduced, zone_reclaim_mode made sense as NUMA distances
> > > punished and workloads were generally partitioned to fit into a NUMA
> > > node. NUMA machines are now common but few of the workloads are NUMA-aware
> > > and it's routine to see major performance due to zone_reclaim_mode being
> > > disabled but relatively few can identify the problem.
> > ^ I think you meant "enabled" here?
> >
> > Just in case the cover letter goes to the changelog...
>
> Correct.
>
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.
FWIW, I've the problem hit majorly on 8 socket machines. Those are the
largest I have seen so far in postgres scenarios. Everything larger is
far less likely to be used as single node database server, so that's
possibly a sensible cutoff.
But then, I'd think that special many-socket machines are setup by
specialists, that'd know to enable if it makes sense...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Apr 08, 2014 at 09:14:05AM -0500, Christoph Lameter wrote:
> On Mon, 7 Apr 2014, Mel Gorman wrote:
>
> > zone_reclaim_mode causes processes to prefer reclaiming memory from local
> > node instead of spilling over to other nodes. This made sense initially when
> > NUMA machines were almost exclusively HPC and the workload was partitioned
> > into nodes. The NUMA penalties were sufficiently high to justify reclaiming
> > the memory. On current machines and workloads it is often the case that
> > zone_reclaim_mode destroys performance but not all users know how to detect
> > this. Favour the common case and disable it by default. Users that are
> > sophisticated enough to know they need zone_reclaim_mode will detect it.
>
> Ok that is going to require SGI machines to deal with zone_reclaim
> configurations on bootup. Dimitri? Any comments?
>
The SGI machines are also likely to be managed by system administrators
who are both aware of zone_reclaim_mode and know how to evaluate if it
should be enabled or not. The pair of patches is really aimmed at the
common case of 2-8 socket machines running workloads that are not NUMA
aware.
--
Mel Gorman
SUSE Labs
On 04/08/2014 10:17 AM, Christoph Lameter wrote:
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.
8 and 16 socket machines aren't common for nonspecialist workloads
*now*, but by the time these changes make it into supported distribution
kernels, they may very well be. So having zone_reclaim_mode
automatically turn itself on if you have more than 8 sockets would still
be a booby-trap ("Boss, I dunno. I installed the additional processors
and memory performance went to hell!")
For zone_reclaim_mode=1 to be useful on standard servers, both of the
following need to be true:
1. the user has to have set CPU affinity for their applications;
2. the applications can't need more than one memory bank worth of cache.
The thing is, there is *no way* for Linux to know if the above is true.
Now, I can certainly imagine non-HPC workloads for which both of the
above would be true; for example, I've set up VMware ESX servers where
each VM has one socket and one memory bank. However, if the user knows
enough to set up socket affinity, they know enough to set
zone_reclaim_mode = 1. The default should cover the know-nothing case,
not the experienced specialist case.
I'd also argue that there's a fundamental false assumption in the entire
algorithm of zone_reclaim_mode, because there is no memory bank which is
as distant as disk is, ever. However, if it's off by default, then I
don't care.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On Tue, Apr 8, 2014 at 10:17 AM, Christoph Lameter <[email protected]> wrote:
> Another solution here would be to increase the threshhold so that
> 4 socket machines do not enable zone reclaim by default. The larger the
> NUMA system is the more memory is off node from the perspective of a
> processor and the larger the hit from remote memory.
Well, as Josh quite rightly said, the hit from accessing remote memory
is never going to be as large as the hit from disk. If and when there
is a machine where remote memory is more expensive to access than
disk, that's a good argument for zone_reclaim_mode. But I don't
believe that's anywhere close to being true today, even on an 8-socket
machine with an SSD.
Now, perhaps the fear is that if we access that remote memory
*repeatedly* the aggregate cost will exceed what it would have cost to
fault that page into the local node just once. But it takes a lot of
accesses for that to be true, and most of the time you won't get them.
Even if you do, I bet many workloads will prefer even performance
across all the accesses over a very slow first access followed by
slightly faster subsequent accesses.
In an ideal world, the kernel would put the hottest pages on the local
node and the less-hot pages on remote nodes, moving pages around as
the workload shifts. In practice, that's probably pretty hard.
Fortunately, it's not nearly as important as making sure we don't
unnecessarily hit the disk, which is infinitely slower than any memory
bank.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 04/08/2014 03:53 PM, Robert Haas wrote:
> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts. In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.
Even if the kernel could do this, we would *still* have to disable it
for PostgreSQL, since our double-buffering makes our pages look "cold"
to the kernel ... as discussed.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
On Tue, 8 Apr 2014, Robert Haas wrote:
> Well, as Josh quite rightly said, the hit from accessing remote memory
> is never going to be as large as the hit from disk. If and when there
> is a machine where remote memory is more expensive to access than
> disk, that's a good argument for zone_reclaim_mode. But I don't
> believe that's anywhere close to being true today, even on an 8-socket
> machine with an SSD.
I am nost sure how disk figures into this?
The tradeoff is zone reclaim vs. the aggregate performance
degradation of the remote memory accesses. That depends on the
cacheability of the app and the scale of memory accesses.
The reason that zone reclaim is on by default is that off node accesses
are a big performance hit on large scale NUMA systems (like ScaleMP and
SGI). Zone reclaim was written *because* those system experienced severe
performance degradation.
On the tightly coupled 4 and 8 node systems there does not seem to
be a benefit from what I hear.
> Now, perhaps the fear is that if we access that remote memory
> *repeatedly* the aggregate cost will exceed what it would have cost to
> fault that page into the local node just once. But it takes a lot of
> accesses for that to be true, and most of the time you won't get them.
> Even if you do, I bet many workloads will prefer even performance
> across all the accesses over a very slow first access followed by
> slightly faster subsequent accesses.
Many HPC workloads prefer the opposite.
> In an ideal world, the kernel would put the hottest pages on the local
> node and the less-hot pages on remote nodes, moving pages around as
> the workload shifts. In practice, that's probably pretty hard.
> Fortunately, it's not nearly as important as making sure we don't
> unnecessarily hit the disk, which is infinitely slower than any memory
> bank.
Shifting pages involves similar tradeoffs as zone reclaim vs. remote
allocations.
On Tue, Apr 08, 2014 at 05:58:21PM -0500, Christoph Lameter wrote:
> On Tue, 8 Apr 2014, Robert Haas wrote:
>
> > Well, as Josh quite rightly said, the hit from accessing remote memory
> > is never going to be as large as the hit from disk. If and when there
> > is a machine where remote memory is more expensive to access than
> > disk, that's a good argument for zone_reclaim_mode. But I don't
> > believe that's anywhere close to being true today, even on an 8-socket
> > machine with an SSD.
>
> I am nost sure how disk figures into this?
>
It's a matter of perspective. For those that are running file servers,
databases and the like they don't see the remote accesses, they see their
page cache getting reclaimed but not all of those users understand why
because they are not NUMA aware. This is why they are seeing the cost of
zone_reclaim_mode to be IO-related.
I think pretty much 100% of the bug reports I've seen related to
zone_reclaim_mode were due to IO-intensive workloads and the user not
recognising why page cache was getting reclaimed aggressively.
> The tradeoff is zone reclaim vs. the aggregate performance
> degradation of the remote memory accesses. That depends on the
> cacheability of the app and the scale of memory accesses.
>
For HPC, yes.
> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
>
Yes, this is understood. However, those same people already know how to use
cpusets, NUMA bindings and how tune their workload to partition it into
the nodes. From a NUMA perspective they are relatively sophisticated and
know how and when to set zone_reclaim_mode. At least on any bug report I've
seen related to these really large machines, they were already using cpusets.
This is why I think think the default for zone_reclaim should now be off
because it helps the common case.
> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.
>
> > Now, perhaps the fear is that if we access that remote memory
> > *repeatedly* the aggregate cost will exceed what it would have cost to
> > fault that page into the local node just once. But it takes a lot of
> > accesses for that to be true, and most of the time you won't get them.
> > Even if you do, I bet many workloads will prefer even performance
> > across all the accesses over a very slow first access followed by
> > slightly faster subsequent accesses.
>
> Many HPC workloads prefer the opposite.
>
And they know how to tune accordingly.
> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts. In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
>
> Shifting pages involves similar tradeoffs as zone reclaim vs. remote
> allocations.
In practice it really is hard for the kernel to do this
automatically. Automatic NUMA balancing will help if the data is mapped but
not if it's buffered read/writes because there is no hinting information
available right now. At some point we may need to tackle IO locality but
it'll take time for users to get experience with automatic balancing as
it is before taking further steps. That's an aside to the current discussion.
--
Mel Gorman
SUSE Labs
On Tue, Apr 08, 2014 at 03:56:49PM -0400, Josh Berkus wrote:
> On 04/08/2014 03:53 PM, Robert Haas wrote:
> > In an ideal world, the kernel would put the hottest pages on the local
> > node and the less-hot pages on remote nodes, moving pages around as
> > the workload shifts. In practice, that's probably pretty hard.
> > Fortunately, it's not nearly as important as making sure we don't
> > unnecessarily hit the disk, which is infinitely slower than any memory
> > bank.
>
> Even if the kernel could do this, we would *still* have to disable it
> for PostgreSQL, since our double-buffering makes our pages look "cold"
> to the kernel ... as discussed.
>
If it's the shared mapping that is being used then automatic NUMA
balancing should migrate those pages to a node local to the CPU
accessing it but how well it works will partially depend on how much
those accesses move around. It's independent of the zone_reclaim_mode
issue.
--
Mel Gorman
SUSE Labs
On 08/04/14 23:58, Christoph Lameter wrote:
> The reason that zone reclaim is on by default is that off node accesses
> are a big performance hit on large scale NUMA systems (like ScaleMP and
> SGI). Zone reclaim was written *because* those system experienced severe
> performance degradation.
>
> On the tightly coupled 4 and 8 node systems there does not seem to
> be a benefit from what I hear.
In principle, is this difference in distance something the kernel
could measure?
--
Cheers,
Jeremy
On Mon 07-04-14 23:34:26, Mel Gorman wrote:
> When it was introduced, zone_reclaim_mode made sense as NUMA distances
> punished and workloads were generally partitioned to fit into a NUMA
> node. NUMA machines are now common but few of the workloads are NUMA-aware
> and it's routine to see major performance due to zone_reclaim_mode being
> disabled but relatively few can identify the problem.
>
> Those that require zone_reclaim_mode are likely to be able to detect when
> it needs to be enabled and tune appropriately so lets have a sensible
> default for the bulk of users.
>
> Documentation/sysctl/vm.txt | 17 +++++++++--------
> include/linux/mmzone.h | 1 -
> mm/page_alloc.c | 17 +----------------
> 3 files changed, 10 insertions(+), 25 deletions(-)
Auto-enabling caused so many reports in the past that it is definitely
much better to not be clever and let admins enable zone_reclaim where it
is appropriate instead.
For both patches.
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
On Fri, 18 Apr 2014, Michal Hocko wrote:
> Auto-enabling caused so many reports in the past that it is definitely
> much better to not be clever and let admins enable zone_reclaim where it
> is appropriate instead.
>
> For both patches.
> Acked-by: Michal Hocko <[email protected]>
I did not get any objections from SGI either.
Reviewed-by: Christoph Lameter <[email protected]>