This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.
Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds. This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.
There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim. This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method. In most cromulent configurations, this then means that we end up
with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.
With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way. For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.
Signed-off-by: Chris Down <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 25 ++++++++--------
mm/vmscan.c | 61 +++++++++++++-------------------------
2 files changed, 32 insertions(+), 54 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 534267947664..2799008c1f88 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -333,17 +333,17 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
-static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
- unsigned long *min, unsigned long *low)
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
+ bool in_low_reclaim)
{
- if (mem_cgroup_disabled()) {
- *min = 0;
- *low = 0;
- return;
- }
+ if (mem_cgroup_disabled())
+ return 0;
+
+ if (in_low_reclaim)
+ return READ_ONCE(memcg->memory.emin);
- *min = READ_ONCE(memcg->memory.emin);
- *low = READ_ONCE(memcg->memory.elow);
+ return max(READ_ONCE(memcg->memory.emin),
+ READ_ONCE(memcg->memory.elow));
}
enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
@@ -845,11 +845,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
{
}
-static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
- unsigned long *min, unsigned long *low)
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
+ bool in_low_reclaim)
{
- *min = 0;
- *low = 0;
+ return 0;
}
static inline enum mem_cgroup_protection mem_cgroup_protected(
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac4806f0f332..920a9c3ee792 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2414,12 +2414,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
int file = is_file_lru(lru);
unsigned long lruvec_size;
unsigned long scan;
- unsigned long min, low;
+ unsigned long protection;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
- mem_cgroup_protection(memcg, &min, &low);
+ protection = mem_cgroup_protection(memcg,
+ sc->memcg_low_reclaim);
- if (min || low) {
+ if (protection) {
/*
* Scale a cgroup's reclaim pressure by proportioning
* its current usage to its memory.low or memory.min
@@ -2432,13 +2433,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* setting extremely liberal protection thresholds. It
* also means we simply get no protection at all if we
* set it too low, which is not ideal.
- */
- unsigned long cgroup_size = mem_cgroup_size(memcg);
-
- /*
- * If there is any protection in place, we adjust scan
- * pressure in proportion to how much a group's current
- * usage exceeds that, in percent.
+ *
+ * If there is any protection in place, we reduce scan
+ * pressure by how much of the total memory used is
+ * within protection thresholds.
*
* There is one special case: in the first reclaim pass,
* we skip over all groups that are within their low
@@ -2448,43 +2446,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* ideally want to honor how well-behaved groups are in
* that case instead of simply punishing them all
* equally. As such, we reclaim them based on how much
- * of their best-effort protection they are using. Usage
- * below memory.min is excluded from consideration when
- * calculating utilisation, as it isn't ever
- * reclaimable, so it might as well not exist for our
- * purposes.
+ * memory they are using, reducing the scan pressure
+ * again by how much of the total memory used is under
+ * hard protection.
*/
- if (sc->memcg_low_reclaim && low > min) {
- /*
- * Reclaim according to utilisation between min
- * and low
- */
- scan = lruvec_size * (cgroup_size - min) /
- (low - min);
- } else {
- /* Reclaim according to protection overage */
- scan = lruvec_size * cgroup_size /
- max(min, low) - lruvec_size;
- }
+ unsigned long cgroup_size = mem_cgroup_size(memcg);
+
+ /* Avoid TOCTOU with earlier protection check */
+ cgroup_size = max(cgroup_size, protection);
+
+ scan = lruvec_size - lruvec_size * protection /
+ cgroup_size;
/*
- * Don't allow the scan target to exceed the lruvec
- * size, which otherwise could happen if we have >200%
- * overage in the normal case, or >100% overage when
- * sc->memcg_low_reclaim is set.
- *
- * This is important because other cgroups without
- * memory.low have their scan target initially set to
- * their lruvec size, so allowing values >100% of the
- * lruvec size here could result in penalising cgroups
- * with memory.low set even *more* than their peers in
- * some cases in the case of large overages.
- *
- * Also, minimally target SWAP_CLUSTER_MAX pages to keep
+ * Minimally target SWAP_CLUSTER_MAX pages to keep
* reclaim moving forwards, avoiding decremeting
* sc->priority further than desirable.
*/
- scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
+ scan = max(scan, SWAP_CLUSTER_MAX);
} else {
scan = lruvec_size;
}
--
2.21.0
On Thu, Feb 28, 2019 at 09:30:50PM +0000, Chris Down wrote:
> This patch is an incremental improvement on the existing
> memory.{low,min} relative reclaim work to base its scan pressure
> calculations on how much protection is available compared to the current
> usage, rather than how much the current usage is over some protection
> threshold.
>
> Previously the way that memory.low protection works is that if you are
> 50% over a certain baseline, you get 50% of your normal scan pressure.
> This is certainly better than the previous cliff-edge behaviour, but it
> can be improved even further by always considering memory under the
> currently enforced protection threshold to be out of bounds. This means
> that we can set relatively low memory.low thresholds for variable or
> bursty workloads while still getting a reasonable level of protection,
> whereas with the previous version we may still trivially hit the 100%
> clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> one is more concretely based on the currently enforced protection
> threshold, which is likely easier to reason about.
>
> There is also a subtle issue with the way that proportional reclaim
> worked previously -- it promotes having no memory.low, since it makes
> pressure higher during low reclaim. This happens because we base our
> scan pressure modulation on how far memory.current is between memory.min
> and memory.low, but if memory.low is unset, we only use the overage
> method. In most cromulent configurations, this then means that we end up
> with *more* pressure than with no memory.low at all when we're in low
> reclaim, which is not really very usable or expected.
>
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.
Looks good to me: the overall logic is fine, and codewise it's so much
cleaner than the previous version.
Reviewed-by: Roman Gushchin <[email protected]>
Thanks, Chris!
This patch is an incremental improvement on the existing
memory.{low,min} relative reclaim work to base its scan pressure
calculations on how much protection is available compared to the current
usage, rather than how much the current usage is over some protection
threshold.
Previously the way that memory.low protection works is that if you are
50% over a certain baseline, you get 50% of your normal scan pressure.
This is certainly better than the previous cliff-edge behaviour, but it
can be improved even further by always considering memory under the
currently enforced protection threshold to be out of bounds. This means
that we can set relatively low memory.low thresholds for variable or
bursty workloads while still getting a reasonable level of protection,
whereas with the previous version we may still trivially hit the 100%
clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
one is more concretely based on the currently enforced protection
threshold, which is likely easier to reason about.
There is also a subtle issue with the way that proportional reclaim
worked previously -- it promotes having no memory.low, since it makes
pressure higher during low reclaim. This happens because we base our
scan pressure modulation on how far memory.current is between memory.min
and memory.low, but if memory.low is unset, we only use the overage
method. In most cromulent configurations, this then means that we end up
with *more* pressure than with no memory.low at all when we're in low
reclaim, which is not really very usable or expected.
With this patch, memory.low and memory.min affect reclaim pressure in a
more understandable and composable way. For example, from a user
standpoint, "protected" memory now remains untouchable from a reclaim
aggression standpoint, and users can also have more confidence that
bursty workloads will still receive some amount of guaranteed
protection.
Signed-off-by: Chris Down <[email protected]>
Reviewed-by: Roman Gushchin <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Dennis Zhou <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
include/linux/memcontrol.h | 25 ++++++++--------
mm/vmscan.c | 61 +++++++++++++-------------------------
2 files changed, 32 insertions(+), 54 deletions(-)
No functional changes, just rebased.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b226c4bafc93..799de23edfb7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -333,17 +333,17 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
-static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
- unsigned long *min, unsigned long *low)
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
+ bool in_low_reclaim)
{
- if (mem_cgroup_disabled()) {
- *min = 0;
- *low = 0;
- return;
- }
+ if (mem_cgroup_disabled())
+ return 0;
+
+ if (in_low_reclaim)
+ return READ_ONCE(memcg->memory.emin);
- *min = READ_ONCE(memcg->memory.emin);
- *low = READ_ONCE(memcg->memory.elow);
+ return max(READ_ONCE(memcg->memory.emin),
+ READ_ONCE(memcg->memory.elow));
}
enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
@@ -845,11 +845,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
{
}
-static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
- unsigned long *min, unsigned long *low)
+static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
+ bool in_low_reclaim)
{
- *min = 0;
- *low = 0;
+ return 0;
}
static inline enum mem_cgroup_protection mem_cgroup_protected(
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6b9b45f731d..d5daa224364d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2374,12 +2374,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
int file = is_file_lru(lru);
unsigned long lruvec_size;
unsigned long scan;
- unsigned long min, low;
+ unsigned long protection;
lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
- mem_cgroup_protection(memcg, &min, &low);
+ protection = mem_cgroup_protection(memcg,
+ sc->memcg_low_reclaim);
- if (min || low) {
+ if (protection) {
/*
* Scale a cgroup's reclaim pressure by proportioning
* its current usage to its memory.low or memory.min
@@ -2392,13 +2393,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* setting extremely liberal protection thresholds. It
* also means we simply get no protection at all if we
* set it too low, which is not ideal.
- */
- unsigned long cgroup_size = mem_cgroup_size(memcg);
-
- /*
- * If there is any protection in place, we adjust scan
- * pressure in proportion to how much a group's current
- * usage exceeds that, in percent.
+ *
+ * If there is any protection in place, we reduce scan
+ * pressure by how much of the total memory used is
+ * within protection thresholds.
*
* There is one special case: in the first reclaim pass,
* we skip over all groups that are within their low
@@ -2408,43 +2406,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
* ideally want to honor how well-behaved groups are in
* that case instead of simply punishing them all
* equally. As such, we reclaim them based on how much
- * of their best-effort protection they are using. Usage
- * below memory.min is excluded from consideration when
- * calculating utilisation, as it isn't ever
- * reclaimable, so it might as well not exist for our
- * purposes.
+ * memory they are using, reducing the scan pressure
+ * again by how much of the total memory used is under
+ * hard protection.
*/
- if (sc->memcg_low_reclaim && low > min) {
- /*
- * Reclaim according to utilisation between min
- * and low
- */
- scan = lruvec_size * (cgroup_size - min) /
- (low - min);
- } else {
- /* Reclaim according to protection overage */
- scan = lruvec_size * cgroup_size /
- max(min, low) - lruvec_size;
- }
+ unsigned long cgroup_size = mem_cgroup_size(memcg);
+
+ /* Avoid TOCTOU with earlier protection check */
+ cgroup_size = max(cgroup_size, protection);
+
+ scan = lruvec_size - lruvec_size * protection /
+ cgroup_size;
/*
- * Don't allow the scan target to exceed the lruvec
- * size, which otherwise could happen if we have >200%
- * overage in the normal case, or >100% overage when
- * sc->memcg_low_reclaim is set.
- *
- * This is important because other cgroups without
- * memory.low have their scan target initially set to
- * their lruvec size, so allowing values >100% of the
- * lruvec size here could result in penalising cgroups
- * with memory.low set even *more* than their peers in
- * some cases in the case of large overages.
- *
- * Also, minimally target SWAP_CLUSTER_MAX pages to keep
+ * Minimally target SWAP_CLUSTER_MAX pages to keep
* reclaim moving forwards, avoiding decremeting
* sc->priority further than desirable.
*/
- scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
+ scan = max(scan, SWAP_CLUSTER_MAX);
} else {
scan = lruvec_size;
}
--
2.21.0
On Fri, 22 Mar 2019 16:03:07 +0000 Chris Down <[email protected]> wrote:
> This patch is an incremental improvement on the existing
> memory.{low,min} relative reclaim work to base its scan pressure
> calculations on how much protection is available compared to the current
> usage, rather than how much the current usage is over some protection
> threshold.
>
> Previously the way that memory.low protection works is that if you are
> 50% over a certain baseline, you get 50% of your normal scan pressure.
> This is certainly better than the previous cliff-edge behaviour, but it
> can be improved even further by always considering memory under the
> currently enforced protection threshold to be out of bounds. This means
> that we can set relatively low memory.low thresholds for variable or
> bursty workloads while still getting a reasonable level of protection,
> whereas with the previous version we may still trivially hit the 100%
> clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> one is more concretely based on the currently enforced protection
> threshold, which is likely easier to reason about.
>
> There is also a subtle issue with the way that proportional reclaim
> worked previously -- it promotes having no memory.low, since it makes
> pressure higher during low reclaim. This happens because we base our
> scan pressure modulation on how far memory.current is between memory.min
> and memory.low, but if memory.low is unset, we only use the overage
> method. In most cromulent configurations, this then means that we end up
> with *more* pressure than with no memory.low at all when we're in low
> reclaim, which is not really very usable or expected.
>
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.
Could you please provide more description of the effect this has upon
userspace? Preferably in real-world cases. What problems were being
observed and how does this improve things?
Andrew Morton writes:
>Could you please provide more description of the effect this has upon
>userspace? Preferably in real-world cases. What problems were being
>observed and how does this improve things?
Sure! The previous patch's behaviour isn't so much problematic as it is just
not as featureful as it could be.
This change doesn't change the experience for the user in the normal case too
much. One benefit is that it replaces the (somewhat arbitrary) 100% cutoff with
an indefinite slope, which makes it easier to ballpark a memory.low value.
As well as this, the old methodology doesn't quite apply generically to
machines with varying amounts of physical memory. Let's say we have a top level
cgroup, workload.slice, and another top level cgroup, system-management.slice.
We want to roughly give 12G to system-management.slice, so on a 32GB machine we
set memory.low to 20GB in workload.slice, and on a 64GB machine we set
memory.low to 52GB. However, because these are relative amounts to the total
machine size, while the amount of memory we want to generally be willing to
yield to system.slice is absolute (12G), we end up putting more pressure on
system.slice just because we have a larger machine and a larger workload to
fill it, which seems fairly unintuitive. With this new behaviour, we don't end
up with this unintended side effect.
On Fri, Mar 22, 2019 at 04:03:07PM +0000, Chris Down wrote:
> This patch is an incremental improvement on the existing
> memory.{low,min} relative reclaim work to base its scan pressure
> calculations on how much protection is available compared to the current
> usage, rather than how much the current usage is over some protection
> threshold.
>
> Previously the way that memory.low protection works is that if you are
> 50% over a certain baseline, you get 50% of your normal scan pressure.
> This is certainly better than the previous cliff-edge behaviour, but it
> can be improved even further by always considering memory under the
> currently enforced protection threshold to be out of bounds. This means
> that we can set relatively low memory.low thresholds for variable or
> bursty workloads while still getting a reasonable level of protection,
> whereas with the previous version we may still trivially hit the 100%
> clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> one is more concretely based on the currently enforced protection
> threshold, which is likely easier to reason about.
>
> There is also a subtle issue with the way that proportional reclaim
> worked previously -- it promotes having no memory.low, since it makes
> pressure higher during low reclaim. This happens because we base our
> scan pressure modulation on how far memory.current is between memory.min
> and memory.low, but if memory.low is unset, we only use the overage
> method. In most cromulent configurations, this then means that we end up
> with *more* pressure than with no memory.low at all when we're in low
> reclaim, which is not really very usable or expected.
>
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.
>
> Signed-off-by: Chris Down <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Dennis Zhou <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> include/linux/memcontrol.h | 25 ++++++++--------
> mm/vmscan.c | 61 +++++++++++++-------------------------
> 2 files changed, 32 insertions(+), 54 deletions(-)
>
> No functional changes, just rebased.
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index b226c4bafc93..799de23edfb7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -333,17 +333,17 @@ static inline bool mem_cgroup_disabled(void)
> return !cgroup_subsys_enabled(memory_cgrp_subsys);
> }
>
> -static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
> - unsigned long *min, unsigned long *low)
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> + bool in_low_reclaim)
> {
> - if (mem_cgroup_disabled()) {
> - *min = 0;
> - *low = 0;
> - return;
> - }
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + if (in_low_reclaim)
> + return READ_ONCE(memcg->memory.emin);
>
> - *min = READ_ONCE(memcg->memory.emin);
> - *low = READ_ONCE(memcg->memory.elow);
> + return max(READ_ONCE(memcg->memory.emin),
> + READ_ONCE(memcg->memory.elow));
> }
>
> enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> @@ -845,11 +845,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> {
> }
>
> -static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
> - unsigned long *min, unsigned long *low)
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> + bool in_low_reclaim)
> {
> - *min = 0;
> - *low = 0;
> + return 0;
> }
>
> static inline enum mem_cgroup_protection mem_cgroup_protected(
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f6b9b45f731d..d5daa224364d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2374,12 +2374,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> int file = is_file_lru(lru);
> unsigned long lruvec_size;
> unsigned long scan;
> - unsigned long min, low;
> + unsigned long protection;
>
> lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> - mem_cgroup_protection(memcg, &min, &low);
> + protection = mem_cgroup_protection(memcg,
> + sc->memcg_low_reclaim);
>
> - if (min || low) {
> + if (protection) {
> /*
> * Scale a cgroup's reclaim pressure by proportioning
> * its current usage to its memory.low or memory.min
> @@ -2392,13 +2393,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> * setting extremely liberal protection thresholds. It
> * also means we simply get no protection at all if we
> * set it too low, which is not ideal.
> - */
> - unsigned long cgroup_size = mem_cgroup_size(memcg);
> -
> - /*
> - * If there is any protection in place, we adjust scan
> - * pressure in proportion to how much a group's current
> - * usage exceeds that, in percent.
> + *
> + * If there is any protection in place, we reduce scan
> + * pressure by how much of the total memory used is
> + * within protection thresholds.
> *
> * There is one special case: in the first reclaim pass,
> * we skip over all groups that are within their low
> @@ -2408,43 +2406,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> * ideally want to honor how well-behaved groups are in
> * that case instead of simply punishing them all
> * equally. As such, we reclaim them based on how much
> - * of their best-effort protection they are using. Usage
> - * below memory.min is excluded from consideration when
> - * calculating utilisation, as it isn't ever
> - * reclaimable, so it might as well not exist for our
> - * purposes.
> + * memory they are using, reducing the scan pressure
> + * again by how much of the total memory used is under
> + * hard protection.
> */
> - if (sc->memcg_low_reclaim && low > min) {
> - /*
> - * Reclaim according to utilisation between min
> - * and low
> - */
> - scan = lruvec_size * (cgroup_size - min) /
> - (low - min);
> - } else {
> - /* Reclaim according to protection overage */
> - scan = lruvec_size * cgroup_size /
> - max(min, low) - lruvec_size;
I've noticed that the old version is just wrong: if cgroup_size is way smaller
than max(min, low), scan will be set to -lruvec_size.
Given that it's unsigned long, we'll end up with scanning the whole list
(due to clamp() below).
So the new commit should be probably squashed into the previous and
generally treated as a fix.
Thanks!
On Fri, Mar 22, 2019 at 03:29:10PM -0700, Roman Gushchin wrote:
> On Fri, Mar 22, 2019 at 04:03:07PM +0000, Chris Down wrote:
> > This patch is an incremental improvement on the existing
> > memory.{low,min} relative reclaim work to base its scan pressure
> > calculations on how much protection is available compared to the current
> > usage, rather than how much the current usage is over some protection
> > threshold.
> >
> > Previously the way that memory.low protection works is that if you are
> > 50% over a certain baseline, you get 50% of your normal scan pressure.
> > This is certainly better than the previous cliff-edge behaviour, but it
> > can be improved even further by always considering memory under the
> > currently enforced protection threshold to be out of bounds. This means
> > that we can set relatively low memory.low thresholds for variable or
> > bursty workloads while still getting a reasonable level of protection,
> > whereas with the previous version we may still trivially hit the 100%
> > clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> > one is more concretely based on the currently enforced protection
> > threshold, which is likely easier to reason about.
> >
> > There is also a subtle issue with the way that proportional reclaim
> > worked previously -- it promotes having no memory.low, since it makes
> > pressure higher during low reclaim. This happens because we base our
> > scan pressure modulation on how far memory.current is between memory.min
> > and memory.low, but if memory.low is unset, we only use the overage
> > method. In most cromulent configurations, this then means that we end up
> > with *more* pressure than with no memory.low at all when we're in low
> > reclaim, which is not really very usable or expected.
> >
> > With this patch, memory.low and memory.min affect reclaim pressure in a
> > more understandable and composable way. For example, from a user
> > standpoint, "protected" memory now remains untouchable from a reclaim
> > aggression standpoint, and users can also have more confidence that
> > bursty workloads will still receive some amount of guaranteed
> > protection.
> >
> > Signed-off-by: Chris Down <[email protected]>
> > Reviewed-by: Roman Gushchin <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Tejun Heo <[email protected]>
> > Cc: Roman Gushchin <[email protected]>
> > Cc: Dennis Zhou <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > Cc: [email protected]
> > ---
> > include/linux/memcontrol.h | 25 ++++++++--------
> > mm/vmscan.c | 61 +++++++++++++-------------------------
> > 2 files changed, 32 insertions(+), 54 deletions(-)
> >
> > No functional changes, just rebased.
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index b226c4bafc93..799de23edfb7 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -333,17 +333,17 @@ static inline bool mem_cgroup_disabled(void)
> > return !cgroup_subsys_enabled(memory_cgrp_subsys);
> > }
> >
> > -static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
> > - unsigned long *min, unsigned long *low)
> > +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > + bool in_low_reclaim)
> > {
> > - if (mem_cgroup_disabled()) {
> > - *min = 0;
> > - *low = 0;
> > - return;
> > - }
> > + if (mem_cgroup_disabled())
> > + return 0;
> > +
> > + if (in_low_reclaim)
> > + return READ_ONCE(memcg->memory.emin);
> >
> > - *min = READ_ONCE(memcg->memory.emin);
> > - *low = READ_ONCE(memcg->memory.elow);
> > + return max(READ_ONCE(memcg->memory.emin),
> > + READ_ONCE(memcg->memory.elow));
> > }
> >
> > enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
> > @@ -845,11 +845,10 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
> > {
> > }
> >
> > -static inline void mem_cgroup_protection(struct mem_cgroup *memcg,
> > - unsigned long *min, unsigned long *low)
> > +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> > + bool in_low_reclaim)
> > {
> > - *min = 0;
> > - *low = 0;
> > + return 0;
> > }
> >
> > static inline enum mem_cgroup_protection mem_cgroup_protected(
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f6b9b45f731d..d5daa224364d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2374,12 +2374,13 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> > int file = is_file_lru(lru);
> > unsigned long lruvec_size;
> > unsigned long scan;
> > - unsigned long min, low;
> > + unsigned long protection;
> >
> > lruvec_size = lruvec_lru_size(lruvec, lru, sc->reclaim_idx);
> > - mem_cgroup_protection(memcg, &min, &low);
> > + protection = mem_cgroup_protection(memcg,
> > + sc->memcg_low_reclaim);
> >
> > - if (min || low) {
> > + if (protection) {
> > /*
> > * Scale a cgroup's reclaim pressure by proportioning
> > * its current usage to its memory.low or memory.min
> > @@ -2392,13 +2393,10 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> > * setting extremely liberal protection thresholds. It
> > * also means we simply get no protection at all if we
> > * set it too low, which is not ideal.
> > - */
> > - unsigned long cgroup_size = mem_cgroup_size(memcg);
> > -
> > - /*
> > - * If there is any protection in place, we adjust scan
> > - * pressure in proportion to how much a group's current
> > - * usage exceeds that, in percent.
> > + *
> > + * If there is any protection in place, we reduce scan
> > + * pressure by how much of the total memory used is
> > + * within protection thresholds.
> > *
> > * There is one special case: in the first reclaim pass,
> > * we skip over all groups that are within their low
> > @@ -2408,43 +2406,24 @@ static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
> > * ideally want to honor how well-behaved groups are in
> > * that case instead of simply punishing them all
> > * equally. As such, we reclaim them based on how much
> > - * of their best-effort protection they are using. Usage
> > - * below memory.min is excluded from consideration when
> > - * calculating utilisation, as it isn't ever
> > - * reclaimable, so it might as well not exist for our
> > - * purposes.
> > + * memory they are using, reducing the scan pressure
> > + * again by how much of the total memory used is under
> > + * hard protection.
> > */
> > - if (sc->memcg_low_reclaim && low > min) {
> > - /*
> > - * Reclaim according to utilisation between min
> > - * and low
> > - */
> > - scan = lruvec_size * (cgroup_size - min) /
> > - (low - min);
> > - } else {
> > - /* Reclaim according to protection overage */
> > - scan = lruvec_size * cgroup_size /
> > - max(min, low) - lruvec_size;
>
> I've noticed that the old version is just wrong: if cgroup_size is way smaller
> than max(min, low), scan will be set to -lruvec_size.
> Given that it's unsigned long, we'll end up with scanning the whole list
> (due to clamp() below).
Just to clarify: in most cases it works fine because we skip cgroups with
cgroup_size < max(min, low). So we just don't call the code above.
However, we can race with the emin/elow update and end up with negative scan,
especially if cgroup_size is about the effective protection size
The new version looks much more secure.
Thanks!
Roman Gushchin writes:
>I've noticed that the old version is just wrong: if cgroup_size is way smaller
>than max(min, low), scan will be set to -lruvec_size.
>Given that it's unsigned long, we'll end up with scanning the whole list
>(due to clamp() below).
Are you certain? If so, I don't see what you mean. This is how the code looks
in Linus' tree after the fixups:
unsigned long cgroup_size = mem_cgroup_size(memcg);
unsigned long baseline = 0;
if (!sc->memcg_low_reclaim)
baseline = lruvec_size;
scan = lruvec_size * cgroup_size / protection - baseline;
This works correctly as far as I can tell:
low reclaim case:
In [1]: cgroup_size=50; lruvec_size=10; protection=2000; baseline=0; lruvec_size * cgroup_size // protection - baseline
Out[1]: 0
normal case:
In [2]: cgroup_size=3000; lruvec_size=10; protection=2000; baseline=lruvec_size; lruvec_size * cgroup_size // protection - baseline
Out[2]: 5
Roman Gushchin writes:
>However, we can race with the emin/elow update and end up with negative scan,
>especially if cgroup_size is about the effective protection size
Yeah, it's possible but unlikely, hence the TOCTOU check. :-)
On Fri, Mar 22, 2019 at 10:49:46PM +0000, Chris Down wrote:
> Roman Gushchin writes:
> > I've noticed that the old version is just wrong: if cgroup_size is way smaller
> > than max(min, low), scan will be set to -lruvec_size.
> > Given that it's unsigned long, we'll end up with scanning the whole list
> > (due to clamp() below).
>
> Are you certain? If so, I don't see what you mean. This is how the code
> looks in Linus' tree after the fixups:
>
> unsigned long cgroup_size = mem_cgroup_size(memcg);
> unsigned long baseline = 0;
>
> if (!sc->memcg_low_reclaim)
> baseline = lruvec_size;
> scan = lruvec_size * cgroup_size / protection - baseline;
>
> This works correctly as far as I can tell:
I'm blaming the old version, not the new one.
New one is perfectly fine, thanks to these lines:
+ /* Avoid TOCTOU with earlier protection check */
+ cgroup_size = max(cgroup_size, protection);
The old one was racy.
Thanks!
Chris Down writes:
>Are you certain? If so, I don't see what you mean. This is how the
>code looks in Linus' tree after the fixups:
Hmm, apparently this actually didn't go into Linus' tree yet, so yeah, seems
worth having as a fixup maybe indeed.
[Sorry for a late reply]
On Fri 22-03-19 16:03:07, Chris Down wrote:
[...]
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.
Maybe I am missing something so correct me if I am wrong but the new
calculation actually means that we always allow to scan even min
protected memcgs right?
Because ...
[...]
> +static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
> + bool in_low_reclaim)
> {
> - if (mem_cgroup_disabled()) {
> - *min = 0;
> - *low = 0;
> - return;
> - }
> + if (mem_cgroup_disabled())
> + return 0;
> +
> + if (in_low_reclaim)
> + return READ_ONCE(memcg->memory.emin);
>
> - *min = READ_ONCE(memcg->memory.emin);
> - *low = READ_ONCE(memcg->memory.elow);
> + return max(READ_ONCE(memcg->memory.emin),
> + READ_ONCE(memcg->memory.elow));
> }
[...]
> + unsigned long cgroup_size = mem_cgroup_size(memcg);
> +
> + /* Avoid TOCTOU with earlier protection check */
> + cgroup_size = max(cgroup_size, protection);
> +
> + scan = lruvec_size - lruvec_size * protection /
> + cgroup_size;
>
[...]
> - scan = clamp(scan, SWAP_CLUSTER_MAX, lruvec_size);
> + scan = max(scan, SWAP_CLUSTER_MAX);
here the zero or sub SWAP_CLUSTER_MAX scan target gets extended to
SWAP_CLUSTER_MAX. Unless I am missing something this is not correct
because min protection should be a guarantee even in in_low_reclaim
mode.
> } else {
> scan = lruvec_size;
> }
> --
> 2.21.0
--
Michal Hocko
SUSE Labs
Michal Hocko writes:
>Maybe I am missing something so correct me if I am wrong but the new
>calculation actually means that we always allow to scan even min
>protected memcgs right?
We check if the memcg is min protected as a precondition for coming into this
function at all, so this generally isn't possible. See the mem_cgroup_protected
MEMCG_PROT_MIN check in shrink_node.
(Of course, it's possible we race with going within protection thresholds
again, but this patch doesn't make that any better or worse than the previous
situation.)
On Wed 29-05-19 23:44:53, Chris Down wrote:
> Michal Hocko writes:
> > Maybe I am missing something so correct me if I am wrong but the new
> > calculation actually means that we always allow to scan even min
> > protected memcgs right?
>
> We check if the memcg is min protected as a precondition for coming into
> this function at all, so this generally isn't possible. See the
> mem_cgroup_protected MEMCG_PROT_MIN check in shrink_node.
OK, that is the part I was missing, I got confused by checking the min
limit as well here. Thanks for the clarification. A comment would be
handy or do we really need to consider min at all?
> (Of course, it's possible we race with going within protection thresholds
> again, but this patch doesn't make that any better or worse than the
> previous situation.)
Yeah.
With the above clarified. The code the resulting code is much easier to
follow and the overal logic makes sense to me.
Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs
Michal Hocko writes:
>On Wed 29-05-19 23:44:53, Chris Down wrote:
>> Michal Hocko writes:
>> > Maybe I am missing something so correct me if I am wrong but the new
>> > calculation actually means that we always allow to scan even min
>> > protected memcgs right?
>>
>> We check if the memcg is min protected as a precondition for coming into
>> this function at all, so this generally isn't possible. See the
>> mem_cgroup_protected MEMCG_PROT_MIN check in shrink_node.
>
>OK, that is the part I was missing, I got confused by checking the min
>limit as well here. Thanks for the clarification. A comment would be
>handy or do we really need to consider min at all?
You mean as part of the reclaim pressure calculation? Yeah, we still need it,
because we might only set memory.min, but not set memory.low.
>> (Of course, it's possible we race with going within protection thresholds
>> again, but this patch doesn't make that any better or worse than the
>> previous situation.)
>
>Yeah.
>
>With the above clarified. The code the resulting code is much easier to
>follow and the overal logic makes sense to me.
>
>Acked-by: Michal Hocko <[email protected]>
Thanks for your thorough review! :-)
On Thu 30-05-19 13:52:10, Chris Down wrote:
> Michal Hocko writes:
> > On Wed 29-05-19 23:44:53, Chris Down wrote:
> > > Michal Hocko writes:
> > > > Maybe I am missing something so correct me if I am wrong but the new
> > > > calculation actually means that we always allow to scan even min
> > > > protected memcgs right?
> > >
> > > We check if the memcg is min protected as a precondition for coming into
> > > this function at all, so this generally isn't possible. See the
> > > mem_cgroup_protected MEMCG_PROT_MIN check in shrink_node.
> >
> > OK, that is the part I was missing, I got confused by checking the min
> > limit as well here. Thanks for the clarification. A comment would be
> > handy or do we really need to consider min at all?
>
> You mean as part of the reclaim pressure calculation? Yeah, we still need
> it, because we might only set memory.min, but not set memory.low.
But then the memcg will get excluded as well right?
--
Michal Hocko
SUSE Labs
Michal Hocko writes:
>On Thu 30-05-19 13:52:10, Chris Down wrote:
>> Michal Hocko writes:
>> > On Wed 29-05-19 23:44:53, Chris Down wrote:
>> > > Michal Hocko writes:
>> > > > Maybe I am missing something so correct me if I am wrong but the new
>> > > > calculation actually means that we always allow to scan even min
>> > > > protected memcgs right?
>> > >
>> > > We check if the memcg is min protected as a precondition for coming into
>> > > this function at all, so this generally isn't possible. See the
>> > > mem_cgroup_protected MEMCG_PROT_MIN check in shrink_node.
>> >
>> > OK, that is the part I was missing, I got confused by checking the min
>> > limit as well here. Thanks for the clarification. A comment would be
>> > handy or do we really need to consider min at all?
>>
>> You mean as part of the reclaim pressure calculation? Yeah, we still need
>> it, because we might only set memory.min, but not set memory.low.
>
>But then the memcg will get excluded as well right?
I'm not sure what you mean, could you clarify? :-)
The only thing we use memory.min for in this patch is potentially as the
protection size, which we then use to determine reclaim pressure. We don't use
this information if the cgroup is below memory.min, because you'll never come
in here. This is for if you *do* have memory.min or memory.low set and you are
*exceeding* it (or we are in low reclaim), in which case we want it (or
memory.low if higher) considered as the protection size.
On Fri, Mar 22, 2019 at 04:03:07PM +0000, Chris Down wrote:
> This patch is an incremental improvement on the existing
> memory.{low,min} relative reclaim work to base its scan pressure
> calculations on how much protection is available compared to the current
> usage, rather than how much the current usage is over some protection
> threshold.
>
> Previously the way that memory.low protection works is that if you are
> 50% over a certain baseline, you get 50% of your normal scan pressure.
> This is certainly better than the previous cliff-edge behaviour, but it
> can be improved even further by always considering memory under the
> currently enforced protection threshold to be out of bounds. This means
> that we can set relatively low memory.low thresholds for variable or
> bursty workloads while still getting a reasonable level of protection,
> whereas with the previous version we may still trivially hit the 100%
> clamp. The previous 100% clamp is also somewhat arbitrary, whereas this
> one is more concretely based on the currently enforced protection
> threshold, which is likely easier to reason about.
>
> There is also a subtle issue with the way that proportional reclaim
> worked previously -- it promotes having no memory.low, since it makes
> pressure higher during low reclaim. This happens because we base our
> scan pressure modulation on how far memory.current is between memory.min
> and memory.low, but if memory.low is unset, we only use the overage
> method. In most cromulent configurations, this then means that we end up
> with *more* pressure than with no memory.low at all when we're in low
> reclaim, which is not really very usable or expected.
>
> With this patch, memory.low and memory.min affect reclaim pressure in a
> more understandable and composable way. For example, from a user
> standpoint, "protected" memory now remains untouchable from a reclaim
> aggression standpoint, and users can also have more confidence that
> bursty workloads will still receive some amount of guaranteed
> protection.
>
> Signed-off-by: Chris Down <[email protected]>
> Reviewed-by: Roman Gushchin <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Dennis Zhou <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
Acked-by: Johannes Weiner <[email protected]>