Muchun Song found out there could be a race between list_lru_add()
and memcg_reparent_list_lru_node() causing the later function to miss
reparenting of a lru entry as shown below:
CPU0: CPU1:
list_lru_add()
spin_lock(&nlru->lock)
l = list_lru_from_kmem(memcg)
memcg_reparent_objcgs(memcg)
memcg_reparent_list_lrus(memcg)
memcg_reparent_list_lru()
memcg_reparent_list_lru_node()
if (!READ_ONCE(nlru->nr_items))
// Miss reparenting
return
// Assume 0->1
l->nr_items++
// Assume 0->1
nlru->nr_items++
Though it is not likely that a list_lru_node that has 0 item suddenly
has a newly added lru entry at the end of its life. The race is still
theoretically possible.
With the lock/unlock pair used within the percpu_ref_kill() which is
the last function call of memcg_reparent_objcgs(), any read issued
in memcg_reparent_list_lru_node() will not be reordered before the
reparenting of objcgs.
Adding a !spin_is_locked()/smp_rmb()/!READ_ONCE(nlru->nr_items) check
to ensure that either the reading of nr_items is valid or the racing
list_lru_add() will see the reparented objcg.
Fixes: 405cc51fc104 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
Reported-by: Muchun Song <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
mm/list_lru.c | 31 +++++++++++++++++++++++++++----
1 file changed, 27 insertions(+), 4 deletions(-)
diff --git a/mm/list_lru.c b/mm/list_lru.c
index c669d87001a6..08ff54ffabd6 100644
--- a/mm/list_lru.c
+++ b/mm/list_lru.c
@@ -395,10 +395,33 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
struct list_lru_one *src, *dst;
/*
- * If there is no lru entry in this nlru, we can skip it immediately.
+ * With the lock/unlock pair used within the percpu_ref_kill()
+ * which is the last function call of memcg_reparent_objcgs(), any
+ * read issued here will not be reordered before the reparenting
+ * of objcgs.
+ *
+ * Assuming a racing list_lru_add():
+ * list_lru_add()
+ * <- memcg_reparent_list_lru_node()
+ * spin_lock(&nlru->lock)
+ * l = list_lru_from_kmem(memcg)
+ * nlru->nr_items++
+ * spin_unlock(&nlru->lock)
+ * <- memcg_reparent_list_lru_node()
+ *
+ * The !spin_is_locked(&nlru->lock) check is true means it is
+ * either before the spin_lock() or after the spin_unlock(). In the
+ * former case, list_lru_add() will see the reparented objcg and so
+ * won't touch the lru to be reparented. In the later case, it will
+ * see the updated nr_items. So we can use the optimization that if
+ * there is no lru entry in this nlru, skip it immediately.
*/
- if (!READ_ONCE(nlru->nr_items))
- return;
+ if (!spin_is_locked(&nlru->lock)) {
+ /* nr_items read must be ordered after nlru->lock */
+ smp_rmb();
+ if (!READ_ONCE(nlru->nr_items))
+ return;
+ }
/*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
@@ -407,7 +430,7 @@ static void memcg_reparent_list_lru_node(struct list_lru *lru, int nid,
spin_lock_irq(&nlru->lock);
src = list_lru_from_memcg_idx(lru, nid, src_idx);
- if (!src)
+ if (!src || !src->nr_items)
goto out;
dst = list_lru_from_memcg_idx(lru, nid, dst_idx);
--
2.27.0
On Thu 31-03-22 06:39:56, Shakeel Butt wrote:
> On Wed, Mar 30, 2022 at 07:48:45PM -0700, Roman Gushchin wrote:
> >
> >
> [...]
> >
> >
> > But honestly, I’d drop the original optimization together with
> > the fix, if only there is no _real world_ data on the problem and
> > the improvement. It seems like it has started as a nice simple
> > improvement, but the race makes it complex and probably not worth
> > the added complexity and fragility.
>
> I agree with dropping the original optimization as it is not really
> fixing an observed issue which may justify adding some complexity.
Completely agreed. The patch as it is proposed is not really acceptable
IMHO and I have to say I am worried that this is not the first time we
are in a situation when a follow up fixes or unrelated patches are
growing in complexity to fit on top of a performance optimizations which
do not refer to any actual numbers.
--
Michal Hocko
SUSE Labs
On Fri, Apr 1, 2022 at 9:11 AM Andrew Morton <[email protected]> wrote:
>
> On Thu, 31 Mar 2022 09:46:52 +0200 Michal Hocko <[email protected]> wrote:
>
> > On Thu 31-03-22 06:39:56, Shakeel Butt wrote:
> > > On Wed, Mar 30, 2022 at 07:48:45PM -0700, Roman Gushchin wrote:
> > > >
> > > >
> > > [...]
> > > >
> > > >
> > > > But honestly, I’d drop the original optimization together with
> > > > the fix, if only there is no _real world_ data on the problem and
> > > > the improvement. It seems like it has started as a nice simple
> > > > improvement, but the race makes it complex and probably not worth
> > > > the added complexity and fragility.
> > >
> > > I agree with dropping the original optimization as it is not really
> > > fixing an observed issue which may justify adding some complexity.
> >
> > Completely agreed. The patch as it is proposed is not really acceptable
> > IMHO and I have to say I am worried that this is not the first time we
> > are in a situation when a follow up fixes or unrelated patches are
> > growing in complexity to fit on top of a performance optimizations which
> > do not refer to any actual numbers.
>
> Yup. I did this:
>
> From: Andrew Morton <[email protected]>
> Subject: mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
>
> 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> has subtle races which are proving ugly to fix. Revert the original
> optimization. If quantitative testing indicates that we have a
> significant problem here then other implementations can be looked at.
>
> Fixes: 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> Cc: Waiman Long <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Muchun Song <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Reviewed-by: Muchun Song <[email protected]>
On Thu, Mar 31, 2022 at 6:11 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 31 Mar 2022 09:46:52 +0200 Michal Hocko <[email protected]> wrote:
>
> > On Thu 31-03-22 06:39:56, Shakeel Butt wrote:
> > > On Wed, Mar 30, 2022 at 07:48:45PM -0700, Roman Gushchin wrote:
> > > >
> > > >
> > > [...]
> > > >
> > > >
> > > > But honestly, I’d drop the original optimization together with
> > > > the fix, if only there is no _real world_ data on the problem and
> > > > the improvement. It seems like it has started as a nice simple
> > > > improvement, but the race makes it complex and probably not worth
> > > > the added complexity and fragility.
> > >
> > > I agree with dropping the original optimization as it is not really
> > > fixing an observed issue which may justify adding some complexity.
> >
> > Completely agreed. The patch as it is proposed is not really acceptable
> > IMHO and I have to say I am worried that this is not the first time we
> > are in a situation when a follow up fixes or unrelated patches are
> > growing in complexity to fit on top of a performance optimizations which
> > do not refer to any actual numbers.
>
> Yup. I did this:
>
> From: Andrew Morton <[email protected]>
> Subject: mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
>
> 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> has subtle races which are proving ugly to fix. Revert the original
> optimization. If quantitative testing indicates that we have a
> significant problem here then other implementations can be looked at.
>
> Fixes: 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> Cc: Waiman Long <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Muchun Song <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Thu, 31 Mar 2022 09:46:52 +0200 Michal Hocko <[email protected]> wrote:
> On Thu 31-03-22 06:39:56, Shakeel Butt wrote:
> > On Wed, Mar 30, 2022 at 07:48:45PM -0700, Roman Gushchin wrote:
> > >
> > >
> > [...]
> > >
> > >
> > > But honestly, I’d drop the original optimization together with
> > > the fix, if only there is no _real world_ data on the problem and
> > > the improvement. It seems like it has started as a nice simple
> > > improvement, but the race makes it complex and probably not worth
> > > the added complexity and fragility.
> >
> > I agree with dropping the original optimization as it is not really
> > fixing an observed issue which may justify adding some complexity.
>
> Completely agreed. The patch as it is proposed is not really acceptable
> IMHO and I have to say I am worried that this is not the first time we
> are in a situation when a follow up fixes or unrelated patches are
> growing in complexity to fit on top of a performance optimizations which
> do not refer to any actual numbers.
Yup. I did this:
From: Andrew Morton <[email protected]>
Subject: mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
has subtle races which are proving ugly to fix. Revert the original
optimization. If quantitative testing indicates that we have a
significant problem here then other implementations can be looked at.
Fixes: 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
Cc: Waiman Long <[email protected]>
Cc: Roman Gushchin <[email protected]>
Cc: Muchun Song <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Shakeel Butt <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
mm/list_lru.c | 6 ------
1 file changed, 6 deletions(-)
--- a/mm/list_lru.c~revert-1
+++ a/mm/list_lru.c
@@ -395,12 +395,6 @@ static void memcg_reparent_list_lru_node
struct list_lru_one *src, *dst;
/*
- * If there is no lru entry in this nlru, we can skip it immediately.
- */
- if (!READ_ONCE(nlru->nr_items))
- return;
-
- /*
* Since list_lru_{add,del} may be called under an IRQ-safe lock,
* we have to use IRQ-safe primitives here to avoid deadlock.
*/
_
On Thu 31-03-22 18:11:26, Andrew Morton wrote:
[...]
> Yup. I did this:
>
> From: Andrew Morton <[email protected]>
> Subject: mm/list_lru.c: revert "mm/list_lru: optimize memcg_reparent_list_lru_node()"
>
> 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> has subtle races which are proving ugly to fix. Revert the original
> optimization. If quantitative testing indicates that we have a
> significant problem here then other implementations can be looked at.
>
> Fixes: 405cc51fc1049c73 ("mm/list_lru: optimize memcg_reparent_list_lru_node()")
> Cc: Waiman Long <[email protected]>
> Cc: Roman Gushchin <[email protected]>
> Cc: Muchun Song <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Thanks!
> ---
>
> mm/list_lru.c | 6 ------
> 1 file changed, 6 deletions(-)
>
> --- a/mm/list_lru.c~revert-1
> +++ a/mm/list_lru.c
> @@ -395,12 +395,6 @@ static void memcg_reparent_list_lru_node
> struct list_lru_one *src, *dst;
>
> /*
> - * If there is no lru entry in this nlru, we can skip it immediately.
> - */
> - if (!READ_ONCE(nlru->nr_items))
> - return;
> -
> - /*
> * Since list_lru_{add,del} may be called under an IRQ-safe lock,
> * we have to use IRQ-safe primitives here to avoid deadlock.
> */
> _
--
Michal Hocko
SUSE Labs