2013-09-07 05:59:44

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
taking the lock is not enough, we must check scanned afterwards too.

Signed-off-by: Hugh Dickins <[email protected]>
Cc: [email protected]
---

mm/vmpressure.c | 3 +++
1 file changed, 3 insertions(+)

--- 3.11/mm/vmpressure.c 2013-09-02 13:46:10.000000000 -0700
+++ linux/mm/vmpressure.c 2013-09-06 22:43:03.596003080 -0700
@@ -187,6 +187,9 @@ static void vmpressure_work_fn(struct wo
vmpr->reclaimed = 0;
spin_unlock(&vmpr->sr_lock);

+ if (!scanned)
+ return;
+
do {
if (vmpressure_event(vmpr, scanned, reclaimed))
break;


2013-09-08 01:43:36

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Fri, 6 Sep 2013, Hugh Dickins wrote:

> Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> taking the lock is not enough, we must check scanned afterwards too.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]

Acked-by: David Rientjes <[email protected]>

2013-09-09 11:08:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Fri 06-09-13 22:59:16, Hugh Dickins wrote:
> Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> taking the lock is not enough, we must check scanned afterwards too.

As vmpressure_work_fn seems the be the only place where we set scanned
to 0 (except for the rare occasion when scanned overflows which
would be really surprising) then the only possible way would be two
vmpressure_work_fn racing over the same work item. system_wq is
!WQ_NON_REENTRANT so one work item might be processed by multiple
workers on different CPUs. This means that the vmpr->scanned check in
the beginning of vmpressure_work_fn is inherently racy.

Hugh's patch fixes the issue obviously but doesn't it make more sense to
move the initial vmpr->scanned check under the lock instead?

Anton, what was the initial motivation for the out of the lock
check? Does it really optimize anything?

> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]
> ---
>
> mm/vmpressure.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> --- 3.11/mm/vmpressure.c 2013-09-02 13:46:10.000000000 -0700
> +++ linux/mm/vmpressure.c 2013-09-06 22:43:03.596003080 -0700
> @@ -187,6 +187,9 @@ static void vmpressure_work_fn(struct wo
> vmpr->reclaimed = 0;
> spin_unlock(&vmpr->sr_lock);
>
> + if (!scanned)
> + return;
> +
> do {
> if (vmpressure_event(vmpr, scanned, reclaimed))
> break;

--
Michal Hocko
SUSE Labs

2013-09-11 05:37:38

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Fri, Sep 06, 2013 at 10:59:16PM -0700, Hugh Dickins wrote:
> Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> taking the lock is not enough, we must check scanned afterwards too.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: [email protected]

Hm... Just trying to understand this one. I don't see how this can happen,
considering that only one instance of vmpressure_work_fn() supposed to be
running (unlike vmpressure()), and the only place where we zero
vmpr->scanned is vmpressure_work_fn() itself?

> ---
>
> mm/vmpressure.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> --- 3.11/mm/vmpressure.c 2013-09-02 13:46:10.000000000 -0700
> +++ linux/mm/vmpressure.c 2013-09-06 22:43:03.596003080 -0700
> @@ -187,6 +187,9 @@ static void vmpressure_work_fn(struct wo
> vmpr->reclaimed = 0;
> spin_unlock(&vmpr->sr_lock);
>
> + if (!scanned)
> + return;
> +
> do {
> if (vmpressure_event(vmpr, scanned, reclaimed))
> break;

2013-09-11 15:41:04

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Mon, Sep 09, 2013 at 01:08:47PM +0200, Michal Hocko wrote:
> On Fri 06-09-13 22:59:16, Hugh Dickins wrote:
> > Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> > taking the lock is not enough, we must check scanned afterwards too.
>
> As vmpressure_work_fn seems the be the only place where we set scanned
> to 0 (except for the rare occasion when scanned overflows which
> would be really surprising) then the only possible way would be two
> vmpressure_work_fn racing over the same work item. system_wq is
> !WQ_NON_REENTRANT so one work item might be processed by multiple
> workers on different CPUs. This means that the vmpr->scanned check in
> the beginning of vmpressure_work_fn is inherently racy.
>
> Hugh's patch fixes the issue obviously but doesn't it make more sense to
> move the initial vmpr->scanned check under the lock instead?
>
> Anton, what was the initial motivation for the out of the lock
> check? Does it really optimize anything?

Thanks a lot for the explanation.

Answering your question: the idea was to minimize the lock section, but the
section is quite small anyway so I doubt that it makes any difference (during
development I could not measure any effect of vmpressure() calls in my system,
though the system itself was quite small).

I am happy with moving the check under the lock or moving the work into
its own WQ_NON_REENTRANT queue.

Anton

2013-09-11 16:04:00

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Wed 11-09-13 08:40:57, Anton Vorontsov wrote:
> On Mon, Sep 09, 2013 at 01:08:47PM +0200, Michal Hocko wrote:
> > On Fri 06-09-13 22:59:16, Hugh Dickins wrote:
> > > Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> > > taking the lock is not enough, we must check scanned afterwards too.
> >
> > As vmpressure_work_fn seems the be the only place where we set scanned
> > to 0 (except for the rare occasion when scanned overflows which
> > would be really surprising) then the only possible way would be two
> > vmpressure_work_fn racing over the same work item. system_wq is
> > !WQ_NON_REENTRANT so one work item might be processed by multiple
> > workers on different CPUs. This means that the vmpr->scanned check in
> > the beginning of vmpressure_work_fn is inherently racy.
> >
> > Hugh's patch fixes the issue obviously but doesn't it make more sense to
> > move the initial vmpr->scanned check under the lock instead?
> >
> > Anton, what was the initial motivation for the out of the lock
> > check? Does it really optimize anything?
>
> Thanks a lot for the explanation.
>
> Answering your question: the idea was to minimize the lock section, but the
> section is quite small anyway so I doubt that it makes any difference (during
> development I could not measure any effect of vmpressure() calls in my system,
> though the system itself was quite small).
>
> I am happy with moving the check under the lock

The patch below. I find it little bit nicer than Hugh's original one
because having the two checks sounds more confusing.
What do you think Hugh, Anton?

> or moving the work into its own WQ_NON_REENTRANT queue.

That sounds like an overkill.

---
>From 888745909da34f8aee8a208a82d467236b828d0d Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 11 Sep 2013 17:48:10 +0200
Subject: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

Hugh Dickins has reported a division by 0 when a vmpressure event is
processed. The reason for the exception is that a single vmpressure
work item (which is per memcg) might be processed by multiple CPUs
because it is enqueued on system_wq which is !WQ_NON_REENTRANT.
This means that the out of lock vmpr->scanned check in
vmpressure_work_fn is inherently racy and the racing workers will see
already zeroed scanned value after they manage to take the spin lock.

The patch simply moves the vmp->scanned check inside the sr_lock to fix
the race.

The issue was there since the very beginning but "vmpressure: change
vmpressure::sr_lock to spinlock" might have made it more visible as the
racing workers would sleep on the mutex and give it more time to see
updated value. The issue was still there, though.

Reported-by: Hugh Dickins <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
Cc: [email protected]
---
mm/vmpressure.c | 17 +++++++++--------
1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/mm/vmpressure.c b/mm/vmpressure.c
index e0f6283..ad679a0 100644
--- a/mm/vmpressure.c
+++ b/mm/vmpressure.c
@@ -164,18 +164,19 @@ static void vmpressure_work_fn(struct work_struct *work)
unsigned long scanned;
unsigned long reclaimed;

+ spin_lock(&vmpr->sr_lock);
+
/*
- * Several contexts might be calling vmpressure(), so it is
- * possible that the work was rescheduled again before the old
- * work context cleared the counters. In that case we will run
- * just after the old work returns, but then scanned might be zero
- * here. No need for any locks here since we don't care if
- * vmpr->reclaimed is in sync.
+ * Several contexts might be calling vmpressure() and the work
+ * item is sitting on !WQ_NON_REENTRANT workqueue so different
+ * CPUs might execute it concurrently. Bail out if the scanned
+ * counter is already 0 because all the work has been done already.
*/
- if (!vmpr->scanned)
+ if (!vmpr->scanned) {
+ spin_unlock(&vmpr->sr_lock);
return;
+ }

- spin_lock(&vmpr->sr_lock);
scanned = vmpr->scanned;
reclaimed = vmpr->reclaimed;
vmpr->scanned = 0;
--
1.7.10.4


--
Michal Hocko
SUSE Labs

2013-09-11 16:13:00

by Anton Vorontsov

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Wed, Sep 11, 2013 at 06:03:57PM +0200, Michal Hocko wrote:
> The patch below. I find it little bit nicer than Hugh's original one
> because having the two checks sounds more confusing.
> What do you think Hugh, Anton?

Acked-by: Anton Vorontsov <[email protected]>

Thanks!

> ---
> From 888745909da34f8aee8a208a82d467236b828d0d Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 11 Sep 2013 17:48:10 +0200
> Subject: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn
>
> Hugh Dickins has reported a division by 0 when a vmpressure event is
> processed. The reason for the exception is that a single vmpressure
> work item (which is per memcg) might be processed by multiple CPUs
> because it is enqueued on system_wq which is !WQ_NON_REENTRANT.
> This means that the out of lock vmpr->scanned check in
> vmpressure_work_fn is inherently racy and the racing workers will see
> already zeroed scanned value after they manage to take the spin lock.
>
> The patch simply moves the vmp->scanned check inside the sr_lock to fix
> the race.
>
> The issue was there since the very beginning but "vmpressure: change
> vmpressure::sr_lock to spinlock" might have made it more visible as the
> racing workers would sleep on the mutex and give it more time to see
> updated value. The issue was still there, though.
>
> Reported-by: Hugh Dickins <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> Cc: [email protected]
> ---
> mm/vmpressure.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index e0f6283..ad679a0 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -164,18 +164,19 @@ static void vmpressure_work_fn(struct work_struct *work)
> unsigned long scanned;
> unsigned long reclaimed;
>
> + spin_lock(&vmpr->sr_lock);
> +
> /*
> - * Several contexts might be calling vmpressure(), so it is
> - * possible that the work was rescheduled again before the old
> - * work context cleared the counters. In that case we will run
> - * just after the old work returns, but then scanned might be zero
> - * here. No need for any locks here since we don't care if
> - * vmpr->reclaimed is in sync.
> + * Several contexts might be calling vmpressure() and the work
> + * item is sitting on !WQ_NON_REENTRANT workqueue so different
> + * CPUs might execute it concurrently. Bail out if the scanned
> + * counter is already 0 because all the work has been done already.
> */
> - if (!vmpr->scanned)
> + if (!vmpr->scanned) {
> + spin_unlock(&vmpr->sr_lock);
> return;
> + }
>
> - spin_lock(&vmpr->sr_lock);
> scanned = vmpr->scanned;
> reclaimed = vmpr->reclaimed;
> vmpr->scanned = 0;
> --
> 1.7.10.4

2013-09-11 20:04:54

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Wed, 11 Sep 2013, Michal Hocko wrote:
> On Wed 11-09-13 08:40:57, Anton Vorontsov wrote:
> > On Mon, Sep 09, 2013 at 01:08:47PM +0200, Michal Hocko wrote:
> > > On Fri 06-09-13 22:59:16, Hugh Dickins wrote:
> > > > Hit divide-by-0 in vmpressure_work_fn(): checking vmpr->scanned before
> > > > taking the lock is not enough, we must check scanned afterwards too.
> > >
> > > As vmpressure_work_fn seems the be the only place where we set scanned
> > > to 0 (except for the rare occasion when scanned overflows which
> > > would be really surprising) then the only possible way would be two
> > > vmpressure_work_fn racing over the same work item. system_wq is
> > > !WQ_NON_REENTRANT so one work item might be processed by multiple
> > > workers on different CPUs. This means that the vmpr->scanned check in
> > > the beginning of vmpressure_work_fn is inherently racy.
> > >
> > > Hugh's patch fixes the issue obviously but doesn't it make more sense to
> > > move the initial vmpr->scanned check under the lock instead?
> > >
> > > Anton, what was the initial motivation for the out of the lock
> > > check? Does it really optimize anything?
> >
> > Thanks a lot for the explanation.
> >
> > Answering your question: the idea was to minimize the lock section, but the
> > section is quite small anyway so I doubt that it makes any difference (during
> > development I could not measure any effect of vmpressure() calls in my system,
> > though the system itself was quite small).
> >
> > I am happy with moving the check under the lock
>
> The patch below. I find it little bit nicer than Hugh's original one
> because having the two checks sounds more confusing.
> What do you think Hugh, Anton?
>
> > or moving the work into its own WQ_NON_REENTRANT queue.
>
> That sounds like an overkill.
>
> ---
> From 888745909da34f8aee8a208a82d467236b828d0d Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Wed, 11 Sep 2013 17:48:10 +0200
> Subject: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn
>
> Hugh Dickins has reported a division by 0 when a vmpressure event is
> processed. The reason for the exception is that a single vmpressure
> work item (which is per memcg) might be processed by multiple CPUs
> because it is enqueued on system_wq which is !WQ_NON_REENTRANT.
> This means that the out of lock vmpr->scanned check in
> vmpressure_work_fn is inherently racy and the racing workers will see
> already zeroed scanned value after they manage to take the spin lock.
>
> The patch simply moves the vmp->scanned check inside the sr_lock to fix
> the race.
>
> The issue was there since the very beginning but "vmpressure: change
> vmpressure::sr_lock to spinlock" might have made it more visible as the
> racing workers would sleep on the mutex and give it more time to see
> updated value. The issue was still there, though.
>
> Reported-by: Hugh Dickins <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> Cc: [email protected]

Nack! But equally Nack to my original.

Many thanks for looking into how this might have happened, Michal,
and for mentioning the WQ_NON_REENTRANT flag: which I knew nothing
about, but have now followed up.

I owe you all an abject apology: what I didn't mention in my patch
was that actually I hit the problem on a v3.3-based kernel to which
vmpressure had been backported.

I have not yet seen the problem on v3.11 or v3.10, and now believe
that it cannot happen there - which would explain why I was the
first to hit it.

When I looked up WQ_NON_REENTRANT in the latest tree, I found
WQ_NON_REENTRANT = 1 << 0, /* DEPRECATED */
and git blame on that line leads to Tejun explaining

dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
WQ_NON_REENTRANT no-op but the following patches didn't remove the
flag or update the documentation. Let's mark the flag deprecated and
update the documentation accordingly.

dbf2576e37 went into v3.7, so I now believe this divide-by-0 could
only happen on a backport of vmpressure to an earlier kernel than that.

Tejun made that change precisely to guard against this kind of subtle
unsafe issue; but it does provide a good illustration of the danger of
backporting something to a kernel where primitives behave less safely.

Sorry for wasting all your time.

As to your code change itself, Michal: I don't really mind one way or
the other - it now seems unnecessary. On the one hand I liked Anton's
minor optimization, on the other hand your way is more proof against
future change.

My Nack is really to your comment (and the Cc stable): we cannot
explain in terms of WQ_NON_REENTRANT when that is a no-op!

Hugh

> ---
> mm/vmpressure.c | 17 +++++++++--------
> 1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmpressure.c b/mm/vmpressure.c
> index e0f6283..ad679a0 100644
> --- a/mm/vmpressure.c
> +++ b/mm/vmpressure.c
> @@ -164,18 +164,19 @@ static void vmpressure_work_fn(struct work_struct *work)
> unsigned long scanned;
> unsigned long reclaimed;
>
> + spin_lock(&vmpr->sr_lock);
> +
> /*
> - * Several contexts might be calling vmpressure(), so it is
> - * possible that the work was rescheduled again before the old
> - * work context cleared the counters. In that case we will run
> - * just after the old work returns, but then scanned might be zero
> - * here. No need for any locks here since we don't care if
> - * vmpr->reclaimed is in sync.
> + * Several contexts might be calling vmpressure() and the work
> + * item is sitting on !WQ_NON_REENTRANT workqueue so different
> + * CPUs might execute it concurrently. Bail out if the scanned
> + * counter is already 0 because all the work has been done already.
> */
> - if (!vmpr->scanned)
> + if (!vmpr->scanned) {
> + spin_unlock(&vmpr->sr_lock);
> return;
> + }
>
> - spin_lock(&vmpr->sr_lock);
> scanned = vmpr->scanned;
> reclaimed = vmpr->reclaimed;
> vmpr->scanned = 0;
> --
> 1.7.10.4

2013-09-12 11:46:05

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn

On Wed 11-09-13 13:04:33, Hugh Dickins wrote:
> On Wed, 11 Sep 2013, Michal Hocko wrote:
[...]
> > From 888745909da34f8aee8a208a82d467236b828d0d Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Wed, 11 Sep 2013 17:48:10 +0200
> > Subject: [PATCH] vmpressure: fix divide-by-0 in vmpressure_work_fn
> >
> > Hugh Dickins has reported a division by 0 when a vmpressure event is
> > processed. The reason for the exception is that a single vmpressure
> > work item (which is per memcg) might be processed by multiple CPUs
> > because it is enqueued on system_wq which is !WQ_NON_REENTRANT.
> > This means that the out of lock vmpr->scanned check in
> > vmpressure_work_fn is inherently racy and the racing workers will see
> > already zeroed scanned value after they manage to take the spin lock.
> >
> > The patch simply moves the vmp->scanned check inside the sr_lock to fix
> > the race.
> >
> > The issue was there since the very beginning but "vmpressure: change
> > vmpressure::sr_lock to spinlock" might have made it more visible as the
> > racing workers would sleep on the mutex and give it more time to see
> > updated value. The issue was still there, though.
> >
> > Reported-by: Hugh Dickins <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > Cc: [email protected]
>
> Nack! But equally Nack to my original.
>
> Many thanks for looking into how this might have happened, Michal,
> and for mentioning the WQ_NON_REENTRANT flag: which I knew nothing
> about, but have now followed up.
> I owe you all an abject apology: what I didn't mention in my patch
> was that actually I hit the problem on a v3.3-based kernel to which
> vmpressure had been backported.
>
> I have not yet seen the problem on v3.11 or v3.10, and now believe
> that it cannot happen there - which would explain why I was the
> first to hit it.
>
> When I looked up WQ_NON_REENTRANT in the latest tree, I found
> WQ_NON_REENTRANT = 1 << 0, /* DEPRECATED */
> and git blame on that line leads to Tejun explaining
>
> dbf2576e37 ("workqueue: make all workqueues non-reentrant") made
> WQ_NON_REENTRANT no-op but the following patches didn't remove the
> flag or update the documentation. Let's mark the flag deprecated and
> update the documentation accordingly.

Goon point. I didn't check the code and relied on the documentation.
Thanks for pointing this out.

> dbf2576e37 went into v3.7, so I now believe this divide-by-0 could
> only happen on a backport of vmpressure to an earlier kernel than that.

git grep WQ_NON_REENTRANT on kernel/workqueue.c really shows nothing so
I guess you are right.

Andrew, please drop the patch.
--
Michal Hocko
SUSE Labs