2005-11-18 17:12:53

by Andrea Arcangeli

[permalink] [raw]
Subject: shrinker->nr = LONG_MAX means deadlock for icache

Hello,

Greg Edwards found some deadlock in the icache shrinker.

I believe the major bug is that the VM is currently potentially setting
nr = LONG_MAX before shrinking the icache (and the icache shrinker never
returns -1, which means the api doesn't currently require shrinkers to
return -1 when they're finished).

The below is the most obviously safe way I could address this problem
(still untested).

This is not necessairly the way we want to fix it in mainline, but it at
least shows what I believe to be the major cuplrit in the code (i.e. nr
growing insane ;).

Comments welcome as usual.

Signed-off-by: Andrea Arcangeli <[email protected]>

diff -r 5111ab3d0d8a mm/vmscan.c
--- a/mm/vmscan.c Fri Nov 18 09:26:56 2005 +0800
+++ b/mm/vmscan.c Fri Nov 18 19:01:55 2005 +0200
@@ -201,13 +201,21 @@
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
unsigned long total_scan;
+ unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);

delta = (4 * scanned) / shrinker->seeks;
- delta *= (*shrinker->shrinker)(0, gfp_mask);
+ delta *= max_pass;
do_div(delta, lru_pages + 1);
shrinker->nr += delta;
if (shrinker->nr < 0)
shrinker->nr = LONG_MAX; /* It wrapped! */
+ /*
+ * Avoid risking looping forever due to too large nr value:
+ * never try to free more than twice the estimate number of
+ * freeable entries.
+ */
+ if (shrinker->nr > max_pass * 2)
+ shrinker->nr = max_pass * 2;

total_scan = shrinker->nr;
shrinker->nr = 0;


2005-11-19 07:32:25

by Andrew Morton

[permalink] [raw]
Subject: Re: shrinker->nr = LONG_MAX means deadlock for icache

Andrea Arcangeli <[email protected]> wrote:
>
> Greg Edwards found some deadlock in the icache shrinker.

Do we know what sort of machine it was? Amount of memory? Were there
zillions of inodes in icache at the time? edward@sgi means 64 bits?

> I believe the major bug is that the VM is currently potentially setting
> nr = LONG_MAX before shrinking the icache (and the icache shrinker never
> returns -1, which means the api doesn't currently require shrinkers to
> return -1 when they're finished).

I'm having trouble seeing how ->nr could go negative.

I'm also having trouble remembering how the code works, so bear with me ;)

The idea in shrink_slab is that we'll scan each slab object at a rate which
is proportional to that at which we scan each page.

So shrink_slab() is passed the number of LRU pages which were just scanned,
and the total number of LRU pages (which are pertinent to this allocation
attempt).

On this call to shrink_slab(), we need to work out how many objects are to
be scanned for each cache. This is calculated as:

(number of objects in the cache) *
(number of scanned LRU pages / number of LRU pages)

and then we apply a basically-constant multiplier of 4/shrinker->seeks,
which is close to 1.

The RHS of this multiplication cannot exceed 1. (Checks. Yup, we zero out
nr_scanned each time we increase the scanning priority).

The LHS cannot exceed LONG_MAX on 32-bit or 64-bit.


Now if the shrinker returns -1 during the actual slab scan, the scanner
will bale out but will remember that it still needs to scan those entries
for next time.


All I can surmise is that either

a) a shrinker keeps on returning -1 and ->nr overflowed. That means we
did a huge number of !__GFP_FS memory allocations. That seems like
quite a problem, so perhaps:

diff -puN mm/vmscan.c~a mm/vmscan.c
--- devel/mm/vmscan.c~a 2005-11-18 23:22:51.000000000 -0800
+++ devel-akpm/mm/vmscan.c 2005-11-18 23:24:08.000000000 -0800
@@ -227,8 +227,15 @@ static int shrink_slab(unsigned long sca

nr_before = (*shrinker->shrinker)(0, gfp_mask);
shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
- if (shrink_ret == -1)
+ if (shrink_ret == -1) {
+ /*
+ * Don't allow ->nr to overflow. This can
+ * happen if someone is doing a great number
+ * of !__GFP_FS calls.
+ */
+ total_scan = 0;
break;
+ }
if (shrink_ret < nr_before)
ret += nr_before - shrink_ret;
mod_page_state(slabs_scanned, this_scan);
_



or

b) your inodes_stat.nr_unused went nuts. ISTR that we've had a few
problems with .nr_unused (was it the dentry cache or the inode cache
though)?

Which kernel was this based upon?

Is it reproducible? If so, a few printks triggered when shrinker->nr
starts to get ridiculously large would be useful.

2005-11-19 10:37:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: shrinker->nr = LONG_MAX means deadlock for icache

On Fri, Nov 18, 2005 at 11:29:04PM -0800, Andrew Morton wrote:
> Do we know what sort of machine it was? Amount of memory? Were there
> zillions of inodes in icache at the time? edward@sgi means 64 bits?

I don't know, but perhaps Greg can answer that.

All I know is that they deadlocked in shrink_slab with
shrink_icache_memory returning 3 (which I don't see how it can return 3,
given it's multiplied by 100 before returning in shrink_icache_memory,
but anyway I don't care about this detail, I guess they debugged it
wrong, or they tweaked the sysctl or similar).

So one of the suggestion from Greg has been to modify the code to break
the loop if shrink_icache_memory returns the same value twice in a row
(like twice 3), but I thought that was the wrong fix, given that the
division by 100 means shrink_icache_memory can return the same thing
twice despite it made some progress in the lru walk.

If breaking the loop despite we made progress doesn't risk to make us go
out of memory (which I think it theoretically could) then we should
remove the loop completely. I believe the loop is needed if we
shrink in SHRINK_BATCH small passes and breaking the loop early if
shrink_icache_memory returned the same value twice in a row, was unsafe.

And if we don't try to keep nr under control, even if the deadlock is
avoided, an huge nr value would over-shrink the caches. So that
would leave a performance bug and a potential early-oom.

> I'm having trouble seeing how ->nr could go negative.

Me too, I doubt they hit that path, I assume it's a path just in case,
but still the interesting thing is that if that code exists it means
that nr can grow to insane values. Hitting that code or reaching only
half of LONG_MAX isn't going to make any difference w.r.t. deadlocking
in shrink_slab.

> I'm also having trouble remembering how the code works, so bear with me ;)

Well, I'm trying to understand it too, I didn't write it ;). I posted
here exactly to have feedback to who wrote the code to verify my
proposed fix (or band-aid) was ok.

The code does something like this before returning:

shrinker->nr += total_scan

total_scan is never decreased if the gfp_mask has no __GFP_FS.

So I guess it tries to account for the fact it couldn't shrink the
caches despite it should have.

Regardless of what the code does and why does it, my point is that it
can't be right to have something like this:

if (shrinker->nr < 0)
shrinker->nr = LONG_MAX; /* It wrapped!
*/

right before looping shrinker->nr/SHRINK_BATCH times without any chance
of breaking the loop (and even if we break the loop when there are no
more freeable entries, such an huge "nr" setting means destroying the
icache/dcache every time).

Clearly when I heard about this deadlock, after reading the code, my
preferred point of attack is certainly the above "nr = LONG_MAX"
assumption.

> The idea in shrink_slab is that we'll scan each slab object at a rate which
> is proportional to that at which we scan each page.
>
> So shrink_slab() is passed the number of LRU pages which were just scanned,
> and the total number of LRU pages (which are pertinent to this allocation
> attempt).
>
> On this call to shrink_slab(), we need to work out how many objects are to
> be scanned for each cache. This is calculated as:
>
> (number of objects in the cache) *
> (number of scanned LRU pages / number of LRU pages)
> and then we apply a basically-constant multiplier of 4/shrinker->seeks,
> which is close to 1.

So it tries to simulate to have those slab entries in the page-lru
(and the speed becomes tunable by the per-shrinker seek param to give
higher/lower prio to the caches).

> The RHS of this multiplication cannot exceed 1. (Checks. Yup, we zero out
> nr_scanned each time we increase the scanning priority).
>
> The LHS cannot exceed LONG_MAX on 32-bit or 64-bit.
>
>
> Now if the shrinker returns -1 during the actual slab scan, the scanner
> will bale out but will remember that it still needs to scan those entries
> for next time.
>
>
> All I can surmise is that either
>
> a) a shrinker keeps on returning -1 and ->nr overflowed. That means we
> did a huge number of !__GFP_FS memory allocations. That seems like
> quite a problem, so perhaps:

Agreed.

> diff -puN mm/vmscan.c~a mm/vmscan.c
> --- devel/mm/vmscan.c~a 2005-11-18 23:22:51.000000000 -0800
> +++ devel-akpm/mm/vmscan.c 2005-11-18 23:24:08.000000000 -0800
> @@ -227,8 +227,15 @@ static int shrink_slab(unsigned long sca
>
> nr_before = (*shrinker->shrinker)(0, gfp_mask);
> shrink_ret = (*shrinker->shrinker)(this_scan, gfp_mask);
> - if (shrink_ret == -1)
> + if (shrink_ret == -1) {
> + /*
> + * Don't allow ->nr to overflow. This can
> + * happen if someone is doing a great number
> + * of !__GFP_FS calls.
> + */
> + total_scan = 0;
> break;
> + }
> if (shrink_ret < nr_before)
> ret += nr_before - shrink_ret;
> mod_page_state(slabs_scanned, this_scan);
> _

It made some sense to me to account the "failed attempt" for the future
passes, my fix tried to retain this feature.

However I certainly won't object to the above fix, I guess it doesn't
make any real difference in practice to use my fix or the above one.

But then you also have to add a BUG_ON(shrinker->nr < 0). There is no
chance that the current code doing shrinker->nr = LONG_MAX can be
correct, so whatever we change, that branch must go away too (and be
replaced by a BUG_ON).

> b) your inodes_stat.nr_unused went nuts. ISTR that we've had a few
> problems with .nr_unused (was it the dentry cache or the inode cache
> though)?
>
> Which kernel was this based upon?

Good point, however they reported the return value of
shrink_icache_memory was constant 3, so no negative and no huge value
(as said I don't see why it's not a multiple of 100, but at least it
didn't sound a nr_unused screwup if what they found returned by
shrink_icache_memory was number "3", so not negative and not huge).

If that was the bug, our fixes wouldn't make a difference. Current
status of the kernel that they used has all nr_unused and other icache
wakeup race conditions fixes, but it's not certain if they were using a
previous version that still had those problems.

> Is it reproducible? If so, a few printks triggered when shrinker->nr
> starts to get ridiculously large would be useful.

Dunno. Also they're using the toss-pagecache feature to invoke the
shrinking, system was not short of free ram. I don't know how they
debugged it but my plan was to fix the obvious problem with "nr"
potentially growing too much (the part certainly good for mainline too)
and then to try again and if they can reproduce to debug it further by
starting checking the paramters in input to shrink_slab. However I
clearly have an hope that all they need is the above needed fix ;).

Thanks!

2005-11-19 11:06:27

by Andrew Morton

[permalink] [raw]
Subject: Re: shrinker->nr = LONG_MAX means deadlock for icache

Andrea Arcangeli <[email protected]> wrote:
>
>
> ...
> The code does something like this before returning:
>
> shrinker->nr += total_scan
>
> total_scan is never decreased if the gfp_mask has no __GFP_FS.
>
> So I guess it tries to account for the fact it couldn't shrink the
> caches despite it should have.

yup. It's problematic, if someone is doing a huge amount of GFP_IO (say)
allocations, ->nr might get really big. But we do expect that next time
kswapd gets in there with GFP_KERNEL it will sit in a tight loop doing huge
amounts of work, and will catch up.

It would be nice to understand exactly what's gone wrong.

> Regardless of what the code does and why does it, my point is that it
> can't be right to have something like this:
>
> if (shrinker->nr < 0)
> shrinker->nr = LONG_MAX; /* It wrapped!
> */
>
> right before looping shrinker->nr/SHRINK_BATCH times without any chance
> of breaking the loop (and even if we break the loop when there are no
> more freeable entries, such an huge "nr" setting means destroying the
> icache/dcache every time).

Yes, I think that code's crap. Perhaps it's left over from when the
counters in there were `long'. I ended up making them `long long' because
I couldn't work out how to prevent the intermediate results from
overflowing in all scenarios.

We'd be better off with a printk and a reset-it-to-something-sane.

>
> So it tries to simulate to have those slab entries in the page-lru
> (and the speed becomes tunable by the per-shrinker seek param to give
> higher/lower prio to the caches).

Yup.

<OT>It's actually quite inaccurate because of slab internal fragmentation -
freeing a page always frees some memory, but freeing a slab object doesn't
necessarily do so. Although on average I guess it gets it right, unless
there are pathological allocation patterns in slab.

It's fairly easy to set up such patterns with artificial test apps: create
a deep+broad directory tree with the same number of entries in each subdir,
for example. </OT>

>
> It made some sense to me to account the "failed attempt" for the future
> passes, my fix tried to retain this feature.
>

I guess so, although I worry that this way we'll obscure the real bug,
whatever it is.

> However I certainly won't object to the above fix, I guess it doesn't
> make any real difference in practice to use my fix or the above one.

Sure. You've limited the number of scanned objects in one pass to twice
the number of objects - there's no point in doing more work than that.

> But then you also have to add a BUG_ON(shrinker->nr < 0). There is no
> chance that the current code doing shrinker->nr = LONG_MAX can be
> correct, so whatever we change, that branch must go away too (and be
> replaced by a BUG_ON).

True.

> > b) your inodes_stat.nr_unused went nuts. ISTR that we've had a few
> > problems with .nr_unused (was it the dentry cache or the inode cache
> > though)?
> >
> > Which kernel was this based upon?
>
> Good point, however they reported the return value of
> shrink_icache_memory was constant 3, so no negative and no huge value
> (as said I don't see why it's not a multiple of 100, but at least it
> didn't sound a nr_unused screwup if what they found returned by
> shrink_icache_memory was number "3", so not negative and not huge).

A return value of 3 is very odd. I'd be suspecting a mismeasurement.
Unless someone had altered vfs_cache_pressure.

> If that was the bug, our fixes wouldn't make a difference. Current
> status of the kernel that they used has all nr_unused and other icache
> wakeup race conditions fixes, but it's not certain if they were using a
> previous version that still had those problems.

OK. Well If Edward&co could do a bit more investigation it'd be great -
meanwhile I'll hang onto this (and might add some mm-only debugging,
depending on how Edward gets on):


diff -puN mm/vmscan.c~shrinker-nr-=-long_max-means-deadlock-for-icache mm/vmscan.c
--- devel/mm/vmscan.c~shrinker-nr-=-long_max-means-deadlock-for-icache 2005-11-19 02:53:18.000000000 -0800
+++ devel-akpm/mm/vmscan.c 2005-11-19 03:00:25.000000000 -0800
@@ -201,13 +201,25 @@ static int shrink_slab(unsigned long sca
list_for_each_entry(shrinker, &shrinker_list, list) {
unsigned long long delta;
unsigned long total_scan;
+ unsigned long max_pass = (*shrinker->shrinker)(0, gfp_mask);

delta = (4 * scanned) / shrinker->seeks;
- delta *= (*shrinker->shrinker)(0, gfp_mask);
+ delta *= max_pass;
do_div(delta, lru_pages + 1);
shrinker->nr += delta;
- if (shrinker->nr < 0)
- shrinker->nr = LONG_MAX; /* It wrapped! */
+ if (shrinker->nr < 0) {
+ printk(KERN_ERR "%s: nr=%ld\n",
+ __FUNCTION__, shrinker->nr);
+ shrinker->nr = max_pass;
+ }
+
+ /*
+ * Avoid risking looping forever due to too large nr value:
+ * never try to free more than twice the estimate number of
+ * freeable entries.
+ */
+ if (shrinker->nr > max_pass * 2)
+ shrinker->nr = max_pass * 2;

total_scan = shrinker->nr;
shrinker->nr = 0;
_

2005-11-19 11:38:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: shrinker->nr = LONG_MAX means deadlock for icache

On Sat, Nov 19, 2005 at 03:03:06AM -0800, Andrew Morton wrote:
> It would be nice to understand exactly what's gone wrong.

I found something more, see below.

> I guess so, although I worry that this way we'll obscure the real bug,
> whatever it is.

Now that I understand better the math around scanned and lru_pages I
believe their caller could be the reason they have this huge number in
"nr" is because they pass 0 to shrink all slabs entries. As said in the
previous email they lockup when invoking the slab shrinking with the
toss-cache feature. They should have passed "tossed" as third parameter
too, not 0.

int tossed = atomic_read(&npgs_tossed);
shrink_slab(tossed, GFP_KERNEL, 0 /* shrink max */);
atomic_set(&npgs_tossed, 0);

The zero as thrid parameter means nr will be "max_pass * scanned", so if
both the page-lru is huge and the icache is huge, that can lead to an
huge value.

They should also add a WARN_ON to be sure that "tossed" is never
negative just in case: when the "tossed" gets sign zero extended during
the int2unsigned-long conversion, that could generate the huge number if
tossed was negative.

So the caller has to be fixed too, even if now it would be ok to pass 0
without risking huge nr values (after fixing the unrelated __GFP_IO bug).

So hopefully the "0" as third parameter is good enough to explain the
(other) real bug and we won't be hiding more bugs with this fix.

> Sure. You've limited the number of scanned objects in one pass to twice
> the number of objects - there's no point in doing more work than that.

Agreed.

> A return value of 3 is very odd. I'd be suspecting a mismeasurement.
> Unless someone had altered vfs_cache_pressure.

Exactly.

> OK. Well If Edward&co could do a bit more investigation it'd be great -
> meanwhile I'll hang onto this (and might add some mm-only debugging,
> depending on how Edward gets on):

Looks good to me, thanks!

2005-11-22 23:02:10

by Greg Edwards

[permalink] [raw]
Subject: Re: shrinker->nr = LONG_MAX means deadlock for icache

On Sat, Nov 19, 2005 at 12:38:34PM +0100, Andrea Arcangeli wrote:
| On Sat, Nov 19, 2005 at 03:03:06AM -0800, Andrew Morton wrote:
| > It would be nice to understand exactly what's gone wrong.
|
| I found something more, see below.

Looks like Andrea found the real culprit.


| > I guess so, although I worry that this way we'll obscure the real bug,
| > whatever it is.
|
| Now that I understand better the math around scanned and lru_pages I
| believe their caller could be the reason they have this huge number in
| "nr" is because they pass 0 to shrink all slabs entries. As said in the
| previous email they lockup when invoking the slab shrinking with the
| toss-cache feature. They should have passed "tossed" as third parameter
| too, not 0.
|
| int tossed = atomic_read(&npgs_tossed);
| shrink_slab(tossed, GFP_KERNEL, 0 /* shrink max */);
| atomic_set(&npgs_tossed, 0);
|
| The zero as thrid parameter means nr will be "max_pass * scanned", so if
| both the page-lru is huge and the icache is huge, that can lead to an
| huge value.
|
| They should also add a WARN_ON to be sure that "tossed" is never
| negative just in case: when the "tossed" gets sign zero extended during
| the int2unsigned-long conversion, that could generate the huge number if
| tossed was negative.
|
| So the caller has to be fixed too, even if now it would be ok to pass 0
| without risking huge nr values (after fixing the unrelated __GFP_IO bug).
|
| So hopefully the "0" as third parameter is good enough to explain the
| (other) real bug and we won't be hiding more bugs with this fix.
|
| > Sure. You've limited the number of scanned objects in one pass to twice
| > the number of objects - there's no point in doing more work than that.
|
| Agreed.
|
| > A return value of 3 is very odd. I'd be suspecting a mismeasurement.
| > Unless someone had altered vfs_cache_pressure.
|
| Exactly.
|
| > OK. Well If Edward&co could do a bit more investigation it'd be great -
| > meanwhile I'll hang onto this (and might add some mm-only debugging,
| > depending on how Edward gets on):
|
| Looks good to me, thanks!

CC'ed some of the folks who debugged this, in case they have anything to
add.

Greg