2015-11-25 15:36:57

by Vladimir Davydov

[permalink] [raw]
Subject: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

Block device drivers often hand off io request processing to kernel
threads (example: device mapper). If such a thread calls kmalloc, it can
dive into direct reclaim path and end up waiting for too_many_isolated
to return false, blocking writeback. This can lead to a dead lock if the
pages were isolated by processes performing memcg reclaim, because they
call wait_on_page_writeback upon encountering a page under writeback,
which will never finish if bio_endio is to be called by the kernel
thread stuck in the reclaimer, waiting for the isolated pages to be put
back.

I've never encountered such a dead lock on vanilla kernel, neither have
I tried to reproduce it. However, I faced it with an out-of-tree block
device driver, which uses a kernel thread for completing bios: the
kernel thread got stuck busy-checking too_many_isolated on the DMA zone,
which had only 3 inactive and 68 isolated file pages (2163 pages were
free); the pages were isolated by memcg processes waiting for writeback
to finish. I don't see anything that could prevent this in case of e.g.
device mapper.

Let's fix this problem by making too_many_isolated always return false
for kernel threads. Apart from fixing the possible dead lock in case of
the legacy cgroup hierarchy, this makes sense even if the unified
hierarchy is used, where processes performing memcg reclaim will never
call wait_on_page_writeback, because kernel threads might be responsible
for cleaning pages necessary for reclaim - BTW throttle_direct_reclaim
never throttles kernel threads for the same reason.

Signed-off-by: Vladimir Davydov <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9d553b07bb86..0f1318a52b23 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1457,7 +1457,7 @@ static int too_many_isolated(struct zone *zone, int file,
{
unsigned long inactive, isolated;

- if (current_is_kswapd())
+ if (current->flags & PF_KTHREAD)
return 0;

if (!sane_reclaim(sc))
--
2.1.4


2015-11-25 15:45:18

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On 11/25/2015 04:36 PM, Vladimir Davydov wrote:
> Block device drivers often hand off io request processing to kernel
> threads (example: device mapper). If such a thread calls kmalloc, it can
> dive into direct reclaim path and end up waiting for too_many_isolated
> to return false, blocking writeback. This can lead to a dead lock if the

Shouldn't such allocation lack __GFP_IO to prevent this and other kinds of
deadlocks? And/or have mempools? PF_KTHREAD looks like a big hammer to me that
will solve only one potential problem...

2015-11-25 16:28:15

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Wed, Nov 25, 2015 at 04:45:13PM +0100, Vlastimil Babka wrote:
> On 11/25/2015 04:36 PM, Vladimir Davydov wrote:
> > Block device drivers often hand off io request processing to kernel
> > threads (example: device mapper). If such a thread calls kmalloc, it can
> > dive into direct reclaim path and end up waiting for too_many_isolated
> > to return false, blocking writeback. This can lead to a dead lock if the
>
> Shouldn't such allocation lack __GFP_IO to prevent this and other kinds of
> deadlocks? And/or have mempools?

Not necessarily. loopback is an example: it can call
grab_cache_write_begin -> add_to_page_cache_lru with GFP_KERNEL.

> PF_KTHREAD looks like a big hammer to me that will solve only one
> potential problem...

This problem can result in processes hanging forever. Any ideas how this
could be fixed in a better way?

Thanks,
Vladimir

2015-11-26 08:16:44

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Wed, Nov 25, 2015 at 07:27:57PM +0300, Vladimir Davydov wrote:
> On Wed, Nov 25, 2015 at 04:45:13PM +0100, Vlastimil Babka wrote:
> > On 11/25/2015 04:36 PM, Vladimir Davydov wrote:
> > > Block device drivers often hand off io request processing to kernel
> > > threads (example: device mapper). If such a thread calls kmalloc, it can
> > > dive into direct reclaim path and end up waiting for too_many_isolated
> > > to return false, blocking writeback. This can lead to a dead lock if the
> >
> > Shouldn't such allocation lack __GFP_IO to prevent this and other kinds of
> > deadlocks? And/or have mempools?
>
> Not necessarily. loopback is an example: it can call
> grab_cache_write_begin -> add_to_page_cache_lru with GFP_KERNEL.

Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
because it isn't an allocation context problem: the reclaimer locks up
not because it tries to take an fs/io lock the caller holds, but because
it waits for isolated pages to be put back, which will never happen,
since processes that isolated them depend on the kthread making
progress. This is purely a reclaimer heuristic, which kmalloc users are
not aware of.

My point is that, in contrast to userspace processes, it is dangerous to
throttle kthreads in the reclaimer, because they might be responsible
for reclaimer progress (e.g. performing writeback).

Regarding side effects of this patch. Well, there aren't many kthreads
out there, so I don't believe this can put the system under the risk of
thrashing because of isolating too many reclaimable pages.

Thanks,
Vladimir

>
> > PF_KTHREAD looks like a big hammer to me that will solve only one
> > potential problem...
>
> This problem can result in processes hanging forever. Any ideas how this
> could be fixed in a better way?

2015-11-27 12:50:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Thu 26-11-15 11:16:24, Vladimir Davydov wrote:
> On Wed, Nov 25, 2015 at 07:27:57PM +0300, Vladimir Davydov wrote:
> > On Wed, Nov 25, 2015 at 04:45:13PM +0100, Vlastimil Babka wrote:
> > > On 11/25/2015 04:36 PM, Vladimir Davydov wrote:
> > > > Block device drivers often hand off io request processing to kernel
> > > > threads (example: device mapper). If such a thread calls kmalloc, it can
> > > > dive into direct reclaim path and end up waiting for too_many_isolated
> > > > to return false, blocking writeback. This can lead to a dead lock if the
> > >
> > > Shouldn't such allocation lack __GFP_IO to prevent this and other kinds of
> > > deadlocks? And/or have mempools?
> >
> > Not necessarily. loopback is an example: it can call
> > grab_cache_write_begin -> add_to_page_cache_lru with GFP_KERNEL.

AFAIR loop driver reduces the gfp_maks via inode mapping.

> Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
> because it isn't an allocation context problem: the reclaimer locks up
> not because it tries to take an fs/io lock the caller holds, but because
> it waits for isolated pages to be put back, which will never happen,
> since processes that isolated them depend on the kthread making
> progress. This is purely a reclaimer heuristic, which kmalloc users are
> not aware of.
>
> My point is that, in contrast to userspace processes, it is dangerous to
> throttle kthreads in the reclaimer, because they might be responsible
> for reclaimer progress (e.g. performing writeback).

Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC
instead because it is in fact a reclaimer so it even get to the reclaim.

There way too many allocations done from the kernel thread context to be
not throttled (just look at worker threads).

--
Michal Hocko
SUSE Labs

2015-11-27 13:40:22

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Fri, Nov 27, 2015 at 01:50:05PM +0100, Michal Hocko wrote:
> On Thu 26-11-15 11:16:24, Vladimir Davydov wrote:
> > On Wed, Nov 25, 2015 at 07:27:57PM +0300, Vladimir Davydov wrote:
> > > On Wed, Nov 25, 2015 at 04:45:13PM +0100, Vlastimil Babka wrote:
> > > > On 11/25/2015 04:36 PM, Vladimir Davydov wrote:
> > > > > Block device drivers often hand off io request processing to kernel
> > > > > threads (example: device mapper). If such a thread calls kmalloc, it can
> > > > > dive into direct reclaim path and end up waiting for too_many_isolated
> > > > > to return false, blocking writeback. This can lead to a dead lock if the
> > > >
> > > > Shouldn't such allocation lack __GFP_IO to prevent this and other kinds of
> > > > deadlocks? And/or have mempools?
> > >
> > > Not necessarily. loopback is an example: it can call
> > > grab_cache_write_begin -> add_to_page_cache_lru with GFP_KERNEL.
>
> AFAIR loop driver reduces the gfp_maks via inode mapping.

Yeah, it does, missed that, thanks for pointing this out. But it doesn't
really make much difference, because it still can get stuck in
too_many_isolated, although it does reduce the chance of this happening.
When I hit it, DMA only got 3 inactive file pages and 68 isolated file
pages, as I mentioned in the comment to the patch, so even >> 3 wouldn't
save us.

>
> > Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
> > because it isn't an allocation context problem: the reclaimer locks up
> > not because it tries to take an fs/io lock the caller holds, but because
> > it waits for isolated pages to be put back, which will never happen,
> > since processes that isolated them depend on the kthread making
> > progress. This is purely a reclaimer heuristic, which kmalloc users are
> > not aware of.
> >
> > My point is that, in contrast to userspace processes, it is dangerous to
> > throttle kthreads in the reclaimer, because they might be responsible
> > for reclaimer progress (e.g. performing writeback).
>
> Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC
> instead because it is in fact a reclaimer so it even get to the reclaim.

The driver we use is similar to loop. It works as a proxy to fs it works
on top of. Allowing it to access emergency reserves would deplete them
quickly, just like in case of plain loop.

The problem is not about our driver, in fact. I'm pretty sure one can
hit it when using memcg along with loop or dm-crypt for instance.

>
> There way too many allocations done from the kernel thread context to be
> not throttled (just look at worker threads).

What about throttling them only once then?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 97ba9e1cde09..9253f4531b9c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1578,6 +1578,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
+
+ if (current->flags & PF_KTHREAD)
+ break;
}

lru_add_drain();

2015-11-27 15:02:20

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Fri 27-11-15 16:40:03, Vladimir Davydov wrote:
> On Fri, Nov 27, 2015 at 01:50:05PM +0100, Michal Hocko wrote:
> > On Thu 26-11-15 11:16:24, Vladimir Davydov wrote:
[...]
> > > Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
> > > because it isn't an allocation context problem: the reclaimer locks up
> > > not because it tries to take an fs/io lock the caller holds, but because
> > > it waits for isolated pages to be put back, which will never happen,
> > > since processes that isolated them depend on the kthread making
> > > progress. This is purely a reclaimer heuristic, which kmalloc users are
> > > not aware of.
> > >
> > > My point is that, in contrast to userspace processes, it is dangerous to
> > > throttle kthreads in the reclaimer, because they might be responsible
> > > for reclaimer progress (e.g. performing writeback).
> >
> > Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC
> > instead because it is in fact a reclaimer so it even get to the reclaim.
>
> The driver we use is similar to loop. It works as a proxy to fs it works
> on top of. Allowing it to access emergency reserves would deplete them
> quickly, just like in case of plain loop.

OK, I see. I thought it would be using only a very limited amount of
memory for the writeback.

> The problem is not about our driver, in fact. I'm pretty sure one can
> hit it when using memcg along with loop or dm-crypt for instance.

I am not familiar much with neither but from a quick look the loop
driver doesn't use mempools tool, it simply relays the data to the
underlaying file and relies on the underlying fs to write all the pages
and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then
I am not really sure how can we guarantee a forward progress. The
GFP_NOFS allocation might loop inside the allocator endlessly and so
the writeback wouldn't make any progress. This doesn't seem to be only
memcg specific. The global case would just replace the deadlock by a
livelock. I certainly must be missing something here.

> > There way too many allocations done from the kernel thread context to be
> > not throttled (just look at worker threads).
>
> What about throttling them only once then?

This still sounds way too broad to me and I am even not sure it solves
the problem. If anything I think we really should make it specific
only to those callers who are really required to make a forward
progress. What about PF_LESS_THROTTLE? NFS is already using this flag for a
similar purpose and we indeed do not throttle at few places during the
reclaim. So I would expect current_may_throttle(current) check there
although I must confess I have no idea about the whole condition right
now.

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 97ba9e1cde09..9253f4531b9c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1578,6 +1578,9 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> /* We are about to die and free our memory. Return now. */
> if (fatal_signal_pending(current))
> return SWAP_CLUSTER_MAX;
> +
> + if (current->flags & PF_KTHREAD)
> + break;
> }
>
> lru_add_drain();
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Michal Hocko
SUSE Labs

2015-12-01 11:26:05

by Vladimir Davydov

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Fri, Nov 27, 2015 at 04:01:33PM +0100, Michal Hocko wrote:
> On Fri 27-11-15 16:40:03, Vladimir Davydov wrote:
> > On Fri, Nov 27, 2015 at 01:50:05PM +0100, Michal Hocko wrote:
> > > On Thu 26-11-15 11:16:24, Vladimir Davydov wrote:
> [...]
> > > > Anyway, kthreads that use GFP_NOIO and/or mempool aren't safe either,
> > > > because it isn't an allocation context problem: the reclaimer locks up
> > > > not because it tries to take an fs/io lock the caller holds, but because
> > > > it waits for isolated pages to be put back, which will never happen,
> > > > since processes that isolated them depend on the kthread making
> > > > progress. This is purely a reclaimer heuristic, which kmalloc users are
> > > > not aware of.
> > > >
> > > > My point is that, in contrast to userspace processes, it is dangerous to
> > > > throttle kthreads in the reclaimer, because they might be responsible
> > > > for reclaimer progress (e.g. performing writeback).
> > >
> > > Wouldn't it be better if your writeback kthread did PF_MEMALLOC/__GFP_MEMALLOC
> > > instead because it is in fact a reclaimer so it even get to the reclaim.
> >
> > The driver we use is similar to loop. It works as a proxy to fs it works
> > on top of. Allowing it to access emergency reserves would deplete them
> > quickly, just like in case of plain loop.
>
> OK, I see. I thought it would be using only a very limited amount of
> memory for the writeback.

Even if it does, there's still a chance of a dead lock: even if kthread
uses mempool + NOIO, it can go into direct reclaim and start looping in
too_many_isolated AFAIU. The probability of this happening is much less
though.

>
> > The problem is not about our driver, in fact. I'm pretty sure one can
> > hit it when using memcg along with loop or dm-crypt for instance.
>
> I am not familiar much with neither but from a quick look the loop
> driver doesn't use mempools tool, it simply relays the data to the
> underlaying file and relies on the underlying fs to write all the pages
> and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then
> I am not really sure how can we guarantee a forward progress. The
> GFP_NOFS allocation might loop inside the allocator endlessly and so
> the writeback wouldn't make any progress. This doesn't seem to be only
> memcg specific. The global case would just replace the deadlock by a
> livelock. I certainly must be missing something here.

Yeah, I think you're right. If the loop kthread gets stuck in the
reclaimer, it might occur that other processes will isolate, reclaim and
then dirty reclaimed pages, preventing the kthread from running and
cleaning memory, so that we might end up with all memory being under
writeback and no reclaimable memory left for kthread to run and clean
it. Due to dirty limit, this is unlikely to happen though, but I'm not
sure.

OTOH, with legacy memcg, there is no dirty limit and we can isolate a
lot of pages (SWAP_CLUSTER_MAX = 512 now) per process and wait on page
writeback to complete before releasing them, which sounds bad. And we
can't just remove this wait_on_page_writeback from shrink_page_list,
otherwise OOM might be triggered prematurely. May be, we should putback
rotated pages and release all reclaimed pages before initiating wait?

>
> > > There way too many allocations done from the kernel thread context to be
> > > not throttled (just look at worker threads).
> >
> > What about throttling them only once then?
>
> This still sounds way too broad to me and I am even not sure it solves
> the problem. If anything I think we really should make it specific
> only to those callers who are really required to make a forward
> progress. What about PF_LESS_THROTTLE? NFS is already using this flag for a
> similar purpose and we indeed do not throttle at few places during the
> reclaim. So I would expect current_may_throttle(current) check there
> although I must confess I have no idea about the whole condition right
> now.

Yeah, thanks for the tip. I'll take a look.

Thanks,
Vladimir

2015-12-02 13:52:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] vmscan: do not throttle kthreads due to too_many_isolated

On Tue 01-12-15 14:25:45, Vladimir Davydov wrote:
> On Fri, Nov 27, 2015 at 04:01:33PM +0100, Michal Hocko wrote:
> > On Fri 27-11-15 16:40:03, Vladimir Davydov wrote:
[...]
> > > The problem is not about our driver, in fact. I'm pretty sure one can
> > > hit it when using memcg along with loop or dm-crypt for instance.
> >
> > I am not familiar much with neither but from a quick look the loop
> > driver doesn't use mempools tool, it simply relays the data to the
> > underlaying file and relies on the underlying fs to write all the pages
> > and only prevents from the recursion by clearing GFP_FS and GFP_IO. Then
> > I am not really sure how can we guarantee a forward progress. The
> > GFP_NOFS allocation might loop inside the allocator endlessly and so
> > the writeback wouldn't make any progress. This doesn't seem to be only
> > memcg specific. The global case would just replace the deadlock by a
> > livelock. I certainly must be missing something here.
>
> Yeah, I think you're right. If the loop kthread gets stuck in the
> reclaimer, it might occur that other processes will isolate, reclaim and
> then dirty reclaimed pages, preventing the kthread from running and
> cleaning memory, so that we might end up with all memory being under
> writeback and no reclaimable memory left for kthread to run and clean
> it. Due to dirty limit, this is unlikely to happen though, but I'm not
> sure.
>
> OTOH, with legacy memcg, there is no dirty limit and we can isolate a
> lot of pages (SWAP_CLUSTER_MAX = 512 now) per process and wait on page
> writeback to complete before releasing them, which sounds bad. And we
> can't just remove this wait_on_page_writeback from shrink_page_list,
> otherwise OOM might be triggered prematurely. May be, we should putback
> rotated pages and release all reclaimed pages before initiating wait?

So you want to write for the page while it is on the LRU? I think it
would be better to simply not throttle the loopback kthread endlessly in
the first place.

I was tempted to add an exception and do not wait for writeback on
loopback (and alike) devices but this is a dirty hack and I think it
is reasonable to rely on the global case to work properly and guarantee
a forward progress. And this seems broken currently:

FS1 (marks page Writeback) -> request -> BLK -> LOOP -> vfs_iter_write -> FS2

So the writeback bit will be set until vfs_iter_write finishes and that
doesn't require data to be written down to the FS2 backing store AFAICS.
But what happens if vfs_iter_write gets throttled on the dirty limit
or FS2 performs a GFP_NOFS allocation? We will get stuck there for ever
without any progress. If loopback FS load is dominant then we are
screwed. So I think the global case needs a fix. If it works properly
our waiting in the memcg reclaim will be finite as well.

On the other hand if all the reclaimable memory is marked as writeback
or dirty then we should trigger OOM killer sooner or later so I am not
so sure this is such a big deal. I have to play with this though.

Anyway I guess we at least want to wait for the writebit killable. I
will cook up a patch.
--
Michal Hocko
SUSE Labs