2016-04-26 11:56:23

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/2] scop GFP_NOFS api

Hi,
we have discussed this topic at LSF/MM this year. There was a general
interest in the scope GFP_NOFS allocation context among some FS
developers. For those who are not aware of the discussion or the issue
I am trying to sort out (or at least start in that direction) please
have a look at patch 1 which adds memalloc_nofs_{save,restore} api
which basically copies what we have for the scope GFP_NOIO allocation
context. I haven't converted any of the FS myself because that is way
beyond my area of expertise but I would be happy to help with further
changes on the MM front as well as in some more generic code paths.

Dave had an idea on how to further improve the reclaim context to be
less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
and FS specific cookie set in the FS allocation context and consumed
by the FS reclaim context to allow doing some provably save actions
that would be skipped due to GFP_NOFS normally. I like this idea and
I believe we can go that direction regardless of the approach taken here.
Many filesystems simply need to cleanup their NOFS usage first before
diving into a more complex changes.

The patch 2 is a debugging aid which warns about explicit allocation
requests from the scope context. This is should help to reduce the
direct usage of the NOFS flags to bare minimum in favor of the scope
API. It is not aimed to be merged upstream. I would hope Andrew took it
into mmotm tree to give it linux-next exposure and allow developers to
do further cleanups. There is a new kernel command line parameter which
has to be used for the debugging to be enabled.

I think the GFP_NOIO should be seeing the same clean up.

Any feedback is highly appreciated.



2016-04-26 11:56:25

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

From: Michal Hocko <[email protected]>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 56 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 86bb5d6ddd7d..085d00280496 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3750,6 +3750,61 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
return page;
}

+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+ debug_scope_gfp = true;
+ return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+static inline void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+ gfp_t restrict_mask;
+
+ if (likely(!debug_scope_gfp))
+ return;
+
+ /* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+ return;
+
+ if (current->flags & PF_MEMALLOC_NOIO)
+ restrict_mask = __GFP_IO;
+ else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+ restrict_mask = __GFP_FS;
+ else
+ return;
+
+ if ((gfp_mask & restrict_mask) != restrict_mask) {
+ /*
+ * If you see this this warning then the code does:
+ * memalloc_no{fs,io}_save()
+ * ...
+ * foo()
+ * alloc_page(GFP_NO{FS,IO})
+ * ...
+ * memalloc_no{fs,io}_restore()
+ *
+ * allocation which is unnecessary because the scope gfp
+ * context will do that for all allocation requests already.
+ * If foo() is called from multiple contexts then make sure other
+ * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+ * scope protection into particular paths or change the gfp mask
+ * to GFP_KERNEL.
+ */
+ pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context\n", gfp_mask, &gfp_mask,
+ (current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS");
+ dump_stack();
+ }
+}
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -3796,6 +3851,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
ac.nodemask);

/* First allocation attempt */
+ debug_scope_gfp_context(gfp_mask);
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (likely(page))
goto out;
--
2.8.0.rc3


2016-04-26 11:56:24

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

From: Michal Hocko <[email protected]>

GFP_NOFS context is used for the following 4 reasons currently
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
API to control the scope. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO.

Xfs has already had a similar functionality as PF_FSTRANS so let's just
give it a more generic name and make it usable for others as well and
move the GFP_NOFS context tracking to the page allocator. Xfs has its
own accessor functions but let's keep them for now to reduce this patch
as minimum.

This patch shouldn't introduce any functional changes. Xfs code paths
preserve their semantic. kmem_flags_convert() doesn't need to evaluate
the flag anymore because it is the page allocator to care about the
flag. memalloc_noio_flags is renamed to current_gfp_context because it
now cares about both PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 4 ++--
fs/xfs/kmem.h | 2 +-
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/xfs_aops.c | 6 +++---
fs/xfs/xfs_trans.c | 12 ++++++------
include/linux/gfp.h | 8 ++++++++
include/linux/sched.h | 32 ++++++++++++++++++++++++++------
mm/page_alloc.c | 8 +++++---
mm/vmscan.c | 4 ++--
9 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
* the filesystem here and potentially deadlocking.
*/
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();

lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);

- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);

return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, work);
unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
+ unsigned long new_pflags = PF_MEMALLOC_NOFS;

/*
* we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
* We hand off the transaction to the completion thread now, so
* clear the flag here.
*/
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return 0;
}

@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
* thus we need to mark ourselves as being in a transaction manually.
* Similarly for freeze protection.
*/
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);

/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
* Given that we do not allow direct reclaim to call us, we should
* never be called while in a filesystem transaction.
*/
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;

/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;

/* Mark this thread as being in a transaction */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

/*
* Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
if (error != 0) {
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return -ENOSPC;
}
tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
tp->t_blk_res = 0;
}

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

return error;
}
@@ -921,7 +921,7 @@ __xfs_trans_commit(

xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free(tp);

/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
if (commit_lsn == -1 && !error)
error = -EIO;
}
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
xfs_trans_free(tp);

@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
xfs_log_done(mp, tp->t_ticket, NULL, false);

/* mark this thread as no longer being in a transaction */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
xfs_trans_free(tp);
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
*
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_noio_{save,restore} to mark the whole scope which cannot
+ * perform any IO with a short explanation why. All allocation requests
+ * will inherit GFP_NOIO implicitly.
*
* GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ * recurse into the FS layer with a short explanation why. All allocation
+ * requests will inherit GFP_NOFS implicitly.
*
* GFP_USER is for userspace allocations that also need to be directly
* accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
-#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
-#define PF_KSWAPD 0x00040000 /* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
+#define PF_KSWAPD 0x00020000 /* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -2140,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
*/
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
{
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precendence
+ */
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~(__GFP_IO | __GFP_FS);
+ else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+ flags &= ~__GFP_FS;
return flags;
}

@@ -2162,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+static inline unsigned int memalloc_nofs_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+ current->flags |= PF_MEMALLOC_NOFS;
+ return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
/* Per-process atomic flags. */
#define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
#define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
goto out;

/*
- * Runtime PM, block IO and its error handling path can deadlock
- * because I/O on the device might not complete.
+ * Apply scoped allocation constrains. This is mainly about
+ * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+ * allocation requests from a particular context which has
+ * been marked by memalloc_no{fs,io}_{save,restore}
*/
- alloc_mask = memalloc_noio_flags(gfp_mask);
+ alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;

page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.order = order,
.nodemask = nodemask,
.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.order = order,
.priority = ZONE_RECLAIM_PRIORITY,
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
--
2.8.0.rc3


2016-04-26 22:59:42

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
>
> It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
>
> Let's help this process and add a debugging tool to catch when an
> explicit allocation request for GFP_NO{FS,IO} is done from the scope
> context. The printed stacktrace should help to identify the caller
> and evaluate whether it can be changed to use a wider context or whether
> it is called from another potentially dangerous context which needs
> a scope protection as well.

You're going to get a large number of these from XFS. There are call
paths in XFs that get called both inside and outside transaction
context, and many of them are marked with GFP_NOFS to prevent issues
that have cropped up in the past.

Often these are to silence lockdep warnings (e.g. commit b17cb36
("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
lockdep gets very unhappy about the same functions being called with
different reclaim contexts. e.g. directory block mapping might
occur from readdir (no transaction context) or within transactions
(create/unlink). hence paths like this are tagged with GFP_NOFS to
stop lockdep emitting false positive warnings....

Removing the GFP_NOFS flags in situations like this is simply going
to restart the flood of false positive lockdep warnings we've
silenced over the years, so perhaps lockdep needs to be made smarter
as well...

Cheers,

Dave.

--
Dave Chinner
[email protected]

2016-04-26 23:07:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> GFP_NOFS context is used for the following 4 reasons currently
> - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
> - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
> - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
> - just in case because this would be safe from the fs POV

- silencing lockdep false positives

> Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> API to control the scope. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO.
>
> Xfs has already had a similar functionality as PF_FSTRANS so let's just
> give it a more generic name and make it usable for others as well and
> move the GFP_NOFS context tracking to the page allocator. Xfs has its
> own accessor functions but let's keep them for now to reduce this patch
> as minimum.

Can you split this into two patches? The first simply does this:

#define PF_MEMALLOC_NOFS PF_FSTRANS

and changes only the XFS code to use PF_MEMALLOC_NOFS.

The second patch can then do the rest of the mm API changes that we
don't actually care about in XFS at all. That way I can carry all
the XFS changes in the XFS tree and not have to worry about when
this stuff gets merged or conflicts with the rest of the work that
is being done to the mm/ code and whatever tree that eventually
lands in...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-27 07:51:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm: add PF_MEMALLOC_NOFS

On Wed 27-04-16 09:07:02, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:11PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > GFP_NOFS context is used for the following 4 reasons currently
> > - to prevent from deadlocks when the lock held by the allocation
> > context would be needed during the memory reclaim
> > - to prevent from stack overflows during the reclaim because
> > the allocation is performed from a deep context already
> > - to prevent lockups when the allocation context depends on
> > other reclaimers to make a forward progress indirectly
> > - just in case because this would be safe from the fs POV
>
> - silencing lockdep false positives
>
> > Introduce PF_MEMALLOC_NOFS task specific flag and memalloc_nofs_{save,restore}
> > API to control the scope. This is basically copying
> > memalloc_noio_{save,restore} API we have for other restricted allocation
> > context GFP_NOIO.
> >
> > Xfs has already had a similar functionality as PF_FSTRANS so let's just
> > give it a more generic name and make it usable for others as well and
> > move the GFP_NOFS context tracking to the page allocator. Xfs has its
> > own accessor functions but let's keep them for now to reduce this patch
> > as minimum.
>
> Can you split this into two patches? The first simply does this:
>
> #define PF_MEMALLOC_NOFS PF_FSTRANS
>
> and changes only the XFS code to use PF_MEMALLOC_NOFS.
>
> The second patch can then do the rest of the mm API changes that we
> don't actually care about in XFS at all. That way I can carry all
> the XFS changes in the XFS tree and not have to worry about when
> this stuff gets merged or conflicts with the rest of the work that
> is being done to the mm/ code and whatever tree that eventually
> lands in...

Sure I will do that

--
Michal Hocko
SUSE Labs

2016-04-27 08:03:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> >
> > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> >
> > Let's help this process and add a debugging tool to catch when an
> > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > context. The printed stacktrace should help to identify the caller
> > and evaluate whether it can be changed to use a wider context or whether
> > it is called from another potentially dangerous context which needs
> > a scope protection as well.
>
> You're going to get a large number of these from XFS. There are call
> paths in XFs that get called both inside and outside transaction
> context, and many of them are marked with GFP_NOFS to prevent issues
> that have cropped up in the past.
>
> Often these are to silence lockdep warnings (e.g. commit b17cb36
> ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> lockdep gets very unhappy about the same functions being called with
> different reclaim contexts. e.g. directory block mapping might
> occur from readdir (no transaction context) or within transactions
> (create/unlink). hence paths like this are tagged with GFP_NOFS to
> stop lockdep emitting false positive warnings....

I would much rather see lockdep being fixed than abusing GFP_NOFS to
workaround its limitations. GFP_NOFS has a real consequences to the
memory reclaim. I will go and check the commit you mentioned and try
to understand why that is a problem. From what you described above
I would like to get rid of exactly this kind of usage which is not
really needed for the recursion protection.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-27 17:42:04

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

On Apr 27, 2016, at 5:54 AM, Michal Hocko <[email protected]> wrote:
>
> From: Michal Hocko <[email protected]>
>
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
>
> This patch doesn't introduce any functional change.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> Hi,
> as suggested by Dave, I have split up [1] into two parts. The first one
> addes a new PF flag which is just an alias to the existing PF_FSTRANS
> and does all the renaming and the second one to introduce the generic
> API which only changes the bare minimum in the xfs proper.
>
> fs/xfs/kmem.c | 4 ++--
> fs/xfs/kmem.h | 2 +-
> fs/xfs/libxfs/xfs_btree.c | 2 +-
> fs/xfs/xfs_aops.c | 6 +++---
> fs/xfs/xfs_trans.c | 12 ++++++------
> include/linux/sched.h | 2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 686ba6fb20dd..73f6ab59c664 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> * the filesystem here and potentially deadlocking.
> */
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> noio_flag = memalloc_noio_save();
>
> lflags = kmem_flags_convert(flags);
> ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> memalloc_noio_restore(noio_flag);

Not really the fault of this patch, but it brings this nasty bit of code into
the light. Is all of this machinery still needed given that __vmalloc() can
accept GFP flags? If yes, wouldn't it be better to fix __vmalloc() to honor
the GFP flags instead of working around it in the filesystem code?

Cheers, Andreas

> return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index d1c66e465ca5..0d83f332e5c2 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> } else {
> lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> lflags &= ~__GFP_FS;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index a0eb18ce3ad3..326566f4a131 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
> struct xfs_btree_split_args *args = container_of(work,
> struct xfs_btree_split_args, work);
> unsigned long pflags;
> - unsigned long new_pflags = PF_FSTRANS;
> + unsigned long new_pflags = PF_MEMALLOC_NOFS;
>
> /*
> * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index d12dfcfd0cc8..6d816ff0b763 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
> * We hand off the transaction to the completion thread now, so
> * clear the flag here.
> */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return 0;
> }
>
> @@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
> * thus we need to mark ourselves as being in a transaction manually.
> * Similarly for freeze protection.
> */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>
> /* we abort the update if there was an IO error */
> @@ -979,7 +979,7 @@ xfs_vm_writepage(
> * Given that we do not allow direct reclaim to call us, we should
> * never be called while in a filesystem transaction.
> */
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> goto redirty;
>
> /* Is this page beyond the end of the file? */
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 748b16aff45a..1d247366c733 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -176,7 +176,7 @@ xfs_trans_reserve(
> bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>
> /* Mark this thread as being in a transaction */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> /*
> * Attempt to reserve the needed disk blocks by decrementing
> @@ -186,7 +186,7 @@ xfs_trans_reserve(
> if (blocks > 0) {
> error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> if (error != 0) {
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return -ENOSPC;
> }
> tp->t_blk_res += blocks;
> @@ -263,7 +263,7 @@ xfs_trans_reserve(
> tp->t_blk_res = 0;
> }
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> return error;
> }
> @@ -921,7 +921,7 @@ __xfs_trans_commit(
>
> xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free(tp);
>
> /*
> @@ -951,7 +951,7 @@ __xfs_trans_commit(
> if (commit_lsn == -1 && !error)
> error = -EIO;
> }
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> xfs_trans_free(tp);
>
> @@ -1005,7 +1005,7 @@ xfs_trans_cancel(
> xfs_log_done(mp, tp->t_ticket, NULL, false);
>
> /* mark this thread as no longer being in a transaction */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index acfc32b30704..820db8f98bfc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */
>
> +#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
> * Only the _current_ task can read/write to tsk->flags, but other
> * tasks can access tsk->flags in readonly mode for example
> --
> 2.8.0.rc3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2016-04-27 19:43:51

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

On Wed 27-04-16 11:41:51, Andreas Dilger wrote:
> On Apr 27, 2016, at 5:54 AM, Michal Hocko <[email protected]> wrote:
[...]
> > --- a/fs/xfs/kmem.c
> > +++ b/fs/xfs/kmem.c
> > @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> > * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> > * the filesystem here and potentially deadlocking.
> > */
> > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > noio_flag = memalloc_noio_save();
> >
> > lflags = kmem_flags_convert(flags);
> > ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
> >
> > - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> > + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> > memalloc_noio_restore(noio_flag);
>
> Not really the fault of this patch, but it brings this nasty bit of code into
> the light. Is all of this machinery still needed given that __vmalloc() can
> accept GFP flags? If yes, wouldn't it be better to fix __vmalloc() to honor
> the GFP flags instead of working around it in the filesystem code?

This is not that easy. __vmalloc can accept gfp flags but it doesn't
honor __GFP_IO 100%. IIRC some paths like page table allocations are
hardcoded GFP_KERNEL. Besides that I would like to have GFP_NOIO used
via memalloc_noio_{save,restore} API as well for the similar reasons as
GFP_NOFS - it is just easier to explain scope than particular code paths
which might be shared.
--
Michal Hocko
SUSE Labs

2016-04-27 20:09:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API

Hi Dave,

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
[...]
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 0d83f332e5c2..b35688a54c9a 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> } else {
> lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> + if (flags & KM_NOFS)
> lflags &= ~__GFP_FS;
> }
>

I was trying to reproduce the false positives you mentioned in other
email by reverting b17cb364dbbb ("xfs: fix missing KM_NOFS tags to keep
lockdep happy"). I could really hit the one below but then it turned out
that it is this hunk above which causes the lockdep warning. Now I am
trying to understand how is this possible as ~__GFP_FS happens at the
page allocation level so we should never leak the __GFP_FS context when
the PF flag is set.

One possible explanation would be that some code path hands over to
a kworker which then loses the PF flag but it would get a gfp_mask which
would have the FS flag cleared with the original code.
xfs_btree_split_worker is using PF_MEMALLOC_NOFS directly so it should
be OK and xfs_reclaim_worker resp. xfs_eofblocks_worker use struct
xfs_mount as a context which doesn't contain gfp_mask. The stack trace
also doesn't indicate any kworker involvement. Do you have an idea what
else might get wrong or what am I missing?

---
[ 53.990821] =================================
[ 53.991672] [ INFO: inconsistent lock state ]
[ 53.992419] 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902 Not tainted
[ 53.993458] ---------------------------------
[ 53.993480] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[ 53.993480] kswapd0/467 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 53.993480] (&xfs_nondir_ilock_class){+++++-}, at: [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[ 53.993480] {RECLAIM_FS-ON-W} state was registered at:
[ 53.993480] [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
[ 53.993480] [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
[ 53.993480] [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0
[ 53.993480] [<ffffffffa0073422>] kmem_zone_alloc+0x65/0xc1 [xfs]
[ 54.003925] [<ffffffffa007a701>] xfs_buf_item_init+0x40/0x147 [xfs]
[ 54.003925] [<ffffffffa0084451>] _xfs_trans_bjoin+0x23/0x53 [xfs]
[ 54.003925] [<ffffffffa0084ea5>] xfs_trans_read_buf_map+0x2e9/0x5b3 [xfs]
[ 54.003925] [<ffffffffa001427c>] xfs_read_agf+0x141/0x1d4 [xfs]
[ 54.003925] [<ffffffffa0014413>] xfs_alloc_read_agf+0x104/0x223 [xfs]
[ 54.003925] [<ffffffffa001482f>] xfs_alloc_pagf_init+0x1a/0x3a [xfs]
[ 54.003925] [<ffffffffa001e1f2>] xfs_bmap_longest_free_extent+0x4c/0x9c [xfs]
[ 54.003925] [<ffffffffa0026225>] xfs_bmap_btalloc_nullfb+0x7a/0xc6 [xfs]
[ 54.003925] [<ffffffffa00289fc>] xfs_bmap_btalloc+0x21a/0x59d [xfs]
[ 54.003925] [<ffffffffa0028d8d>] xfs_bmap_alloc+0xe/0x10 [xfs]
[ 54.003925] [<ffffffffa0029612>] xfs_bmapi_write+0x401/0x80f [xfs]
[ 54.003925] [<ffffffffa0064209>] xfs_iomap_write_allocate+0x1bf/0x2b3 [xfs]
[ 54.003925] [<ffffffffa004be66>] xfs_map_blocks+0x141/0x3c9 [xfs]
[ 54.003925] [<ffffffffa004d8f1>] xfs_vm_writepage+0x3f6/0x612 [xfs]
[ 54.003925] [<ffffffff8112e6b4>] __writepage+0x16/0x34
[ 54.003925] [<ffffffff8112ecdb>] write_cache_pages+0x35d/0x4b4
[ 54.003925] [<ffffffff8112ee82>] generic_writepages+0x50/0x6f
[ 54.003925] [<ffffffffa004bbc2>] xfs_vm_writepages+0x44/0x4c [xfs]
[ 54.003925] [<ffffffff81130b7c>] do_writepages+0x23/0x2c
[ 54.003925] [<ffffffff81124395>] __filemap_fdatawrite_range+0x84/0x8b
[ 54.003925] [<ffffffff8112445f>] filemap_write_and_wait_range+0x2d/0x5b
[ 54.003925] [<ffffffffa0059512>] xfs_file_fsync+0x113/0x29a [xfs]
[ 54.003925] [<ffffffff811bd269>] vfs_fsync_range+0x8c/0x9e
[ 54.003925] [<ffffffff811bd297>] vfs_fsync+0x1c/0x1e
[ 54.003925] [<ffffffff811bd2ca>] do_fsync+0x31/0x4a
[ 54.003925] [<ffffffff811bd50c>] SyS_fsync+0x10/0x14
[ 54.003925] [<ffffffff81618197>] entry_SYSCALL_64_fastpath+0x12/0x6b
[ 54.003925] irq event stamp: 2998549
[ 54.003925] hardirqs last enabled at (2998549): [<ffffffff81617997>] _raw_spin_unlock_irq+0x2c/0x4a
[ 54.003925] hardirqs last disabled at (2998548): [<ffffffff81617805>] _raw_spin_lock_irq+0x13/0x47
[ 54.003925] softirqs last enabled at (2994888): [<ffffffff8161afbf>] __do_softirq+0x38f/0x4d5
[ 54.003925] softirqs last disabled at (2994867): [<ffffffff810542ea>] irq_exit+0x6f/0xd1
[ 54.003925]
[ 54.003925] other info that might help us debug this:
[ 54.003925] Possible unsafe locking scenario:
[ 54.003925]
[ 54.003925] CPU0
[ 54.003925] ----
[ 54.003925] lock(&xfs_nondir_ilock_class);
[ 54.003925] <Interrupt>
[ 54.003925] lock(&xfs_nondir_ilock_class);
[ 54.003925]
[ 54.003925] *** DEADLOCK ***
[ 54.003925]
[ 54.003925] 2 locks held by kswapd0/467:
[ 54.003925] #0: (shrinker_rwsem){++++..}, at: [<ffffffff81136350>] shrink_slab+0x7a/0x518
[ 54.003925] #1: (&type->s_umount_key#25){.+.+..}, at: [<ffffffff81192b74>] trylock_super+0x1b/0x4b
[ 54.003925]
[ 54.003925] stack backtrace:
[ 54.003925] CPU: 0 PID: 467 Comm: kswapd0 Not tainted 4.5.0-nofs5-00004-g35ad69a8eb83-dirty #902
[ 54.003925] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
[ 54.003925] 0000000000000000 ffff8800065678b8 ffffffff81308ded ffffffff825caa70
[ 54.003925] ffff880007328000 ffff8800065678f0 ffffffff81121489 0000000000000009
[ 54.003925] ffff8800073288a0 ffff880007328000 ffffffff810936a7 0000000000000009
[ 54.003925] Call Trace:
[ 54.003925] [<ffffffff81308ded>] dump_stack+0x67/0x90
[ 54.003925] [<ffffffff81121489>] print_usage_bug.part.24+0x259/0x268
[ 54.003925] [<ffffffff810936a7>] ? print_shortest_lock_dependencies+0x180/0x180
[ 54.003925] [<ffffffff8109439f>] mark_lock+0x381/0x567
[ 54.003925] [<ffffffff810953ff>] __lock_acquire+0x9f7/0x190c
[ 54.003925] [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[ 54.003925] [<ffffffff81093f2c>] ? check_irq_usage+0x99/0xaa
[ 54.003925] [<ffffffff81092c77>] ? add_lock_to_list.isra.8.constprop.25+0x82/0x8d
[ 54.003925] [<ffffffff810961b1>] ? __lock_acquire+0x17a9/0x190c
[ 54.003925] [<ffffffff81096ae2>] lock_acquire+0x139/0x1e1
[ 54.003925] [<ffffffff81096ae2>] ? lock_acquire+0x139/0x1e1
[ 54.003925] [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[ 54.003925] [<ffffffffa0050df2>] ? xfs_free_eofblocks+0x84/0x1ce [xfs]
[ 54.003925] [<ffffffff81090e34>] down_read_nested+0x29/0x3e
[ 54.003925] [<ffffffffa0066897>] ? xfs_ilock+0x18a/0x205 [xfs]
[ 54.003925] [<ffffffffa0066897>] xfs_ilock+0x18a/0x205 [xfs]
[ 54.003925] [<ffffffffa0050df2>] xfs_free_eofblocks+0x84/0x1ce [xfs]
[ 54.003925] [<ffffffff81094765>] ? trace_hardirqs_on_caller+0x16c/0x188
[ 54.003925] [<ffffffffa0069ef7>] xfs_inactive+0x55/0xc6 [xfs]
[ 54.003925] [<ffffffffa006ef1b>] xfs_fs_evict_inode+0x14b/0x1bd [xfs]
[ 54.003925] [<ffffffff811a7db9>] evict+0xb0/0x165
[ 54.003925] [<ffffffff811a7eaa>] dispose_list+0x3c/0x4a
[ 54.003925] [<ffffffff811a9241>] prune_icache_sb+0x4a/0x55
[ 54.003925] [<ffffffff81192cd3>] super_cache_scan+0x12f/0x179
[ 54.003925] [<ffffffff811365a1>] shrink_slab+0x2cb/0x518
[ 54.003925] [<ffffffff81139b97>] shrink_zone+0x175/0x263
[ 54.003925] [<ffffffff8113b175>] kswapd+0x7dc/0x948
[ 54.003925] [<ffffffff8113a999>] ? mem_cgroup_shrink_node_zone+0x305/0x305
[ 54.003925] [<ffffffff8106c911>] kthread+0xed/0xf5
[ 54.003925] [<ffffffff81617997>] ? _raw_spin_unlock_irq+0x2c/0x4a
[ 54.003925] [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
[ 54.003925] [<ffffffff816184ef>] ret_from_fork+0x3f/0x70
[ 54.003925] [<ffffffff8106c824>] ? kthread_create_on_node+0x1bd/0x1bd
--
Michal Hocko
SUSE Labs

2016-04-27 20:30:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API

On Wed 27-04-16 22:09:27, Michal Hocko wrote:
[...]
> [ 53.993480] [<ffffffff810945e3>] mark_held_locks+0x5e/0x74
> [ 53.993480] [<ffffffff8109722c>] lockdep_trace_alloc+0xb2/0xb5
> [ 53.993480] [<ffffffff81174e56>] kmem_cache_alloc+0x36/0x2b0

Scratch that. I got it. It is the lockdep annotation which I got wrong
with my patch. I thought this was done much later in the slow path.
My head is burnt so I will get back to it tomorrow. The patch 1.1 should
be OK to go for XFS though because it doesn't really introduce anything
new.

Sorry about the noise!
--
Michal Hocko
SUSE Labs

2016-04-27 21:14:18

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API

OK, so the lockdep splats I was seeing [1] were much easier to fix than
I originally thought. So the following should be folded into the
original patch. I will send the full patch later on.

[1] http://lkml.kernel.org/r/[email protected]
---

2016-04-27 22:55:20

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

On Wed, Apr 27, 2016 at 10:03:11AM +0200, Michal Hocko wrote:
> On Wed 27-04-16 08:58:45, Dave Chinner wrote:
> > On Tue, Apr 26, 2016 at 01:56:12PM +0200, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE
> > >
> > > It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
> > > prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.
> > >
> > > Let's help this process and add a debugging tool to catch when an
> > > explicit allocation request for GFP_NO{FS,IO} is done from the scope
> > > context. The printed stacktrace should help to identify the caller
> > > and evaluate whether it can be changed to use a wider context or whether
> > > it is called from another potentially dangerous context which needs
> > > a scope protection as well.
> >
> > You're going to get a large number of these from XFS. There are call
> > paths in XFs that get called both inside and outside transaction
> > context, and many of them are marked with GFP_NOFS to prevent issues
> > that have cropped up in the past.
> >
> > Often these are to silence lockdep warnings (e.g. commit b17cb36
> > ("xfs: fix missing KM_NOFS tags to keep lockdep happy")) because
> > lockdep gets very unhappy about the same functions being called with
> > different reclaim contexts. e.g. directory block mapping might
> > occur from readdir (no transaction context) or within transactions
> > (create/unlink). hence paths like this are tagged with GFP_NOFS to
> > stop lockdep emitting false positive warnings....
>
> I would much rather see lockdep being fixed than abusing GFP_NOFS to
> workaround its limitations. GFP_NOFS has a real consequences to the
> memory reclaim. I will go and check the commit you mentioned and try
> to understand why that is a problem. From what you described above
> I would like to get rid of exactly this kind of usage which is not
> really needed for the recursion protection.

The problem is that every time we come across this, the answer is
"use lockdep annotations". Our lockdep annotations are freakin'
complex because of this, and more often than not lockdep false
positives occur due to bugs in the annotations. e.g. see
fs/xfs/xfs_inode.h for all the inode locking annotations we have to
use and the hoops we have to jump through because we are limited to
8 subclasses and we have to be able to annotate nested inode locks
5 deep in places (RENAME_WHITEOUT, thanks).

At one point, we had to reset lockdep classes for inodes in reclaim
so that they didn't throw lockdep false positives the moment an
inode was locked in a memory reclaim context. We had to change
locking to remove that problem (commit 4f59af7 ("xfs: remove iolock
lock classes"). Then there were all the problems with reclaim
triggering lockdep warnings on directory inodes - we had to add a
separate directory inode class for them, and even then we still need
GFP_NOFS in places to minimise reclaim noise (as per the above
commit).

Put simply: we've had to resort to designing locking and allocation
strategies around the limitations of lockdep annotations, as opposed
to what is actually possible or even optimal. i.e. when the choice
is a 2 minute fix to add GFP_NOFS in cases like this, versus another
week long effort to rewrite the inode annotations (again) like this
one last year:

commit 0952c8183c1575a78dc416b5e168987ff98728bb
Author: Dave Chinner <[email protected]>
Date: Wed Aug 19 10:32:49 2015 +1000

xfs: clean up inode lockdep annotations

Lockdep annotations are a maintenance nightmare. Locking has to be
modified to suit the limitations of the annotations, and we're
always having to fix the annotations because they are unable to
express the complexity of locking heirarchies correctly.
.....

It's a no-brainer to see why GFP_NOFS will be added to the
allocation in question. I've been saying for years that I consider
lockdep harmful - if you want to get rid of GFP_NOFS, then you're
going to need to sort out the lockdep reclaim annotation mess at the
same time...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-29 06:12:57

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Tue, Apr 26 2016, Michal Hocko wrote:

> Hi,
> we have discussed this topic at LSF/MM this year. There was a general
> interest in the scope GFP_NOFS allocation context among some FS
> developers. For those who are not aware of the discussion or the issue
> I am trying to sort out (or at least start in that direction) please
> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> which basically copies what we have for the scope GFP_NOIO allocation
> context. I haven't converted any of the FS myself because that is way
> beyond my area of expertise but I would be happy to help with further
> changes on the MM front as well as in some more generic code paths.
>
> Dave had an idea on how to further improve the reclaim context to be
> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> and FS specific cookie set in the FS allocation context and consumed
> by the FS reclaim context to allow doing some provably save actions
> that would be skipped due to GFP_NOFS normally. I like this idea and
> I believe we can go that direction regardless of the approach taken here.
> Many filesystems simply need to cleanup their NOFS usage first before
> diving into a more complex changes.>

This strikes me as over-engineering to work around an unnecessarily
burdensome interface.... but without details it is hard to be certain.

Exactly what things happen in "FS reclaim context" which may, or may
not, be safe depending on the specific FS allocation context? Do they
need to happen at all?

My research suggests that for most filesystems the only thing that
happens in reclaim context that is at all troublesome is the final
'evict()' on an inode. This needs to flush out dirty pages and sync the
inode to storage. Some time ago we moved most dirty-page writeout out
of the reclaim context and into kswapd. I think this was an excellent
advance in simplicity.
If we could similarly move evict() into kswapd (and I believe we can)
then most file systems would do nothing in reclaim context that
interferes with allocation context.

The exceptions include:
- nfs and any filesystem using fscache can block for up to 1 second
in ->releasepage(). They used to block waiting for some IO, but that
caused deadlocks and wasn't really needed. I left the timeout because
it seemed likely that some throttling would help. I suspect that a
careful analysis will show that there is sufficient throttling
elsewhere.

- xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
for IO so it can free some quotainfo things. If it could be changed
to just schedule the IO without waiting for it then I think this
would be safe to be called in any FS allocation context. It already
uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
if the lock is held.

I think you/we would end up with a much simpler system if instead of
focussing on the places where GFP_NOFS is used, we focus on places where
__GFP_FS is tested, and try to remove them. If we get rid of enough of
them the remainder could just use __GFP_IO.

> The patch 2 is a debugging aid which warns about explicit allocation
> requests from the scope context. This is should help to reduce the
> direct usage of the NOFS flags to bare minimum in favor of the scope
> API. It is not aimed to be merged upstream. I would hope Andrew took it
> into mmotm tree to give it linux-next exposure and allow developers to
> do further cleanups. There is a new kernel command line parameter which
> has to be used for the debugging to be enabled.
>
> I think the GFP_NOIO should be seeing the same clean up.

I think you are suggesting that use of GFP_NOIO should (largely) be
deprecated in favour of memalloc_noio_save(). I think I agree.
Could we go a step further and deprecate GFP_ATOMIC in favour of some
in_atomic() test? Maybe that is going too far.

Thanks,
NeilBrown

>
> Any feedback is highly appreciated.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
signature.asc (818.00 B)

2016-04-29 10:20:41

by Steven Whitehouse

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

Hi,

On 29/04/16 06:35, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
>> Hi,
>> we have discussed this topic at LSF/MM this year. There was a general
>> interest in the scope GFP_NOFS allocation context among some FS
>> developers. For those who are not aware of the discussion or the issue
>> I am trying to sort out (or at least start in that direction) please
>> have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> which basically copies what we have for the scope GFP_NOIO allocation
>> context. I haven't converted any of the FS myself because that is way
>> beyond my area of expertise but I would be happy to help with further
>> changes on the MM front as well as in some more generic code paths.
>>
>> Dave had an idea on how to further improve the reclaim context to be
>> less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> and FS specific cookie set in the FS allocation context and consumed
>> by the FS reclaim context to allow doing some provably save actions
>> that would be skipped due to GFP_NOFS normally. I like this idea and
>> I believe we can go that direction regardless of the approach taken here.
>> Many filesystems simply need to cleanup their NOFS usage first before
>> diving into a more complex changes.>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context? Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode. This needs to flush out dirty pages and sync the
> inode to storage. Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd. I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
evict() is an issue, but moving it into kswapd would be a potential
problem for GFS2. We already have a memory allocation issue when
recovery is taking place and memory is short. The code path is as follows:

1. Inode is scheduled for eviction (which requires deallocation)
2. The glock is required in order to perform the deallocation, which
implies getting a DLM lock
3. Another node in the cluster fails, so needs recovery
4. When the DLM lock is requested, it gets blocked until recovery is
complete (for the failed node)
5. Recovery is performed using a userland fencing utility
6. Fencing requires memory and then blocks on the eviction
7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on
DLM lock, DLM lock waiting on fencing)

It doesn't happen often, but we've been looking at the best place to
break that cycle, and one of the things we've been wondering is whether
we could avoid deallocation evictions from memory related contexts, or
at least make it async somehow.

The issue is that it is not possible to know in advance whether an
eviction will result in mearly writing things back to disk (because the
inode is being dropped from cache, but still resides on disk) which is
easy to do, or whether it requires a full deallocation (n_link==0) which
may require significant resources and time,

Steve.


2016-04-29 12:04:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Fri 29-04-16 15:35:42, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally. I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context? Do they
> need to happen at all?

Let me quote Dave Chinner from one of the emails discussed at LSFMM
mailing list:
: IMO, making GFP_NOFS "better" cannot be done with context-less flags
: being passed through reclaim. If we want to prevent the recursive
: self-deadlock case in an optimal manner, then we need to be able to
: pass state down to reclaim so that page writeback and the shrinkers
: can determine if they are likely to deadlock.
:
: IOWs, I think we should stop thinking of GFP_NOFS as a *global*
: directive to avoid recursion under any circumstance and instead
: start thinking about it as a mechanism to avoid recursion in
: specific reclaim contexts.
:
: Something as simple as adding an opaque cookie (e.g. can hold a
: superblock or inode pointer) to check against in writeback and
: subsystem shrinkers would result in the vast majority of GFP_NOFS
: contexts being able to reclaim from everything but the one context
: that we *might* deadlock against.
:
: e.g, if we then also check the PF_FSTRANS flag in XFS, we'll
: still be able to reclaim clean inodes, buffers and write back
: dirty pages that don't require transactions to complete under "don't
: recurse" situations because we know it's transactions that we could
: deadlock on in the direct reclaim context.
:
: Note that this information could be added to the writeback_control
: for page writeback, and it could be passed directly to shrinkers
: in the shrink_control structures. The allocation paths might be a
: little harder, but I suspect using the task struct for passing this
: information into direct reclaim might be the easiest approach...

> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode. This needs to flush out dirty pages and sync the
> inode to storage. Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd. I think this was an excellent
> advance in simplicity.
> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.
>
> The exceptions include:
> - nfs and any filesystem using fscache can block for up to 1 second
> in ->releasepage(). They used to block waiting for some IO, but that
> caused deadlocks and wasn't really needed. I left the timeout because
> it seemed likely that some throttling would help. I suspect that a
> careful analysis will show that there is sufficient throttling
> elsewhere.
>
> - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
> for IO so it can free some quotainfo things. If it could be changed
> to just schedule the IO without waiting for it then I think this
> would be safe to be called in any FS allocation context. It already
> uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> if the lock is held.
>
> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them.

One think I have learned is that shrinkers can be really complex and
getting rid of GFP_NOFS will be really hard so I would really like to
start the easiest way possible and remove the direct usage and replace
it by scope one which would at least _explain_ why it is needed. I think
this is a reasonable _first_ step and a large step ahead because we have
a good chance to get rid of a large number of those which were used
"just because I wasn't sure and this should be safe, right?". I wouldn't
be surprised if we end up with a very small number of both scope and
direct usage in the end.

I would also like to revisit generic inode/dentry shrinker and see
whether it could be more __GFP_FS friendly. As you say many FS might
even not depend on some FS internal locks so pushing GFP_FS check down
the layers might make a lot of sense and allow to clean some [id]cache
even for __GFP_FS context.

> If we get rid of enough of them the remainder could just use __GFP_IO.
>
> > The patch 2 is a debugging aid which warns about explicit allocation
> > requests from the scope context. This is should help to reduce the
> > direct usage of the NOFS flags to bare minimum in favor of the scope
> > API. It is not aimed to be merged upstream. I would hope Andrew took it
> > into mmotm tree to give it linux-next exposure and allow developers to
> > do further cleanups. There is a new kernel command line parameter which
> > has to be used for the debugging to be enabled.
> >
> > I think the GFP_NOIO should be seeing the same clean up.
>
> I think you are suggesting that use of GFP_NOIO should (largely) be
> deprecated in favour of memalloc_noio_save(). I think I agree.

Yes that was the idea.

> Could we go a step further and deprecate GFP_ATOMIC in favour of some
> in_atomic() test? Maybe that is going too far.

I am not really sure we need that and some GFP_NOWAIT usage is deliberate
to perform an optimistic allocation with another fallback (e.g. higher order
for performance reasons with single page fallback). So I think that nowait
is a slightly different thing.

Thanks!
--
Michal Hocko
SUSE Labs

2016-04-27 13:07:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API

On Wed 27-04-16 13:54:35, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>

Ups missed Dave's note about:

> GFP_NOFS context is used for the following 4 reasons currently
> - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
> - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
> - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
> - just in case because this would be safe from the fs POV

- silence lockdep false positives
--
Michal Hocko
SUSE Labs

2016-04-27 11:54:50

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1.2/2] mm: introduce memalloc_nofs_{save,restore} API

From: Michal Hocko <[email protected]>

GFP_NOFS context is used for the following 4 reasons currently
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in the
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much and possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.h | 2 +-
include/linux/gfp.h | 8 ++++++++
include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
mm/page_alloc.c | 8 +++++---
mm/vmscan.c | 4 ++--
5 files changed, 42 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 0d83f332e5c2..b35688a54c9a 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+ if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..3ebdbdff44b4 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -210,8 +210,16 @@ struct vm_area_struct;
*
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_noio_{save,restore} to mark the whole scope which cannot
+ * perform any IO with a short explanation why. All allocation requests
+ * will inherit GFP_NOIO implicitly.
*
* GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ * recurse into the FS layer with a short explanation why. All allocation
+ * requests will inherit GFP_NOFS implicitly.
*
* GFP_USER is for userspace allocations that also need to be directly
* accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 820db8f98bfc..e9521dc0475f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2102,9 +2102,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
-#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
-#define PF_KSWAPD 0x00040000 /* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
+#define PF_KSWAPD 0x00020000 /* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -2115,8 +2115,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

-#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
-
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
@@ -2142,13 +2140,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
*/
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
{
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precendence
+ */
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~(__GFP_IO | __GFP_FS);
+ else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+ flags &= ~__GFP_FS;
return flags;
}

@@ -2164,6 +2170,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+static inline unsigned int memalloc_nofs_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+ current->flags |= PF_MEMALLOC_NOFS;
+ return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
/* Per-process atomic flags. */
#define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
#define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0e2bb8f4ec0b..86bb5d6ddd7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3801,10 +3801,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
goto out;

/*
- * Runtime PM, block IO and its error handling path can deadlock
- * because I/O on the device might not complete.
+ * Apply scoped allocation constrains. This is mainly about
+ * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+ * allocation requests from a particular context which has
+ * been marked by memalloc_no{fs,io}_{save,restore}
*/
- alloc_mask = memalloc_noio_flags(gfp_mask);
+ alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;

page = __alloc_pages_slowpath(alloc_mask, order, &ac);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..cfb74de1efa3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2808,7 +2808,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.order = order,
.nodemask = nodemask,
.priority = DEF_PRIORITY,
@@ -3656,7 +3656,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.order = order,
.priority = ZONE_RECLAIM_PRIORITY,
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
--
2.8.0.rc3


2016-04-27 11:54:49

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1.1/2] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

From: Michal Hocko <[email protected]>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <[email protected]>
---
Hi,
as suggested by Dave, I have split up [1] into two parts. The first one
addes a new PF flag which is just an alias to the existing PF_FSTRANS
and does all the renaming and the second one to introduce the generic
API which only changes the bare minimum in the xfs proper.

fs/xfs/kmem.c | 4 ++--
fs/xfs/kmem.h | 2 +-
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/xfs_aops.c | 6 +++---
fs/xfs/xfs_trans.c | 12 ++++++------
include/linux/sched.h | 2 ++
6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 686ba6fb20dd..73f6ab59c664 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
* the filesystem here and potentially deadlocking.
*/
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();

lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);

- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);

return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index d1c66e465ca5..0d83f332e5c2 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -50,7 +50,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a0eb18ce3ad3..326566f4a131 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2540,7 +2540,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, work);
unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
+ unsigned long new_pflags = PF_MEMALLOC_NOFS;

/*
* we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index d12dfcfd0cc8..6d816ff0b763 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -124,7 +124,7 @@ xfs_setfilesize_trans_alloc(
* We hand off the transaction to the completion thread now, so
* clear the flag here.
*/
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return 0;
}

@@ -169,7 +169,7 @@ xfs_setfilesize_ioend(
* thus we need to mark ourselves as being in a transaction manually.
* Similarly for freeze protection.
*/
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);

/* we abort the update if there was an IO error */
@@ -979,7 +979,7 @@ xfs_vm_writepage(
* Given that we do not allow direct reclaim to call us, we should
* never be called while in a filesystem transaction.
*/
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;

/* Is this page beyond the end of the file? */
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 748b16aff45a..1d247366c733 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -176,7 +176,7 @@ xfs_trans_reserve(
bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;

/* Mark this thread as being in a transaction */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

/*
* Attempt to reserve the needed disk blocks by decrementing
@@ -186,7 +186,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
if (error != 0) {
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return -ENOSPC;
}
tp->t_blk_res += blocks;
@@ -263,7 +263,7 @@ xfs_trans_reserve(
tp->t_blk_res = 0;
}

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

return error;
}
@@ -921,7 +921,7 @@ __xfs_trans_commit(

xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free(tp);

/*
@@ -951,7 +951,7 @@ __xfs_trans_commit(
if (commit_lsn == -1 && !error)
error = -EIO;
}
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
xfs_trans_free(tp);

@@ -1005,7 +1005,7 @@ xfs_trans_cancel(
xfs_log_done(mp, tp->t_ticket, NULL, false);

/* mark this thread as no longer being in a transaction */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index acfc32b30704..820db8f98bfc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2115,6 +2115,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

+#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
+
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
--
2.8.0.rc3


2016-04-30 00:11:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
> On Tue, Apr 26 2016, Michal Hocko wrote:
>
> > Hi,
> > we have discussed this topic at LSF/MM this year. There was a general
> > interest in the scope GFP_NOFS allocation context among some FS
> > developers. For those who are not aware of the discussion or the issue
> > I am trying to sort out (or at least start in that direction) please
> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
> > which basically copies what we have for the scope GFP_NOIO allocation
> > context. I haven't converted any of the FS myself because that is way
> > beyond my area of expertise but I would be happy to help with further
> > changes on the MM front as well as in some more generic code paths.
> >
> > Dave had an idea on how to further improve the reclaim context to be
> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
> > and FS specific cookie set in the FS allocation context and consumed
> > by the FS reclaim context to allow doing some provably save actions
> > that would be skipped due to GFP_NOFS normally. I like this idea and
> > I believe we can go that direction regardless of the approach taken here.
> > Many filesystems simply need to cleanup their NOFS usage first before
> > diving into a more complex changes.>
>
> This strikes me as over-engineering to work around an unnecessarily
> burdensome interface.... but without details it is hard to be certain.
>
> Exactly what things happen in "FS reclaim context" which may, or may
> not, be safe depending on the specific FS allocation context? Do they
> need to happen at all?
>
> My research suggests that for most filesystems the only thing that
> happens in reclaim context that is at all troublesome is the final
> 'evict()' on an inode. This needs to flush out dirty pages and sync the
> inode to storage. Some time ago we moved most dirty-page writeout out
> of the reclaim context and into kswapd. I think this was an excellent
> advance in simplicity.

No, we didn't move dirty page writeout to kswapd - we moved it back
to the background writeback threads where it can be done
efficiently. kswapd should almost never do single page writeback
because of how inefficient it is from an IO perspective, even though
it can. i.e. if we are doing any significant amount of dirty page
writeback from memory reclaim (direct, kswapd or otherwise) then
we've screwed something up.

> If we could similarly move evict() into kswapd (and I believe we can)
> then most file systems would do nothing in reclaim context that
> interferes with allocation context.

When lots of GFP_NOFS allocation is being done, this already
happens. The shrinkers that can't run due to context accumulate the
work on the shrinker structure, and when the shrinker can next run
(e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
reclaim contexts.

IOWs, we already move shrinker work from direct reclaim to kswapd
when appropriate.

> The exceptions include:
> - nfs and any filesystem using fscache can block for up to 1 second
> in ->releasepage(). They used to block waiting for some IO, but that
> caused deadlocks and wasn't really needed. I left the timeout because
> it seemed likely that some throttling would help. I suspect that a
> careful analysis will show that there is sufficient throttling
> elsewhere.
>
> - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
> for IO so it can free some quotainfo things.

No it's not. evict() can block on IO - waiting for data or inode
writeback to complete, or even for filesystems to run transactions
on the inode. Hence the superblock shrinker can and does block in
inode cache reclaim.

Indeed, blocking the superblock shrinker in reclaim is a key part of
balancing inode cache pressure in XFS. If the shrinker starts
hitting dirty inodes, it blocks on cleaning them, thereby slowing
the rate of allocation to that which inodes can be cleaned and
reclaimed. There are also background threads that walk ahead freeing
clean inodes, but we have to throttle direct reclaim in this manner
otherwise the allocation pressure vastly outweighs the ability to
reclaim inodes. if we don't balance this, inode allocation triggers
the OOM killer because reclaim keeps reporting "no progress being
made" because dirty inodes are skipped. BY blocking on such inodes,
the shrinker makes progress (slowly) and reclaim sees that memory is
being freed and so continues without invoking the OOM killer...

> If it could be changed
> to just schedule the IO without waiting for it then I think this
> would be safe to be called in any FS allocation context. It already
> uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> if the lock is held.

We could, but then we have the same problem as the inode cache -
there's no indication of progress going back to the memory reclaim
subsystem, nor is reclaim able to throttle memory allocation back to
the rate at which reclaim is making progress.

There's feedback loops all throughout the XFS reclaim code - it's
designed specifically that way - I made changes to the shrinker
infrastructure years ago to enable this. It's no different to the
dirty page throttling that was done at roughly the same time -
that's also one big feedback loop controlled by the rate at which
pages can be cleaned. Indeed, it was designed was based on the same
premise as all the XFS shrinker code: in steady state conditions
we can't allocate a resource faster than we can reclaim it, so we
need to make reclaim as efficient at possible...

> I think you/we would end up with a much simpler system if instead of
> focussing on the places where GFP_NOFS is used, we focus on places where
> __GFP_FS is tested, and try to remove them. If we get rid of enough of
> them the remainder could just use __GFP_IO.

The problem with this is that a single kswapd thread can't keep up
with all of the allocation pressure that occurs. e.g. a 20-core
intel CPU with local memory will be seen as a single node and so
will have a single kswapd thread to do reclaim. There's a massive
imbalance between maximum reclaim rate and maximum allocation rate
in situations like this. If we want memory reclaim to run faster,
we to be able to do more work *now*, not defer it to a context with
limited execution resources.

i.e. IMO deferring more work to a single reclaim thread per node is
going to limit memory reclaim scalability and performance, not
improve it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-30 00:24:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Fri, Apr 29, 2016 at 02:04:18PM +0200, Michal Hocko wrote:
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

That's precisely my point about passing a context to the shrinker.
It's recursion within a single superblock context that makes up the
majority of cases GFP_NOFS is used for, so passing the superblock
immediately allows for reclaim to run the superblock shrinker on
every other superblock.

We can refine it further from there, but I strongly suspect that
further refinement is going to require filesystems to specifically
configure the superblock shrinker.

e.g. in XFS, we can't allow evict() even on clean VFS inodes in a
PF_FSTRANS context, because we may run a transaction on a clean
VFS inode to prepare it for reclaim. We can, however,
allow the fs-specific shrinker callouts to run (i.e. call into
.free_cached_objects) so that it can reclaim clean XFS inodes
because that doesn't require transactions....

i.e. the infrastructure I suggested we use is aimed directly at
providing the mechanism required for finer-grained inode/dentry
cache reclaim in contexts that it is currently disallowed
completely. I was also implying that once the infrastructure to pass
contexts is in place, I'd then make the changes to the generic
superblock shrinker code to enable finer grained reclaim and
optimise the XFS shrinkers to make use of it...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-04-30 21:18:25

by NeilBrown

[permalink] [raw]
Subject: Re: [Cluster-devel] [PATCH 0/2] scop GFP_NOFS api

On Fri, Apr 29 2016, Steven Whitehouse wrote:

> Hi,
>
> On 29/04/16 06:35, NeilBrown wrote:
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
> evict() is an issue, but moving it into kswapd would be a potential
> problem for GFS2. We already have a memory allocation issue when
> recovery is taking place and memory is short. The code path is as follows:
>
> 1. Inode is scheduled for eviction (which requires deallocation)
> 2. The glock is required in order to perform the deallocation, which
> implies getting a DLM lock
> 3. Another node in the cluster fails, so needs recovery
> 4. When the DLM lock is requested, it gets blocked until recovery is
> complete (for the failed node)
> 5. Recovery is performed using a userland fencing utility
> 6. Fencing requires memory and then blocks on the eviction
> 7. Deadlock (Fencing waiting on memory alloc, memory alloc waiting on
> DLM lock, DLM lock waiting on fencing)

You even have user-space in the loop there - impressive! You can't
really pass GFP_NOFS to a user-space thread, can you :-?

>
> It doesn't happen often, but we've been looking at the best place to
> break that cycle, and one of the things we've been wondering is whether
> we could avoid deallocation evictions from memory related contexts, or
> at least make it async somehow.

I think "async" is definitely the answer and I think
evict()/evict_inode() is the best place to focus attention.

I can see now (thanks) that just moving the evict() call to kswapd isn't
really a solution as it will just serve to block kswapd and so lots of
other freeing of memory won't happen.

I'm now imagining giving ->evict_inode() a "don't sleep" flag and
allowing it to return -EAGAIN. In that case evict would queue the inode
to kswapd (or maybe another thread) for periodic retry.

The flag would only get set when prune_icache_sb() calls dispose_list()
to call evict(). Other uses (e.g. unmount, iput) would still be
synchronous.

How difficult would it be to change gfs's evict_inode() to optionally
never block?

For this to work we would need to add a way for
deactivate_locked_super() to wait for all the async evictions to
complete. Currently prune_icache_sb() is called under s_umount. If we
moved part of the eviction out of that lock some other synchronization
would be needed. Possibly a per-superblock list of "inodes being
evicted" would suffice.

Thanks,
NeilBrown


>
> The issue is that it is not possible to know in advance whether an
> eviction will result in mearly writing things back to disk (because the
> inode is being dropped from cache, but still resides on disk) which is
> easy to do, or whether it requires a full deallocation (n_link==0) which
> may require significant resources and time,
>
> Steve.


Attachments:
signature.asc (818.00 B)

2016-04-30 21:55:55

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Fri, Apr 29 2016, Michal Hocko wrote:

>
> One think I have learned is that shrinkers can be really complex and
> getting rid of GFP_NOFS will be really hard so I would really like to
> start the easiest way possible and remove the direct usage and replace
> it by scope one which would at least _explain_ why it is needed. I think
> this is a reasonable _first_ step and a large step ahead because we have
> a good chance to get rid of a large number of those which were used
> "just because I wasn't sure and this should be safe, right?". I wouldn't
> be surprised if we end up with a very small number of both scope and
> direct usage in the end.

Yes, shrinkers can be complex. About two of them are. We could fix
lots and lots of call sites, or fix two shrinkers.
OK, that's a bit unfair as fixing one of the shrinkers involves changing
many ->evict_inode() functions. But that would be a very focused
change.

I think your proposal is little more than re-arranging deck chairs on
the titanic. Yes, it might give everybody a better view of the iceberg
but the iceberg is still there and in reality we can already see it.

The main iceberg is evict_inode. It appears in both examples given so
far: xfs and gfs. There are other little icebergs but they won't last
long after evict_inode is dealt with.

One particular problem with your process-context idea is that it isn't
inherited across threads.
Steve Whitehouse's example in gfs shows how allocation dependencies can
even cross into user space.

A more localized one that I have seen is that NFSv4 sometimes needs to
start up a state-management thread (particularly if the server
restarted).
It uses kthread_run(), which doesn't actually create the thread but asks
kthreadd to do it. If NFS writeout is waiting for state management it
would need to make sure that kthreadd runs in allocation context to
avoid deadlock.
I feel that I've forgotten some important detail here and this might
have been fixed somehow, but the point still stands that the allocation
context can cross from thread to thread and can effectively become
anything and everything.

It is OK to wait for memory to be freed. It is not OK to wait for any
particular piece of memory to be freed because you don't always know who
is waiting for you, or who you really are waiting on to free that
memory.

Whenever trying to free memory I think you need to do best-effort
without blocking.

>
> I would also like to revisit generic inode/dentry shrinker and see
> whether it could be more __GFP_FS friendly. As you say many FS might
> even not depend on some FS internal locks so pushing GFP_FS check down
> the layers might make a lot of sense and allow to clean some [id]cache
> even for __GFP_FS context.

I think the only part of prune_dcache_sb() that might need care is
iput() which boils down to evict(). The final unlink for NFS
silly-rename might happen in there too (in d_iput).
shrinking the dcache seems rather late to be performing that unlink
though, so I've probably missed some key detail.

If we find a way for evict(), when called from the shrinker, to be
non-blocking, and generally require all shrinkers to be non-blocking,
then many of these allocation problems evaporate.

Thanks,
NeilBrown


Attachments:
signature.asc (818.00 B)

2016-04-30 22:20:07

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Sat, Apr 30 2016, Dave Chinner wrote:

> On Fri, Apr 29, 2016 at 03:35:42PM +1000, NeilBrown wrote:
>> On Tue, Apr 26 2016, Michal Hocko wrote:
>>
>> > Hi,
>> > we have discussed this topic at LSF/MM this year. There was a general
>> > interest in the scope GFP_NOFS allocation context among some FS
>> > developers. For those who are not aware of the discussion or the issue
>> > I am trying to sort out (or at least start in that direction) please
>> > have a look at patch 1 which adds memalloc_nofs_{save,restore} api
>> > which basically copies what we have for the scope GFP_NOIO allocation
>> > context. I haven't converted any of the FS myself because that is way
>> > beyond my area of expertise but I would be happy to help with further
>> > changes on the MM front as well as in some more generic code paths.
>> >
>> > Dave had an idea on how to further improve the reclaim context to be
>> > less all-or-nothing wrt. GFP_NOFS. In short he was suggesting an opaque
>> > and FS specific cookie set in the FS allocation context and consumed
>> > by the FS reclaim context to allow doing some provably save actions
>> > that would be skipped due to GFP_NOFS normally. I like this idea and
>> > I believe we can go that direction regardless of the approach taken here.
>> > Many filesystems simply need to cleanup their NOFS usage first before
>> > diving into a more complex changes.>
>>
>> This strikes me as over-engineering to work around an unnecessarily
>> burdensome interface.... but without details it is hard to be certain.
>>
>> Exactly what things happen in "FS reclaim context" which may, or may
>> not, be safe depending on the specific FS allocation context? Do they
>> need to happen at all?
>>
>> My research suggests that for most filesystems the only thing that
>> happens in reclaim context that is at all troublesome is the final
>> 'evict()' on an inode. This needs to flush out dirty pages and sync the
>> inode to storage. Some time ago we moved most dirty-page writeout out
>> of the reclaim context and into kswapd. I think this was an excellent
>> advance in simplicity.
>
> No, we didn't move dirty page writeout to kswapd - we moved it back
> to the background writeback threads where it can be done
> efficiently. kswapd should almost never do single page writeback
> because of how inefficient it is from an IO perspective, even though
> it can. i.e. if we are doing any significant amount of dirty page
> writeback from memory reclaim (direct, kswapd or otherwise) then
> we've screwed something up.
>
>> If we could similarly move evict() into kswapd (and I believe we can)
>> then most file systems would do nothing in reclaim context that
>> interferes with allocation context.
>
> When lots of GFP_NOFS allocation is being done, this already
> happens. The shrinkers that can't run due to context accumulate the
> work on the shrinker structure, and when the shrinker can next run
> (e.g. run from kswapd) it runs all the deferred work from GFP_NOFS
> reclaim contexts.
>
> IOWs, we already move shrinker work from direct reclaim to kswapd
> when appropriate.
>
>> The exceptions include:
>> - nfs and any filesystem using fscache can block for up to 1 second
>> in ->releasepage(). They used to block waiting for some IO, but that
>> caused deadlocks and wasn't really needed. I left the timeout because
>> it seemed likely that some throttling would help. I suspect that a
>> careful analysis will show that there is sufficient throttling
>> elsewhere.
>>
>> - xfs_qm_shrink_scan is nearly unique among shrinkers in that it waits
>> for IO so it can free some quotainfo things.
>
> No it's not. evict() can block on IO - waiting for data or inode
> writeback to complete, or even for filesystems to run transactions
> on the inode. Hence the superblock shrinker can and does block in
> inode cache reclaim.

That is why I said "nearly" :-)

>
> Indeed, blocking the superblock shrinker in reclaim is a key part of
> balancing inode cache pressure in XFS. If the shrinker starts
> hitting dirty inodes, it blocks on cleaning them, thereby slowing
> the rate of allocation to that which inodes can be cleaned and
> reclaimed. There are also background threads that walk ahead freeing
> clean inodes, but we have to throttle direct reclaim in this manner
> otherwise the allocation pressure vastly outweighs the ability to
> reclaim inodes. if we don't balance this, inode allocation triggers
> the OOM killer because reclaim keeps reporting "no progress being
> made" because dirty inodes are skipped. BY blocking on such inodes,
> the shrinker makes progress (slowly) and reclaim sees that memory is
> being freed and so continues without invoking the OOM killer...

I'm very aware of the need to throttle allocation based on IO. I
remember when NFS didn't quite get this right and filled up memory :-)

balance_dirty_pages() used to force threads to wait on the write-out of
one page for every page that they dirtied (or wait on 128 pages for every 128
dirtied or whatever). This was exactly to provide the sort of
throttling you are talking about.

We don't do that any more. It was problematic. I don't recall all the
reasons but I think that different backing devices having different
clearance rates was part of the problem.
So now we monitor clearance rates and wait for some number of blocks to
be written, rather than waiting for some specific blocks to be written.

We should be able to do the same thing to balance dirty inodes as we do
to balance dirty pages.


>
>> If it could be changed
>> to just schedule the IO without waiting for it then I think this
>> would be safe to be called in any FS allocation context. It already
>> uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
>> if the lock is held.
>
> We could, but then we have the same problem as the inode cache -
> there's no indication of progress going back to the memory reclaim
> subsystem, nor is reclaim able to throttle memory allocation back to
> the rate at which reclaim is making progress.
>
> There's feedback loops all throughout the XFS reclaim code - it's
> designed specifically that way - I made changes to the shrinker
> infrastructure years ago to enable this. It's no different to the
> dirty page throttling that was done at roughly the same time -
> that's also one big feedback loop controlled by the rate at which
> pages can be cleaned. Indeed, it was designed was based on the same
> premise as all the XFS shrinker code: in steady state conditions
> we can't allocate a resource faster than we can reclaim it, so we
> need to make reclaim as efficient at possible...

You seem to be referring here to the same change that I was referred to
above, but seem to be seeing it from a different perspective.

Waiting for inodes to be freed in important. Waiting for any one
specific inode to be freed is dangerous.

>
>> I think you/we would end up with a much simpler system if instead of
>> focussing on the places where GFP_NOFS is used, we focus on places where
>> __GFP_FS is tested, and try to remove them. If we get rid of enough of
>> them the remainder could just use __GFP_IO.
>
> The problem with this is that a single kswapd thread can't keep up
> with all of the allocation pressure that occurs. e.g. a 20-core
> intel CPU with local memory will be seen as a single node and so
> will have a single kswapd thread to do reclaim. There's a massive
> imbalance between maximum reclaim rate and maximum allocation rate
> in situations like this. If we want memory reclaim to run faster,
> we to be able to do more work *now*, not defer it to a context with
> limited execution resources.

I agree it makes sense to do *work* locally on the core that sees the
need. What I want to avoid is *sleeping* locally.
How would it be to make evict_inode non-blocking? It would do as much work
as it can, which in many cases would presumably complete the whole task.
But if it cannot progress for some reason, it returns -EAGAIN and then
the rest gets queued for later.
Might that work, or is there often still lots of CPU work to be done
after things have been waited for?

Thanks,
NeilBrown


>
> i.e. IMO deferring more work to a single reclaim thread per node is
> going to limit memory reclaim scalability and performance, not
> improve it.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]


Attachments:
signature.asc (818.00 B)

2016-05-03 15:13:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

Hi,

On Sun 01-05-16 07:55:31, NeilBrown wrote:
[...]
> One particular problem with your process-context idea is that it isn't
> inherited across threads.
> Steve Whitehouse's example in gfs shows how allocation dependencies can
> even cross into user space.

Hmm, I am still not sure I understand that example completely but making
a dependency between direct reclaim and userspace can hardly work.
Especially when the direct reclaim might be sitting on top of hard to
guess pile of locks. So unless I've missed anything what Steve has
described is a clear NOFS context.

> A more localized one that I have seen is that NFSv4 sometimes needs to
> start up a state-management thread (particularly if the server
> restarted).
> It uses kthread_run(), which doesn't actually create the thread but asks
> kthreadd to do it. If NFS writeout is waiting for state management it
> would need to make sure that kthreadd runs in allocation context to
> avoid deadlock.
> I feel that I've forgotten some important detail here and this might
> have been fixed somehow, but the point still stands that the allocation
> context can cross from thread to thread and can effectively become
> anything and everything.

Not sure I understand your point here but relying on kthread_run
from GFP_NOFS context has always been deadlock prone with or without
scope GFP_NOFS semantic so I am not really sure I see your point
here. Similarly relying on a work item which doesn't have a dedicated
WQ_MEM_RECLAIM WQ is deadlock prone. You simply shouldn't do that.

> It is OK to wait for memory to be freed. It is not OK to wait for any
> particular piece of memory to be freed because you don't always know who
> is waiting for you, or who you really are waiting on to free that
> memory.
>
> Whenever trying to free memory I think you need to do best-effort
> without blocking.

I agree with that. Or at least you have to wait on something that is
_guaranteed_ to make a forward progress. I am not really that sure this
is easy to achieve with the current code base.

--
Michal Hocko
SUSE Labs

2016-05-03 23:26:48

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Wed, May 04 2016, Michal Hocko wrote:

> Hi,
>
> On Sun 01-05-16 07:55:31, NeilBrown wrote:
> [...]
>> One particular problem with your process-context idea is that it isn't
>> inherited across threads.
>> Steve Whitehouse's example in gfs shows how allocation dependencies can
>> even cross into user space.
>
> Hmm, I am still not sure I understand that example completely but making
> a dependency between direct reclaim and userspace can hardly work.

No it can't. Specifically: if direct reclaim blocks on user-space that
must be a problem.
I think the point of this example is that some filesystem things can
block on user-space in ways that are very hard to encode in with flags
as they are multi-level indirect.
So the conclusion (my conclusion) is that direct reclaim mustn't block.

When I was working on deadlock avoidance in loop-back NFS I went down
the path of adding GFP flags and extended the PF_FSTRANS flag and got it
working (think) but no-one liked it. It was way too intrusive.

Some how I landed on the idea of making nfs_release_page non blocking
and everything suddenly became much much simpler. Problems evaporated.

NFS has a distinct advantage here. The "Close-to-open" cache semantic
means that all dirty pages must be flushed on last close. So when
->evict_inode is finally called there is nothing much to do - just free
everything up. So I could fix NFS without worrying about (or even
noticing) evict_inode.


> Especially when the direct reclaim might be sitting on top of hard to
> guess pile of locks. So unless I've missed anything what Steve has
> described is a clear NOFS context.
>
>> A more localized one that I have seen is that NFSv4 sometimes needs to
>> start up a state-management thread (particularly if the server
>> restarted).
>> It uses kthread_run(), which doesn't actually create the thread but asks
>> kthreadd to do it. If NFS writeout is waiting for state management it
>> would need to make sure that kthreadd runs in allocation context to
>> avoid deadlock.
>> I feel that I've forgotten some important detail here and this might
>> have been fixed somehow, but the point still stands that the allocation
>> context can cross from thread to thread and can effectively become
>> anything and everything.
>
> Not sure I understand your point here but relying on kthread_run
> from GFP_NOFS context has always been deadlock prone with or without
> scope GFP_NOFS semantic so I am not really sure I see your point
> here. Similarly relying on a work item which doesn't have a dedicated
> WQ_MEM_RECLAIM WQ is deadlock prone. You simply shouldn't do that.

The point is really that saying "You shouldn't do that" isn't much good
when "that" is exactly what the fs developer wants to do and it seems to
work and never triggers a warning.

You can create lots of rules about what is or is not allowed, or
you can make everything that it not explicit forbidden (ideally at
compile time but possibly at runtime with might_sleep or lockdep),
permitted.

I prefer the latter.

>
>> It is OK to wait for memory to be freed. It is not OK to wait for any
>> particular piece of memory to be freed because you don't always know who
>> is waiting for you, or who you really are waiting on to free that
>> memory.
>>
>> Whenever trying to free memory I think you need to do best-effort
>> without blocking.
>
> I agree with that. Or at least you have to wait on something that is
> _guaranteed_ to make a forward progress. I am not really that sure this
> is easy to achieve with the current code base.

I accept that it isn't "easy". But I claim that it isn't particularly
difficult either.

NeilBrown


Attachments:
signature.asc (818.00 B)

2016-05-04 01:00:12

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Sun, May 01, 2016 at 08:19:44AM +1000, NeilBrown wrote:
> On Sat, Apr 30 2016, Dave Chinner wrote:
> > Indeed, blocking the superblock shrinker in reclaim is a key part of
> > balancing inode cache pressure in XFS. If the shrinker starts
> > hitting dirty inodes, it blocks on cleaning them, thereby slowing
> > the rate of allocation to that which inodes can be cleaned and
> > reclaimed. There are also background threads that walk ahead freeing
> > clean inodes, but we have to throttle direct reclaim in this manner
> > otherwise the allocation pressure vastly outweighs the ability to
> > reclaim inodes. if we don't balance this, inode allocation triggers
> > the OOM killer because reclaim keeps reporting "no progress being
> > made" because dirty inodes are skipped. BY blocking on such inodes,
> > the shrinker makes progress (slowly) and reclaim sees that memory is
> > being freed and so continues without invoking the OOM killer...
>
> I'm very aware of the need to throttle allocation based on IO. I
> remember when NFS didn't quite get this right and filled up memory :-)
>
> balance_dirty_pages() used to force threads to wait on the write-out of
> one page for every page that they dirtied (or wait on 128 pages for every 128
> dirtied or whatever). This was exactly to provide the sort of
> throttling you are talking about.
>
> We don't do that any more. It was problematic. I don't recall all the
> reasons but I think that different backing devices having different
> clearance rates was part of the problem.

As the original architect of those changes (the IO-less dirty page
throttling was an evolution of the algorithm I developed for Irix
years before it was done in linux) I remember the reasons very well.

Mostly it was to prevent what we termed "IO breakdown" where so many
different foreground threads were dispatching writeback that it
turned dirty page writeback into random IO instead of being nice,
large sequential IOs.

> So now we monitor clearance rates and wait for some number of blocks to
> be written, rather than waiting for some specific blocks to be written.

Right - we have a limited pool of flusher threads dispatching IO
efficiently as possible, and foreground dirtying processes wait for
that to do the work of cleaning pages.

> We should be able to do the same thing to balance dirty inodes as
> we do to balance dirty pages.

No, we can't. Dirty inode state is deeply tied into filesystem
implementations - it's intertwined with the journal operation in
many filesystems and can't be separated easily. Indeed, some
filesystem don't even use the VFS for tracking dirty inode state,
nor do they implement the ->write_inode method. Hence the VFS northe
inode cache shrinkers are able to determine if an inode is dirty, or
if it is, trigger writeback of it.

IOWs, inode caches are not unified in allocation, behaviour, design
or implementation like the page cache is, and so balancing dirty
inodes is likely not to be possible....

> >> If it could be changed
> >> to just schedule the IO without waiting for it then I think this
> >> would be safe to be called in any FS allocation context. It already
> >> uses a 'trylock' in xfs_dqlock_nowait() to avoid deadlocking
> >> if the lock is held.
> >
> > We could, but then we have the same problem as the inode cache -
> > there's no indication of progress going back to the memory reclaim
> > subsystem, nor is reclaim able to throttle memory allocation back to
> > the rate at which reclaim is making progress.
> >
> > There's feedback loops all throughout the XFS reclaim code - it's
> > designed specifically that way - I made changes to the shrinker
> > infrastructure years ago to enable this. It's no different to the
> > dirty page throttling that was done at roughly the same time -
> > that's also one big feedback loop controlled by the rate at which
> > pages can be cleaned. Indeed, it was designed was based on the same
> > premise as all the XFS shrinker code: in steady state conditions
> > we can't allocate a resource faster than we can reclaim it, so we
> > need to make reclaim as efficient at possible...
>
> You seem to be referring here to the same change that I was referred to
> above, but seem to be seeing it from a different perspective.

Sure, not many people have the same viewpoint as the preson who had
to convince everyone else that it was a sound idea even before
prototypes were written...

> Waiting for inodes to be freed in important. Waiting for any one
> specific inode to be freed is dangerous.

Sure. But there's a difference between waiting on an inode you have
no idea when you'll be able to reclaim it, versus waiting on IO
completion for an inode you've already guaranteed can complete
reclaim and are the only context that has access to the inode....

Not to mention that XFS also triggers an async inode reclaim worker
thread to do non-blocking reclaim of inodes in the background, so
while direct reclaim throttles (and hence prevents excessive
concurrent direct reclaim from trashing the working set of clean
inodes whilst ignoring dirty inodes) we still get appropriate
amounts of inode cache reclaim occurring...

> > The problem with this is that a single kswapd thread can't keep
> > up with all of the allocation pressure that occurs. e.g. a
> > 20-core intel CPU with local memory will be seen as a single
> > node and so will have a single kswapd thread to do reclaim.
> > There's a massive imbalance between maximum reclaim rate and
> > maximum allocation rate in situations like this. If we want
> > memory reclaim to run faster, we to be able to do more work
> > *now*, not defer it to a context with limited execution
> > resources.
>
> I agree it makes sense to do *work* locally on the core that sees
> the need. What I want to avoid is *sleeping* locally.

With direct reclaim, sleeping locally is the only way we can
throttle it. If we want to avoid reclaim from doing this, then we
need to move to a reclaim model similar to the IO-less dirty page
throttling mechanism which directly limits reclaim concurrency and
so individual subsystem shrinkers don't need to do this to prevent
reclaim from trashing the working set.

Indeed, I've proposed this in the past, and been told by various
developers that it can't possibly work and, besides, fs developers
know nothing about mm/ design.... I've got better things to do than
bash my head against brick walls like this, even though I know such
algorithms both perform and scale for memory reclaim because that's
exactly where I got the original inspriation for IO-less dirty page
throttling from: the Irix memory reclaim algorithms....

> How would
> it be to make evict_inode non-blocking?

We can't without completely redesigning how iput_final() and evict()
work and order operations. Indeed, by the time we call
->evict_inode(), we've already marked the inode as I_FREEING,
removed the inode from the superblock inode list, dirty writeback
lists, the LRU lists and blocked waiting for any inode writeback in
progress to complete.

We'd also have to redesign the shrinker code, because it does bulk
operations to isolate inodes from the LRU onto temporary lists after
marking them I_FREEING, which then does evict() calls from a
separate context. We'd then have to change all that code to handle
inodes that can't be reclaimed and re-insert them into the LRUs
approriately, and that would have to be managed in a way that we
don't end up live-locking the shrinker lru scanning on unreferenced
inodes that cannot be evicted. I'm also not sure whether
dispose_list() function can add inodes back onto the LRU safely
because contexts calling evict() make the assumption the inode is no
longer on the LRU and don't need to access LRU locks at all...

Also, evict() is called when unmounting the filesystem, and so in
those cases we actually want it to block and do everything that is
necessary to free the inode, so this would introduce more complex
interactions depending on caller context....

> It would do as much work
> as it can, which in many cases would presumably complete the whole
> task. But if it cannot progress for some reason, it returns
> -EAGAIN and then the rest gets queued for later. Might that work,
> or is there often still lots of CPU work to be done after things
> have been waited for?

Freeing the inode occurs after it's been waited for, usually. i.e.
no wait, no inode being freed and no feedback for throttling of the
inode reclaim rate...

FWIW, I don't think making evict() non-blocking is going to be worth
the effort here. Making memory reclaim wait on a priority ordered
queue while asynchronous reclaim threads run reclaim as efficiently
as possible and wakes waiters as it frees the memory the waiters
require is a model that has been proven to work in the past, and
this appears to me to be the model you are advocating for. I agree
that direct reclaim needs to die and be replaced with something
entirely more predictable, controllable and less prone to deadlock
contexts - you just need to convince the mm developers that it will
perform and scale better than what we have now.

In the mean time, having a slightly more fine grained GFP_NOFS
equivalent context will allow us to avoid the worst of the current
GFP_NOFS problems with very little extra code.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-05-06 03:21:06

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 0/2] scop GFP_NOFS api

On Wed, May 04 2016, Dave Chinner wrote:

> FWIW, I don't think making evict() non-blocking is going to be worth
> the effort here. Making memory reclaim wait on a priority ordered
> queue while asynchronous reclaim threads run reclaim as efficiently
> as possible and wakes waiters as it frees the memory the waiters
> require is a model that has been proven to work in the past, and
> this appears to me to be the model you are advocating for. I agree
> that direct reclaim needs to die and be replaced with something
> entirely more predictable, controllable and less prone to deadlock
> contexts - you just need to convince the mm developers that it will
> perform and scale better than what we have now.
>
> In the mean time, having a slightly more fine grained GFP_NOFS
> equivalent context will allow us to avoid the worst of the current
> GFP_NOFS problems with very little extra code.

You have painted two pictures here. The first is an ideal which does
look a lot like the sort of outcome I was aiming for, but is more than a
small step away.
The second is a band-aid which would take us in exactly the wrong
direction. It makes an interface which people apparently find hard to
use (or easy to misused) - the setting of __GFP_FS - and makes it more
complex. Certainly it would be more powerful, but I think it would also
be more misused.

So I ask myself: can we take some small steps towards 'A' and thereby
enable at least the functionality enabled by 'B'?

A core design principle for me is to enable filesystems to take control
of their own destiny. They should have the information available to
make the decisions they need to make, and the opportunity to carry them
out.

All the places where direct reclaim currently calls into filesystems
carry the 'gfp' flags so the file system can decide what to do, with one
exception: evict_inode. So my first proposal would be to rectify that.

- redefine .nr_cached_objects and .free_cached_objects so that, if they
are defined, they are responsible for s_dentry_lru and s_inode_lru.
e.g. super_cache_count *either* calls ->nr_cached_objects *or* makes
two calls to list_lru_shrink_count. This would require exporting
prune_dcache_sb and prune_icache_sb but otherwise should be a fairly
straight forward change.
If nr_cached_objects were defined, super_cache_scan would no longer
abort without __GFP_FS - that test would be left to the filesystem.

- Now any filesystem that wants to can stash it's super_block pointer
in current->journal_info while doing memory allocations, and abort
any reclaim attempts (release_page, shrinker, nr_cached_objects) if
and only if current->journal_info == "my superblock". This can be
done without the core mm code knowing any more than it already does.

- A more sophisticated filesystem might import much of the code for
prune_icache_sb() - either by copy/paste or by exporting some vfs
internals - and then store an inode pointer in current->journal_info
and only abort reclaim which touches that inode.

- if a filesystem happens to know that it will never block in any of
these reclaim calls, it can always allow prune_dcache_sb to run, and
never needs to use GFP_NOFS. I think NFS might be close to being
able to do this as it flushes everything on last-close. But that is
something that NFS developers can care about (or not) quite
independently from mm people.

- Maybe some fs developer will try to enable free_cached_objects to do
as much work as possible for every inode, but never deadlock. It
could do its own fs-specfic deadlock detection, or could queue work
to a work queue and wait a limited time for it. Or something.
If some filesystem developer comes up with something that works
really well, developers of other filesystems might copy it - or not
as they choose.

Maybe ->journal_info isn't perfect for this. It is currently only safe
for reclaim code to compare it against a known value. It is not safe to
dereference it to see if it points to a known value. That could possibly be
cleaned up, or another task_struct field could be provided for
filesystems to track their state. Or do you find a task_struct field
unacceptable and there is some reason and that an explicitly passed cookie
is superior?

My key point is that we shouldn't try to plumb some new abstraction
through the MM code so there is a new pattern for all filesystems to
follow. Rather the mm/vfs should get out of the filesystems' way as much
as possible and let them innovate independently.

Thanks for your time,
NeilBrown


Attachments:
signature.asc (818.00 B)