2016-12-15 14:07:51

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/9 v2] scope GFP_NOFS api

Hi,
I have posted the previous version here [1]. Since then I have added a
support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
removing GFP_NOFS usage motivated by the lockdep false positives. On top
of that I've tried to convert few KM_NOFS usages to use the new flag in
the xfs code base. This would need a review from somebody familiar with
xfs of course.

Then I've added the new scope API to the jbd/ext transaction code +
reverted some explicit GFP_NOFS usages which are covered by the scope one
now. This also needs a deep review from ext developers. I have some more
patches which remove more explicit GFP_NOFS users but that is not really
ready yet. I would really appreciate if developers for other filesystems
joined me here as well. Maybe ext parts can help to show how to start.
Especially btrfs which uses GFP_NOFS a lot (and not with a good reason
in many cases I suspect).

The patchset is based on linux-next (next-20161214).

I think the GFP_NOIO should be seeing the same clean up but that is not
a part of this patchset.

Any feedback is highly appreciated of course.

Diffstat says
fs/ext4/acl.c | 6 +++---
fs/ext4/extents.c | 8 ++++----
fs/ext4/resize.c | 4 ++--
fs/ext4/xattr.c | 4 ++--
fs/jbd2/journal.c | 7 +++++++
fs/jbd2/transaction.c | 11 +++++++++++
fs/xfs/kmem.c | 10 +++++-----
fs/xfs/kmem.h | 6 +++++-
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
fs/xfs/xfs_aops.c | 6 +++---
fs/xfs/xfs_buf.c | 10 +++++-----
fs/xfs/xfs_dir2_readdir.c | 2 +-
fs/xfs/xfs_trans.c | 12 ++++++------
include/linux/gfp.h | 18 +++++++++++++++++-
include/linux/jbd2.h | 2 ++
include/linux/sched.h | 32 ++++++++++++++++++++++++++------
kernel/locking/lockdep.c | 6 +++++-
lib/radix-tree.c | 2 ++
mm/page_alloc.c | 8 +++++---
mm/vmscan.c | 6 +++---
21 files changed, 117 insertions(+), 49 deletions(-)

Shortlog:
Michal Hocko (9):
lockdep: allow to disable reclaim lockup detection
xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives
xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS
mm: introduce memalloc_nofs_{save,restore} API
xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*
jbd2: mark the transaction context with the scope GFP_NOFS context
jbd2: make the whole kjournald2 kthread NOFS safe
Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"
Revert "ext4: fix wrong gfp type under transaction"


[1] http://lkml.kernel.org/r/[email protected]




2016-12-15 14:07:58

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 6/9] jbd2: mark the transaction context with the scope GFP_NOFS context

From: Michal Hocko <[email protected]>

now that we have memalloc_nofs_{save,restore} api we can mark the whole
transaction context as implicitly GFP_NOFS. All allocations will
automatically inherit GFP_NOFS this way. This means that we do not have
to mark any of those requests with GFP_NOFS and moreover all the
ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd2/transaction.c | 11 +++++++++++
include/linux/jbd2.h | 2 ++
2 files changed, 13 insertions(+)

diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index e1652665bd93..35a5d3d76182 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,

rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
jbd2_journal_free_transaction(new_transaction);
+ /*
+ * Make sure that no allocations done while the transaction is
+ * open is going to recurse back to the fs layer.
+ */
+ handle->saved_alloc_context = memalloc_nofs_save();
return 0;
}

@@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
handle->h_transaction->t_tid, type,
line_no, nblocks);
+
return handle;
}
EXPORT_SYMBOL(jbd2__journal_start);
@@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
if (handle->h_rsv_handle)
jbd2_journal_free_reserved(handle->h_rsv_handle);
free_and_exit:
+ /*
+ * scope of th GFP_NOFS context is over here and so we can
+ * restore the original alloc context.
+ */
+ memalloc_nofs_restore(handle->saved_alloc_context);
jbd2_free_handle(handle);
return err;
}
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index dfaa1f4dcb0c..606b6bce3a5b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -491,6 +491,8 @@ struct jbd2_journal_handle

unsigned long h_start_jiffies;
unsigned int h_requested_credits;
+
+ unsigned int saved_alloc_context;
};


--
2.10.2


2016-12-15 14:07:56

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/9] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

From: Michal Hocko <[email protected]>

kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
API to prevent from reclaim recursion into the fs because vmalloc can
invoke unconditional GFP_KERNEL allocations and these functions might be
called from the NOFS contexts. The memalloc_noio_save will enforce
GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
provide exactly what we need here - implicit GFP_NOFS context.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 10 +++++-----
fs/xfs/xfs_buf.c | 8 ++++----
2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index a76a05dae96b..d69ed5e76621 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
void *
kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
{
- unsigned noio_flag = 0;
+ unsigned nofs_flag = 0;
void *ptr;
gfp_t lflags;

@@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
* the filesystem here and potentially deadlocking.
*/
- if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
- noio_flag = memalloc_noio_save();
+ if (flags & KM_NOFS)
+ nofs_flag = memalloc_nofs_save();

lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);

- if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
- memalloc_noio_restore(noio_flag);
+ if (flags & KM_NOFS)
+ memalloc_nofs_restore(nofs_flag);

return ptr;
}
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index f31ae592dcae..5c6f9bd4d8be 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -441,17 +441,17 @@ _xfs_buf_map_pages(
bp->b_addr = NULL;
} else {
int retried = 0;
- unsigned noio_flag;
+ unsigned nofs_flag;

/*
* vm_map_ram() will allocate auxillary structures (e.g.
* pagetables) with GFP_KERNEL, yet we are likely to be under
* GFP_NOFS context here. Hence we need to tell memory reclaim
- * that we are in such a context via PF_MEMALLOC_NOIO to prevent
+ * that we are in such a context via PF_MEMALLOC_NOFS to prevent
* memory reclaim re-entering the filesystem here and
* potentially deadlocking.
*/
- noio_flag = memalloc_noio_save();
+ nofs_flag = memalloc_nofs_save();
do {
bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
-1, PAGE_KERNEL);
@@ -459,7 +459,7 @@ _xfs_buf_map_pages(
break;
vm_unmap_aliases();
} while (retried++ <= 1);
- memalloc_noio_restore(noio_flag);
+ memalloc_noio_restore(nofs_flag);

if (!bp->b_addr)
return -ENOMEM;
--
2.10.2


2016-12-15 14:08:01

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 9/9] Revert "ext4: fix wrong gfp type under transaction"

From: Michal Hocko <[email protected]>

This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
the transaction context uses memalloc_nofs_save and all allocations
within the this context inherit GFP_NOFS automatically, there is no
reason to mark specific allocations explicitly.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code
to make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/ext4/acl.c | 6 +++---
fs/ext4/extents.c | 2 +-
fs/ext4/resize.c | 4 ++--
fs/ext4/xattr.c | 4 ++--
4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index fd389935ecd1..9e98092c2a4b 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
return ERR_PTR(-EINVAL);
if (count == 0)
return NULL;
- acl = posix_acl_alloc(count, GFP_NOFS);
+ acl = posix_acl_alloc(count, GFP_KERNEL);
if (!acl)
return ERR_PTR(-ENOMEM);
for (n = 0; n < count; n++) {
@@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)

*size = ext4_acl_size(acl->a_count);
ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
- sizeof(ext4_acl_entry), GFP_NOFS);
+ sizeof(ext4_acl_entry), GFP_KERNEL);
if (!ext_acl)
return ERR_PTR(-ENOMEM);
ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
@@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
}
retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
if (retval > 0) {
- value = kmalloc(retval, GFP_NOFS);
+ value = kmalloc(retval, GFP_KERNEL);
if (!value)
return ERR_PTR(-ENOMEM);
retval = ext4_xattr_get(inode, name_index, "", value, retval);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ef815eb72389..c901d89f0038 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
le16_to_cpu(path[k].p_hdr->eh_entries)+1;
} else {
path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
- GFP_NOFS);
+ GFP_KERNEL);
if (path == NULL) {
ext4_journal_stop(handle);
return -ENOMEM;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index cf681004b196..e121f4e048b8 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,

n_group_desc = ext4_kvmalloc((gdb_num + 1) *
sizeof(struct buffer_head *),
- GFP_NOFS);
+ GFP_KERNEL);
if (!n_group_desc) {
err = -ENOMEM;
ext4_warning(sb, "not enough memory for %lu groups",
@@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
int res, i;
int err;

- primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
+ primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
if (!primary)
return -ENOMEM;

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 5a94fa52b74f..172317462238 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,

unlock_buffer(bs->bh);
ea_bdebug(bs->bh, "cloning");
- s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
+ s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
error = -ENOMEM;
if (s->base == NULL)
goto cleanup;
@@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
}
} else {
/* Allocate a buffer where we construct the new block. */
- s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
+ s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
/* assert(header == s->base) */
error = -ENOMEM;
if (s->base == NULL)
--
2.10.2


2016-12-15 14:08:00

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 8/9] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

From: Michal Hocko <[email protected]>

This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
sb_getblk_gfp is not really needed as
sb_getblk
__getblk_gfp
__getblk_slow
grow_buffers
grow_dev_page
gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp

so __GFP_FS is cleared unconditionally and therefore the above commit
didn't have any real effect in fact.

This patch should not introduce any functional change. The main point
of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
make the review of the remaining usage easier.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/ext4/extents.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index b1f8416923ab..ef815eb72389 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
struct buffer_head *bh;
int err;

- bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+ bh = sb_getblk(inode->i_sb, pblk);
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);

@@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
err = -EFSCORRUPTED;
goto cleanup;
}
- bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+ bh = sb_getblk(inode->i_sb, newblock);
if (unlikely(!bh)) {
err = -ENOMEM;
goto cleanup;
@@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
if (newblock == 0)
return err;

- bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+ bh = sb_getblk(inode->i_sb, newblock);
if (unlikely(!bh))
return -ENOMEM;
lock_buffer(bh);
--
2.10.2


2016-12-15 14:07:59

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 7/9] jbd2: make the whole kjournald2 kthread NOFS safe

From: Michal Hocko <[email protected]>

kjournald2 is central to the transaction commit processing. As such any
potential allocation from this kernel thread has to be GFP_NOFS. Make
sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/jbd2/journal.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 8ed971eeab44..6dad8c5d6ddf 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -206,6 +206,13 @@ static int kjournald2(void *arg)
wake_up(&journal->j_wait_done_commit);

/*
+ * Make sure that no allocations from this kernel thread will ever recurse
+ * to the fs layer because we are responsible for the transaction commit
+ * and any fs involvement might get stuck waiting for the trasn. commit.
+ */
+ memalloc_nofs_save();
+
+ /*
* And now, wait forever for commit wakeup events.
*/
write_lock(&journal->j_state_lock);
--
2.10.2


2016-12-15 14:07:55

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

From: Michal Hocko <[email protected]>

xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
some time ago. We would like to make this concept more generic and use
it for other filesystems as well. Let's start by giving the flag a
more genric name PF_MEMALLOC_NOFS which is in line with an exiting
PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
contexts. Replace all PF_FSTRANS usage from the xfs code in the first
step before we introduce a full API for it as xfs uses the flag directly
anyway.

This patch doesn't introduce any functional change.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.c | 4 ++--
fs/xfs/kmem.h | 2 +-
fs/xfs/libxfs/xfs_btree.c | 2 +-
fs/xfs/xfs_aops.c | 6 +++---
fs/xfs/xfs_trans.c | 12 ++++++------
include/linux/sched.h | 2 ++
6 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
index 339c696bbc01..a76a05dae96b 100644
--- a/fs/xfs/kmem.c
+++ b/fs/xfs/kmem.c
@@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
* the filesystem here and potentially deadlocking.
*/
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
noio_flag = memalloc_noio_save();

lflags = kmem_flags_convert(flags);
ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);

- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
memalloc_noio_restore(noio_flag);

return ptr;
diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index ea3984091d58..e40ddd12900b 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
+ if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
lflags &= ~__GFP_FS;
}

diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index 21e6a6ab6b9a..a2672ba4dc33 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
struct xfs_btree_split_args *args = container_of(work,
struct xfs_btree_split_args, work);
unsigned long pflags;
- unsigned long new_pflags = PF_FSTRANS;
+ unsigned long new_pflags = PF_MEMALLOC_NOFS;

/*
* we are in a transaction context here, but may also be doing work
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0f56fcd3a5d5..61ca9f9c5a12 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
* We hand off the transaction to the completion thread now, so
* clear the flag here.
*/
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return 0;
}

@@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
* thus we need to mark ourselves as being in a transaction manually.
* Similarly for freeze protection.
*/
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);

/* we abort the update if there was an IO error */
@@ -1015,7 +1015,7 @@ xfs_do_writepage(
* Given that we do not allow direct reclaim to call us, we should
* never be called while in a filesystem transaction.
*/
- if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
+ if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
goto redirty;

/*
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..f5969c8274fc 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -134,7 +134,7 @@ xfs_trans_reserve(
bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;

/* Mark this thread as being in a transaction */
- current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

/*
* Attempt to reserve the needed disk blocks by decrementing
@@ -144,7 +144,7 @@ xfs_trans_reserve(
if (blocks > 0) {
error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
if (error != 0) {
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
return -ENOSPC;
}
tp->t_blk_res += blocks;
@@ -221,7 +221,7 @@ xfs_trans_reserve(
tp->t_blk_res = 0;
}

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

return error;
}
@@ -914,7 +914,7 @@ __xfs_trans_commit(

xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);

- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free(tp);

/*
@@ -944,7 +944,7 @@ __xfs_trans_commit(
if (commit_lsn == -1 && !error)
error = -EIO;
}
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
xfs_trans_free(tp);

@@ -998,7 +998,7 @@ xfs_trans_cancel(
xfs_log_done(mp, tp->t_ticket, NULL, false);

/* mark this thread as no longer being in a transaction */
- current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
+ current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);

xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
xfs_trans_free(tp);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4d1905245c7a..baffd340ea82 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

+#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
+
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
--
2.10.2


2016-12-15 14:07:56

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/9] mm: introduce memalloc_nofs_{save,restore} API

From: Michal Hocko <[email protected]>

GFP_NOFS context is used for the following 5 reasons currently
- to prevent from deadlocks when the lock held by the allocation
context would be needed during the memory reclaim
- to prevent from stack overflows during the reclaim because
the allocation is performed from a deep context already
- to prevent lockups when the allocation context depends on
other reclaimers to make a forward progress indirectly
- just in case because this would be safe from the fs POV
- silence lockdep false positives

Unfortunately overuse of this allocation context brings some problems
to the MM. Memory reclaim is much weaker (especially during heavy FS
metadata workloads), OOM killer cannot be invoked because the MM layer
doesn't have enough information about how much memory is freeable by the
FS layer.

In many cases it is far from clear why the weaker context is even used
and so it might be used unnecessarily. We would like to get rid of
those as much as possible. One way to do that is to use the flag in
scopes rather than isolated cases. Such a scope is declared when really
necessary, tracked per task and all the allocation requests from within
the context will simply inherit the GFP_NOFS semantic.

Not only this is easier to understand and maintain because there are
much less problematic contexts than specific allocation requests, this
also helps code paths where FS layer interacts with other layers (e.g.
crypto, security modules, MM etc...) and there is no easy way to convey
the allocation context between the layers.

Introduce memalloc_nofs_{save,restore} API to control the scope
of GFP_NOFS allocation context. This is basically copying
memalloc_noio_{save,restore} API we have for other restricted allocation
context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
just an alias for PF_FSTRANS which has been xfs specific until recently.
There are no more PF_FSTRANS users anymore so let's just drop it.

PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
is renamed to current_gfp_context because it now cares about both
PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
their semantic. kmem_flags_convert() doesn't need to evaluate the flag
anymore.

This patch shouldn't introduce any functional changes.

Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
usage as much as possible and only use a properly documented
memalloc_nofs_{save,restore} checkpoints where they are appropriate.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.h | 2 +-
include/linux/gfp.h | 8 ++++++++
include/linux/sched.h | 34 ++++++++++++++++++++++++++--------
kernel/locking/lockdep.c | 2 +-
mm/page_alloc.c | 8 +++++---
mm/vmscan.c | 6 +++---
6 files changed, 44 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index e40ddd12900b..afaa3e059076 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
lflags = GFP_ATOMIC | __GFP_NOWARN;
} else {
lflags = GFP_KERNEL | __GFP_NOWARN;
- if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
+ if (flags & KM_NOFS)
lflags &= ~__GFP_FS;
}

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 1a934383cc20..bfe53d95c25b 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -217,8 +217,16 @@ struct vm_area_struct;
*
* GFP_NOIO will use direct reclaim to discard clean pages or slab pages
* that do not require the starting of any physical IO.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_noio_{save,restore} to mark the whole scope which cannot
+ * perform any IO with a short explanation why. All allocation requests
+ * will inherit GFP_NOIO implicitly.
*
* GFP_NOFS will use direct reclaim but will not use any filesystem interfaces.
+ * Please try to avoid using this flag directly and instead use
+ * memalloc_nofs_{save,restore} to mark the whole scope which cannot/shouldn't
+ * recurse into the FS layer with a short explanation why. All allocation
+ * requests will inherit GFP_NOFS implicitly.
*
* GFP_USER is for userspace allocations that also need to be directly
* accessibly by the kernel or hardware. It is typically used by hardware
diff --git a/include/linux/sched.h b/include/linux/sched.h
index baffd340ea82..1c9fbcbcfcc8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2307,9 +2307,9 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_USED_ASYNC 0x00004000 /* used async_schedule*(), used by module init */
#define PF_NOFREEZE 0x00008000 /* this thread should not be frozen */
#define PF_FROZEN 0x00010000 /* frozen for system suspend */
-#define PF_FSTRANS 0x00020000 /* inside a filesystem transaction */
-#define PF_KSWAPD 0x00040000 /* I am kswapd */
-#define PF_MEMALLOC_NOIO 0x00080000 /* Allocating memory without IO involved */
+#define PF_KSWAPD 0x00020000 /* I am kswapd */
+#define PF_MEMALLOC_NOFS 0x00040000 /* All allocation requests will inherit GFP_NOFS */
+#define PF_MEMALLOC_NOIO 0x00080000 /* All allocation requests will inherit GFP_NOIO */
#define PF_LESS_THROTTLE 0x00100000 /* Throttle me less: I clean memory */
#define PF_KTHREAD 0x00200000 /* I am a kernel thread */
#define PF_RANDOMIZE 0x00400000 /* randomize virtual address space */
@@ -2320,8 +2320,6 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
#define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */

-#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
-
/*
* Only the _current_ task can read/write to tsk->flags, but other
* tasks can access tsk->flags in readonly mode for example
@@ -2347,13 +2345,21 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

-/* __GFP_IO isn't allowed if PF_MEMALLOC_NOIO is set in current->flags
- * __GFP_FS is also cleared as it implies __GFP_IO.
+/*
+ * Applies per-task gfp context to the given allocation flags.
+ * PF_MEMALLOC_NOIO implies GFP_NOIO
+ * PF_MEMALLOC_NOFS implies GFP_NOFS
*/
-static inline gfp_t memalloc_noio_flags(gfp_t flags)
+static inline gfp_t current_gfp_context(gfp_t flags)
{
+ /*
+ * NOIO implies both NOIO and NOFS and it is a weaker context
+ * so always make sure it makes precendence
+ */
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~(__GFP_IO | __GFP_FS);
+ else if (unlikely(current->flags & PF_MEMALLOC_NOFS))
+ flags &= ~__GFP_FS;
return flags;
}

@@ -2369,6 +2375,18 @@ static inline void memalloc_noio_restore(unsigned int flags)
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

+static inline unsigned int memalloc_nofs_save(void)
+{
+ unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
+ current->flags |= PF_MEMALLOC_NOFS;
+ return flags;
+}
+
+static inline void memalloc_nofs_restore(unsigned int flags)
+{
+ current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
+}
+
/* Per-process atomic flags. */
#define PFA_NO_NEW_PRIVS 0 /* May not gain new privileges. */
#define PFA_SPREAD_PAGE 1 /* Spread page cache over cpuset */
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 5684682a9ef8..553bceeedec4 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2870,7 +2870,7 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
return;

/* We're only interested __GFP_FS allocations for now */
- if (!(gfp_mask & __GFP_FS))
+ if (!(gfp_mask & __GFP_FS) || (curr->flags & PF_MEMALLOC_NOFS))
return;

/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2c6d5f64feca..e701be6b930a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3803,10 +3803,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
goto out;

/*
- * Runtime PM, block IO and its error handling path can deadlock
- * because I/O on the device might not complete.
+ * Apply scoped allocation constrains. This is mainly about
+ * GFP_NOFS resp. GFP_NOIO which has to be inherited for all
+ * allocation requests from a particular context which has
+ * been marked by memalloc_no{fs,io}_{save,restore}
*/
- alloc_mask = memalloc_noio_flags(gfp_mask);
+ alloc_mask = current_gfp_context(gfp_mask);
ac.spread_dirty_pages = false;

/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6aa5b01d3e75..4ea6b610f20e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,7 +2949,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.nr_to_reclaim = SWAP_CLUSTER_MAX,
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.reclaim_idx = gfp_zone(gfp_mask),
.order = order,
.nodemask = nodemask,
@@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
int nid;
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
+ .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
.reclaim_idx = MAX_NR_ZONES - 1,
.target_mem_cgroup = memcg,
@@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned in
int classzone_idx = gfp_zone(gfp_mask);
struct scan_control sc = {
.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
- .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
+ .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
.order = order,
.priority = NODE_RECLAIM_PRIORITY,
.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
--
2.10.2


2016-12-15 14:07:53

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/9] lockdep: allow to disable reclaim lockup detection

From: Michal Hocko <[email protected]>

The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the
code to silence the lockdep by using GFP_NOFS even though the context
can use __GFP_FS just fine. See
http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.

=================================
[ INFO: inconsistent lock state ]
4.5.0-rc2+ #4 Tainted: G O
---------------------------------
inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

(&xfs_nondir_ilock_class){++++-+}, at: [<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]

{RECLAIM_FS-ON-R} state was registered at:
[<ffffffff8110f369>] mark_held_locks+0x79/0xa0
[<ffffffff81113a43>] lockdep_trace_alloc+0xb3/0x100
[<ffffffff81224623>] kmem_cache_alloc+0x33/0x230
[<ffffffffa008acc1>] kmem_zone_alloc+0x81/0x120 [xfs]
[<ffffffffa005456e>] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
[<ffffffffa0053455>] __xfs_refcount_find_shared+0x75/0x580 [xfs]
[<ffffffffa00539e4>] xfs_refcount_find_shared+0x84/0xb0 [xfs]
[<ffffffffa005dcb8>] xfs_getbmap+0x608/0x8c0 [xfs]
[<ffffffffa007634b>] xfs_vn_fiemap+0xab/0xc0 [xfs]
[<ffffffff81244208>] do_vfs_ioctl+0x498/0x670
[<ffffffff81244459>] SyS_ioctl+0x79/0x90
[<ffffffff81847cd7>] entry_SYSCALL_64_fastpath+0x12/0x6f

CPU0
----
lock(&xfs_nondir_ilock_class);
<Interrupt>
lock(&xfs_nondir_ilock_class);

*** DEADLOCK ***

3 locks held by kswapd0/543:

stack backtrace:
CPU: 0 PID: 543 Comm: kswapd0 Tainted: G O 4.5.0-rc2+ #4

Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006

ffffffff82a34f10 ffff88003aa078d0 ffffffff813a14f9 ffff88003d8551c0
ffff88003aa07920 ffffffff8110ec65 0000000000000000 0000000000000001
ffff880000000001 000000000000000b 0000000000000008 ffff88003d855aa0
Call Trace:
[<ffffffff813a14f9>] dump_stack+0x4b/0x72
[<ffffffff8110ec65>] print_usage_bug+0x215/0x240
[<ffffffff8110ee85>] mark_lock+0x1f5/0x660
[<ffffffff8110e100>] ? print_shortest_lock_dependencies+0x1a0/0x1a0
[<ffffffff811102e0>] __lock_acquire+0xa80/0x1e50
[<ffffffff8122474e>] ? kmem_cache_alloc+0x15e/0x230
[<ffffffffa008acc1>] ? kmem_zone_alloc+0x81/0x120 [xfs]
[<ffffffff811122e8>] lock_acquire+0xd8/0x1e0
[<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa0083a70>] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
[<ffffffff8110aace>] down_write_nested+0x5e/0xc0
[<ffffffffa00781f7>] ? xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa00781f7>] xfs_ilock+0x177/0x200 [xfs]
[<ffffffffa0083a70>] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
[<ffffffffa0085bdc>] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
[<ffffffff8124d7d5>] evict+0xc5/0x190
[<ffffffff8124d8d9>] dispose_list+0x39/0x60
[<ffffffff8124eb2b>] prune_icache_sb+0x4b/0x60
[<ffffffff8123317f>] super_cache_scan+0x14f/0x1a0
[<ffffffff811e0d19>] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
[<ffffffff811e50ee>] shrink_zone+0x15e/0x170
[<ffffffff811e5ef1>] kswapd+0x4f1/0xa80
[<ffffffff811e5a00>] ? zone_reclaim+0x230/0x230
[<ffffffff810e6882>] kthread+0xf2/0x110
[<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220
[<ffffffff8184803f>] ret_from_fork+0x3f/0x70
[<ffffffff810e6790>] ? kthread_create_on_node+0x220/0x220

To quote Dave:
"
Ignoring whether reflink should be doing anything or not, that's a
"xfs_refcountbt_init_cursor() gets called both outside and inside
transactions" lockdep false positive case. The problem here is
lockdep has seen this allocation from within a transaction, hence a
GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
Also note that we have an active reference to this inode.

So, because the reclaim annotations overload the interrupt level
detections and it's seen the inode ilock been taken in reclaim
("interrupt") context, this triggers a reclaim context warning where
it thinks it is unsafe to do this allocation in GFP_KERNEL context
holding the inode ilock...
"

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely. Until then it
is much better to provide a way to add "I know what I am doing flag"
and mark problematic places. This would prevent from abusing GFP_NOFS
flag which has a runtime effect even on configurations which have
lockdep disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Suggested-by: Peter Zijlstra <[email protected]>
Acked-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/gfp.h | 10 +++++++++-
kernel/locking/lockdep.c | 4 ++++
lib/radix-tree.c | 2 ++
3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4175dca4ac39..1a934383cc20 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -41,6 +41,11 @@ struct vm_area_struct;
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
#define ___GFP_KSWAPD_RECLAIM 0x2000000u
+#ifdef CONFIG_LOCKDEP
+#define ___GFP_NOLOCKDEP 0x4000000u
+#else
+#define ___GFP_NOLOCKDEP 0
+#endif
/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
@@ -186,8 +191,11 @@ struct vm_area_struct;
#define __GFP_NOTRACK_FALSE_POSITIVE (__GFP_NOTRACK)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE)

+/* Disable lockdep for GFP context tracking */
+#define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
+
/* Room for N __GFP_FOO bits */
-#define __GFP_BITS_SHIFT 26
+#define __GFP_BITS_SHIFT (26 + IS_ENABLED(CONFIG_LOCKDEP))
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

/*
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 7c38f8f3d97b..5684682a9ef8 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2879,6 +2879,10 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
return;

+ /* Disable lockdep if explicitly requested */
+ if (gfp_mask & __GFP_NOLOCKDEP)
+ return;
+
mark_held_locks(curr, RECLAIM_FS);
}

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 5e8fc32697b1..e528bd426f8e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -2498,6 +2498,8 @@ static int radix_tree_cpu_dead(unsigned int cpu)
void __init radix_tree_init(void)
{
int ret;
+
+ BUILD_BUG_ON(RADIX_TREE_MAX_TAGS + __GFP_BITS_SHIFT > 32);
radix_tree_node_cachep = kmem_cache_create("radix_tree_node",
sizeof(struct radix_tree_node), 0,
SLAB_PANIC | SLAB_RECLAIM_ACCOUNT,
--
2.10.2


2016-12-15 14:07:53

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

From: Michal Hocko <[email protected]>

Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy") and use the new flag for them
instead. There is really no reason to make these allocations contexts
weaker just because of the lockdep which even might not be enabled
in most cases.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/xfs/kmem.h | 4 ++++
fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
fs/xfs/xfs_buf.c | 2 +-
fs/xfs/xfs_dir2_readdir.c | 2 +-
4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
index 689f746224e7..ea3984091d58 100644
--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
#define KM_NOFS ((__force xfs_km_flags_t)0x0004u)
#define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u)
#define KM_ZERO ((__force xfs_km_flags_t)0x0010u)
+#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u)

/*
* We use a special process flag to avoid recursive callbacks into
@@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
if (flags & KM_ZERO)
lflags |= __GFP_ZERO;

+ if (flags & KM_NOLOCKDEP)
+ lflags |= __GFP_NOLOCKDEP;
+
return lflags;
}

diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
index f2dc1a950c85..b8b5f6914863 100644
--- a/fs/xfs/libxfs/xfs_da_btree.c
+++ b/fs/xfs/libxfs/xfs_da_btree.c
@@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(

if (nirecs > 1) {
map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
- KM_SLEEP | KM_NOFS);
+ KM_SLEEP | KM_NOLOCKDEP);
if (!map)
return -ENOMEM;
*mapp = map;
@@ -2488,7 +2488,7 @@ xfs_dabuf_map(
*/
if (nfsb != 1)
irecs = kmem_zalloc(sizeof(irec) * nfsb,
- KM_SLEEP | KM_NOFS);
+ KM_SLEEP | KM_NOLOCKDEP);

nirecs = nfsb;
error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 7f0a01f7b592..f31ae592dcae 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
{
xfs_buftarg_t *btp;

- btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
+ btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);

btp->bt_mount = mp;
btp->bt_dev = bdev->bd_dev;
diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
index 003a99b83bd8..033ed65d7ce6 100644
--- a/fs/xfs/xfs_dir2_readdir.c
+++ b/fs/xfs/xfs_dir2_readdir.c
@@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
(length * sizeof(struct xfs_bmbt_irec)),
- KM_SLEEP | KM_NOFS);
+ KM_SLEEP | KM_NOLOCKDEP);
map_info->map_size = length;

/*
--
2.10.2


2016-12-16 12:46:43

by Michal Hocko

[permalink] [raw]
Subject: [DEBUG PATCH 0/2] debug explicit GFP_NO{FS,IO} usage from the scope context

Hi,
I've forgot to add the following two patches which should help to
identify explicit GFP_NO{FS,IO} usage from withing a scope context. Such
a usage can be changed to the full GFP_KERNEL because all the calls
from within the NO{FS,IO} scope will drop the __GFP_FS resp. __GFP_IO
automatically and if the function is called outside of the scope then
we do not need to restrict it to NOFS/NOIO as long as all the reclaim
recursion unsafe contexts are marked properly. This means that each
such a reported allocation site has to be checked before converted.

The debugging has to be enabled explicitly by a kernel command line
parameter and then it reports the stack trace of the allocation and
also the function which has started the current scope.

These two patches are _not_ intended to be merged and they are only
aimed at debugging.


2016-12-16 12:46:45

by Michal Hocko

[permalink] [raw]
Subject: [DEBUG PATCH 2/2] silent warnings which we cannot do anything about

From: Michal Hocko <[email protected]>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

There are some code paths used by all the filesystems which we cannot
change to drop the GFP_NOFS, yet they generate a lot of warnings.
Provide {disable,enable}_scope_gfp_check to silence those.
alloc_page_buffers and grow_dev_page are silenced right away.

Signed-off-by: Michal Hocko <[email protected]>
---
fs/buffer.c | 4 ++++
include/linux/sched.h | 11 +++++++++++
mm/page_alloc.c | 3 +++
3 files changed, 18 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index d21771fcf7d3..d27e8f05f736 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -873,7 +873,9 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
head = NULL;
offset = PAGE_SIZE;
while ((offset -= size) >= 0) {
+ disable_scope_gfp_check();
bh = alloc_buffer_head(GFP_NOFS);
+ enable_scope_gfp_check();
if (!bh)
goto no_grow;

@@ -1003,7 +1005,9 @@ grow_dev_page(struct block_device *bdev, sector_t block,
*/
gfp_mask |= __GFP_NOFAIL;

+ disable_scope_gfp_check();
page = find_or_create_page(inode->i_mapping, index, gfp_mask);
+ enable_scope_gfp_check();
if (!page)
return ret;

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 288946bfc326..b379ef9ed464 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,7 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
#endif
+ bool disable_scope_gfp_warn;
unsigned long nofs_caller;
unsigned long noio_caller;
/* CPU-specific state of this task */
@@ -2390,6 +2391,16 @@ static inline unsigned int __memalloc_nofs_save(unsigned long caller)
return flags;
}

+static inline void disable_scope_gfp_check(void)
+{
+ current->disable_scope_gfp_warn = true;
+}
+
+static inline void enable_scope_gfp_check(void)
+{
+ current->disable_scope_gfp_warn = false;
+}
+
#define memalloc_nofs_save() __memalloc_nofs_save(_RET_IP_)

static inline void memalloc_nofs_restore(unsigned int flags)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e35fb2a8681..7ecae58abf74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3758,6 +3758,9 @@ void debug_scope_gfp_context(gfp_t gfp_mask)
if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;

+ if (current->disable_scope_gfp_warn)
+ return;
+
if (current->flags & PF_MEMALLOC_NOIO)
restrict_mask = __GFP_IO;
else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
--
2.10.2


2016-12-16 12:46:44

by Michal Hocko

[permalink] [raw]
Subject: [DEBUG PATCH 1/2] mm, debug: report when GFP_NO{FS,IO} is used explicitly from memalloc_no{fs,io}_{save,restore} context

From: Michal Hocko <[email protected]>

THIS PATCH IS FOR TESTING ONLY AND NOT MEANT TO HIT LINUS TREE

It is desirable to reduce the direct GFP_NO{FS,IO} usage at minimum and
prefer scope usage defined by memalloc_no{fs,io}_{save,restore} API.

Let's help this process and add a debugging tool to catch when an
explicit allocation request for GFP_NO{FS,IO} is done from the scope
context. The printed stacktrace should help to identify the caller
and evaluate whether it can be changed to use a wider context or whether
it is called from another potentially dangerous context which needs
a scope protection as well.

The checks have to be enabled explicitly by debug_scope_gfp kernel
command line parameter.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 14 +++++++++++--
include/linux/slab.h | 3 +++
mm/page_alloc.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c9fbcbcfcc8..288946bfc326 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1988,6 +1988,8 @@ struct task_struct {
/* A live task holds one reference. */
atomic_t stack_refcount;
#endif
+ unsigned long nofs_caller;
+ unsigned long noio_caller;
/* CPU-specific state of this task */
struct thread_struct thread;
/*
@@ -2345,6 +2347,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define tsk_used_math(p) ((p)->flags & PF_USED_MATH)
#define used_math() tsk_used_math(current)

+extern void debug_scope_gfp_context(gfp_t gfp_mask);
+
/*
* Applies per-task gfp context to the given allocation flags.
* PF_MEMALLOC_NOIO implies GFP_NOIO
@@ -2363,25 +2367,31 @@ static inline gfp_t current_gfp_context(gfp_t flags)
return flags;
}

-static inline unsigned int memalloc_noio_save(void)
+static inline unsigned int __memalloc_noio_save(unsigned long caller)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOIO;
current->flags |= PF_MEMALLOC_NOIO;
+ current->noio_caller = caller;
return flags;
}

+#define memalloc_noio_save() __memalloc_noio_save(_RET_IP_)
+
static inline void memalloc_noio_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOIO) | flags;
}

-static inline unsigned int memalloc_nofs_save(void)
+static inline unsigned int __memalloc_nofs_save(unsigned long caller)
{
unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
current->flags |= PF_MEMALLOC_NOFS;
+ current->nofs_caller = caller;
return flags;
}

+#define memalloc_nofs_save() __memalloc_nofs_save(_RET_IP_)
+
static inline void memalloc_nofs_restore(unsigned int flags)
{
current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12bad198..6559668e29db 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -477,6 +477,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
*/
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
+ debug_scope_gfp_context(flags);
if (__builtin_constant_p(size)) {
if (size > KMALLOC_MAX_CACHE_SIZE)
return kmalloc_large(size, flags);
@@ -517,6 +518,7 @@ static __always_inline int kmalloc_size(int n)

static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
{
+ debug_scope_gfp_context(flags);
#ifndef CONFIG_SLOB
if (__builtin_constant_p(size) &&
size <= KMALLOC_MAX_CACHE_SIZE && !(flags & GFP_DMA)) {
@@ -575,6 +577,7 @@ int memcg_update_all_caches(int num_memcgs);
*/
static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
{
+ debug_scope_gfp_context(flags);
if (size != 0 && n > SIZE_MAX / size)
return NULL;
if (__builtin_constant_p(n) && __builtin_constant_p(size))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e701be6b930a..9e35fb2a8681 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3734,6 +3734,63 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
return page;
}

+static bool debug_scope_gfp;
+
+static int __init enable_debug_scope_gfp(char *unused)
+{
+ debug_scope_gfp = true;
+ return 0;
+}
+
+/*
+ * spit the stack trace if the given gfp_mask clears flags which are context
+ * wide cleared. Such a caller can remove special flags clearing and rely on
+ * the context wide mask.
+ */
+void debug_scope_gfp_context(gfp_t gfp_mask)
+{
+ gfp_t restrict_mask;
+
+ if (likely(!debug_scope_gfp))
+ return;
+
+ /* both NOFS, NOIO are irrelevant when direct reclaim is disabled */
+ if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
+ return;
+
+ if (current->flags & PF_MEMALLOC_NOIO)
+ restrict_mask = __GFP_IO;
+ else if ((current->flags & PF_MEMALLOC_NOFS) && (gfp_mask & __GFP_IO))
+ restrict_mask = __GFP_FS;
+ else
+ return;
+
+ if ((gfp_mask & restrict_mask) != restrict_mask) {
+ /*
+ * If you see this this warning then the code does:
+ * memalloc_no{fs,io}_save()
+ * ...
+ * foo()
+ * alloc_page(GFP_NO{FS,IO})
+ * ...
+ * memalloc_no{fs,io}_restore()
+ *
+ * allocation which is unnecessary because the scope gfp
+ * context will do that for all allocation requests already.
+ * If foo() is called from multiple contexts then make sure other
+ * contexts are safe wrt. GFP_NO{FS,IO} semantic and either add
+ * scope protection into particular paths or change the gfp mask
+ * to GFP_KERNEL.
+ */
+ pr_info("Unnecesarily specific gfp mask:%#x(%pGg) for the %s task wide context from %ps\n", gfp_mask, &gfp_mask,
+ (current->flags & PF_MEMALLOC_NOIO)?"NOIO":"NOFS",
+ (void*)((current->flags & PF_MEMALLOC_NOIO)?current->noio_caller:current->nofs_caller));
+ dump_stack();
+ }
+}
+EXPORT_SYMBOL(debug_scope_gfp_context);
+early_param("debug_scope_gfp", enable_debug_scope_gfp);
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -3798,6 +3855,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
}

/* First allocation attempt */
+ debug_scope_gfp_context(gfp_mask);
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
if (likely(page))
goto out;
--
2.10.2


2016-12-16 15:15:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] scope GFP_NOFS api

On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> Hi,
> I have posted the previous version here [1]. Since then I have added a
> support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> removing GFP_NOFS usage motivated by the lockdep false positives. On top
> of that I've tried to convert few KM_NOFS usages to use the new flag in
> the xfs code base. This would need a review from somebody familiar with
> xfs of course.

The wild ass guess below prevents the xfs explosion below when running
ltp zram tests.

---
fs/xfs/kmem.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/xfs/kmem.h
+++ b/fs/xfs/kmem.h
@@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
{
gfp_t lflags;

- BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
+ BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));

if (flags & KM_NOSLEEP) {
lflags = GFP_ATOMIC | __GFP_NOWARN;

[ 108.775501] ------------[ cut here ]------------
[ 108.775503] kernel BUG at fs/xfs/kmem.h:48!
[ 108.775504] invalid opcode: 0000 [#1] SMP
[ 108.775505] Dumping ftrace buffer:
[ 108.775508] (ftrace buffer empty)
[ 108.775508] Modules linked in: xfs(E) libcrc32c(E) btrfs(E) xor(E) raid6_pq(E) zram(E) ebtable_filter(E) ebtables(E) fuse(E) bridge(E) stp(E) llc(E) iscsi_ibft(E) iscsi_boot_sysfs(E) ip6t_REJECT(E) xt_tcpudp(E) nf_conntrack_ipv6(E) nf_defrag_ipv6(E) ip6table_raw(E) ipt_REJECT(E) iptable_raw(E) iptable_filter(E) ip6table_mangle(E) nf_conntrack_netbios_ns(E) nf_conntrack_broadcast(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) ip_tables(E) xt_conntrack(E) nf_conntrack(E) ip6table_filter(E) ip6_tables(E) x_tables(E) nls_iso8859_1(E) nls_cp437(E) x86_pkg_temp_thermal(E) intel_powerclamp(E) coretemp(E) kvm_intel(E) nfsd(E) kvm(E) auth_rpcgss(E) nfs_acl(E) lockd(E) snd_hda_codec_realtek(E) snd_hda_codec_hdmi(E) pl2303(E) grace(E) snd_hda_codec_generic(E) usbserial(E) irqbypass(E) snd_hda_intel(E) snd_hda_codec(E)
[ 108.775523] snd_hwdep(E) sunrpc(E) snd_hda_core(E) crct10dif_pclmul(E) mei_me(E) mei(E) serio_raw(E) snd_pcm(E) crc32_pclmul(E) snd_timer(E) crc32c_intel(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) iTCO_wdt(E) iTCO_vendor_support(E) lpc_ich(E) mfd_core(E) snd(E) soundcore(E) joydev(E) fan(E) shpchp(E) tpm_infineon(E) cryptd(E) battery(E) thermal(E) pcspkr(E) glue_helper(E) usblp(E) intel_smartconnect(E) i2c_i801(E) efivarfs(E) hid_logitech_hidpp(E) hid_logitech_dj(E) hid_generic(E) usbhid(E) nouveau(E) wmi(E) i2c_algo_bit(E) ahci(E) libahci(E) drm_kms_helper(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ehci_pci(E) xhci_pci(E) ttm(E) ehci_hcd(E) xhci_hcd(E) r8169(E) libata(E) mii(E) drm(E) usbcore(E) fjes(E) video(E) button(E) af_packet(E) sd_mod(E) vfat(E) fat(E) ext4(E) crc16(E)
[ 108.775540] jbd2(E) mbcache(E) dm_mod(E) loop(E) sg(E) scsi_mod(E) autofs4(E)
[ 108.775544] CPU: 5 PID: 4495 Comm: mount Tainted: G E 4.10.0-master #4
[ 108.775545] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 108.775546] task: ffff8803f9e54e00 task.stack: ffffc900018fc000
[ 108.775565] RIP: 0010:kmem_flags_convert.part.0+0x4/0x6 [xfs]
[ 108.775565] RSP: 0018:ffffc900018ffcd8 EFLAGS: 00010202
[ 108.775566] RAX: ffff8803f630a800 RBX: ffff8803f6b20000 RCX: 0000000000001000
[ 108.775567] RDX: 0000000000001000 RSI: 0000000000000031 RDI: 00000000000000b0
[ 108.775568] RBP: ffffc900018ffcd8 R08: 0000000000019fe0 R09: ffff8803f6b20000
[ 108.775568] R10: 0000000000000005 R11: 0000000000010641 R12: ffff88041e21ea00
[ 108.775569] R13: ffff8803f6b20000 R14: 0000000000000000 R15: 00000000fffffff4
[ 108.775570] FS: 00007f1cbee9e840(0000) GS:ffff88041ed40000(0000) knlGS:0000000000000000
[ 108.775571] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 108.775571] CR2: 00007f5d4b6ed000 CR3: 00000003fbf13000 CR4: 00000000001406e0
[ 108.775572] Call Trace:
[ 108.775588] kmem_alloc+0x100/0x100 [xfs]
[ 108.775591] ? kstrndup+0x49/0x60
[ 108.775605] xfs_alloc_buftarg+0x23/0xd0 [xfs]
[ 108.775619] xfs_open_devices+0x8c/0x170 [xfs]
[ 108.775621] ? sb_set_blocksize+0x1d/0x50
[ 108.775633] xfs_fs_fill_super+0x234/0x580 [xfs]
[ 108.775635] mount_bdev+0x184/0x1c0
[ 108.775647] ? xfs_test_remount_options.isra.15+0x60/0x60 [xfs]
[ 108.775658] xfs_fs_mount+0x15/0x20 [xfs]
[ 108.775659] mount_fs+0x15/0x90
[ 108.775661] vfs_kern_mount+0x67/0x130
[ 108.775663] do_mount+0x190/0xbd0
[ 108.775664] ? memdup_user+0x42/0x60
[ 108.775665] SyS_mount+0x83/0xd0
[ 108.775668] entry_SYSCALL_64_fastpath+0x1a/0xa9
[ 108.775669] RIP: 0033:0x7f1cbe7bf78a
[ 108.775669] RSP: 002b:00007ffc99215198 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
[ 108.775670] RAX: ffffffffffffffda RBX: 00007f1cbeabb3b8 RCX: 00007f1cbe7bf78a
[ 108.775671] RDX: 000055715a611690 RSI: 000055715a60d270 RDI: 000055715a60d2d0
[ 108.775671] RBP: 000055715a60d120 R08: 0000000000000000 R09: 00007f1cbea7c678
[ 108.775672] R10: 00000000c0ed0000 R11: 0000000000000202 R12: 00007f1cbecc8e78
[ 108.775672] R13: 00000000ffffffff R14: 0000000000000000 R15: 000055715a60d060
[ 108.775673] Code: ff 74 05 e8 c2 17 64 e0 48 8b 3d 6b ec 03 00 48 85 ff 74 05 e8 b1 17 64 e0 48 8b 3d 32 ec 03 00 e8 25 a6 74 e0 5d c3 55 48 89 e5 <0f> 0b 55 48 89 e5 e8 f4 53 ff ff 48 c7 c7 40 78 b7 a0 e8 18 01
[ 108.775700] RIP: kmem_flags_convert.part.0+0x4/0x6 [xfs] RSP: ffffc900018ffcd8

2016-12-16 15:35:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] scope GFP_NOFS api

On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > Hi,
> > I have posted the previous version here [1]. Since then I have added a
> > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > the xfs code base. This would need a review from somebody familiar with
> > xfs of course.
>
> The wild ass guess below prevents the xfs explosion below when running
> ltp zram tests.

Yes this looks correct. Thanks for noticing. I will fold it to the
patch2. Thanks for testing Mike!
>
> ---
> fs/xfs/kmem.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -45,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> {
> gfp_t lflags;
>
> - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> + BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>
> if (flags & KM_NOSLEEP) {
> lflags = GFP_ATOMIC | __GFP_NOWARN;
--
Michal Hocko
SUSE Labs

2016-12-16 15:41:16

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
---

2016-12-16 16:28:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] scope GFP_NOFS api

On Fri, 2016-12-16 at 16:35 +0100, Michal Hocko wrote:
> On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> > On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > > Hi,
> > > I have posted the previous version here [1]. Since then I have added a
> > > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > > the xfs code base. This would need a review from somebody familiar with
> > > xfs of course.
> >
> > The wild ass guess below prevents the xfs explosion below when running
> > ltp zram tests.
>
> Yes this looks correct. Thanks for noticing. I will fold it to the
> patch2. Thanks for testing Mike!

I had ulterior motives, was hoping you might have made the irksome RT
gripe below just _go away_, as staring at it ain't working out ;-)

[ 1441.309006] =========================================================
[ 1441.309006] [ INFO: possible irq lock inversion dependency detected ]
[ 1441.309007] 4.10.0-rt9-rt #11 Tainted: G E
[ 1441.309007] ---------------------------------------------------------
[ 1441.309008] kswapd0/165 just changed the state of lock:
[ 1441.309009] (&journal->j_state_lock){+.+.-.}, at: [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309017] but this lock took another, RECLAIM_FS-unsafe lock in the past:
[ 1441.309017] (&tb->tb6_lock){+.+.+.}
[ 1441.309018] and interrupts could create inverse lock ordering between them.
[ 1441.309018] other info that might help us debug this:
[ 1441.309018] Chain exists of: &journal->j_state_lock --> &journal->j_list_lock --> &tb->tb6_lock
[ 1441.309019] Possible interrupt unsafe locking scenario:
[ 1441.309019] CPU0 CPU1
[ 1441.309019] ---- ----
[ 1441.309019] lock(&tb->tb6_lock);
[ 1441.309020] local_irq_disable();
[ 1441.309020] lock(&journal->j_state_lock);
[ 1441.309020] lock(&journal->j_list_lock);
[ 1441.309021] <Interrupt>
[ 1441.309021] lock(&journal->j_state_lock);
[ 1441.309021] *** DEADLOCK ***
[ 1441.309022] 2 locks held by kswapd0/165:
[ 1441.309022] #0: (shrinker_rwsem){+.+...}, at: [<ffffffff811efa2a>] shrink_slab+0x7a/0x6c0
[ 1441.309027] #1: (&type->s_umount_key#29){+.+.+.}, at: [<ffffffff8126f20b>] trylock_super+0x1b/0x50
[ 1441.309030] the shortest dependencies between 2nd lock and 1st lock:
[ 1441.309031] -> (&tb->tb6_lock){+.+.+.} ops: 271 {
[ 1441.309032] HARDIRQ-ON-W at:
[ 1441.309035] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309036] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309039] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309041] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309043] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309044] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309046] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309047] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309049] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309051] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309053] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309054] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309055] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309057] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309058] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309060] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309061] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309061] SOFTIRQ-ON-W at:
[ 1441.309063] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309064] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309064] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309065] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309067] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309067] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309068] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309069] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309071] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309073] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309074] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309075] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309077] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309078] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309079] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309080] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309081] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309081] RECLAIM_FS-ON-W at:
[ 1441.309082] [<ffffffff810e0316>] mark_held_locks+0x66/0x90
[ 1441.309084] [<ffffffff810e34a8>] lockdep_trace_alloc+0xd8/0x120
[ 1441.309085] [<ffffffff81246df6>] kmem_cache_alloc_node+0x36/0x310
[ 1441.309086] [<ffffffff815dfd4e>] __alloc_skb+0x4e/0x280
[ 1441.309088] [<ffffffff816ee6ac>] inet6_rt_notify+0x5c/0x130
[ 1441.309089] [<ffffffff816f101b>] fib6_add+0x56b/0xa30
[ 1441.309090] [<ffffffff816e66f8>] __ip6_ins_rt+0x48/0x70
[ 1441.309091] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309092] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309093] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309094] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309096] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309097] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309098] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309099] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309100] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309102] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309103] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309104] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309105] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309105] INITIAL USE at:
[ 1441.309106] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309107] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309108] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309109] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
[ 1441.309110] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
[ 1441.309111] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
[ 1441.309112] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
[ 1441.309113] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
[ 1441.309115] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
[ 1441.309116] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
[ 1441.309117] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
[ 1441.309118] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
[ 1441.309119] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
[ 1441.309120] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
[ 1441.309121] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
[ 1441.309122] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
[ 1441.309123] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309123] }
[ 1441.309125] ... key at: [<ffffffff82dd96e0>] __key.59908+0x0/0x8
[ 1441.309125] ... acquired at:
[ 1441.309126] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309127] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309128] [<ffffffff816ea541>] ip6_pol_route+0x61/0xa60
[ 1441.309130] [<ffffffff816eaf5a>] ip6_pol_route_input+0x1a/0x20
[ 1441.309131] [<ffffffff81718f21>] fib6_rule_action+0xa1/0x1e0
[ 1441.309133] [<ffffffff81621e53>] fib_rules_lookup+0x153/0x2e0
[ 1441.309134] [<ffffffff81719219>] fib6_rule_lookup+0x59/0xc0
[ 1441.309135] [<ffffffff816e6a3e>] ip6_route_input_lookup+0x4e/0x60
[ 1441.309136] [<ffffffff816ec59d>] ip6_route_input+0xdd/0x1a0
[ 1441.309137] [<ffffffff816d87d0>] ip6_rcv_finish+0x60/0x200
[ 1441.309139] [<ffffffffa09e00b0>] ip_sabotage_in+0x30/0x40 [br_netfilter]
[ 1441.309141] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
[ 1441.309142] [<ffffffff816d998a>] ipv6_rcv+0x72a/0x980
[ 1441.309143] [<ffffffff815f81ef>] __netif_receive_skb_core+0x38f/0xd20
[ 1441.309144] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
[ 1441.309145] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
[ 1441.309147] [<ffffffff815fa668>] netif_receive_skb+0x38/0x180
[ 1441.309151] [<ffffffffa09b77e5>] br_pass_frame_up+0xd5/0x2c0 [bridge]
[ 1441.309154] [<ffffffffa09b7d66>] br_handle_frame_finish+0x256/0x5c0 [bridge]
[ 1441.309156] [<ffffffffa09e159c>] br_nf_hook_thresh+0xac/0x220 [br_netfilter]
[ 1441.309157] [<ffffffffa09e2ee3>] br_nf_pre_routing_finish_ipv6+0x1c3/0x340 [br_netfilter]
[ 1441.309158] [<ffffffffa09e349d>] br_nf_pre_routing_ipv6+0xdd/0x27a [br_netfilter]
[ 1441.309159] [<ffffffffa09e2942>] br_nf_pre_routing+0x1b2/0x540 [br_netfilter]
[ 1441.309160] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
[ 1441.309163] [<ffffffffa09b82f7>] br_handle_frame+0x227/0x5b0 [bridge]
[ 1441.309164] [<ffffffff815f8036>] __netif_receive_skb_core+0x1d6/0xd20
[ 1441.309165] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
[ 1441.309166] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
[ 1441.309167] [<ffffffff815fbca2>] napi_gro_receive+0x192/0x250
[ 1441.309171] [<ffffffffa03f2163>] rtl8169_poll+0x183/0x6a0 [r8169]
[ 1441.309172] [<ffffffff815faea0>] net_rx_action+0x3b0/0x700
[ 1441.309173] [<ffffffff810841a5>] do_current_softirqs+0x285/0x680
[ 1441.309174] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309177] [<ffffffff810f7d81>] irq_forced_thread_fn+0x41/0x60
[ 1441.309178] [<ffffffff810f832f>] irq_thread+0x13f/0x1e0
[ 1441.309179] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309180] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309181] -> (&per_cpu(local_softirq_locks[i], __cpu).lock){+.+...} ops: 3582145 {
[ 1441.309182] HARDIRQ-ON-W at:
[ 1441.309183] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309184] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309185] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309186] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309187] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309188] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309190] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309191] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309193] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309194] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309195] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309196] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309196] SOFTIRQ-ON-W at:
[ 1441.309197] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309198] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309199] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309200] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309201] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309202] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309203] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309204] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309204] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309205] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309206] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309207] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309207] INITIAL USE at:
[ 1441.309208] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309209] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309210] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309211] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309212] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309213] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
[ 1441.309213] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
[ 1441.309215] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
[ 1441.309216] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
[ 1441.309216] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
[ 1441.309217] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
[ 1441.309218] [<ffffffff810001b5>] start_cpu+0x5/0x14
[ 1441.309218] }
[ 1441.309220] ... key at: [<ffffffff81f6c110>] __key.38555+0x0/0x8
[ 1441.309220] ... acquired at:
[ 1441.309221] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309222] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
[ 1441.309222] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
[ 1441.309223] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
[ 1441.309225] [<ffffffff81200bc9>] wb_wakeup_delayed+0x69/0x70
[ 1441.309226] [<ffffffff812a0aab>] __mark_inode_dirty+0x60b/0x7c0
[ 1441.309227] [<ffffffff812ab235>] mark_buffer_dirty+0xb5/0x240
[ 1441.309231] [<ffffffffa009ba1d>] __jbd2_journal_temp_unlink_buffer+0xbd/0xe0 [jbd2]
[ 1441.309233] [<ffffffffa009e40a>] __jbd2_journal_refile_buffer+0xba/0xe0 [jbd2]
[ 1441.309235] [<ffffffffa009fbed>] jbd2_journal_commit_transaction+0x112d/0x2130 [jbd2]
[ 1441.309237] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
[ 1441.309239] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309239] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309240] -> (&journal->j_list_lock){+.+...} ops: 587416 {
[ 1441.309241] HARDIRQ-ON-W at:
[ 1441.309242] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309243] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309244] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309246] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309248] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309259] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309264] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309265] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309266] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309268] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309269] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309270] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309271] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309272] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309274] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309274] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309275] SOFTIRQ-ON-W at:
[ 1441.309276] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309277] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309277] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309279] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309281] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309288] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309293] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309294] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309295] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309296] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309297] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309298] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309299] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309300] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309301] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309302] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309302] INITIAL USE at:
[ 1441.309303] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309304] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309305] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309307] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
[ 1441.309308] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
[ 1441.309314] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
[ 1441.309319] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
[ 1441.309319] [<ffffffff81268281>] do_dentry_open+0x231/0x360
[ 1441.309320] [<ffffffff81269642>] vfs_open+0x52/0x80
[ 1441.309321] [<ffffffff8127b206>] path_openat+0x476/0xdd0
[ 1441.309322] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
[ 1441.309323] [<ffffffff812723b7>] do_open_execat+0x67/0x150
[ 1441.309324] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
[ 1441.309325] [<ffffffff8127415c>] do_execve+0x2c/0x30
[ 1441.309326] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
[ 1441.309327] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309327] }
[ 1441.309330] ... key at: [<ffffffffa00af5c0>] __key.47251+0x0/0xffffffffffff9a40 [jbd2]
[ 1441.309330] ... acquired at:
[ 1441.309331] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309331] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
[ 1441.309333] [<ffffffffa009ee25>] jbd2_journal_commit_transaction+0x365/0x2130 [jbd2]
[ 1441.309335] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
[ 1441.309337] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309337] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309338] -> (&journal->j_state_lock){+.+.-.} ops: 5849939 {
[ 1441.309339] HARDIRQ-ON-W at:
[ 1441.309340] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
[ 1441.309341] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309341] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309348] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309353] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309355] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309360] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309361] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309362] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309363] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309364] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309365] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309365] SOFTIRQ-ON-W at:
[ 1441.309366] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
[ 1441.309367] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309368] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309373] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309378] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309379] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309383] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309385] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309385] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309386] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309387] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309388] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309388] IN-RECLAIM_FS-W at:
[ 1441.309390] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
[ 1441.309391] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309391] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309394] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309398] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309400] [<ffffffff8128b221>] evict+0xd1/0x1a0
[ 1441.309401] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
[ 1441.309402] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
[ 1441.309404] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
[ 1441.309405] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
[ 1441.309406] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
[ 1441.309407] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
[ 1441.309409] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309409] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309410] INITIAL USE at:
[ 1441.309411] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
[ 1441.309412] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309412] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
[ 1441.309417] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
[ 1441.309422] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
[ 1441.309423] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
[ 1441.309427] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
[ 1441.309429] [<ffffffff8126fa79>] mount_fs+0x39/0x170
[ 1441.309429] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
[ 1441.309430] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
[ 1441.309431] [<ffffffff81294773>] SyS_mount+0x83/0xd0
[ 1441.309432] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
[ 1441.309432] }
[ 1441.309435] ... key at: [<ffffffffa00af5b0>] __key.47253+0x0/0xffffffffffff9a50 [jbd2]
[ 1441.309435] ... acquired at:
[ 1441.309436] [<ffffffff810df99e>] check_usage_forwards+0x11e/0x120
[ 1441.309437] [<ffffffff810e01a8>] mark_lock+0x1e8/0x2f0
[ 1441.309437] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
[ 1441.309438] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
[ 1441.309439] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
[ 1441.309441] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309446] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309447] [<ffffffff8128b221>] evict+0xd1/0x1a0
[ 1441.309448] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
[ 1441.309449] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
[ 1441.309450] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
[ 1441.309451] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
[ 1441.309452] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
[ 1441.309453] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
[ 1441.309454] [<ffffffff810a5ecc>] kthread+0x10c/0x140
[ 1441.309455] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
[ 1441.309455] stack backtrace:
[ 1441.309457] CPU: 0 PID: 165 Comm: kswapd0 Tainted: G E 4.10.0-rt9-rt #11
[ 1441.309457] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 1441.309457] Call Trace:
[ 1441.309459] dump_stack+0x85/0xc8
[ 1441.309461] print_irq_inversion_bug.part.34+0x1ac/0x1b8
[ 1441.309462] check_usage_forwards+0x11e/0x120
[ 1441.309463] ? check_usage_backwards+0x120/0x120
[ 1441.309463] mark_lock+0x1e8/0x2f0
[ 1441.309464] __lock_acquire+0x2b6/0x1770
[ 1441.309465] ? __lock_acquire+0x420/0x1770
[ 1441.309466] lock_acquire+0xd4/0x270
[ 1441.309468] ? jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309469] rt_read_lock+0x47/0x60
[ 1441.309471] ? jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309472] jbd2_complete_transaction+0x20/0x90 [jbd2]
[ 1441.309477] ext4_evict_inode+0x37e/0x700 [ext4]
[ 1441.309478] evict+0xd1/0x1a0
[ 1441.309479] dispose_list+0x4d/0x70
[ 1441.309480] prune_icache_sb+0x4b/0x60
[ 1441.309481] super_cache_scan+0x141/0x190
[ 1441.309482] shrink_slab+0x277/0x6c0
[ 1441.309483] shrink_node+0x2e3/0x2f0
[ 1441.309485] kswapd+0x34f/0x980
[ 1441.309487] kthread+0x10c/0x140
[ 1441.309488] ? mem_cgroup_shrink_node+0x390/0x390
[ 1441.309488] ? kthread_park+0x90/0x90
[ 1441.309489] ret_from_fork+0x2a/0x40

2016-12-16 16:37:53

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> ---
> From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Thu, 15 Dec 2016 13:06:43 +0100
> Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
> lockdep false positives
>
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
>

Hi Michal,

I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
the original problem. FWIW, there was another KM_NOFS instance added by
that commit in xlog_cil_prepare_log_vecs() that is now in
xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
original issue still applies..?

Brian

> Changes since v1
> - check for KM_NOLOCKDEP in kmem_flags_convert to not hit sanity BUG_ON
> as per Mike Galbraith
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.h | 6 +++++-
> fs/xfs/libxfs/xfs_da_btree.c | 4 ++--
> fs/xfs/xfs_buf.c | 2 +-
> fs/xfs/xfs_dir2_readdir.c | 2 +-
> 4 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index 689f746224e7..d5d634ef1f7f 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -33,6 +33,7 @@ typedef unsigned __bitwise xfs_km_flags_t;
> #define KM_NOFS ((__force xfs_km_flags_t)0x0004u)
> #define KM_MAYFAIL ((__force xfs_km_flags_t)0x0008u)
> #define KM_ZERO ((__force xfs_km_flags_t)0x0010u)
> +#define KM_NOLOCKDEP ((__force xfs_km_flags_t)0x0020u)
>
> /*
> * We use a special process flag to avoid recursive callbacks into
> @@ -44,7 +45,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> {
> gfp_t lflags;
>
> - BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO));
> + BUG_ON(flags & ~(KM_SLEEP|KM_NOSLEEP|KM_NOFS|KM_MAYFAIL|KM_ZERO|KM_NOLOCKDEP));
>
> if (flags & KM_NOSLEEP) {
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> @@ -57,6 +58,9 @@ kmem_flags_convert(xfs_km_flags_t flags)
> if (flags & KM_ZERO)
> lflags |= __GFP_ZERO;
>
> + if (flags & KM_NOLOCKDEP)
> + lflags |= __GFP_NOLOCKDEP;
> +
> return lflags;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c
> index f2dc1a950c85..b8b5f6914863 100644
> --- a/fs/xfs/libxfs/xfs_da_btree.c
> +++ b/fs/xfs/libxfs/xfs_da_btree.c
> @@ -2429,7 +2429,7 @@ xfs_buf_map_from_irec(
>
> if (nirecs > 1) {
> map = kmem_zalloc(nirecs * sizeof(struct xfs_buf_map),
> - KM_SLEEP | KM_NOFS);
> + KM_SLEEP | KM_NOLOCKDEP);
> if (!map)
> return -ENOMEM;
> *mapp = map;
> @@ -2488,7 +2488,7 @@ xfs_dabuf_map(
> */
> if (nfsb != 1)
> irecs = kmem_zalloc(sizeof(irec) * nfsb,
> - KM_SLEEP | KM_NOFS);
> + KM_SLEEP | KM_NOLOCKDEP);
>
> nirecs = nfsb;
> error = xfs_bmapi_read(dp, (xfs_fileoff_t)bno, nfsb, irecs,
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..f31ae592dcae 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1785,7 +1785,7 @@ xfs_alloc_buftarg(
> {
> xfs_buftarg_t *btp;
>
> - btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOFS);
> + btp = kmem_zalloc(sizeof(*btp), KM_SLEEP | KM_NOLOCKDEP);
>
> btp->bt_mount = mp;
> btp->bt_dev = bdev->bd_dev;
> diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c
> index 003a99b83bd8..033ed65d7ce6 100644
> --- a/fs/xfs/xfs_dir2_readdir.c
> +++ b/fs/xfs/xfs_dir2_readdir.c
> @@ -503,7 +503,7 @@ xfs_dir2_leaf_getdents(
> length = howmany(bufsize + geo->blksize, (1 << geo->fsblog));
> map_info = kmem_zalloc(offsetof(struct xfs_dir2_leaf_map_info, map) +
> (length * sizeof(struct xfs_bmbt_irec)),
> - KM_SLEEP | KM_NOFS);
> + KM_SLEEP | KM_NOLOCKDEP);
> map_info->map_size = length;
>
> /*
> --
> 2.10.2
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-16 16:38:03

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 3/9] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

On Thu, Dec 15, 2016 at 03:07:09PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more genric name PF_MEMALLOC_NOFS which is in line with an exiting

Typos: generic existing

> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
>
> This patch doesn't introduce any functional change.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---

Otherwise seems fine to me:

Reviewed-by: Brian Foster <[email protected]>

> fs/xfs/kmem.c | 4 ++--
> fs/xfs/kmem.h | 2 +-
> fs/xfs/libxfs/xfs_btree.c | 2 +-
> fs/xfs/xfs_aops.c | 6 +++---
> fs/xfs/xfs_trans.c | 12 ++++++------
> include/linux/sched.h | 2 ++
> 6 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index 339c696bbc01..a76a05dae96b 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -80,13 +80,13 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
> * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
> * the filesystem here and potentially deadlocking.
> */
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> noio_flag = memalloc_noio_save();
>
> lflags = kmem_flags_convert(flags);
> ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> memalloc_noio_restore(noio_flag);
>
> return ptr;
> diff --git a/fs/xfs/kmem.h b/fs/xfs/kmem.h
> index ea3984091d58..e40ddd12900b 100644
> --- a/fs/xfs/kmem.h
> +++ b/fs/xfs/kmem.h
> @@ -51,7 +51,7 @@ kmem_flags_convert(xfs_km_flags_t flags)
> lflags = GFP_ATOMIC | __GFP_NOWARN;
> } else {
> lflags = GFP_KERNEL | __GFP_NOWARN;
> - if ((current->flags & PF_FSTRANS) || (flags & KM_NOFS))
> + if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> lflags &= ~__GFP_FS;
> }
>
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index 21e6a6ab6b9a..a2672ba4dc33 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2866,7 +2866,7 @@ xfs_btree_split_worker(
> struct xfs_btree_split_args *args = container_of(work,
> struct xfs_btree_split_args, work);
> unsigned long pflags;
> - unsigned long new_pflags = PF_FSTRANS;
> + unsigned long new_pflags = PF_MEMALLOC_NOFS;
>
> /*
> * we are in a transaction context here, but may also be doing work
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0f56fcd3a5d5..61ca9f9c5a12 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -189,7 +189,7 @@ xfs_setfilesize_trans_alloc(
> * We hand off the transaction to the completion thread now, so
> * clear the flag here.
> */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return 0;
> }
>
> @@ -252,7 +252,7 @@ xfs_setfilesize_ioend(
> * thus we need to mark ourselves as being in a transaction manually.
> * Similarly for freeze protection.
> */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> __sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>
> /* we abort the update if there was an IO error */
> @@ -1015,7 +1015,7 @@ xfs_do_writepage(
> * Given that we do not allow direct reclaim to call us, we should
> * never be called while in a filesystem transaction.
> */
> - if (WARN_ON_ONCE(current->flags & PF_FSTRANS))
> + if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> goto redirty;
>
> /*
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 70f42ea86dfb..f5969c8274fc 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -134,7 +134,7 @@ xfs_trans_reserve(
> bool rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>
> /* Mark this thread as being in a transaction */
> - current_set_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> /*
> * Attempt to reserve the needed disk blocks by decrementing
> @@ -144,7 +144,7 @@ xfs_trans_reserve(
> if (blocks > 0) {
> error = xfs_mod_fdblocks(tp->t_mountp, -((int64_t)blocks), rsvd);
> if (error != 0) {
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> return -ENOSPC;
> }
> tp->t_blk_res += blocks;
> @@ -221,7 +221,7 @@ xfs_trans_reserve(
> tp->t_blk_res = 0;
> }
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> return error;
> }
> @@ -914,7 +914,7 @@ __xfs_trans_commit(
>
> xfs_log_commit_cil(mp, tp, &commit_lsn, regrant);
>
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free(tp);
>
> /*
> @@ -944,7 +944,7 @@ __xfs_trans_commit(
> if (commit_lsn == -1 && !error)
> error = -EIO;
> }
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> xfs_trans_free_items(tp, NULLCOMMITLSN, !!error);
> xfs_trans_free(tp);
>
> @@ -998,7 +998,7 @@ xfs_trans_cancel(
> xfs_log_done(mp, tp->t_ticket, NULL, false);
>
> /* mark this thread as no longer being in a transaction */
> - current_restore_flags_nested(&tp->t_pflags, PF_FSTRANS);
> + current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>
> xfs_trans_free_items(tp, NULLCOMMITLSN, dirty);
> xfs_trans_free(tp);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4d1905245c7a..baffd340ea82 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
> #define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezable */
> #define PF_SUSPEND_TASK 0x80000000 /* this thread called freeze_processes and should not be frozen */
>
> +#define PF_MEMALLOC_NOFS PF_FSTRANS /* Transition to a more generic GFP_NOFS scope semantic */
> +
> /*
> * Only the _current_ task can read/write to tsk->flags, but other
> * tasks can access tsk->flags in readonly mode for example
> --
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-16 16:38:14

by Brian Foster

[permalink] [raw]
Subject: Re: [PATCH 5/9] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/xfs/kmem.c | 10 +++++-----
> fs/xfs/xfs_buf.c | 8 ++++----
> 2 files changed, 9 insertions(+), 9 deletions(-)
>
...
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index f31ae592dcae..5c6f9bd4d8be 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
> bp->b_addr = NULL;
> } else {
> int retried = 0;
> - unsigned noio_flag;
> + unsigned nofs_flag;
>
> /*
> * vm_map_ram() will allocate auxillary structures (e.g.
> * pagetables) with GFP_KERNEL, yet we are likely to be under
> * GFP_NOFS context here. Hence we need to tell memory reclaim
> - * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> + * that we are in such a context via PF_MEMALLOC_NOFS to prevent
> * memory reclaim re-entering the filesystem here and
> * potentially deadlocking.
> */
> - noio_flag = memalloc_noio_save();
> + nofs_flag = memalloc_nofs_save();
> do {
> bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
> -1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
> break;
> vm_unmap_aliases();
> } while (retried++ <= 1);
> - memalloc_noio_restore(noio_flag);
> + memalloc_noio_restore(nofs_flag);

memalloc_nofs_restore() ?

Brian

>
> if (!bp->b_addr)
> return -ENOMEM;
> --
> 2.10.2
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-16 22:00:32

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/9 v2] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

On Fri 16-12-16 11:38:11, Brian Foster wrote:
> On Thu, Dec 15, 2016 at 03:07:11PM +0100, Michal Hocko wrote:
[...]
> > @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
> > break;
> > vm_unmap_aliases();
> > } while (retried++ <= 1);
> > - memalloc_noio_restore(noio_flag);
> > + memalloc_noio_restore(nofs_flag);
>
> memalloc_nofs_restore() ?

Ups, you are right of course. Fixed.
---

2016-12-16 22:01:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/9 v2] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Fri 16-12-16 11:37:50, Brian Foster wrote:
> On Fri, Dec 16, 2016 at 04:40:41PM +0100, Michal Hocko wrote:
> > Updated patch after Mike noticed a BUG_ON when KM_NOLOCKDEP is used.
> > ---
> > From 1497e713e11639157aef21cae29052cb3dc7ab44 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Thu, 15 Dec 2016 13:06:43 +0100
> > Subject: [PATCH] xfs: introduce and use KM_NOLOCKDEP to silence reclaim
> > lockdep false positives
> >
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> >
>
> Hi Michal,
>
> I haven't gone back to fully grok b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy"), so I'm not really familiar with
> the original problem. FWIW, there was another KM_NOFS instance added by
> that commit in xlog_cil_prepare_log_vecs() that is now in
> xlog_cil_alloc_shadow_bufs(). Perhaps Dave can confirm whether the
> original issue still applies..?

Yes, I've noticed that but the reworked code looked sufficiently
different that I didn't dare to simply convert it.
--
Michal Hocko
SUSE Labs

2016-12-19 09:25:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] scope GFP_NOFS api

On Fri 16-12-16 17:27:28, Mike Galbraith wrote:
> On Fri, 2016-12-16 at 16:35 +0100, Michal Hocko wrote:
> > On Fri 16-12-16 16:05:58, Mike Galbraith wrote:
> > > On Thu, 2016-12-15 at 15:07 +0100, Michal Hocko wrote:
> > > > Hi,
> > > > I have posted the previous version here [1]. Since then I have added a
> > > > support to suppress reclaim lockdep warnings (__GFP_NOLOCKDEP) to allow
> > > > removing GFP_NOFS usage motivated by the lockdep false positives. On top
> > > > of that I've tried to convert few KM_NOFS usages to use the new flag in
> > > > the xfs code base. This would need a review from somebody familiar with
> > > > xfs of course.
> > >
> > > The wild ass guess below prevents the xfs explosion below when running
> > > ltp zram tests.
> >
> > Yes this looks correct. Thanks for noticing. I will fold it to the
> > patch2. Thanks for testing Mike!
>
> I had ulterior motives, was hoping you might have made the irksome RT
> gripe below just _go away_, as staring at it ain't working out ;-)
>
> [ 1441.309006] =========================================================
> [ 1441.309006] [ INFO: possible irq lock inversion dependency detected ]
> [ 1441.309007] 4.10.0-rt9-rt #11 Tainted: G E
> [ 1441.309007] ---------------------------------------------------------
> [ 1441.309008] kswapd0/165 just changed the state of lock:
> [ 1441.309009] (&journal->j_state_lock){+.+.-.}, at: [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309017] but this lock took another, RECLAIM_FS-unsafe lock in the past:
> [ 1441.309017] (&tb->tb6_lock){+.+.+.}
> [ 1441.309018] and interrupts could create inverse lock ordering between them.
> [ 1441.309018] other info that might help us debug this:
> [ 1441.309018] Chain exists of: &journal->j_state_lock --> &journal->j_list_lock --> &tb->tb6_lock
> [ 1441.309019] Possible interrupt unsafe locking scenario:
> [ 1441.309019] CPU0 CPU1
> [ 1441.309019] ---- ----
> [ 1441.309019] lock(&tb->tb6_lock);
> [ 1441.309020] local_irq_disable();
> [ 1441.309020] lock(&journal->j_state_lock);
> [ 1441.309020] lock(&journal->j_list_lock);
> [ 1441.309021] <Interrupt>
> [ 1441.309021] lock(&journal->j_state_lock);

Hum, so AFAICT this is just a false positive resulting from lockdep
allocation context tracking being implemented via interrupt context
tracking. At least I don't see how you could possibly enter FS reclaim
under &tb->tb6_lock...

Honza

> [ 1441.309021] *** DEADLOCK ***
> [ 1441.309022] 2 locks held by kswapd0/165:
> [ 1441.309022] #0: (shrinker_rwsem){+.+...}, at: [<ffffffff811efa2a>] shrink_slab+0x7a/0x6c0
> [ 1441.309027] #1: (&type->s_umount_key#29){+.+.+.}, at: [<ffffffff8126f20b>] trylock_super+0x1b/0x50
> [ 1441.309030] the shortest dependencies between 2nd lock and 1st lock:
> [ 1441.309031] -> (&tb->tb6_lock){+.+.+.} ops: 271 {
> [ 1441.309032] HARDIRQ-ON-W at:
> [ 1441.309035] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
> [ 1441.309036] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309039] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309041] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
> [ 1441.309043] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
> [ 1441.309044] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
> [ 1441.309046] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
> [ 1441.309047] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
> [ 1441.309049] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
> [ 1441.309051] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
> [ 1441.309053] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
> [ 1441.309054] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
> [ 1441.309055] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
> [ 1441.309057] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
> [ 1441.309058] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
> [ 1441.309060] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
> [ 1441.309061] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309061] SOFTIRQ-ON-W at:
> [ 1441.309063] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
> [ 1441.309064] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309064] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309065] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
> [ 1441.309067] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
> [ 1441.309067] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
> [ 1441.309068] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
> [ 1441.309069] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
> [ 1441.309071] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
> [ 1441.309073] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
> [ 1441.309074] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
> [ 1441.309075] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
> [ 1441.309077] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
> [ 1441.309078] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
> [ 1441.309079] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
> [ 1441.309080] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
> [ 1441.309081] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309081] RECLAIM_FS-ON-W at:
> [ 1441.309082] [<ffffffff810e0316>] mark_held_locks+0x66/0x90
> [ 1441.309084] [<ffffffff810e34a8>] lockdep_trace_alloc+0xd8/0x120
> [ 1441.309085] [<ffffffff81246df6>] kmem_cache_alloc_node+0x36/0x310
> [ 1441.309086] [<ffffffff815dfd4e>] __alloc_skb+0x4e/0x280
> [ 1441.309088] [<ffffffff816ee6ac>] inet6_rt_notify+0x5c/0x130
> [ 1441.309089] [<ffffffff816f101b>] fib6_add+0x56b/0xa30
> [ 1441.309090] [<ffffffff816e66f8>] __ip6_ins_rt+0x48/0x70
> [ 1441.309091] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
> [ 1441.309092] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
> [ 1441.309093] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
> [ 1441.309094] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
> [ 1441.309096] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
> [ 1441.309097] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
> [ 1441.309098] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
> [ 1441.309099] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
> [ 1441.309100] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
> [ 1441.309102] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
> [ 1441.309103] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
> [ 1441.309104] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
> [ 1441.309105] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309105] INITIAL USE at:
> [ 1441.309106] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
> [ 1441.309107] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309108] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309109] [<ffffffff816e66e3>] __ip6_ins_rt+0x33/0x70
> [ 1441.309110] [<ffffffff816eccd1>] ip6_route_add+0x81/0xd0
> [ 1441.309111] [<ffffffff816dbf33>] addrconf_prefix_route+0x133/0x1d0
> [ 1441.309112] [<ffffffff816e16eb>] inet6_addr_add+0x1eb/0x250
> [ 1441.309113] [<ffffffff816e294b>] inet6_rtm_newaddr+0x33b/0x410
> [ 1441.309115] [<ffffffff81613c35>] rtnetlink_rcv_msg+0x95/0x220
> [ 1441.309116] [<ffffffff8163a477>] netlink_rcv_skb+0xa7/0xc0
> [ 1441.309117] [<ffffffff8160de88>] rtnetlink_rcv+0x28/0x30
> [ 1441.309118] [<ffffffff81639e53>] netlink_unicast+0x143/0x1f0
> [ 1441.309119] [<ffffffff8163a222>] netlink_sendmsg+0x322/0x3a0
> [ 1441.309120] [<ffffffff815d5c48>] sock_sendmsg+0x38/0x50
> [ 1441.309121] [<ffffffff815d60a6>] SYSC_sendto+0xf6/0x170
> [ 1441.309122] [<ffffffff815d6f6e>] SyS_sendto+0xe/0x10
> [ 1441.309123] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309123] }
> [ 1441.309125] ... key at: [<ffffffff82dd96e0>] __key.59908+0x0/0x8
> [ 1441.309125] ... acquired at:
> [ 1441.309126] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309127] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
> [ 1441.309128] [<ffffffff816ea541>] ip6_pol_route+0x61/0xa60
> [ 1441.309130] [<ffffffff816eaf5a>] ip6_pol_route_input+0x1a/0x20
> [ 1441.309131] [<ffffffff81718f21>] fib6_rule_action+0xa1/0x1e0
> [ 1441.309133] [<ffffffff81621e53>] fib_rules_lookup+0x153/0x2e0
> [ 1441.309134] [<ffffffff81719219>] fib6_rule_lookup+0x59/0xc0
> [ 1441.309135] [<ffffffff816e6a3e>] ip6_route_input_lookup+0x4e/0x60
> [ 1441.309136] [<ffffffff816ec59d>] ip6_route_input+0xdd/0x1a0
> [ 1441.309137] [<ffffffff816d87d0>] ip6_rcv_finish+0x60/0x200
> [ 1441.309139] [<ffffffffa09e00b0>] ip_sabotage_in+0x30/0x40 [br_netfilter]
> [ 1441.309141] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
> [ 1441.309142] [<ffffffff816d998a>] ipv6_rcv+0x72a/0x980
> [ 1441.309143] [<ffffffff815f81ef>] __netif_receive_skb_core+0x38f/0xd20
> [ 1441.309144] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
> [ 1441.309145] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
> [ 1441.309147] [<ffffffff815fa668>] netif_receive_skb+0x38/0x180
> [ 1441.309151] [<ffffffffa09b77e5>] br_pass_frame_up+0xd5/0x2c0 [bridge]
> [ 1441.309154] [<ffffffffa09b7d66>] br_handle_frame_finish+0x256/0x5c0 [bridge]
> [ 1441.309156] [<ffffffffa09e159c>] br_nf_hook_thresh+0xac/0x220 [br_netfilter]
> [ 1441.309157] [<ffffffffa09e2ee3>] br_nf_pre_routing_finish_ipv6+0x1c3/0x340 [br_netfilter]
> [ 1441.309158] [<ffffffffa09e349d>] br_nf_pre_routing_ipv6+0xdd/0x27a [br_netfilter]
> [ 1441.309159] [<ffffffffa09e2942>] br_nf_pre_routing+0x1b2/0x540 [br_netfilter]
> [ 1441.309160] [<ffffffff8163c7ac>] nf_hook_slow+0x2c/0xf0
> [ 1441.309163] [<ffffffffa09b82f7>] br_handle_frame+0x227/0x5b0 [bridge]
> [ 1441.309164] [<ffffffff815f8036>] __netif_receive_skb_core+0x1d6/0xd20
> [ 1441.309165] [<ffffffff815f8b98>] __netif_receive_skb+0x18/0x60
> [ 1441.309166] [<ffffffff815fa4c1>] netif_receive_skb_internal+0x61/0x1d0
> [ 1441.309167] [<ffffffff815fbca2>] napi_gro_receive+0x192/0x250
> [ 1441.309171] [<ffffffffa03f2163>] rtl8169_poll+0x183/0x6a0 [r8169]
> [ 1441.309172] [<ffffffff815faea0>] net_rx_action+0x3b0/0x700
> [ 1441.309173] [<ffffffff810841a5>] do_current_softirqs+0x285/0x680
> [ 1441.309174] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
> [ 1441.309177] [<ffffffff810f7d81>] irq_forced_thread_fn+0x41/0x60
> [ 1441.309178] [<ffffffff810f832f>] irq_thread+0x13f/0x1e0
> [ 1441.309179] [<ffffffff810a5ecc>] kthread+0x10c/0x140
> [ 1441.309180] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309181] -> (&per_cpu(local_softirq_locks[i], __cpu).lock){+.+...} ops: 3582145 {
> [ 1441.309182] HARDIRQ-ON-W at:
> [ 1441.309183] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
> [ 1441.309184] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309185] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
> [ 1441.309186] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
> [ 1441.309187] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
> [ 1441.309188] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
> [ 1441.309190] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
> [ 1441.309191] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
> [ 1441.309193] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
> [ 1441.309194] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
> [ 1441.309195] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
> [ 1441.309196] [<ffffffff810001b5>] start_cpu+0x5/0x14
> [ 1441.309196] SOFTIRQ-ON-W at:
> [ 1441.309197] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
> [ 1441.309198] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309199] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
> [ 1441.309200] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
> [ 1441.309201] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
> [ 1441.309202] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
> [ 1441.309203] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
> [ 1441.309204] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
> [ 1441.309204] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
> [ 1441.309205] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
> [ 1441.309206] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
> [ 1441.309207] [<ffffffff810001b5>] start_cpu+0x5/0x14
> [ 1441.309207] INITIAL USE at:
> [ 1441.309208] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
> [ 1441.309209] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309210] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
> [ 1441.309211] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
> [ 1441.309212] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
> [ 1441.309213] [<ffffffff8113c631>] cgroup_idr_alloc.constprop.41+0x61/0x80
> [ 1441.309213] [<ffffffff811d4d5b>] cgroup_setup_root+0x65/0x28f
> [ 1441.309215] [<ffffffff81d9ef88>] cgroup_init+0xf7/0x3e5
> [ 1441.309216] [<ffffffff81d770d1>] start_kernel+0x43f/0x484
> [ 1441.309216] [<ffffffff81d76599>] x86_64_start_reservations+0x2a/0x2c
> [ 1441.309217] [<ffffffff81d766d8>] x86_64_start_kernel+0x13d/0x14c
> [ 1441.309218] [<ffffffff810001b5>] start_cpu+0x5/0x14
> [ 1441.309218] }
> [ 1441.309220] ... key at: [<ffffffff81f6c110>] __key.38555+0x0/0x8
> [ 1441.309220] ... acquired at:
> [ 1441.309221] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309222] [<ffffffff8174cdea>] rt_spin_lock__no_mg+0x5a/0x70
> [ 1441.309222] [<ffffffff81084094>] do_current_softirqs+0x174/0x680
> [ 1441.309223] [<ffffffff81084607>] __local_bh_enable+0x67/0x80
> [ 1441.309225] [<ffffffff81200bc9>] wb_wakeup_delayed+0x69/0x70
> [ 1441.309226] [<ffffffff812a0aab>] __mark_inode_dirty+0x60b/0x7c0
> [ 1441.309227] [<ffffffff812ab235>] mark_buffer_dirty+0xb5/0x240
> [ 1441.309231] [<ffffffffa009ba1d>] __jbd2_journal_temp_unlink_buffer+0xbd/0xe0 [jbd2]
> [ 1441.309233] [<ffffffffa009e40a>] __jbd2_journal_refile_buffer+0xba/0xe0 [jbd2]
> [ 1441.309235] [<ffffffffa009fbed>] jbd2_journal_commit_transaction+0x112d/0x2130 [jbd2]
> [ 1441.309237] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
> [ 1441.309239] [<ffffffff810a5ecc>] kthread+0x10c/0x140
> [ 1441.309239] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309240] -> (&journal->j_list_lock){+.+...} ops: 587416 {
> [ 1441.309241] HARDIRQ-ON-W at:
> [ 1441.309242] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
> [ 1441.309243] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309244] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
> [ 1441.309246] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
> [ 1441.309248] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
> [ 1441.309259] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
> [ 1441.309264] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
> [ 1441.309265] [<ffffffff81268281>] do_dentry_open+0x231/0x360
> [ 1441.309266] [<ffffffff81269642>] vfs_open+0x52/0x80
> [ 1441.309268] [<ffffffff8127b206>] path_openat+0x476/0xdd0
> [ 1441.309269] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
> [ 1441.309270] [<ffffffff812723b7>] do_open_execat+0x67/0x150
> [ 1441.309271] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
> [ 1441.309272] [<ffffffff8127415c>] do_execve+0x2c/0x30
> [ 1441.309274] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
> [ 1441.309274] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309275] SOFTIRQ-ON-W at:
> [ 1441.309276] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
> [ 1441.309277] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309277] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
> [ 1441.309279] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
> [ 1441.309281] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
> [ 1441.309288] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
> [ 1441.309293] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
> [ 1441.309294] [<ffffffff81268281>] do_dentry_open+0x231/0x360
> [ 1441.309295] [<ffffffff81269642>] vfs_open+0x52/0x80
> [ 1441.309296] [<ffffffff8127b206>] path_openat+0x476/0xdd0
> [ 1441.309297] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
> [ 1441.309298] [<ffffffff812723b7>] do_open_execat+0x67/0x150
> [ 1441.309299] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
> [ 1441.309300] [<ffffffff8127415c>] do_execve+0x2c/0x30
> [ 1441.309301] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
> [ 1441.309302] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309302] INITIAL USE at:
> [ 1441.309303] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
> [ 1441.309304] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309305] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
> [ 1441.309307] [<ffffffffa009d529>] do_get_write_access+0x3b9/0x5c0 [jbd2]
> [ 1441.309308] [<ffffffffa009d761>] jbd2_journal_get_write_access+0x31/0x60 [jbd2]
> [ 1441.309314] [<ffffffffa0105259>] __ext4_journal_get_write_access+0x49/0x90 [ext4]
> [ 1441.309319] [<ffffffffa00c5cb2>] ext4_file_open+0x1c2/0x230 [ext4]
> [ 1441.309319] [<ffffffff81268281>] do_dentry_open+0x231/0x360
> [ 1441.309320] [<ffffffff81269642>] vfs_open+0x52/0x80
> [ 1441.309321] [<ffffffff8127b206>] path_openat+0x476/0xdd0
> [ 1441.309322] [<ffffffff8127d64e>] do_filp_open+0x7e/0xd0
> [ 1441.309323] [<ffffffff812723b7>] do_open_execat+0x67/0x150
> [ 1441.309324] [<ffffffff812739ec>] do_execveat_common.isra.34+0x25c/0x9a0
> [ 1441.309325] [<ffffffff8127415c>] do_execve+0x2c/0x30
> [ 1441.309326] [<ffffffff81098c46>] call_usermodehelper_exec_async+0xf6/0x130
> [ 1441.309327] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309327] }
> [ 1441.309330] ... key at: [<ffffffffa00af5c0>] __key.47251+0x0/0xffffffffffff9a40 [jbd2]
> [ 1441.309330] ... acquired at:
> [ 1441.309331] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309331] [<ffffffff8174cd6f>] rt_spin_lock+0x5f/0x80
> [ 1441.309333] [<ffffffffa009ee25>] jbd2_journal_commit_transaction+0x365/0x2130 [jbd2]
> [ 1441.309335] [<ffffffffa00a53dd>] kjournald2+0xcd/0x270 [jbd2]
> [ 1441.309337] [<ffffffff810a5ecc>] kthread+0x10c/0x140
> [ 1441.309337] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309338] -> (&journal->j_state_lock){+.+.-.} ops: 5849939 {
> [ 1441.309339] HARDIRQ-ON-W at:
> [ 1441.309340] [<ffffffff810e11b8>] __lock_acquire+0x938/0x1770
> [ 1441.309341] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309341] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309348] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
> [ 1441.309353] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
> [ 1441.309355] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
> [ 1441.309360] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
> [ 1441.309361] [<ffffffff8126fa79>] mount_fs+0x39/0x170
> [ 1441.309362] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
> [ 1441.309363] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
> [ 1441.309364] [<ffffffff81294773>] SyS_mount+0x83/0xd0
> [ 1441.309365] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309365] SOFTIRQ-ON-W at:
> [ 1441.309366] [<ffffffff810e0b03>] __lock_acquire+0x283/0x1770
> [ 1441.309367] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309368] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309373] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
> [ 1441.309378] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
> [ 1441.309379] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
> [ 1441.309383] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
> [ 1441.309385] [<ffffffff8126fa79>] mount_fs+0x39/0x170
> [ 1441.309385] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
> [ 1441.309386] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
> [ 1441.309387] [<ffffffff81294773>] SyS_mount+0x83/0xd0
> [ 1441.309388] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309388] IN-RECLAIM_FS-W at:
> [ 1441.309390] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
> [ 1441.309391] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309391] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
> [ 1441.309394] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309398] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
> [ 1441.309400] [<ffffffff8128b221>] evict+0xd1/0x1a0
> [ 1441.309401] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
> [ 1441.309402] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
> [ 1441.309404] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
> [ 1441.309405] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
> [ 1441.309406] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
> [ 1441.309407] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
> [ 1441.309409] [<ffffffff810a5ecc>] kthread+0x10c/0x140
> [ 1441.309409] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309410] INITIAL USE at:
> [ 1441.309411] [<ffffffff810e0b4e>] __lock_acquire+0x2ce/0x1770
> [ 1441.309412] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309412] [<ffffffff8174d291>] rt_write_lock+0x31/0x40
> [ 1441.309417] [<ffffffffa00e9c0d>] ext4_init_journal_params+0x4d/0xc0 [ext4]
> [ 1441.309422] [<ffffffffa00f515e>] ext4_fill_super+0x1e4e/0x3770 [ext4]
> [ 1441.309423] [<ffffffff8126ee4a>] mount_bdev+0x18a/0x1c0
> [ 1441.309427] [<ffffffffa00e9705>] ext4_mount+0x15/0x20 [ext4]
> [ 1441.309429] [<ffffffff8126fa79>] mount_fs+0x39/0x170
> [ 1441.309429] [<ffffffff81290817>] vfs_kern_mount+0x67/0x130
> [ 1441.309430] [<ffffffff812939fb>] do_mount+0x1bb/0xc60
> [ 1441.309431] [<ffffffff81294773>] SyS_mount+0x83/0xd0
> [ 1441.309432] [<ffffffff8174d545>] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1441.309432] }
> [ 1441.309435] ... key at: [<ffffffffa00af5b0>] __key.47253+0x0/0xffffffffffff9a50 [jbd2]
> [ 1441.309435] ... acquired at:
> [ 1441.309436] [<ffffffff810df99e>] check_usage_forwards+0x11e/0x120
> [ 1441.309437] [<ffffffff810e01a8>] mark_lock+0x1e8/0x2f0
> [ 1441.309437] [<ffffffff810e0b36>] __lock_acquire+0x2b6/0x1770
> [ 1441.309438] [<ffffffff810e2564>] lock_acquire+0xd4/0x270
> [ 1441.309439] [<ffffffff8174d307>] rt_read_lock+0x47/0x60
> [ 1441.309441] [<ffffffffa00a6d60>] jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309446] [<ffffffffa00d472e>] ext4_evict_inode+0x37e/0x700 [ext4]
> [ 1441.309447] [<ffffffff8128b221>] evict+0xd1/0x1a0
> [ 1441.309448] [<ffffffff8128b33d>] dispose_list+0x4d/0x70
> [ 1441.309449] [<ffffffff8128c60b>] prune_icache_sb+0x4b/0x60
> [ 1441.309450] [<ffffffff8126f381>] super_cache_scan+0x141/0x190
> [ 1441.309451] [<ffffffff811efc27>] shrink_slab+0x277/0x6c0
> [ 1441.309452] [<ffffffff811f4523>] shrink_node+0x2e3/0x2f0
> [ 1441.309453] [<ffffffff811f5a7f>] kswapd+0x34f/0x980
> [ 1441.309454] [<ffffffff810a5ecc>] kthread+0x10c/0x140
> [ 1441.309455] [<ffffffff8174d7da>] ret_from_fork+0x2a/0x40
> [ 1441.309455] stack backtrace:
> [ 1441.309457] CPU: 0 PID: 165 Comm: kswapd0 Tainted: G E 4.10.0-rt9-rt #11
> [ 1441.309457] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [ 1441.309457] Call Trace:
> [ 1441.309459] dump_stack+0x85/0xc8
> [ 1441.309461] print_irq_inversion_bug.part.34+0x1ac/0x1b8
> [ 1441.309462] check_usage_forwards+0x11e/0x120
> [ 1441.309463] ? check_usage_backwards+0x120/0x120
> [ 1441.309463] mark_lock+0x1e8/0x2f0
> [ 1441.309464] __lock_acquire+0x2b6/0x1770
> [ 1441.309465] ? __lock_acquire+0x420/0x1770
> [ 1441.309466] lock_acquire+0xd4/0x270
> [ 1441.309468] ? jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309469] rt_read_lock+0x47/0x60
> [ 1441.309471] ? jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309472] jbd2_complete_transaction+0x20/0x90 [jbd2]
> [ 1441.309477] ext4_evict_inode+0x37e/0x700 [ext4]
> [ 1441.309478] evict+0xd1/0x1a0
> [ 1441.309479] dispose_list+0x4d/0x70
> [ 1441.309480] prune_icache_sb+0x4b/0x60
> [ 1441.309481] super_cache_scan+0x141/0x190
> [ 1441.309482] shrink_slab+0x277/0x6c0
> [ 1441.309483] shrink_node+0x2e3/0x2f0
> [ 1441.309485] kswapd+0x34f/0x980
> [ 1441.309487] kthread+0x10c/0x140
> [ 1441.309488] ? mem_cgroup_shrink_node+0x390/0x390
> [ 1441.309488] ? kthread_park+0x90/0x90
> [ 1441.309489] ret_from_fork+0x2a/0x40
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-12-19 09:36:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 6/9] jbd2: mark the transaction context with the scope GFP_NOFS context

On Thu 15-12-16 15:07:12, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> now that we have memalloc_nofs_{save,restore} api we can mark the whole
> transaction context as implicitly GFP_NOFS. All allocations will
> automatically inherit GFP_NOFS this way. This means that we do not have
> to mark any of those requests with GFP_NOFS and moreover all the
> ext4_kv[mz]alloc(GFP_NOFS) are also safe now because even the hardcoded
> GFP_KERNEL allocations deep inside the vmalloc will be NOFS now.
>
> Signed-off-by: Michal Hocko <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/jbd2/transaction.c | 11 +++++++++++
> include/linux/jbd2.h | 2 ++
> 2 files changed, 13 insertions(+)
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index e1652665bd93..35a5d3d76182 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -388,6 +388,11 @@ static int start_this_handle(journal_t *journal, handle_t *handle,
>
> rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
> jbd2_journal_free_transaction(new_transaction);
> + /*
> + * Make sure that no allocations done while the transaction is
> + * open is going to recurse back to the fs layer.
> + */
> + handle->saved_alloc_context = memalloc_nofs_save();
> return 0;
> }
>
> @@ -466,6 +471,7 @@ handle_t *jbd2__journal_start(journal_t *journal, int nblocks, int rsv_blocks,
> trace_jbd2_handle_start(journal->j_fs_dev->bd_dev,
> handle->h_transaction->t_tid, type,
> line_no, nblocks);
> +
> return handle;
> }
> EXPORT_SYMBOL(jbd2__journal_start);
> @@ -1760,6 +1766,11 @@ int jbd2_journal_stop(handle_t *handle)
> if (handle->h_rsv_handle)
> jbd2_journal_free_reserved(handle->h_rsv_handle);
> free_and_exit:
> + /*
> + * scope of th GFP_NOFS context is over here and so we can
> + * restore the original alloc context.
> + */
> + memalloc_nofs_restore(handle->saved_alloc_context);
> jbd2_free_handle(handle);
> return err;
> }
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index dfaa1f4dcb0c..606b6bce3a5b 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -491,6 +491,8 @@ struct jbd2_journal_handle
>
> unsigned long h_start_jiffies;
> unsigned int h_requested_credits;
> +
> + unsigned int saved_alloc_context;
> };
>
>
> --
> 2.10.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-12-19 09:36:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 8/9] Revert "ext4: avoid deadlocks in the writeback path by using sb_getblk_gfp"

On Thu 15-12-16 15:07:14, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> This reverts commit c45653c341f5c8a0ce19c8f0ad4678640849cb86 because
> sb_getblk_gfp is not really needed as
> sb_getblk
> __getblk_gfp
> __getblk_slow
> grow_buffers
> grow_dev_page
> gfp_mask = mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp
>
> so __GFP_FS is cleared unconditionally and therefore the above commit
> didn't have any real effect in fact.
>
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code to
> make the review of the remaining usage easier.
>
> Signed-off-by: Michal Hocko <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/extents.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index b1f8416923ab..ef815eb72389 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -518,7 +518,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
> struct buffer_head *bh;
> int err;
>
> - bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
> + bh = sb_getblk(inode->i_sb, pblk);
> if (unlikely(!bh))
> return ERR_PTR(-ENOMEM);
>
> @@ -1096,7 +1096,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
> err = -EFSCORRUPTED;
> goto cleanup;
> }
> - bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
> + bh = sb_getblk(inode->i_sb, newblock);
> if (unlikely(!bh)) {
> err = -ENOMEM;
> goto cleanup;
> @@ -1290,7 +1290,7 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
> if (newblock == 0)
> return err;
>
> - bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
> + bh = sb_getblk(inode->i_sb, newblock);
> if (unlikely(!bh))
> return -ENOMEM;
> lock_buffer(bh);
> --
> 2.10.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-12-19 09:36:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 7/9] jbd2: make the whole kjournald2 kthread NOFS safe

On Thu 15-12-16 15:07:13, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> kjournald2 is central to the transaction commit processing. As such any
> potential allocation from this kernel thread has to be GFP_NOFS. Make
> sure to mark the whole kernel thread GFP_NOFS by the memalloc_nofs_save.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/jbd2/journal.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 8ed971eeab44..6dad8c5d6ddf 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -206,6 +206,13 @@ static int kjournald2(void *arg)
> wake_up(&journal->j_wait_done_commit);
>
> /*
> + * Make sure that no allocations from this kernel thread will ever recurse
> + * to the fs layer because we are responsible for the transaction commit
> + * and any fs involvement might get stuck waiting for the trasn. commit.
> + */
> + memalloc_nofs_save();
> +
> + /*
> * And now, wait forever for commit wakeup events.
> */
> write_lock(&journal->j_state_lock);
> --
> 2.10.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-12-19 09:37:02

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 9/9] Revert "ext4: fix wrong gfp type under transaction"

On Thu 15-12-16 15:07:15, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> This reverts commit 216553c4b7f3e3e2beb4981cddca9b2027523928. Now that
> the transaction context uses memalloc_nofs_save and all allocations
> within the this context inherit GFP_NOFS automatically, there is no
> reason to mark specific allocations explicitly.
>
> This patch should not introduce any functional change. The main point
> of this change is to reduce explicit GFP_NOFS usage inside ext4 code
> to make the review of the remaining usage easier.
>
> Signed-off-by: Michal Hocko <[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/acl.c | 6 +++---
> fs/ext4/extents.c | 2 +-
> fs/ext4/resize.c | 4 ++--
> fs/ext4/xattr.c | 4 ++--
> 4 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index fd389935ecd1..9e98092c2a4b 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -32,7 +32,7 @@ ext4_acl_from_disk(const void *value, size_t size)
> return ERR_PTR(-EINVAL);
> if (count == 0)
> return NULL;
> - acl = posix_acl_alloc(count, GFP_NOFS);
> + acl = posix_acl_alloc(count, GFP_KERNEL);
> if (!acl)
> return ERR_PTR(-ENOMEM);
> for (n = 0; n < count; n++) {
> @@ -94,7 +94,7 @@ ext4_acl_to_disk(const struct posix_acl *acl, size_t *size)
>
> *size = ext4_acl_size(acl->a_count);
> ext_acl = kmalloc(sizeof(ext4_acl_header) + acl->a_count *
> - sizeof(ext4_acl_entry), GFP_NOFS);
> + sizeof(ext4_acl_entry), GFP_KERNEL);
> if (!ext_acl)
> return ERR_PTR(-ENOMEM);
> ext_acl->a_version = cpu_to_le32(EXT4_ACL_VERSION);
> @@ -159,7 +159,7 @@ ext4_get_acl(struct inode *inode, int type)
> }
> retval = ext4_xattr_get(inode, name_index, "", NULL, 0);
> if (retval > 0) {
> - value = kmalloc(retval, GFP_NOFS);
> + value = kmalloc(retval, GFP_KERNEL);
> if (!value)
> return ERR_PTR(-ENOMEM);
> retval = ext4_xattr_get(inode, name_index, "", value, retval);
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index ef815eb72389..c901d89f0038 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2933,7 +2933,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
> le16_to_cpu(path[k].p_hdr->eh_entries)+1;
> } else {
> path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
> - GFP_NOFS);
> + GFP_KERNEL);
> if (path == NULL) {
> ext4_journal_stop(handle);
> return -ENOMEM;
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index cf681004b196..e121f4e048b8 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -816,7 +816,7 @@ static int add_new_gdb(handle_t *handle, struct inode *inode,
>
> n_group_desc = ext4_kvmalloc((gdb_num + 1) *
> sizeof(struct buffer_head *),
> - GFP_NOFS);
> + GFP_KERNEL);
> if (!n_group_desc) {
> err = -ENOMEM;
> ext4_warning(sb, "not enough memory for %lu groups",
> @@ -943,7 +943,7 @@ static int reserve_backup_gdb(handle_t *handle, struct inode *inode,
> int res, i;
> int err;
>
> - primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_NOFS);
> + primary = kmalloc(reserved_gdb * sizeof(*primary), GFP_KERNEL);
> if (!primary)
> return -ENOMEM;
>
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index 5a94fa52b74f..172317462238 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -875,7 +875,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
>
> unlock_buffer(bs->bh);
> ea_bdebug(bs->bh, "cloning");
> - s->base = kmalloc(bs->bh->b_size, GFP_NOFS);
> + s->base = kmalloc(bs->bh->b_size, GFP_KERNEL);
> error = -ENOMEM;
> if (s->base == NULL)
> goto cleanup;
> @@ -887,7 +887,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
> }
> } else {
> /* Allocate a buffer where we construct the new block. */
> - s->base = kzalloc(sb->s_blocksize, GFP_NOFS);
> + s->base = kzalloc(sb->s_blocksize, GFP_KERNEL);
> /* assert(header == s->base) */
> error = -ENOMEM;
> if (s->base == NULL)
> --
> 2.10.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-12-19 21:24:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> KM_NOFS tags to keep lockdep happy") and use the new flag for them
> instead. There is really no reason to make these allocations contexts
> weaker just because of the lockdep which even might not be enabled
> in most cases.
>
> Signed-off-by: Michal Hocko <[email protected]>

I'd suggest that it might be better to drop this patch for now -
it's not necessary for the context flag changeover but does
introduce a risk of regressions if the conversion is wrong.

Hence I think this is better as a completely separate series
which audits and changes all the unnecessary KM_NOFS allocations
in one go. I've never liked whack-a-mole style changes like this -
do it once, do it properly....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-12-19 22:07:16

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Tue, Dec 20, 2016 at 08:24:13AM +1100, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> I'd suggest that it might be better to drop this patch for now -
> it's not necessary for the context flag changeover but does
> introduce a risk of regressions if the conversion is wrong.

I was just about to write in that while I didn't see anything obviously
wrong with the NOFS removals, I also don't know for sure that we can't
end up recursively in those code paths (specifically the directory
traversal thing).

--D

> Hence I think this is better as a completely separate series
> which audits and changes all the unnecessary KM_NOFS allocations
> in one go. I've never liked whack-a-mole style changes like this -
> do it once, do it properly....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2016-12-20 08:38:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Tue 20-12-16 08:24:13, Dave Chinner wrote:
> On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > instead. There is really no reason to make these allocations contexts
> > weaker just because of the lockdep which even might not be enabled
> > in most cases.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> I'd suggest that it might be better to drop this patch for now -
> it's not necessary for the context flag changeover but does
> introduce a risk of regressions if the conversion is wrong.
>
> Hence I think this is better as a completely separate series
> which audits and changes all the unnecessary KM_NOFS allocations
> in one go. I've never liked whack-a-mole style changes like this -
> do it once, do it properly....

OK, fair enough. I thought it might be better to have an example user so
that others can follow but as you say, the risk of regression is really
there and these kind of changes definitely need a throughout review.

I am not sure I will be able to post more of those changes because that
requires an intimate knowledge of the fs so I hope somebody can take
over there and follow up.

Thanks!
--
Michal Hocko
SUSE Labs

2016-12-20 21:39:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 2/9] xfs: introduce and use KM_NOLOCKDEP to silence reclaim lockdep false positives

On Mon, Dec 19, 2016 at 02:06:19PM -0800, Darrick J. Wong wrote:
> On Tue, Dec 20, 2016 at 08:24:13AM +1100, Dave Chinner wrote:
> > On Thu, Dec 15, 2016 at 03:07:08PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Now that the page allocator offers __GFP_NOLOCKDEP let's introduce
> > > KM_NOLOCKDEP alias for the xfs allocation APIs. While we are at it
> > > also change KM_NOFS users introduced by b17cb364dbbb ("xfs: fix missing
> > > KM_NOFS tags to keep lockdep happy") and use the new flag for them
> > > instead. There is really no reason to make these allocations contexts
> > > weaker just because of the lockdep which even might not be enabled
> > > in most cases.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > I'd suggest that it might be better to drop this patch for now -
> > it's not necessary for the context flag changeover but does
> > introduce a risk of regressions if the conversion is wrong.
>
> I was just about to write in that while I didn't see anything obviously
> wrong with the NOFS removals, I also don't know for sure that we can't
> end up recursively in those code paths (specifically the directory
> traversal thing).

The issue is with code paths that can be called from both inside and
outside transaction context - lockdep complains when it sees an
allocation path that is used with both GFP_NOFS and GFP_KERNEL
context, as it doesn't know that the GFP_KERNEL usage is safe or
not.

So things like the directory buffer path, which can be called from
readdir without a transaction context, have various KM_NOFS flags
scattered through it so that lockdep doesn't get all upset every
time readdir is called...

There are other cases like this - btree manipulation via bunmapi()
can be called without transaction context to remove delayed alloc
extents, and that puts all of the btree cursor and incore extent
list handling in the same boat (all those allocations are KM_NOFS),
etc.

So it's not really recursion that is the problem here - it's
different allocation contexts that lockdep can't know about unless
it's told about them. We've done that with KM_NOFS in the past; in
future we should use this KM_NOLOCKDEP flag, though I'd prefer a
better name for it. e.g. KM_NOTRANS to indicate that the allocation
can occur both inside and outside of transaction context....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-12-22 09:38:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] scope GFP_NOFS api

Are there any objections to the approach and can we have this merged to
the mm tree?

Dave has expressed the patch2 should be dropped for now. I will do that
in a next submission but I do not want to resubmit until there is a
consensus on this.

What do other than xfs/ext4 developers think about this API. Can we find
a way to use it?
--
Michal Hocko
SUSE Labs