2018-04-09 03:09:55

by Minchan Kim

[permalink] [raw]
Subject: [PATCH] mm: workingset: fix NULL ptr dereference

Recently, I got a report like below.

[ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
[ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
[ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
[ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
[ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
[ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
[ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
[ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
[ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
[ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
[ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
[ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
[ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
[ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
[ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64

v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
plant shadow entries without radix tree node") so I thought
it should be okay. When I was googling, I found others report
such problem and I think current kernel still has the problem.

https://bugzilla.redhat.com/show_bug.cgi?id=1431567
https://bugzilla.redhat.com/show_bug.cgi?id=1420335

It assumes shadow entry of radix tree relies on the init state
that node->private_list allocated should be list_empty state.
Currently, it's initailized in SLAB constructor which means
node of radix tree would be initialized only when *slub allocates
new page*, not *new object*. So, if some FS or subsystem pass
gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
That means allocated node can have !list_empty(node->private_list).
It ends up calling NULL deference at workingset_update_node by
failing list_empty check.

This patch should fix it.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Reported-by: Chris Fries <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
If it is reviewed and proved with testing, I will resend the patch to
Ccing [email protected].

Thanks.

lib/radix-tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e00138d593f..afcbdb6c495f 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -428,6 +428,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
ret->exceptional = exceptional;
ret->parent = parent;
ret->root = root;
+ INIT_LIST_HEAD(&ret->private_list);
}
return ret;
}
@@ -2234,7 +2235,6 @@ radix_tree_node_ctor(void *arg)
struct radix_tree_node *node = arg;

memset(node, 0, sizeof(*node));
- INIT_LIST_HEAD(&node->private_list);
}

static __init unsigned long __maxindex(unsigned int height)
--
2.17.0.484.g0c8726318c-goog



2018-04-09 03:10:09

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> It assumes shadow entry of radix tree relies on the init state
> that node->private_list allocated should be list_empty state.
> Currently, it's initailized in SLAB constructor which means
> node of radix tree would be initialized only when *slub allocates
> new page*, not *new object*. So, if some FS or subsystem pass
> gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.

Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
with GFP_ZERO.

Although, even if nobody's doing that intentionally, if somebody has
a bitflip with the __GFP_ZERO bit, it's going to propagate widely.
I think something like this might be appropriate:

diff --git a/mm/slub.c b/mm/slub.c
index 9e1100f9298f..0f55f0a0dcaa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2714,8 +2714,10 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
stat(s, ALLOC_FASTPATH);
}

- if (unlikely(gfpflags & __GFP_ZERO) && object)
- memset(object, 0, s->object_size);
+ if (unlikely(gfpflags & __GFP_ZERO) && object) {
+ if (!WARN_ON_ONCE(s->ctor))
+ memset(object, 0, s->object_size);
+ }

slab_post_alloc_hook(s, gfpflags, 1, &object);


Something you could try is checking that the list is empty when the node
is inserted into the radix tree.

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 8e00138d593f..580f52d0c072 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -428,6 +428,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
ret->exceptional = exceptional;
ret->parent = parent;
ret->root = root;
+ BUG_ON(!list_empty(&ret->private_list));
}
return ret;
}

2018-04-09 03:17:48

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > It assumes shadow entry of radix tree relies on the init state
> > that node->private_list allocated should be list_empty state.
> > Currently, it's initailized in SLAB constructor which means
> > node of radix tree would be initialized only when *slub allocates
> > new page*, not *new object*. So, if some FS or subsystem pass
> > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
>
> Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> with GFP_ZERO.

Look at fs/f2fs/inode.c
mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);

__add_to_page_cache_locked
radix_tree_maybe_preload

add_to_page_cache_lru

What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?

>
> Although, even if nobody's doing that intentionally, if somebody has
> a bitflip with the __GFP_ZERO bit, it's going to propagate widely.
> I think something like this might be appropriate:
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 9e1100f9298f..0f55f0a0dcaa 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2714,8 +2714,10 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> stat(s, ALLOC_FASTPATH);
> }
>
> - if (unlikely(gfpflags & __GFP_ZERO) && object)
> - memset(object, 0, s->object_size);
> + if (unlikely(gfpflags & __GFP_ZERO) && object) {
> + if (!WARN_ON_ONCE(s->ctor))
> + memset(object, 0, s->object_size);
> + }


>
> slab_post_alloc_hook(s, gfpflags, 1, &object);
>
>
> Something you could try is checking that the list is empty when the node
> is inserted into the radix tree.
>
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 8e00138d593f..580f52d0c072 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -428,6 +428,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
> ret->exceptional = exceptional;
> ret->parent = parent;
> ret->root = root;
> + BUG_ON(!list_empty(&ret->private_list));
> }
> return ret;
> }

2018-04-09 11:18:32

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> > On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > > It assumes shadow entry of radix tree relies on the init state
> > > that node->private_list allocated should be list_empty state.
> > > Currently, it's initailized in SLAB constructor which means
> > > node of radix tree would be initialized only when *slub allocates
> > > new page*, not *new object*. So, if some FS or subsystem pass
> > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> >
> > Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> > I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> > with GFP_ZERO.
>
> Look at fs/f2fs/inode.c
> mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
>
> __add_to_page_cache_locked
> radix_tree_maybe_preload
>
> add_to_page_cache_lru
>
> What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?

Because it's a stupid thing to do. Pages are allocated and then filled
from disk. Zeroing them before DMAing to them is just a waste of time.


2018-04-09 11:28:45

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > > > It assumes shadow entry of radix tree relies on the init state
> > > > that node->private_list allocated should be list_empty state.
> > > > Currently, it's initailized in SLAB constructor which means
> > > > node of radix tree would be initialized only when *slub allocates
> > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > >
> > > Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> > > I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> > > with GFP_ZERO.
> >
> > Look at fs/f2fs/inode.c
> > mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> >
> > __add_to_page_cache_locked
> > radix_tree_maybe_preload
> >
> > add_to_page_cache_lru
> >
> > What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
>
> Because it's a stupid thing to do. Pages are allocated and then filled
> from disk. Zeroing them before DMAing to them is just a waste of time.

Every FSes do address_space to read pages from storage? I'm not sure.

If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
on mapping_set_gfp_mask at the beginning and remove all of those
stupid thins.

Jaegeuk, why do you need __GFP_ZERO? Could you explain?

2018-04-09 12:29:38

by Chao Yu

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On 2018/4/9 19:25, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
>> On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
>>> On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
>>>> On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
>>>>> It assumes shadow entry of radix tree relies on the init state
>>>>> that node->private_list allocated should be list_empty state.
>>>>> Currently, it's initailized in SLAB constructor which means
>>>>> node of radix tree would be initialized only when *slub allocates
>>>>> new page*, not *new object*. So, if some FS or subsystem pass
>>>>> gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
>>>>
>>>> Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
>>>> I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
>>>> with GFP_ZERO.
>>>
>>> Look at fs/f2fs/inode.c
>>> mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
>>>
>>> __add_to_page_cache_locked
>>> radix_tree_maybe_preload
>>>
>>> add_to_page_cache_lru
>>>
>>> What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
>>
>> Because it's a stupid thing to do. Pages are allocated and then filled
>> from disk. Zeroing them before DMAing to them is just a waste of time.
>
> Every FSes do address_space to read pages from storage? I'm not sure.

No, sometimes, we need to write meta data to new allocated block address,
then we will allocate a zeroed page in inner inode's address space, and
fill partial data in it, and leave other place with zero value which means
some fields are initial status.

There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
I have just checked them, for both of them, we can avoid using __GFP_ZERO,
and do initialization by ourselves to avoid unneeded/redundant zeroing
from mm.

To Jaegeuk, if I missed something, please let me know.

---
fs/f2fs/inode.c | 4 ++--
fs/f2fs/node.c | 2 ++
2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index c85cccc2e800..cc63f8c448f0 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
make_now:
if (ino == F2FS_NODE_INO(sbi)) {
inode->i_mapping->a_ops = &f2fs_node_aops;
- mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
+ mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
} else if (ino == F2FS_META_INO(sbi)) {
inode->i_mapping->a_ops = &f2fs_meta_aops;
- mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
+ mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
} else if (S_ISREG(inode->i_mode)) {
inode->i_op = &f2fs_file_inode_operations;
inode->i_fop = &f2fs_file_operations;
diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
index 9dedd4b5e077..31e5ecf98ffd 100644
--- a/fs/f2fs/node.c
+++ b/fs/f2fs/node.c
@@ -1078,6 +1078,7 @@ struct page *new_node_page(struct dnode_of_data *dn, unsigned int ofs)
set_node_addr(sbi, &new_ni, NEW_ADDR, false);

f2fs_wait_on_page_writeback(page, NODE, true);
+ memset(F2FS_NODE(page), 0, PAGE_SIZE);
fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true);
set_cold_node(page, S_ISDIR(dn->inode->i_mode));
if (!PageUptodate(page))
@@ -2321,6 +2322,7 @@ int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page)

if (!PageUptodate(ipage))
SetPageUptodate(ipage);
+ memset(F2FS_NODE(page), 0, PAGE_SIZE);
fill_node_footer(ipage, ino, ino, 0, true);
set_cold_node(page, false);

--

>
> If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
> on mapping_set_gfp_mask at the beginning and remove all of those
> stupid thins.
>
> Jaegeuk, why do you need __GFP_ZERO? Could you explain?
>
> .
>


2018-04-09 12:52:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon 09-04-18 20:25:06, Chao Yu wrote:
[...]
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index c85cccc2e800..cc63f8c448f0 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> make_now:
> if (ino == F2FS_NODE_INO(sbi)) {
> inode->i_mapping->a_ops = &f2fs_node_aops;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);

An unrelated question. Why do you make all allocations for the mapping
NOFS automatically? What kind of reclaim recursion problems are you
trying to prevent?
--
Michal Hocko
SUSE Labs

2018-04-09 13:45:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 02:48:52PM +0200, Michal Hocko wrote:
> On Mon 09-04-18 20:25:06, Chao Yu wrote:
> [...]
> > diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> > index c85cccc2e800..cc63f8c448f0 100644
> > --- a/fs/f2fs/inode.c
> > +++ b/fs/f2fs/inode.c
> > @@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> > make_now:
> > if (ino == F2FS_NODE_INO(sbi)) {
> > inode->i_mapping->a_ops = &f2fs_node_aops;
> > - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
>
> An unrelated question. Why do you make all allocations for the mapping
> NOFS automatically? What kind of reclaim recursion problems are you
> trying to prevent?

It's worth noting that this is endemic in filesystems.

$ git grep mapping_set_gfp_mask.*FS
drivers/block/loop.c: mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
fs/btrfs/disk-io.c: mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
fs/f2fs/inode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
fs/f2fs/inode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
fs/gfs2/glock.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
fs/gfs2/ops_fstype.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
fs/jfs/jfs_imap.c: mapping_set_gfp_mask(ip->i_mapping, GFP_NOFS);
fs/jfs/super.c: mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
fs/nilfs2/gcinode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
fs/nilfs2/page.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
fs/reiserfs/xattr.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
fs/xfs/xfs_iops.c: mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));


2018-04-09 13:55:57

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 06:41:14AM -0700, Matthew Wilcox wrote:
> It's worth noting that this is endemic in filesystems.

For the rationale in XFS take a look at commit ad22c7a043c2cc6792820e6c5da699935933e87d

2018-04-09 13:56:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon 09-04-18 06:41:14, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 02:48:52PM +0200, Michal Hocko wrote:
> > On Mon 09-04-18 20:25:06, Chao Yu wrote:
> > [...]
> > > diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> > > index c85cccc2e800..cc63f8c448f0 100644
> > > --- a/fs/f2fs/inode.c
> > > +++ b/fs/f2fs/inode.c
> > > @@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> > > make_now:
> > > if (ino == F2FS_NODE_INO(sbi)) {
> > > inode->i_mapping->a_ops = &f2fs_node_aops;
> > > - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > > + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> >
> > An unrelated question. Why do you make all allocations for the mapping
> > NOFS automatically? What kind of reclaim recursion problems are you
> > trying to prevent?
>
> It's worth noting that this is endemic in filesystems.

Yes, and I have strong suspicion that this is a mindless copy&pasting...
Well, xfs had a good reason for it in the past - mostly to handle deep
call stacks on complicated storage setups in the past when we used to
trigger IO from the direct reclaim. I am not sure whether there are
other reasons to keep the status quo except for finding somebody brave
enough to post the patch, do all the due testing.

> $ git grep mapping_set_gfp_mask.*FS
> drivers/block/loop.c: mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> fs/btrfs/disk-io.c: mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);
> fs/f2fs/inode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> fs/f2fs/inode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> fs/gfs2/glock.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
> fs/gfs2/ops_fstype.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
> fs/jfs/jfs_imap.c: mapping_set_gfp_mask(ip->i_mapping, GFP_NOFS);
> fs/jfs/super.c: mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> fs/nilfs2/gcinode.c: mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> fs/nilfs2/page.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
> fs/reiserfs/xattr.c: mapping_set_gfp_mask(mapping, GFP_NOFS);
> fs/xfs/xfs_iops.c: mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));

--
Michal Hocko
SUSE Labs

2018-04-09 14:58:01

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 08:25:06PM +0800, Chao Yu wrote:
> On 2018/4/9 19:25, Minchan Kim wrote:
> > On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> >> On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> >>> On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> >>>> On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> >>>>> It assumes shadow entry of radix tree relies on the init state
> >>>>> that node->private_list allocated should be list_empty state.
> >>>>> Currently, it's initailized in SLAB constructor which means
> >>>>> node of radix tree would be initialized only when *slub allocates
> >>>>> new page*, not *new object*. So, if some FS or subsystem pass
> >>>>> gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> >>>>
> >>>> Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> >>>> I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> >>>> with GFP_ZERO.
> >>>
> >>> Look at fs/f2fs/inode.c
> >>> mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> >>>
> >>> __add_to_page_cache_locked
> >>> radix_tree_maybe_preload
> >>>
> >>> add_to_page_cache_lru
> >>>
> >>> What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
> >>
> >> Because it's a stupid thing to do. Pages are allocated and then filled
> >> from disk. Zeroing them before DMAing to them is just a waste of time.
> >
> > Every FSes do address_space to read pages from storage? I'm not sure.
>
> No, sometimes, we need to write meta data to new allocated block address,
> then we will allocate a zeroed page in inner inode's address space, and
> fill partial data in it, and leave other place with zero value which means
> some fields are initial status.

Thanks for the explaining.

>
> There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
> I have just checked them, for both of them, we can avoid using __GFP_ZERO,
> and do initialization by ourselves to avoid unneeded/redundant zeroing
> from mm.

Yub, it would be desirable for f2fs. Please go ahead for f2fs side.
However, I think current problem is orthgonal. Now, the problem is
radix_tree_node allocation is bind to page cache allocation.
Why does FS cannot allocate page cache with __GFP_ZERO?
I agree if the concern is only performance matter as Matthew mentioned.
But it is beyond that because it shouldn't do due to limitation
of workingset shadow entry implementation. I think such coupling is
not a good idea.

I think right approach to abstract shadow entry in radix_tree is
to mask off __GFP_ZERO in radix_tree's allocation APIs.

>
> To Jaegeuk, if I missed something, please let me know.
>
> ---
> fs/f2fs/inode.c | 4 ++--
> fs/f2fs/node.c | 2 ++
> 2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> index c85cccc2e800..cc63f8c448f0 100644
> --- a/fs/f2fs/inode.c
> +++ b/fs/f2fs/inode.c
> @@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> make_now:
> if (ino == F2FS_NODE_INO(sbi)) {
> inode->i_mapping->a_ops = &f2fs_node_aops;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> } else if (ino == F2FS_META_INO(sbi)) {
> inode->i_mapping->a_ops = &f2fs_meta_aops;
> - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> } else if (S_ISREG(inode->i_mode)) {
> inode->i_op = &f2fs_file_inode_operations;
> inode->i_fop = &f2fs_file_operations;
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 9dedd4b5e077..31e5ecf98ffd 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1078,6 +1078,7 @@ struct page *new_node_page(struct dnode_of_data *dn, unsigned int ofs)
> set_node_addr(sbi, &new_ni, NEW_ADDR, false);
>
> f2fs_wait_on_page_writeback(page, NODE, true);
> + memset(F2FS_NODE(page), 0, PAGE_SIZE);
> fill_node_footer(page, dn->nid, dn->inode->i_ino, ofs, true);
> set_cold_node(page, S_ISDIR(dn->inode->i_mode));
> if (!PageUptodate(page))
> @@ -2321,6 +2322,7 @@ int recover_inode_page(struct f2fs_sb_info *sbi, struct page *page)
>
> if (!PageUptodate(ipage))
> SetPageUptodate(ipage);
> + memset(F2FS_NODE(page), 0, PAGE_SIZE);
> fill_node_footer(ipage, ino, ino, 0, true);
> set_cold_node(page, false);
>
> --
>
> >
> > If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
> > on mapping_set_gfp_mask at the beginning and remove all of those
> > stupid thins.
> >
> > Jaegeuk, why do you need __GFP_ZERO? Could you explain?
> >
> > .
> >
>

2018-04-09 15:24:42

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 11:49:58PM +0900, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 08:25:06PM +0800, Chao Yu wrote:
> > On 2018/4/9 19:25, Minchan Kim wrote:
> > > On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> > >> On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > >>> Look at fs/f2fs/inode.c
> > >>> mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > >>>
> > >>> __add_to_page_cache_locked
> > >>> radix_tree_maybe_preload
> > >>>
> > >>> add_to_page_cache_lru
> >
> > No, sometimes, we need to write meta data to new allocated block address,
> > then we will allocate a zeroed page in inner inode's address space, and
> > fill partial data in it, and leave other place with zero value which means
> > some fields are initial status.
>
> Thanks for the explaining.
>
> > There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
> > I have just checked them, for both of them, we can avoid using __GFP_ZERO,
> > and do initialization by ourselves to avoid unneeded/redundant zeroing
> > from mm.
>
> Yub, it would be desirable for f2fs. Please go ahead for f2fs side.
> However, I think current problem is orthgonal. Now, the problem is
> radix_tree_node allocation is bind to page cache allocation.
> Why does FS cannot allocate page cache with __GFP_ZERO?
> I agree if the concern is only performance matter as Matthew mentioned.
> But it is beyond that because it shouldn't do due to limitation
> of workingset shadow entry implementation. I think such coupling is
> not a good idea.
>
> I think right approach to abstract shadow entry in radix_tree is
> to mask off __GFP_ZERO in radix_tree's allocation APIs.

I don't think this is something the radix tree should know about.
SLAB should be checking for it (the patch I posted earlier in this
thread), but the right place to filter this out is in the caller of
radix_tree_maybe_preload -- it's already filtering out HIGHMEM pages,
and should filter out GFP_ZERO too.

diff --git a/mm/filemap.c b/mm/filemap.c
index c2147682f4c3..a87a523eea8e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -785,7 +785,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
@@ -841,7 +841,8 @@ static int __add_to_page_cache_locked(struct page *page,
return error;
}

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_maybe_preload(gfp_mask &
+ ~(__GFP_HIGHMEM | __GFP_ZERO));
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);

2018-04-09 15:42:25

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 03:52:15PM +0200, Michal Hocko wrote:
> On Mon 09-04-18 06:41:14, Matthew Wilcox wrote:
> > On Mon, Apr 09, 2018 at 02:48:52PM +0200, Michal Hocko wrote:
> > > On Mon 09-04-18 20:25:06, Chao Yu wrote:
> > > [...]
> > > > diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
> > > > index c85cccc2e800..cc63f8c448f0 100644
> > > > --- a/fs/f2fs/inode.c
> > > > +++ b/fs/f2fs/inode.c
> > > > @@ -339,10 +339,10 @@ struct inode *f2fs_iget(struct super_block *sb, unsigned long ino)
> > > > make_now:
> > > > if (ino == F2FS_NODE_INO(sbi)) {
> > > > inode->i_mapping->a_ops = &f2fs_node_aops;
> > > > - mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > > > + mapping_set_gfp_mask(inode->i_mapping, GFP_NOFS);
> > >
> > > An unrelated question. Why do you make all allocations for the mapping
> > > NOFS automatically? What kind of reclaim recursion problems are you
> > > trying to prevent?
> >
> > It's worth noting that this is endemic in filesystems.
>
> Yes, and I have strong suspicion that this is a mindless copy&pasting...
> Well, xfs had a good reason for it in the past - mostly to handle deep
> call stacks on complicated storage setups in the past when we used to
> trigger IO from the direct reclaim. I am not sure whether there are
> other reasons to keep the status quo except for finding somebody brave
> enough to post the patch, do all the due testing.
>
> > $ git grep mapping_set_gfp_mask.*FS
> > drivers/block/loop.c: mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> > fs/btrfs/disk-io.c: mapping_set_gfp_mask(fs_info->btree_inode->i_mapping, GFP_NOFS);

Thi was added in 1561deda687eef0
(https://git.kernel.org/linus/1561deda687eef0e9506) and probably after a
deadlock report.

The changelog mentions the potential recursion from fs -> allocation -> fs,
but I'm not sure if this still happens on the MM side today.

For the filesystem part, I think the key functions of the callchain are
still there.

The code was been added in 2011 and the 2nd hunk of the patch added a
code that's not present today AFAICS, so this is worth revisiting.

I still don't understand how it's related to the GFP_HIGHUSER_MOVABLE,
this patch is from time where the metadata pages were possibly allocated
from HIGHMEM but this was removed later in a65917156e345946db
(https://git.kernel.org/linus/a65917156e345946db).

2018-04-09 18:41:47

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On 04/09, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> > On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > > On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> > > > On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > > > > It assumes shadow entry of radix tree relies on the init state
> > > > > that node->private_list allocated should be list_empty state.
> > > > > Currently, it's initailized in SLAB constructor which means
> > > > > node of radix tree would be initialized only when *slub allocates
> > > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > >
> > > > Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> > > > I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> > > > with GFP_ZERO.
> > >
> > > Look at fs/f2fs/inode.c
> > > mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > >
> > > __add_to_page_cache_locked
> > > radix_tree_maybe_preload
> > >
> > > add_to_page_cache_lru
> > >
> > > What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
> >
> > Because it's a stupid thing to do. Pages are allocated and then filled
> > from disk. Zeroing them before DMAing to them is just a waste of time.
>
> Every FSes do address_space to read pages from storage? I'm not sure.
>
> If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
> on mapping_set_gfp_mask at the beginning and remove all of those
> stupid thins.
>
> Jaegeuk, why do you need __GFP_ZERO? Could you explain?

Comment says "__GFP_ZERO returns a zeroed page on success."

The f2fs maintains two inodes to manage some metadata in the page cache,
which requires zeroed data when introducing a new structure. It's not
a big deal to avoid __GFP_ZERO for whatever performance reasons tho, does
it only matters with f2fs?

Thanks,

2018-04-09 19:45:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 11:38:27AM -0700, Jaegeuk Kim wrote:
> On 04/09, Minchan Kim wrote:
> > On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> > > On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > > > On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> > > > > On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > > > > > It assumes shadow entry of radix tree relies on the init state
> > > > > > that node->private_list allocated should be list_empty state.
> > > > > > Currently, it's initailized in SLAB constructor which means
> > > > > > node of radix tree would be initialized only when *slub allocates
> > > > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > > >
> > > > > Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> > > > > I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> > > > > with GFP_ZERO.
> > > >
> > > > Look at fs/f2fs/inode.c
> > > > mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > > >
> > > > __add_to_page_cache_locked
> > > > radix_tree_maybe_preload
> > > >
> > > > add_to_page_cache_lru
> > > >
> > > > What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
> > >
> > > Because it's a stupid thing to do. Pages are allocated and then filled
> > > from disk. Zeroing them before DMAing to them is just a waste of time.
> >
> > Every FSes do address_space to read pages from storage? I'm not sure.
> >
> > If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
> > on mapping_set_gfp_mask at the beginning and remove all of those
> > stupid thins.
> >
> > Jaegeuk, why do you need __GFP_ZERO? Could you explain?
>
> Comment says "__GFP_ZERO returns a zeroed page on success."
>
> The f2fs maintains two inodes to manage some metadata in the page cache,
> which requires zeroed data when introducing a new structure. It's not
> a big deal to avoid __GFP_ZERO for whatever performance reasons tho, does
> it only matters with f2fs?

This isn't a performance issue.

The problem is that the mapping gfp flags are used not only for allocating
pages, but also for allocating the page cache data structures that hold
the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
so it's the first time anyone's noticed that the page cache passes the
__GFP_ZERO bit through to the radix tree allocation routines, which
causes the radix tree nodes to be zeroed instead of constructed.

I think the right solution to this is:

diff --git a/mm/filemap.c b/mm/filemap.c
index c2147682f4c3..a87a523eea8e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -785,7 +785,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
@@ -841,7 +841,8 @@ static int __add_to_page_cache_locked(struct page *page,
return error;
}

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_maybe_preload(gfp_mask &
+ ~(__GFP_HIGHMEM | __GFP_ZERO));
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);

2018-04-09 23:11:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 08:20:32AM -0700, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 11:49:58PM +0900, Minchan Kim wrote:
> > On Mon, Apr 09, 2018 at 08:25:06PM +0800, Chao Yu wrote:
> > > On 2018/4/9 19:25, Minchan Kim wrote:
> > > > On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> > > >> On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > > >>> Look at fs/f2fs/inode.c
> > > >>> mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > > >>>
> > > >>> __add_to_page_cache_locked
> > > >>> radix_tree_maybe_preload
> > > >>>
> > > >>> add_to_page_cache_lru
> > >
> > > No, sometimes, we need to write meta data to new allocated block address,
> > > then we will allocate a zeroed page in inner inode's address space, and
> > > fill partial data in it, and leave other place with zero value which means
> > > some fields are initial status.
> >
> > Thanks for the explaining.
> >
> > > There are two inner inodes (meta inode and node inode) setting __GFP_ZERO,
> > > I have just checked them, for both of them, we can avoid using __GFP_ZERO,
> > > and do initialization by ourselves to avoid unneeded/redundant zeroing
> > > from mm.
> >
> > Yub, it would be desirable for f2fs. Please go ahead for f2fs side.
> > However, I think current problem is orthgonal. Now, the problem is
> > radix_tree_node allocation is bind to page cache allocation.
> > Why does FS cannot allocate page cache with __GFP_ZERO?
> > I agree if the concern is only performance matter as Matthew mentioned.
> > But it is beyond that because it shouldn't do due to limitation
> > of workingset shadow entry implementation. I think such coupling is
> > not a good idea.
> >
> > I think right approach to abstract shadow entry in radix_tree is
> > to mask off __GFP_ZERO in radix_tree's allocation APIs.
>
> I don't think this is something the radix tree should know about.

Because shadow entry implementation is hidden by radix tree implemetation.
IOW, radix tree user cannot know how it works.

> SLAB should be checking for it (the patch I posted earlier in this

I don't think it's right approach. SLAB constructor can initialize
some metadata for slab page populated as well as page zeroing.
However, __GFP_ZERO means only clearing pages, not metadata.
So it's different semantic. No need to mix out.

> thread), but the right place to filter this out is in the caller of
> radix_tree_maybe_preload -- it's already filtering out HIGHMEM pages,
> and should filter out GFP_ZERO too.

radix_tree_[maybe]_preload is exported API, which are error-prone
for out of modules or upcoming customers.

More proper place is __radix_tree_preload.

>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c2147682f4c3..a87a523eea8e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -785,7 +785,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> VM_BUG_ON_PAGE(!PageLocked(new), new);
> VM_BUG_ON_PAGE(new->mapping, new);
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (!error) {
> struct address_space *mapping = old->mapping;
> void (*freepage)(struct page *);
> @@ -841,7 +841,8 @@ static int __add_to_page_cache_locked(struct page *page,
> return error;
> }
>
> - error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_maybe_preload(gfp_mask &
> + ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (error) {
> if (!huge)
> mem_cgroup_cancel_charge(page, memcg, false);

2018-04-10 01:19:43

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 08:04:09AM +0900, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 08:20:32AM -0700, Matthew Wilcox wrote:
> > I don't think this is something the radix tree should know about.
>
> Because shadow entry implementation is hidden by radix tree implemetation.
> IOW, radix tree user cannot know how it works.

I have no idea what you mean.

> > SLAB should be checking for it (the patch I posted earlier in this
>
> I don't think it's right approach. SLAB constructor can initialize
> some metadata for slab page populated as well as page zeroing.
> However, __GFP_ZERO means only clearing pages, not metadata.
> So it's different semantic. No need to mix out.

No, __GFP_ZERO is specified to clear the allocated memory whether
you're allocating from alloc_pages or from slab. What makes no sense
is allocating an object from slab with a constructor *and* __GFP_ZERO.
They're in conflict, and slab can't fulfill both of those requirements.

> > thread), but the right place to filter this out is in the caller of
> > radix_tree_maybe_preload -- it's already filtering out HIGHMEM pages,
> > and should filter out GFP_ZERO too.
>
> radix_tree_[maybe]_preload is exported API, which are error-prone
> for out of modules or upcoming customers.
>
> More proper place is __radix_tree_preload.

I could not disagree with you more. It is the responsibility of the
callers of radix_tree_preload to avoid calling it with nonsense flags
like __GFP_DMA, __GFP_HIGHMEM or __GFP_ZERO.

2018-04-10 02:37:38

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 06:12:11PM -0700, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 08:04:09AM +0900, Minchan Kim wrote:
> > On Mon, Apr 09, 2018 at 08:20:32AM -0700, Matthew Wilcox wrote:
> > > I don't think this is something the radix tree should know about.
> >
> > Because shadow entry implementation is hidden by radix tree implemetation.
> > IOW, radix tree user cannot know how it works.
>
> I have no idea what you mean.
>
> > > SLAB should be checking for it (the patch I posted earlier in this
> >
> > I don't think it's right approach. SLAB constructor can initialize
> > some metadata for slab page populated as well as page zeroing.
> > However, __GFP_ZERO means only clearing pages, not metadata.
> > So it's different semantic. No need to mix out.
>
> No, __GFP_ZERO is specified to clear the allocated memory whether
> you're allocating from alloc_pages or from slab. What makes no sense
> is allocating an object from slab with a constructor *and* __GFP_ZERO.
> They're in conflict, and slab can't fulfill both of those requirements.

It's a stable material. If you really think it does make sense,
please submit patch separately.

>
> > > thread), but the right place to filter this out is in the caller of
> > > radix_tree_maybe_preload -- it's already filtering out HIGHMEM pages,
> > > and should filter out GFP_ZERO too.
> >
> > radix_tree_[maybe]_preload is exported API, which are error-prone
> > for out of modules or upcoming customers.
> >
> > More proper place is __radix_tree_preload.
>
> I could not disagree with you more. It is the responsibility of the
> callers of radix_tree_preload to avoid calling it with nonsense flags
> like __GFP_DMA, __GFP_HIGHMEM or __GFP_ZERO.

How about this?

It would fix current problem and warn potential bugs as well.
radix_tree_preload already has done such warning and
radix_tree_maybe_preload has skipping for misbehaivor gfp.

From 27ecf7a009d3570d1155c528c7f08040ede68ed3 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 10 Apr 2018 11:20:11 +0900
Subject: [PATCH v2] mm: workingset: fix NULL ptr dereference

It assumes shadow entries of radix tree rely on the init state
that node->private_list allocated newly is list_empty state
for the working. Currently, it's initailized in SLAB constructor
which means node of radix tree would be initialized only when
*slub allocates new page*, not *slub alloctes new object*.

If some FS or subsystem pass gfp_mask to __GFP_ZERO, that means
newly allocated node can have !list_empty(node->private_list)
by memset of slab allocator. It ends up calling NULL deference
at workingset_update_node by failing list_empty check.

This patch fixes it.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: [email protected]
Reported-by: Chris Fries <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
lib/radix-tree.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index da9e10c827df..9d68f2a7888e 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -511,6 +511,16 @@ int radix_tree_preload(gfp_t gfp_mask)
{
/* Warn on non-sensical use... */
WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
+ /*
+ * New allocate node must have node->private_list as INIT_LIST_HEAD
+ * state by workingset shadow memory implementation.
+ * If user pass __GFP_ZERO by mistake, slab allocator will clear
+ * node->private_list, which makes a BUG. Rather than going Oops,
+ * just fix and warn about it.
+ */
+ if (WARN_ON(gfp_mask & __GFP_ZERO))
+ gfp_mask &= ~GFP_ZERO
+
return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
}
EXPORT_SYMBOL(radix_tree_preload);
@@ -522,7 +532,7 @@ EXPORT_SYMBOL(radix_tree_preload);
*/
int radix_tree_maybe_preload(gfp_t gfp_mask)
{
- if (gfpflags_allow_blocking(gfp_mask))
+ if (gfpflags_allow_blocking(gfp_mask) && !(gfp_mask & __GFP_ZERO))
return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
/* Preloading doesn't help anything with this gfp mask, skip it */
preempt_disable();
--
2.17.0.484.g0c8726318c-goog



2018-04-10 02:43:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 11:33:39AM +0900, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 06:12:11PM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 10, 2018 at 08:04:09AM +0900, Minchan Kim wrote:
> > > On Mon, Apr 09, 2018 at 08:20:32AM -0700, Matthew Wilcox wrote:
> > > > I don't think this is something the radix tree should know about.
> > >
> > > Because shadow entry implementation is hidden by radix tree implemetation.
> > > IOW, radix tree user cannot know how it works.
> >
> > I have no idea what you mean.
> >
> > > > SLAB should be checking for it (the patch I posted earlier in this
> > >
> > > I don't think it's right approach. SLAB constructor can initialize
> > > some metadata for slab page populated as well as page zeroing.
> > > However, __GFP_ZERO means only clearing pages, not metadata.
> > > So it's different semantic. No need to mix out.
> >
> > No, __GFP_ZERO is specified to clear the allocated memory whether
> > you're allocating from alloc_pages or from slab. What makes no sense
> > is allocating an object from slab with a constructor *and* __GFP_ZERO.
> > They're in conflict, and slab can't fulfill both of those requirements.
>
> It's a stable material. If you really think it does make sense,
> please submit patch separately.
>
> >
> > > > thread), but the right place to filter this out is in the caller of
> > > > radix_tree_maybe_preload -- it's already filtering out HIGHMEM pages,
> > > > and should filter out GFP_ZERO too.
> > >
> > > radix_tree_[maybe]_preload is exported API, which are error-prone
> > > for out of modules or upcoming customers.
> > >
> > > More proper place is __radix_tree_preload.
> >
> > I could not disagree with you more. It is the responsibility of the
> > callers of radix_tree_preload to avoid calling it with nonsense flags
> > like __GFP_DMA, __GFP_HIGHMEM or __GFP_ZERO.
>
> How about this?
>
> It would fix current problem and warn potential bugs as well.
> radix_tree_preload already has done such warning and
> radix_tree_maybe_preload has skipping for misbehaivor gfp.
>
> From 27ecf7a009d3570d1155c528c7f08040ede68ed3 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 10 Apr 2018 11:20:11 +0900
> Subject: [PATCH v2] mm: workingset: fix NULL ptr dereference
>
> It assumes shadow entries of radix tree rely on the init state
> that node->private_list allocated newly is list_empty state
> for the working. Currently, it's initailized in SLAB constructor
> which means node of radix tree would be initialized only when
> *slub allocates new page*, not *slub alloctes new object*.
>
> If some FS or subsystem pass gfp_mask to __GFP_ZERO, that means
> newly allocated node can have !list_empty(node->private_list)
> by memset of slab allocator. It ends up calling NULL deference
> at workingset_update_node by failing list_empty check.
>
> This patch fixes it.
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Cc: Johannes Weiner <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Jaegeuk Kim <[email protected]>
> Cc: Chao Yu <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: [email protected]
> Reported-by: Chris Fries <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> lib/radix-tree.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..9d68f2a7888e 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -511,6 +511,16 @@ int radix_tree_preload(gfp_t gfp_mask)
> {
> /* Warn on non-sensical use... */
> WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
> + /*
> + * New allocate node must have node->private_list as INIT_LIST_HEAD
> + * state by workingset shadow memory implementation.
> + * If user pass __GFP_ZERO by mistake, slab allocator will clear
> + * node->private_list, which makes a BUG. Rather than going Oops,
> + * just fix and warn about it.
> + */
> + if (WARN_ON(gfp_mask & __GFP_ZERO))
> + gfp_mask &= ~GFP_ZERO

Build fail.

If others are okay for this patch, I will resend fixed patch with stable mark.
I will wait feedback from others.

Thanks.

2018-04-10 02:45:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 11:33:39AM +0900, Minchan Kim wrote:
> @@ -522,7 +532,7 @@ EXPORT_SYMBOL(radix_tree_preload);
> */
> int radix_tree_maybe_preload(gfp_t gfp_mask)
> {
> - if (gfpflags_allow_blocking(gfp_mask))
> + if (gfpflags_allow_blocking(gfp_mask) && !(gfp_mask & __GFP_ZERO))
> return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
> /* Preloading doesn't help anything with this gfp mask, skip it */
> preempt_disable();

No, you've completely misunderstood what's going on in this function.

2018-04-10 03:03:04

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 07:41:52PM -0700, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 11:33:39AM +0900, Minchan Kim wrote:
> > @@ -522,7 +532,7 @@ EXPORT_SYMBOL(radix_tree_preload);
> > */
> > int radix_tree_maybe_preload(gfp_t gfp_mask)
> > {
> > - if (gfpflags_allow_blocking(gfp_mask))
> > + if (gfpflags_allow_blocking(gfp_mask) && !(gfp_mask & __GFP_ZERO))
> > return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
> > /* Preloading doesn't help anything with this gfp mask, skip it */
> > preempt_disable();
>
> No, you've completely misunderstood what's going on in this function.

Okay, I hope this version clear current concerns.

From fb37c41b90f7d3ead1798e5cb7baef76709afd94 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 10 Apr 2018 11:54:57 +0900
Subject: [PATCH v3] mm: workingset: fix NULL ptr dereference

It assumes shadow entries of radix tree rely on the init state
that node->private_list allocated newly is list_empty state
for the working. Currently, it's initailized in SLAB constructor
which means node of radix tree would be initialized only when
*slub allocates new page*, not *slub alloctes new object*.

If some FS or subsystem pass gfp_mask to __GFP_ZERO, that means
newly allocated node can have !list_empty(node->private_list)
by memset of slab allocator. It ends up calling NULL deference
at workingset_update_node by failing list_empty check.

This patch fixes it.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Reported-by: Chris Fries <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
lib/radix-tree.c | 9 +++++++++
mm/filemap.c | 5 +++--
2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index da9e10c827df..7569e637dbaa 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -470,6 +470,15 @@ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
struct radix_tree_node *node;
int ret = -ENOMEM;

+ /*
+ * New allocate node must have node->private_list as INIT_LIST_HEAD
+ * state by workingset shadow memory implementation.
+ * If user pass __GFP_ZERO by mistake, slab allocator will clear
+ * node->private_list, which makes a BUG. Rather than going Oops,
+ * just fix and warn about it.
+ */
+ if (WARN_ON(gfp_mask & __GFP_ZERO))
+ gfp_mask &= ~__GFP_ZERO;
/*
* Nodes preloaded by one cgroup can be be used by another cgroup, so
* they should never be accounted to any particular memory cgroup.
diff --git a/mm/filemap.c b/mm/filemap.c
index ab77e19ab09c..b6de9d691c8a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -786,7 +786,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
@@ -842,7 +842,8 @@ static int __add_to_page_cache_locked(struct page *page,
return error;
}

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_maybe_preload(gfp_mask &
+ ~(__GFP_HIGHMEM | __GFP_ZERO));
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
--
2.17.0.484.g0c8726318c-goog



2018-04-10 08:25:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon 09-04-18 10:58:15, Minchan Kim wrote:
> Recently, I got a report like below.
>
> [ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
> [ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
> [ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
> [ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
> [ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
> [ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
> [ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
> [ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
> [ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
> [ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
> [ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
> [ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
> [ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
> [ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
> [ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64
>
> v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
> plant shadow entries without radix tree node") so I thought
> it should be okay. When I was googling, I found others report
> such problem and I think current kernel still has the problem.
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1431567
> https://bugzilla.redhat.com/show_bug.cgi?id=1420335
>
> It assumes shadow entry of radix tree relies on the init state
> that node->private_list allocated should be list_empty state.
> Currently, it's initailized in SLAB constructor which means
> node of radix tree would be initialized only when *slub allocates
> new page*, not *new object*. So, if some FS or subsystem pass
> gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> That means allocated node can have !list_empty(node->private_list).
> It ends up calling NULL deference at workingset_update_node by
> failing list_empty check.
>
> This patch should fix it.
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Reported-by: Chris Fries <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Jan Kara <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>

Regardless of whether it makes sense to use __GFP_ZERO from the upper
layer or not, it is subtle as hell to rely on the pre-existing state
for a newly allocated object. So yes this makes perfect sense.

Do we want CC: stable?
Acked-by: Michal Hocko <[email protected]>

> ---
> If it is reviewed and proved with testing, I will resend the patch to
> Ccing [email protected].
>
> Thanks.
>
> lib/radix-tree.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 8e00138d593f..afcbdb6c495f 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -428,6 +428,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
> ret->exceptional = exceptional;
> ret->parent = parent;
> ret->root = root;
> + INIT_LIST_HEAD(&ret->private_list);
> }
> return ret;
> }
> @@ -2234,7 +2235,6 @@ radix_tree_node_ctor(void *arg)
> struct radix_tree_node *node = arg;
>
> memset(node, 0, sizeof(*node));
> - INIT_LIST_HEAD(&node->private_list);
> }
>
> static __init unsigned long __maxindex(unsigned int height)
> --
> 2.17.0.484.g0c8726318c-goog

--
Michal Hocko
SUSE Labs

2018-04-10 08:31:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon 09-04-18 12:40:44, Matthew Wilcox wrote:
> On Mon, Apr 09, 2018 at 11:38:27AM -0700, Jaegeuk Kim wrote:
> > On 04/09, Minchan Kim wrote:
> > > On Mon, Apr 09, 2018 at 04:14:03AM -0700, Matthew Wilcox wrote:
> > > > On Mon, Apr 09, 2018 at 12:09:30PM +0900, Minchan Kim wrote:
> > > > > On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> > > > > > On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> > > > > > > It assumes shadow entry of radix tree relies on the init state
> > > > > > > that node->private_list allocated should be list_empty state.
> > > > > > > Currently, it's initailized in SLAB constructor which means
> > > > > > > node of radix tree would be initialized only when *slub allocates
> > > > > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > > > >
> > > > > > Wait, what? Who's declaring their radix tree with GFP_ZERO flags?
> > > > > > I don't see anyone using INIT_RADIX_TREE or RADIX_TREE or RADIX_TREE_INIT
> > > > > > with GFP_ZERO.
> > > > >
> > > > > Look at fs/f2fs/inode.c
> > > > > mapping_set_gfp_mask(inode->i_mapping, GFP_F2FS_ZERO);
> > > > >
> > > > > __add_to_page_cache_locked
> > > > > radix_tree_maybe_preload
> > > > >
> > > > > add_to_page_cache_lru
> > > > >
> > > > > What's the wrong with setting __GFP_ZERO with mapping->gfp_mask?
> > > >
> > > > Because it's a stupid thing to do. Pages are allocated and then filled
> > > > from disk. Zeroing them before DMAing to them is just a waste of time.
> > >
> > > Every FSes do address_space to read pages from storage? I'm not sure.
> > >
> > > If you're right, we need to insert WARN_ON to catch up __GFP_ZERO
> > > on mapping_set_gfp_mask at the beginning and remove all of those
> > > stupid thins.
> > >
> > > Jaegeuk, why do you need __GFP_ZERO? Could you explain?
> >
> > Comment says "__GFP_ZERO returns a zeroed page on success."
> >
> > The f2fs maintains two inodes to manage some metadata in the page cache,
> > which requires zeroed data when introducing a new structure. It's not
> > a big deal to avoid __GFP_ZERO for whatever performance reasons tho, does
> > it only matters with f2fs?
>
> This isn't a performance issue.
>
> The problem is that the mapping gfp flags are used not only for allocating
> pages, but also for allocating the page cache data structures that hold
> the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
> so it's the first time anyone's noticed that the page cache passes the
> __GFP_ZERO bit through to the radix tree allocation routines, which
> causes the radix tree nodes to be zeroed instead of constructed.
>
> I think the right solution to this is:

This just hides the underlying problem that the node is not fully and
properly initialized. Relying on the previous released state is just too
subtle. Are you going to blacklist all potential gfp flags that come
from the mapping? This is just unmaintainable! If anything this should
be an explicit & with the allowed set of allowed flags.

>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c2147682f4c3..a87a523eea8e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -785,7 +785,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> VM_BUG_ON_PAGE(!PageLocked(new), new);
> VM_BUG_ON_PAGE(new->mapping, new);
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (!error) {
> struct address_space *mapping = old->mapping;
> void (*freepage)(struct page *);
> @@ -841,7 +841,8 @@ static int __add_to_page_cache_locked(struct page *page,
> return error;
> }
>
> - error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_maybe_preload(gfp_mask &
> + ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (error) {
> if (!huge)
> mem_cgroup_cancel_charge(page, memcg, false);

--
Michal Hocko
SUSE Labs

2018-04-10 08:58:14

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 11:59:03, Minchan Kim wrote:
> On Mon, Apr 09, 2018 at 07:41:52PM -0700, Matthew Wilcox wrote:
> > On Tue, Apr 10, 2018 at 11:33:39AM +0900, Minchan Kim wrote:
> > > @@ -522,7 +532,7 @@ EXPORT_SYMBOL(radix_tree_preload);
> > > */
> > > int radix_tree_maybe_preload(gfp_t gfp_mask)
> > > {
> > > - if (gfpflags_allow_blocking(gfp_mask))
> > > + if (gfpflags_allow_blocking(gfp_mask) && !(gfp_mask & __GFP_ZERO))
> > > return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
> > > /* Preloading doesn't help anything with this gfp mask, skip it */
> > > preempt_disable();
> >
> > No, you've completely misunderstood what's going on in this function.
>
> Okay, I hope this version clear current concerns.
>
> From fb37c41b90f7d3ead1798e5cb7baef76709afd94 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Tue, 10 Apr 2018 11:54:57 +0900
> Subject: [PATCH v3] mm: workingset: fix NULL ptr dereference
>
> It assumes shadow entries of radix tree rely on the init state
> that node->private_list allocated newly is list_empty state
> for the working. Currently, it's initailized in SLAB constructor
> which means node of radix tree would be initialized only when
> *slub allocates new page*, not *slub alloctes new object*.
>
> If some FS or subsystem pass gfp_mask to __GFP_ZERO, that means
> newly allocated node can have !list_empty(node->private_list)
> by memset of slab allocator. It ends up calling NULL deference
> at workingset_update_node by failing list_empty check.
>
> This patch fixes it.

The patch looks good. I'd just rephrase the changelog to be more
understandable. Something like:

GFP mask passed to page cache functions (often coming from
mapping->gfp_mask) is used both for allocation of page cache page and for
allocation of radix tree metadata necessary to add the page to the page
cache. When the mask contains __GFP_ZERO (as is the case for some f2fs
metadata mappings), this breaks radix tree code as that code expects
allocated radix tree nodes to be properly initialized by the slab
constructor and not zeroed. In particular node->private_list is failing
list_empty() check and the following list operation in
workingset_update_node() will dereference NULL.

Fix the problem by removing __GFP_ZERO from the mask for radix tree
allocations. Also warn if __GFP_ZERO gets passed to __radix_tree_preload()
to avoid silent breakage in the future for other radix tree users.

With that fixed you can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Cc: Johannes Weiner <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Jaegeuk Kim <[email protected]>
> Cc: Chao Yu <[email protected]>
> Cc: Christopher Lameter <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Reported-by: Chris Fries <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> lib/radix-tree.c | 9 +++++++++
> mm/filemap.c | 5 +++--
> 2 files changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index da9e10c827df..7569e637dbaa 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -470,6 +470,15 @@ static __must_check int __radix_tree_preload(gfp_t gfp_mask, unsigned nr)
> struct radix_tree_node *node;
> int ret = -ENOMEM;
>
> + /*
> + * New allocate node must have node->private_list as INIT_LIST_HEAD
> + * state by workingset shadow memory implementation.
> + * If user pass __GFP_ZERO by mistake, slab allocator will clear
> + * node->private_list, which makes a BUG. Rather than going Oops,
> + * just fix and warn about it.
> + */
> + if (WARN_ON(gfp_mask & __GFP_ZERO))
> + gfp_mask &= ~__GFP_ZERO;
> /*
> * Nodes preloaded by one cgroup can be be used by another cgroup, so
> * they should never be accounted to any particular memory cgroup.
> diff --git a/mm/filemap.c b/mm/filemap.c
> index ab77e19ab09c..b6de9d691c8a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -786,7 +786,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> VM_BUG_ON_PAGE(!PageLocked(new), new);
> VM_BUG_ON_PAGE(new->mapping, new);
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (!error) {
> struct address_space *mapping = old->mapping;
> void (*freepage)(struct page *);
> @@ -842,7 +842,8 @@ static int __add_to_page_cache_locked(struct page *page,
> return error;
> }
>
> - error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_maybe_preload(gfp_mask &
> + ~(__GFP_HIGHMEM | __GFP_ZERO));
> if (error) {
> if (!huge)
> mem_cgroup_cancel_charge(page, memcg, false);
> --
> 2.17.0.484.g0c8726318c-goog
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-04-10 08:59:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 10:22:43, Michal Hocko wrote:
> On Mon 09-04-18 10:58:15, Minchan Kim wrote:
> > Recently, I got a report like below.
> >
> > [ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
> > [ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
> > [ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
> > [ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
> > [ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
> > [ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
> > [ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
> > [ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
> > [ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
> > [ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
> > [ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
> > [ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
> > [ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
> > [ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
> > [ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64
> >
> > v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
> > plant shadow entries without radix tree node") so I thought
> > it should be okay. When I was googling, I found others report
> > such problem and I think current kernel still has the problem.
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=1431567
> > https://bugzilla.redhat.com/show_bug.cgi?id=1420335
> >
> > It assumes shadow entry of radix tree relies on the init state
> > that node->private_list allocated should be list_empty state.
> > Currently, it's initailized in SLAB constructor which means
> > node of radix tree would be initialized only when *slub allocates
> > new page*, not *new object*. So, if some FS or subsystem pass
> > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > That means allocated node can have !list_empty(node->private_list).
> > It ends up calling NULL deference at workingset_update_node by
> > failing list_empty check.
> >
> > This patch should fix it.
> >
> > Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> > Reported-by: Chris Fries <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Jan Kara <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Regardless of whether it makes sense to use __GFP_ZERO from the upper
> layer or not, it is subtle as hell to rely on the pre-existing state
> for a newly allocated object. So yes this makes perfect sense.
>
> Do we want CC: stable?
> Acked-by: Michal Hocko <[email protected]>

Well, for hot allocations we do rely on previous state a lot. After all
that's what slab constructor was created for. Whether radix tree node
allocation is such a hot path is a question for debate, I agree.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-04-10 09:37:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 10:55:31, Jan Kara wrote:
> On Tue 10-04-18 10:22:43, Michal Hocko wrote:
> > On Mon 09-04-18 10:58:15, Minchan Kim wrote:
> > > Recently, I got a report like below.
> > >
> > > [ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
> > > [ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
> > > [ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
> > > [ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
> > > [ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
> > > [ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
> > > [ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
> > > [ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
> > > [ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
> > > [ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
> > > [ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
> > > [ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
> > > [ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
> > > [ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
> > > [ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64
> > >
> > > v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
> > > plant shadow entries without radix tree node") so I thought
> > > it should be okay. When I was googling, I found others report
> > > such problem and I think current kernel still has the problem.
> > >
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1431567
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1420335
> > >
> > > It assumes shadow entry of radix tree relies on the init state
> > > that node->private_list allocated should be list_empty state.
> > > Currently, it's initailized in SLAB constructor which means
> > > node of radix tree would be initialized only when *slub allocates
> > > new page*, not *new object*. So, if some FS or subsystem pass
> > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > That means allocated node can have !list_empty(node->private_list).
> > > It ends up calling NULL deference at workingset_update_node by
> > > failing list_empty check.
> > >
> > > This patch should fix it.
> > >
> > > Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> > > Reported-by: Chris Fries <[email protected]>
> > > Cc: Johannes Weiner <[email protected]>
> > > Cc: Jan Kara <[email protected]>
> > > Signed-off-by: Minchan Kim <[email protected]>
> >
> > Regardless of whether it makes sense to use __GFP_ZERO from the upper
> > layer or not, it is subtle as hell to rely on the pre-existing state
> > for a newly allocated object. So yes this makes perfect sense.
> >
> > Do we want CC: stable?
> > Acked-by: Michal Hocko <[email protected]>
>
> Well, for hot allocations we do rely on previous state a lot. After all
> that's what slab constructor was created for. Whether radix tree node
> allocation is such a hot path is a question for debate, I agree.

I really doubt that LIST_INIT is something to notice for the radix tree
allocation. So I would rather have safe code than rely on the previous
state which is really subtle.

Btw. I am not a huge fan of ctor semantic as we have it. I am not really
sure all users understand when it is called...
--
Michal Hocko
SUSE Labs

2018-04-10 10:32:38

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 11:32:41, Michal Hocko wrote:
> On Tue 10-04-18 10:55:31, Jan Kara wrote:
> > On Tue 10-04-18 10:22:43, Michal Hocko wrote:
> > > On Mon 09-04-18 10:58:15, Minchan Kim wrote:
> > > > Recently, I got a report like below.
> > > >
> > > > [ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
> > > > [ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
> > > > [ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
> > > > [ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
> > > > [ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
> > > > [ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
> > > > [ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
> > > > [ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
> > > > [ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
> > > > [ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
> > > > [ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
> > > > [ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
> > > > [ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
> > > > [ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
> > > > [ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64
> > > >
> > > > v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
> > > > plant shadow entries without radix tree node") so I thought
> > > > it should be okay. When I was googling, I found others report
> > > > such problem and I think current kernel still has the problem.
> > > >
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1431567
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1420335
> > > >
> > > > It assumes shadow entry of radix tree relies on the init state
> > > > that node->private_list allocated should be list_empty state.
> > > > Currently, it's initailized in SLAB constructor which means
> > > > node of radix tree would be initialized only when *slub allocates
> > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > > That means allocated node can have !list_empty(node->private_list).
> > > > It ends up calling NULL deference at workingset_update_node by
> > > > failing list_empty check.
> > > >
> > > > This patch should fix it.
> > > >
> > > > Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> > > > Reported-by: Chris Fries <[email protected]>
> > > > Cc: Johannes Weiner <[email protected]>
> > > > Cc: Jan Kara <[email protected]>
> > > > Signed-off-by: Minchan Kim <[email protected]>
> > >
> > > Regardless of whether it makes sense to use __GFP_ZERO from the upper
> > > layer or not, it is subtle as hell to rely on the pre-existing state
> > > for a newly allocated object. So yes this makes perfect sense.
> > >
> > > Do we want CC: stable?
> > > Acked-by: Michal Hocko <[email protected]>
> >
> > Well, for hot allocations we do rely on previous state a lot. After all
> > that's what slab constructor was created for. Whether radix tree node
> > allocation is such a hot path is a question for debate, I agree.
>
> I really doubt that LIST_INIT is something to notice for the radix tree
> allocation.

I agree with that.

> So I would rather have safe code than rely on the previous state which is
> really subtle.

And I agree on subtlety part here as well. But even with LIST_INIT we'll be
relying on some fields being 0 / NULL so you cannot really say that with
LIST_INIT we won't be relying on previous state. And fully memsetting
radix_tree_node on allocation *would* IMO have effect on the performance.
So I'm not convinced LIST_INIT buys us much. It deals with __GFP_ZERO
problem but not much else.

> Btw. I am not a huge fan of ctor semantic as we have it. I am not really
> sure all users understand when it is called...

Yeah, ctor semantics is subtle and we had some hard to debug bugs in VFS
because someone didn't understand the semantics. But OTOH for large
structures and frequent allocations the gain is worth it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2018-04-10 11:23:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 12:28:45PM +0200, Jan Kara wrote:
> On Tue 10-04-18 11:32:41, Michal Hocko wrote:
> > On Tue 10-04-18 10:55:31, Jan Kara wrote:
> > > On Tue 10-04-18 10:22:43, Michal Hocko wrote:
> > > > On Mon 09-04-18 10:58:15, Minchan Kim wrote:
> > > > > Recently, I got a report like below.
> > > > >
> > > > > [ 7858.792946] [<ffffff80086f4de0>] __list_del_entry+0x30/0xd0
> > > > > [ 7858.792951] [<ffffff8008362018>] list_lru_del+0xac/0x1ac
> > > > > [ 7858.792957] [<ffffff800830f04c>] page_cache_tree_insert+0xd8/0x110
> > > > > [ 7858.792962] [<ffffff8008310188>] __add_to_page_cache_locked+0xf8/0x4e0
> > > > > [ 7858.792967] [<ffffff800830ff34>] add_to_page_cache_lru+0x50/0x1ac
> > > > > [ 7858.792972] [<ffffff800830fdd0>] pagecache_get_page+0x468/0x57c
> > > > > [ 7858.792979] [<ffffff80085d081c>] __get_node_page+0x84/0x764
> > > > > [ 7858.792986] [<ffffff800859cd94>] f2fs_iget+0x264/0xdc8
> > > > > [ 7858.792991] [<ffffff800859ee00>] f2fs_lookup+0x3b4/0x660
> > > > > [ 7858.792998] [<ffffff80083d2540>] lookup_slow+0x1e4/0x348
> > > > > [ 7858.793003] [<ffffff80083d0eb8>] walk_component+0x21c/0x320
> > > > > [ 7858.793008] [<ffffff80083d0010>] path_lookupat+0x90/0x1bc
> > > > > [ 7858.793013] [<ffffff80083cfe6c>] filename_lookup+0x8c/0x1a0
> > > > > [ 7858.793018] [<ffffff80083c52d0>] vfs_fstatat+0x84/0x10c
> > > > > [ 7858.793023] [<ffffff80083c5b00>] SyS_newfstatat+0x28/0x64
> > > > >
> > > > > v4.9 kenrel already has the d3798ae8c6f3,("mm: filemap: don't
> > > > > plant shadow entries without radix tree node") so I thought
> > > > > it should be okay. When I was googling, I found others report
> > > > > such problem and I think current kernel still has the problem.
> > > > >
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1431567
> > > > > https://bugzilla.redhat.com/show_bug.cgi?id=1420335
> > > > >
> > > > > It assumes shadow entry of radix tree relies on the init state
> > > > > that node->private_list allocated should be list_empty state.
> > > > > Currently, it's initailized in SLAB constructor which means
> > > > > node of radix tree would be initialized only when *slub allocates
> > > > > new page*, not *new object*. So, if some FS or subsystem pass
> > > > > gfp_mask to __GFP_ZERO, slub allocator will do memset blindly.
> > > > > That means allocated node can have !list_empty(node->private_list).
> > > > > It ends up calling NULL deference at workingset_update_node by
> > > > > failing list_empty check.
> > > > >
> > > > > This patch should fix it.
> > > > >
> > > > > Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> > > > > Reported-by: Chris Fries <[email protected]>
> > > > > Cc: Johannes Weiner <[email protected]>
> > > > > Cc: Jan Kara <[email protected]>
> > > > > Signed-off-by: Minchan Kim <[email protected]>
> > > >
> > > > Regardless of whether it makes sense to use __GFP_ZERO from the upper
> > > > layer or not, it is subtle as hell to rely on the pre-existing state
> > > > for a newly allocated object. So yes this makes perfect sense.
> > > >
> > > > Do we want CC: stable?
> > > > Acked-by: Michal Hocko <[email protected]>

Thanks, Michal.

> > >
> > > Well, for hot allocations we do rely on previous state a lot. After all
> > > that's what slab constructor was created for. Whether radix tree node
> > > allocation is such a hot path is a question for debate, I agree.
> >
> > I really doubt that LIST_INIT is something to notice for the radix tree
> > allocation.
>
> I agree with that.

I totally agree with Michal's opinion. I don't want to play with semantic
game here atlhough we can make the API work with simple one line without
any performance lose.

As I stated in description, there was other report hitting the bug
and I believe we didn't fixed it for a long time. Maybe, FS out of tree
and ouf of radix tree users could affect by this bug once they use
__GFP_ZERO intentionally or by chance. MM didn't give any guide to them.

I hope let's make it simple unless we lose big thing.

>
> > So I would rather have safe code than rely on the previous state which is
> > really subtle.
>
> And I agree on subtlety part here as well. But even with LIST_INIT we'll be
> relying on some fields being 0 / NULL so you cannot really say that with
> LIST_INIT we won't be relying on previous state. And fully memsetting
> radix_tree_node on allocation *would* IMO have effect on the performance.

It also does memset in radix_tree_node_rcu_free.
I think if it's really want to get benefit from slab constructor,
the object should have init state when the object is freeing time
so next allocation don't need to do anyting.
In this perspecitve, I think radix_tree_node's constructor is pointless.

> So I'm not convinced LIST_INIT buys us much. It deals with __GFP_ZERO
> problem but not much else.

Jan, so, what is your stance for this patch?

If you're okay for that, I really want to go my original patch
Michal already gave Acked-by.

Thanks.

2018-04-10 12:01:10

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2] mm: workingset: fix NULL ptr dereference

Hi Minchan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16 next-20180410]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-workingset-fix-NULL-ptr-dereference/20180410-163500
config: x86_64-randconfig-x014-201814 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64

All error/warnings (new ones prefixed by >>):

lib/radix-tree.c: In function 'radix_tree_preload':
>> lib/radix-tree.c:522:16: error: 'GFP_ZERO' undeclared (first use in this function); did you mean '__GFP_ZERO'?
gfp_mask &= ~GFP_ZERO
^~~~~~~~
__GFP_ZERO
lib/radix-tree.c:522:16: note: each undeclared identifier is reported only once for each function it appears in
>> lib/radix-tree.c:524:2: error: expected ';' before 'return'
return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
^~~~~~
>> lib/radix-tree.c:525:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^

vim +522 lib/radix-tree.c

500
501 /*
502 * Load up this CPU's radix_tree_node buffer with sufficient objects to
503 * ensure that the addition of a single element in the tree cannot fail. On
504 * success, return zero, with preemption disabled. On error, return -ENOMEM
505 * with preemption not disabled.
506 *
507 * To make use of this facility, the radix tree must be initialised without
508 * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
509 */
510 int radix_tree_preload(gfp_t gfp_mask)
511 {
512 /* Warn on non-sensical use... */
513 WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
514 /*
515 * New allocate node must have node->private_list as INIT_LIST_HEAD
516 * state by workingset shadow memory implementation.
517 * If user pass __GFP_ZERO by mistake, slab allocator will clear
518 * node->private_list, which makes a BUG. Rather than going Oops,
519 * just fix and warn about it.
520 */
521 if (WARN_ON(gfp_mask & __GFP_ZERO))
> 522 gfp_mask &= ~GFP_ZERO
523
> 524 return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
> 525 }
526 EXPORT_SYMBOL(radix_tree_preload);
527

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.60 kB)
.config.gz (22.79 kB)
Download all attachments

2018-04-10 12:01:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 11:59:03AM +0900, Minchan Kim wrote:
> Okay, I hope this version clear current concerns.

It doesn't. The right place to warn about GFP_ZERO used with a
constructor is _slab_, like the patch I already sent. We have no idea
what other places might have the same bug, and slab is the only place
to catch that.


2018-04-10 12:11:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 08:19:31PM +0900, Minchan Kim wrote:
> If you're okay for that, I really want to go my original patch
> Michal already gave Acked-by.

I NAK this patch. It is completely wrong.

2018-04-10 12:13:04

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 10:26:43AM +0200, Michal Hocko wrote:
> On Mon 09-04-18 12:40:44, Matthew Wilcox wrote:
> > The problem is that the mapping gfp flags are used not only for allocating
> > pages, but also for allocating the page cache data structures that hold
> > the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
> > so it's the first time anyone's noticed that the page cache passes the
> > __GFP_ZERO bit through to the radix tree allocation routines, which
> > causes the radix tree nodes to be zeroed instead of constructed.
> >
> > I think the right solution to this is:
>
> This just hides the underlying problem that the node is not fully and
> properly initialized. Relying on the previous released state is just too
> subtle.

That's the fundamental design of slab-with-constructors. The user provides
a constructor, so all newly allocagted objects are initialised to a known
state, then the user will restore the object to that state when it frees
the object to slab.

> Are you going to blacklist all potential gfp flags that come
> from the mapping? This is just unmaintainable! If anything this should
> be an explicit & with the allowed set of allowed flags.

Oh, I agree that using the set of flags used to allocate the page
in order to allocate the radix tree nodes is a pretty horrible idea.

Your suggestion, then, is:

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);

correct?


2018-04-10 12:38:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 05:05:28, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 10:26:43AM +0200, Michal Hocko wrote:
> > On Mon 09-04-18 12:40:44, Matthew Wilcox wrote:
> > > The problem is that the mapping gfp flags are used not only for allocating
> > > pages, but also for allocating the page cache data structures that hold
> > > the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
> > > so it's the first time anyone's noticed that the page cache passes the
> > > __GFP_ZERO bit through to the radix tree allocation routines, which
> > > causes the radix tree nodes to be zeroed instead of constructed.
> > >
> > > I think the right solution to this is:
> >
> > This just hides the underlying problem that the node is not fully and
> > properly initialized. Relying on the previous released state is just too
> > subtle.
>
> That's the fundamental design of slab-with-constructors. The user provides
> a constructor, so all newly allocagted objects are initialised to a known
> state, then the user will restore the object to that state when it frees
> the object to slab.

And that is fundamentally subtle semantic and leads to bugs. So we
should reconsider whether that is really worth keeping for the radix
tree.

> > Are you going to blacklist all potential gfp flags that come
> > from the mapping? This is just unmaintainable! If anything this should
> > be an explicit & with the allowed set of allowed flags.
>
> Oh, I agree that using the set of flags used to allocate the page
> in order to allocate the radix tree nodes is a pretty horrible idea.
>
> Your suggestion, then, is:
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);
>
> correct?

Something like that, yes.

--
Michal Hocko
SUSE Labs

2018-04-10 12:42:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue 10-04-18 04:56:51, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 11:59:03AM +0900, Minchan Kim wrote:
> > Okay, I hope this version clear current concerns.
>
> It doesn't. The right place to warn about GFP_ZERO used with a
> constructor is _slab_, like the patch I already sent. We have no idea
> what other places might have the same bug, and slab is the only place
> to catch that.

I agree with that. Radix tree shouldn't be really that special. I would
rather get rid of the ctor subtle thingy but if we absolutely have to
keep it then the GFP_RECLAIM_MASK filtering and a warning in slab for
__GFP_ZERO looks like a reasonable step forward.
--
Michal Hocko
SUSE Labs

2018-04-10 12:44:42

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 05:05:28AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 10:26:43AM +0200, Michal Hocko wrote:
> > On Mon 09-04-18 12:40:44, Matthew Wilcox wrote:
> > > The problem is that the mapping gfp flags are used not only for allocating
> > > pages, but also for allocating the page cache data structures that hold
> > > the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
> > > so it's the first time anyone's noticed that the page cache passes the
> > > __GFP_ZERO bit through to the radix tree allocation routines, which
> > > causes the radix tree nodes to be zeroed instead of constructed.
> > >
> > > I think the right solution to this is:
> >
> > This just hides the underlying problem that the node is not fully and
> > properly initialized. Relying on the previous released state is just too
> > subtle.
>
> That's the fundamental design of slab-with-constructors. The user provides
> a constructor, so all newly allocagted objects are initialised to a known
> state, then the user will restore the object to that state when it frees
> the object to slab.

Fully agreed. I'm surprised there is so much pushback to what Willy is
saying here, this stuff has been established for years.

> > Are you going to blacklist all potential gfp flags that come
> > from the mapping? This is just unmaintainable! If anything this should
> > be an explicit & with the allowed set of allowed flags.
>
> Oh, I agree that using the set of flags used to allocate the page
> in order to allocate the radix tree nodes is a pretty horrible idea.
>
> Your suggestion, then, is:
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);

That looks reasonable to me.

2018-04-10 12:49:25

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Mon, Apr 09, 2018 at 10:58:15AM +0900, Minchan Kim wrote:
> @@ -428,6 +428,7 @@ radix_tree_node_alloc(gfp_t gfp_mask, struct radix_tree_node *parent,
> ret->exceptional = exceptional;
> ret->parent = parent;
> ret->root = root;
> + INIT_LIST_HEAD(&ret->private_list);
> }
> return ret;
> }
> @@ -2234,7 +2235,6 @@ radix_tree_node_ctor(void *arg)
> struct radix_tree_node *node = arg;
>
> memset(node, 0, sizeof(*node));
> - INIT_LIST_HEAD(&node->private_list);
> }

I have to NAK this.

The slab constructor protocol requires objects to be in their initial
allocation state at the time of being freed. If this isn't the case
here, we need to fix whoever isn't doing this, not the alloc site.

2018-04-10 12:52:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Sun, Apr 08, 2018 at 07:49:25PM -0700, Matthew Wilcox wrote:
> @@ -2714,8 +2714,10 @@ static __always_inline void *slab_alloc_node(struct kmem_cache *s,
> stat(s, ALLOC_FASTPATH);
> }
>
> - if (unlikely(gfpflags & __GFP_ZERO) && object)
> - memset(object, 0, s->object_size);
> + if (unlikely(gfpflags & __GFP_ZERO) && object) {
> + if (!WARN_ON_ONCE(s->ctor))
> + memset(object, 0, s->object_size);
> + }
>
> slab_post_alloc_hook(s, gfpflags, 1, &object);

This looks like a useful check to have. But maybe behind DEBUG_VM?

2018-04-10 13:15:32

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2] mm: workingset: fix NULL ptr dereference

Hi Minchan,

I love your patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v4.16 next-20180410]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url: https://github.com/0day-ci/linux/commits/Minchan-Kim/mm-workingset-fix-NULL-ptr-dereference/20180410-163500
config: i386-randconfig-a0-201814 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=i386

All errors (new ones prefixed by >>):

lib/radix-tree.c: In function 'radix_tree_preload':
>> lib/radix-tree.c:522:16: error: 'GFP_ZERO' undeclared (first use in this function)
gfp_mask &= ~GFP_ZERO
^
lib/radix-tree.c:522:16: note: each undeclared identifier is reported only once for each function it appears in
lib/radix-tree.c:524:2: error: expected ';' before 'return'
return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
^
lib/radix-tree.c:525:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^

vim +/GFP_ZERO +522 lib/radix-tree.c

500
501 /*
502 * Load up this CPU's radix_tree_node buffer with sufficient objects to
503 * ensure that the addition of a single element in the tree cannot fail. On
504 * success, return zero, with preemption disabled. On error, return -ENOMEM
505 * with preemption not disabled.
506 *
507 * To make use of this facility, the radix tree must be initialised without
508 * __GFP_DIRECT_RECLAIM being passed to INIT_RADIX_TREE().
509 */
510 int radix_tree_preload(gfp_t gfp_mask)
511 {
512 /* Warn on non-sensical use... */
513 WARN_ON_ONCE(!gfpflags_allow_blocking(gfp_mask));
514 /*
515 * New allocate node must have node->private_list as INIT_LIST_HEAD
516 * state by workingset shadow memory implementation.
517 * If user pass __GFP_ZERO by mistake, slab allocator will clear
518 * node->private_list, which makes a BUG. Rather than going Oops,
519 * just fix and warn about it.
520 */
521 if (WARN_ON(gfp_mask & __GFP_ZERO))
> 522 gfp_mask &= ~GFP_ZERO
523
524 return __radix_tree_preload(gfp_mask, RADIX_TREE_PRELOAD_SIZE);
525 }
526 EXPORT_SYMBOL(radix_tree_preload);
527

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation


Attachments:
(No filename) (2.53 kB)
.config.gz (30.54 kB)
Download all attachments

2018-04-10 13:33:02

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] mm: workingset: fix NULL ptr dereference

On Tue, Apr 10, 2018 at 05:05:28AM -0700, Matthew Wilcox wrote:
> On Tue, Apr 10, 2018 at 10:26:43AM +0200, Michal Hocko wrote:
> > On Mon 09-04-18 12:40:44, Matthew Wilcox wrote:
> > > The problem is that the mapping gfp flags are used not only for allocating
> > > pages, but also for allocating the page cache data structures that hold
> > > the pages. F2FS is the only filesystem that set the __GFP_ZERO bit,
> > > so it's the first time anyone's noticed that the page cache passes the
> > > __GFP_ZERO bit through to the radix tree allocation routines, which
> > > causes the radix tree nodes to be zeroed instead of constructed.
> > >
> > > I think the right solution to this is:
> >
> > This just hides the underlying problem that the node is not fully and
> > properly initialized. Relying on the previous released state is just too
> > subtle.
>
> That's the fundamental design of slab-with-constructors. The user provides
> a constructor, so all newly allocagted objects are initialised to a known
> state, then the user will restore the object to that state when it frees
> the object to slab.
>
> > Are you going to blacklist all potential gfp flags that come
> > from the mapping? This is just unmaintainable! If anything this should
> > be an explicit & with the allowed set of allowed flags.
>
> Oh, I agree that using the set of flags used to allocate the page
> in order to allocate the radix tree nodes is a pretty horrible idea.
>
> Your suggestion, then, is:
>
> - error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);
>
> correct?
>

Looks much better.

Finally, it seems everyone agree on this. However, I won't include
warning part of slab allocator because I think it's improve stuff
not bug fix so it could be separted.
If anyone really want to include it in this stable patch,
please discuss with slub maintainers before.

Thanks for the reivew, Matthew, Michal, Jan and Johannes.

From 652bb75124896fa040df78b98496a354f54fc524 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Tue, 10 Apr 2018 22:13:50 +0900
Subject: [PATCH v4] mm: workingset: fix NULL ptr dereference

GFP mask passed to page cache functions (often coming from
mapping->gfp_mask) is used both for allocation of page cache page and for
allocation of radix tree metadata necessary to add the page to the page
cache. When the mask contains __GFP_ZERO (as is the case for some f2fs
metadata mappings), this breaks radix tree code as that code expects
allocated radix tree nodes to be properly initialized by the slab
constructor and not zeroed. In particular node->private_list is failing
list_empty() check and the following list operation in
workingset_update_node() will dereference NULL.

Fix the problem by removing non-reclimable flags by GFP_RECLAIM_MASK
for radix tree allocations.

Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
Cc: Johannes Weiner <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jaegeuk Kim <[email protected]>
Cc: Chao Yu <[email protected]>
Cc: Christopher Lameter <[email protected]>
Cc: [email protected]
Cc: [email protected]
Suggested-by: Matthew Wilcox <[email protected]>
Reported-by: Chris Fries <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/filemap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index ab77e19ab09c..5f3311edfea4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -786,7 +786,7 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
VM_BUG_ON_PAGE(!PageLocked(new), new);
VM_BUG_ON_PAGE(new->mapping, new);

- error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_preload(gfp_mask & GFP_RECLAIM_MASK);
if (!error) {
struct address_space *mapping = old->mapping;
void (*freepage)(struct page *);
@@ -842,7 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,
return error;
}

- error = radix_tree_maybe_preload(gfp_mask & ~__GFP_HIGHMEM);
+ error = radix_tree_maybe_preload(gfp_mask & GFP_RECLAIM_MASK);
if (error) {
if (!huge)
mem_cgroup_cancel_charge(page, memcg, false);
--
2.17.0.484.g0c8726318c-goog