2010-11-21 11:26:19

by Arun Bhanu

[permalink] [raw]
Subject: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

I saw this in kernel log messages while testing 2.6.37-rc2. I think it
appeared while mounting an external hard-disk. I can't seem to
reproduce it.

Please let me know if you need more info.

[47064.272151] ===================================================
[47064.273474] [ INFO: suspicious rcu_dereference_check() usage. ]
[47064.273956] ---------------------------------------------------
[47064.274431] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[47064.274905]
[47064.274906] other info that might help us debug this:
[47064.274907]
[47064.276303]
[47064.276303] rcu_scheduler_active = 1, debug_locks = 0
[47064.277202] 2 locks held by mount/21199:
[47064.277635] #0: (&type->s_umount_key#20/1){+.+.+.}, at: [<c05007f0>] sget+0x21f/0x35d
[47064.278078] #1: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c04f5e67>] migrate_page_move_mapping+0x3a/0x120
[47064.278529]
[47064.278529] stack backtrace:
[47064.279409] Pid: 21199, comm: mount Not tainted 2.6.37-rc2-ab2-589136bfa784a4558b397f017ca2f06f0ca9080e+ #1
[47064.279864] Call Trace:
[47064.280313] [<c0822e61>] ? printk+0x2d/0x34
[47064.280765] [<c04709c8>] lockdep_rcu_dereference+0x97/0x9f
[47064.281220] [<c04f5e28>] radix_tree_deref_slot+0x4a/0x4f
[47064.281680] [<c04f5ea7>] migrate_page_move_mapping+0x7a/0x120
[47064.282129] [<c04f6323>] migrate_page+0x1f/0x35
[47064.282573] [<c04f6464>] move_to_new_page+0x12b/0x164
[47064.283017] [<c04f6773>] migrate_pages+0x1e1/0x2ee
[47064.283463] [<c04eed89>] ? compaction_alloc+0x0/0x1ef
[47064.283918] [<c04eebdc>] compact_zone+0x24f/0x3fc
[47064.284363] [<c04ef0f4>] try_to_compact_pages+0x17c/0x1e6
[47064.284820] [<c04cac33>] __alloc_pages_nodemask+0x397/0x6b7
[47064.285276] [<c04caf75>] __get_free_pages+0x22/0x33
[47064.285729] [<c04f4304>] __kmalloc+0x2f/0x112
[47064.286193] [<c0435adc>] ? should_resched+0xd/0x28
[47064.286655] [<c0578ca9>] kzalloc.clone.58+0x12/0x14
[47064.287118] [<c057c7f1>] ext4_fill_super+0x1090/0x2521
[47064.287573] [<c040ea02>] ? native_sched_clock+0x14/0x52
[47064.288029] [<c054774a>] ? disk_name+0x86/0x90
[47064.288477] [<c0523bfa>] ? set_blocksize+0x33/0x78
[47064.288930] [<c0500aed>] mount_bdev+0x123/0x16d
[47064.289385] [<c057b761>] ? ext4_fill_super+0x0/0x2521
[47064.289842] [<c05776fd>] ? ext4_mount+0x0/0x24
[47064.290300] [<c057771c>] ext4_mount+0x1f/0x24
[47064.290760] [<c057b761>] ? ext4_fill_super+0x0/0x2521
[47064.291222] [<c05003e1>] vfs_kern_mount+0xa1/0x1ad
[47064.291687] [<c0511c36>] ? get_fs_type+0x38/0x91
[47064.292149] [<c0500546>] do_kern_mount+0x3d/0xc8
[47064.292615] [<c05142b2>] do_mount+0x614/0x640
[47064.293056] [<c04d69f0>] ? strndup_user+0x2e/0x3f
[47064.293489] [<c05144a1>] sys_mount+0x6d/0x99
[47064.293925] [<c040971f>] sysenter_do_call+0x12/0x38
[47064.368116] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)

-Arun


2010-11-21 13:30:24

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On Sun, Nov 21, 2010 at 07:26:11PM +0800, Arun Bhanu wrote:
> I saw this in kernel log messages while testing 2.6.37-rc2. I think it
> appeared while mounting an external hard-disk. I can't seem to
> reproduce it.

I could be wrong but this looks like it's a bug in mm/migrate.c in
migrate_page_move_mapping(): it is calling radix_tree_lookup_slot()
without first taking an rcu_read_lock().

It was triggered by a memory allocation out of ext4_fill_super(),
which then triggered a memory compaction/migration, but I don't
believe it's otherwise related to the ext4 code.

Over to the linux-mm folks for confirmation...

- Ted

> Please let me know if you need more info.
>
> [47064.272151] ===================================================
> [47064.273474] [ INFO: suspicious rcu_dereference_check() usage. ]
> [47064.273956] ---------------------------------------------------
> [47064.274431] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> [47064.274905]
> [47064.274906] other info that might help us debug this:
> [47064.274907]
> [47064.276303]
> [47064.276303] rcu_scheduler_active = 1, debug_locks = 0
> [47064.277202] 2 locks held by mount/21199:
> [47064.277635] #0: (&type->s_umount_key#20/1){+.+.+.}, at: [<c05007f0>] sget+0x21f/0x35d
> [47064.278078] #1: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c04f5e67>] migrate_page_move_mapping+0x3a/0x120
> [47064.278529]
> [47064.278529] stack backtrace:
> [47064.279409] Pid: 21199, comm: mount Not tainted 2.6.37-rc2-ab2-589136bfa784a4558b397f017ca2f06f0ca9080e+ #1
> [47064.279864] Call Trace:
> [47064.280313] [<c0822e61>] ? printk+0x2d/0x34
> [47064.280765] [<c04709c8>] lockdep_rcu_dereference+0x97/0x9f
> [47064.281220] [<c04f5e28>] radix_tree_deref_slot+0x4a/0x4f
> [47064.281680] [<c04f5ea7>] migrate_page_move_mapping+0x7a/0x120
> [47064.282129] [<c04f6323>] migrate_page+0x1f/0x35
> [47064.282573] [<c04f6464>] move_to_new_page+0x12b/0x164
> [47064.283017] [<c04f6773>] migrate_pages+0x1e1/0x2ee
> [47064.283463] [<c04eed89>] ? compaction_alloc+0x0/0x1ef
> [47064.283918] [<c04eebdc>] compact_zone+0x24f/0x3fc
> [47064.284363] [<c04ef0f4>] try_to_compact_pages+0x17c/0x1e6
> [47064.284820] [<c04cac33>] __alloc_pages_nodemask+0x397/0x6b7
> [47064.285276] [<c04caf75>] __get_free_pages+0x22/0x33
> [47064.285729] [<c04f4304>] __kmalloc+0x2f/0x112
> [47064.286193] [<c0435adc>] ? should_resched+0xd/0x28
> [47064.286655] [<c0578ca9>] kzalloc.clone.58+0x12/0x14
> [47064.287118] [<c057c7f1>] ext4_fill_super+0x1090/0x2521
> [47064.287573] [<c040ea02>] ? native_sched_clock+0x14/0x52
> [47064.288029] [<c054774a>] ? disk_name+0x86/0x90
> [47064.288477] [<c0523bfa>] ? set_blocksize+0x33/0x78
> [47064.288930] [<c0500aed>] mount_bdev+0x123/0x16d
> [47064.289385] [<c057b761>] ? ext4_fill_super+0x0/0x2521
> [47064.289842] [<c05776fd>] ? ext4_mount+0x0/0x24
> [47064.290300] [<c057771c>] ext4_mount+0x1f/0x24
> [47064.290760] [<c057b761>] ? ext4_fill_super+0x0/0x2521
> [47064.291222] [<c05003e1>] vfs_kern_mount+0xa1/0x1ad
> [47064.291687] [<c0511c36>] ? get_fs_type+0x38/0x91
> [47064.292149] [<c0500546>] do_kern_mount+0x3d/0xc8
> [47064.292615] [<c05142b2>] do_mount+0x614/0x640
> [47064.293056] [<c04d69f0>] ? strndup_user+0x2e/0x3f
> [47064.293489] [<c05144a1>] sys_mount+0x6d/0x99
> [47064.293925] [<c040971f>] sysenter_do_call+0x12/0x38
> [47064.368116] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
>
> -Arun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-11-21 15:39:57

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On Sun, Nov 21, 2010 at 08:30:24AM -0500, Ted Ts'o wrote:
> On Sun, Nov 21, 2010 at 07:26:11PM +0800, Arun Bhanu wrote:
> > I saw this in kernel log messages while testing 2.6.37-rc2. I think it
> > appeared while mounting an external hard-disk. I can't seem to
> > reproduce it.
>
> I could be wrong but this looks like it's a bug in mm/migrate.c in
> migrate_page_move_mapping(): it is calling radix_tree_lookup_slot()
> without first taking an rcu_read_lock().
>
> It was triggered by a memory allocation out of ext4_fill_super(),
> which then triggered a memory compaction/migration, but I don't
> believe it's otherwise related to the ext4 code.
>
> Over to the linux-mm folks for confirmation...

I think it's no problem.

That's because migration always holds lock_page on the file page.
So the page couldn't remove from radix.


--
Kind regards,
Minchan Kim

2010-11-21 17:37:26

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On Mon, Nov 22, 2010 at 12:39:49AM +0900, Minchan Kim wrote:
>
> I think it's no problem.
>
> That's because migration always holds lock_page on the file page.
> So the page couldn't remove from radix.

It may be "ok" in that it won't cause a race, but it still leaves an
unsightly warning if LOCKDEP is enabled, and LOCKDEP warnings will
cause /proc_lock_stat to be disabled. So I think it still needs to be
fixed by adding rcu_read_lock()/rcu_read_unlock() to
migrate_page_move_mapping().

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-11-22 00:38:49

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On Mon, Nov 22, 2010 at 2:37 AM, Ted Ts'o <[email protected]> wrote:
> On Mon, Nov 22, 2010 at 12:39:49AM +0900, Minchan Kim wrote:
>>
>> I think it's no problem.
>>
>> That's because migration always holds lock_page on the file page.
>> So the page couldn't remove from radix.
>
> It may be "ok" in that it won't cause a race, but it still leaves an
> unsightly warning if LOCKDEP is enabled, and LOCKDEP warnings will
> cause /proc_lock_stat to be disabled. ?So I think it still needs to be
> fixed by adding rcu_read_lock()/rcu_read_unlock() to
> migrate_page_move_mapping().
>
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? - Ted
>

Yes. if it is really "ok" about race, we will add rcu_read_lock with
below comment to prevent false positive.
"suppress RCU lockdep false positives".
But I am not sure it's good although rcu_read_lock is little cost.
Whenever we find false positive, should we add rcu_read_lock to
suppress although it's no problem in real product?
Couldn't we provide following function? (or we might have already it
but I missed it. )

/*
* Suppress RCU lockdep false positive.
*/
#ifdef CONFIG_PROVE_RCU
#define rcu_read_lock_suppress rcu_read_lock
#else
#define rcu_read_lock_suppress
#endif


--
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href
ilto:"[email protected]"> [email protected] </a>

2010-11-22 03:31:14

by Milton Miller

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On 2010-11-22 at around 0:38:49, Minchan Kim wrote:
> On Mon, Nov 22, 2010 at 2:37 AM, Ted Ts'o <[email protected]> wrote:
> > On Mon, Nov 22, 2010 at 12:39:49AM +0900, Minchan Kim wrote:
> > >
> > > I think it's no problem.
> > >
> > > That's because migration always holds lock_page on the file page.
> > > So the page couldn't remove from radix.
> >
> > It may be "ok" in that it won't cause a race, but it still leaves an
> > unsightly warning if LOCKDEP is enabled, and LOCKDEP warnings will
> > cause /proc_lock_stat to be disabled. So I think it still needs to be
> > fixed by adding rcu_read_lock()/rcu_read_unlock() to
> > migrate_page_move_mapping().
> >
> > - Ted
> >
>
> Yes. if it is really "ok" about race, we will add rcu_read_lock with
> below comment to prevent false positive.
> "suppress RCU lockdep false positives".
> But I am not sure it's good although rcu_read_lock is little cost.
> Whenever we find false positive, should we add rcu_read_lock to
> suppress although it's no problem in real product?
> Couldn't we provide following function? (or we might have already it
> but I missed it. )
>
> /*
> * Suppress RCU lockdep false positive.
> */
> #ifdef CONFIG_PROVE_RCU
> #define rcu_read_lock_suppress rcu_read_lock
> #else
> #define rcu_read_lock_suppress
> #endif

No, you don't need anything like this, as rcu_dereference_check already
takes a test for alternate locking.

However, looking more closely at the code, it appears this is the
"the tree is write locked" case as described in radix-tree.h

Looking at rcupdate.h, perhaps we need a version of radix_tree_deref_slot
that uses rcu_dereference_protected?

Copying Paul McKenney for rcu ...

milton

2010-11-22 06:16:23

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

On Sun, Nov 21, 2010 at 09:31:14PM -0600, Milton Miller wrote:
> On 2010-11-22 at around 0:38:49, Minchan Kim wrote:
> > On Mon, Nov 22, 2010 at 2:37 AM, Ted Ts'o <[email protected]> wrote:
> > > On Mon, Nov 22, 2010 at 12:39:49AM +0900, Minchan Kim wrote:
> > > >
> > > > I think it's no problem.
> > > >
> > > > That's because migration always holds lock_page on the file page.
> > > > So the page couldn't remove from radix.
> > >
> > > It may be "ok" in that it won't cause a race, but it still leaves an
> > > unsightly warning if LOCKDEP is enabled, and LOCKDEP warnings will
> > > cause /proc_lock_stat to be disabled. So I think it still needs to be
> > > fixed by adding rcu_read_lock()/rcu_read_unlock() to
> > > migrate_page_move_mapping().
> > >
> > > - Ted
> > >
> >
> > Yes. if it is really "ok" about race, we will add rcu_read_lock with
> > below comment to prevent false positive.
> > "suppress RCU lockdep false positives".
> > But I am not sure it's good although rcu_read_lock is little cost.
> > Whenever we find false positive, should we add rcu_read_lock to
> > suppress although it's no problem in real product?
> > Couldn't we provide following function? (or we might have already it
> > but I missed it. )
> >
> > /*
> > * Suppress RCU lockdep false positive.
> > */
> > #ifdef CONFIG_PROVE_RCU
> > #define rcu_read_lock_suppress rcu_read_lock
> > #else
> > #define rcu_read_lock_suppress
> > #endif
>
> No, you don't need anything like this, as rcu_dereference_check already
> takes a test for alternate locking.
>
> However, looking more closely at the code, it appears this is the
> "the tree is write locked" case as described in radix-tree.h
>
> Looking at rcupdate.h, perhaps we need a version of radix_tree_deref_slot
> that uses rcu_dereference_protected?
>
> Copying Paul McKenney for rcu ...

This approach could work. One way of doing it would be to add a second
argument:

static inline void *radix_tree_deref_slot_check(void **pslot, int ldc)
{
void *ret = rcu_dereference_check(*pslot, ldc);
if (unlikely(radix_tree_is_indirect_ptr(ret)))
ret = RADIX_TREE_RETRY;
return ret;
}

static inline void *radix_tree_deref_slot(void **pslot)
{
return radix_tree_deref_slot_check(pslot, rcu_read_lock_held());
}

Another alternative would have radix_tree_deref_slot() pass "1" into
the "ldc" argument, which reduces splats but at the expense of failing
to detect problems. ;-)

Thanx, Paul

2010-11-23 07:16:56

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUG?] [Ext4] INFO: suspicious rcu_dereference_check() usage

> On Mon, Nov 22, 2010 at 12:39:49AM +0900, Minchan Kim wrote:
> >
> > I think it's no problem.
> >
> > That's because migration always holds lock_page on the file page.
> > So the page couldn't remove from radix.
>
> It may be "ok" in that it won't cause a race, but it still leaves an
> unsightly warning if LOCKDEP is enabled, and LOCKDEP warnings will
> cause /proc_lock_stat to be disabled. So I think it still needs to be
> fixed by adding rcu_read_lock()/rcu_read_unlock() to
> migrate_page_move_mapping().

Hi Ted,

Current mmotm has following patch and I think it should be fixed your
issue.

Thanks.




From: Zeng Zhaoming <[email protected]>

find_task_by_vpid() should be protected by rcu_read_lock(), to prevent
free_pid() reclaiming pid.

Signed-off-by: Zeng Zhaoming <[email protected]>
Cc: "Paul E. McKenney" <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Christoph Lameter <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/mempolicy.c | 3 +++
1 file changed, 3 insertions(+)

diff -puN mm/mempolicy.c~mm-mempolicyc-add-rcu-read-lock-to-protect-pid-structure mm/mempolicy.c
--- a/mm/mempolicy.c~mm-mempolicyc-add-rcu-read-lock-to-protect-pid-structure
+++ a/mm/mempolicy.c
@@ -1307,15 +1307,18 @@ SYSCALL_DEFINE4(migrate_pages, pid_t, pi
goto out;

/* Find the mm_struct */
+ rcu_read_lock();
read_lock(&tasklist_lock);
task = pid ? find_task_by_vpid(pid) : current;
if (!task) {
read_unlock(&tasklist_lock);
+ rcu_read_unlock();
err = -ESRCH;
goto out;
}
mm = get_task_mm(task);
read_unlock(&tasklist_lock);
+ rcu_read_unlock();


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-12-07 19:02:29

by Gerald Schaefer

[permalink] [raw]
Subject: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

I got the same warning as reported by Arun Bhanu in "[BUG?] [Ext4] INFO:
suspicious rcu_dereference_check() usage" during memory hotplug test on
2.6.37-rc5, see below.

migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous
pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d.
This should be ok, as the comment suggests, so the warning is probably a
false positive. Eventually, migrate_page_move_mapping() will call
radix_tree_deref_slot() with tree_lock held, but w/o rcu_read_lock() for
non-anonymous pages.

As suggested by Milton before, a new version of radix_tree_deref_slot that
uses rcu_dereference_protected could help, if this is a false positive, or
maybe the rcu_read_lock() should be called for all pages, not just anonymous.
Any thoughts?


===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 0
6 locks held by sh/960:
#0: (&buffer->mutex){+.+.+.}, at: [<00000000002afa7a>] sysfs_write_file+0x4a/0x1a8
#1: (s_active#52){.+.+.+}, at: [<00000000002afb02>] sysfs_write_file+0xd2/0x1a8
#2: (&mem->state_mutex){+.+.+.}, at: [<0000000000404d78>] memory_block_change_state+0x40/0x1a0
#3: (mem_hotplug_mutex){+.+.+.}, at: [<00000000002242c4>] lock_memory_hotplug+0x2c/0x44
#4: (pm_mutex){+.+.+.}, at: [<000000000022483c>] remove_memory+0x104/0x758
#5: (&(&inode->i_data.tree_lock)->rlock){..-...}, at: [<000000000022624a>] migrate_page_move_mapping+0x4a/0x2d8

stack backtrace:
CPU: 2 Not tainted 2.6.37-rc5 #2
Process sh (pid: 960, task: 0000000018e38640, ksp: 000000000f037818)
000000000f037ad8 000000000f037a58 0000000000000002 0000000000000000
000000000f037af8 000000000f037a70 000000000f037a70 0000000000567c42
0000000000000000 0000000017bd7e78 0000000000000002 0000000017bd2e30
000003e00000000d 000003e00000000c 000000000f037ac0 0000000000000000
0000000000000000 0000000000100bfa 000000000f037a58 000000000f037a98
Call Trace:
([<0000000000100b02>] show_trace+0xee/0x144)
[<00000000002263ea>] migrate_page_move_mapping+0x1ea/0x2d8
[<0000000000226b1c>] migrate_page+0x38/0x68
[<0000000000226c36>] move_to_new_page+0xea/0x2bc
[<00000000002276f6>] migrate_pages+0x496/0x568
[<0000000000224be6>] remove_memory+0x4ae/0x758
[<0000000000404e20>] memory_block_change_state+0xe8/0x1a0
[<0000000000404fda>] store_mem_state+0x102/0x138
[<00000000002afb26>] sysfs_write_file+0xf6/0x1a8
[<000000000023357c>] vfs_write+0xac/0x18c
[<0000000000233758>] SyS_write+0x58/0xa8
[<0000000000113976>] sysc_noemu+0x16/0x1c
[<0000020000162edc>] 0x20000162edc
INFO: lockdep is turned off.



2010-12-08 01:25:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Tue, 07 Dec 2010 20:01:49 +0100
Gerald Schaefer <[email protected]> wrote:

> I got the same warning as reported by Arun Bhanu in "[BUG?] [Ext4] INFO:
> suspicious rcu_dereference_check() usage" during memory hotplug test on
> 2.6.37-rc5, see below.
>
> migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous
> pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d.
> This should be ok, as the comment suggests, so the warning is probably a
> false positive. Eventually, migrate_page_move_mapping() will call
> radix_tree_deref_slot() with tree_lock held, but w/o rcu_read_lock() for
> non-anonymous pages.
>
> As suggested by Milton before, a new version of radix_tree_deref_slot that
> uses rcu_dereference_protected could help, if this is a false positive, or
> maybe the rcu_read_lock() should be called for all pages, not just anonymous.
> Any thoughts?
>

Hmm, in radix_tree_deref_slot() comment,

140 * For use with radix_tree_lookup_slot(). Caller must hold tree at least read
141 * locked across slot lookup and dereference. More likely, will be used with
142 * radix_tree_replace_slot(), as well, so caller will hold tree write locked.

racy update must not happen. rcu_dereference_protected() sounds nice.

Thanks,
-Kame


2010-12-16 13:50:29

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
below. Both cases are easily reproducible: memory unplug with big page cache,
or adding large pages during run-time.

===================================================
[ INFO: suspicious rcu_dereference_check() usage. ]
---------------------------------------------------
include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

other info that might help us debug this:


rcu_scheduler_active = 1, debug_locks = 0
1 lock held by bash/761:
#0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8

stack backtrace:
CPU: 1 Not tainted 2.6.37-rc6 #4
Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
Call Trace:
([<0000000000100b02>] show_trace+0xee/0x144)
[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
[<0000000000226c80>] migrate_page+0x38/0x68
[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
[<000000000022785a>] migrate_pages+0x496/0x568
[<000000000021e24e>] compact_zone+0x432/0x7d8
[<000000000021e772>] compact_zone_order+0x9e/0xbc
[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
[<00000000002a65be>] proc_sys_write+0x26/0x34
[<00000000002336e0>] vfs_write+0xac/0x18c
[<00000000002338bc>] SyS_write+0x58/0xa8
[<0000000000113976>] sysc_noemu+0x16/0x1c
[<0000020000162edc>] 0x20000162edc
INFO: lockdep is turned off.

I honestly do not understand 100% why this is a false positive, seeing that
e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
the &mapping->tree_lock instead. So I'm not quite sure how to fix this
properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
even if it is not necessary for synchronization, would get rid of the warning,
like in the following patch. Any ideas?

---
fs/hugetlbfs/inode.c | 2 ++
mm/migrate.c | 4 ++++
2 files changed, 6 insertions(+)

--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
{
int rc;

+ rcu_read_lock();
rc = migrate_huge_page_move_mapping(mapping, newpage, page);
+ rcu_read_unlock();
if (rc)
return rc;
migrate_page_copy(newpage, page);
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -417,7 +417,9 @@ int migrate_page(struct address_space *m

BUG_ON(PageWriteback(page)); /* Writeback must be complete */

+ rcu_read_lock();
rc = migrate_page_move_mapping(mapping, newpage, page);
+ rcu_read_unlock();

if (rc)
return rc;
@@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s

head = page_buffers(page);

+ rcu_read_lock();
rc = migrate_page_move_mapping(mapping, newpage, page);
+ rcu_read_unlock();

if (rc)
return rc;



2010-12-17 00:04:14

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
<[email protected]> wrote:
> I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> below. Both cases are easily reproducible: memory unplug with big page cache,
> or adding large pages during run-time.
>
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
>
> other info that might help us debug this:
>
>
> rcu_scheduler_active = 1, debug_locks = 0
> 1 lock held by bash/761:
> ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
>
> stack backtrace:
> CPU: 1 Not tainted 2.6.37-rc6 #4
> Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> Call Trace:
> ([<0000000000100b02>] show_trace+0xee/0x144)
> ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> ?[<0000000000226c80>] migrate_page+0x38/0x68
> ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> ?[<000000000022785a>] migrate_pages+0x496/0x568
> ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> ?[<00000000002336e0>] vfs_write+0xac/0x18c
> ?[<00000000002338bc>] SyS_write+0x58/0xa8
> ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> ?[<0000020000162edc>] 0x20000162edc
> INFO: lockdep is turned off.
>
> I honestly do not understand 100% why this is a false positive, seeing that
> e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> even if it is not necessary for synchronization, would get rid of the warning,
> like in the following patch. Any ideas?

In case of anon page, we hold rcu_read_lock in unmap_and_move.
The problem is file-backed page. In case of that, we hold lock_page
and mapping->tree_lock as update-side lock.
So we don't need rcu_read_lock.

>
> ---
> ?fs/hugetlbfs/inode.c | ? ?2 ++
> ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> ?2 files changed, 6 insertions(+)
>
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> ?{
> ? ? ? ?int rc;
>
> + ? ? ? rcu_read_lock();
> ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> + ? ? ? rcu_read_unlock();
> ? ? ? ?if (rc)
> ? ? ? ? ? ? ? ?return rc;
> ? ? ? ?migrate_page_copy(newpage, page);
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
>
> ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
>
> + ? ? ? rcu_read_lock();
> ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> + ? ? ? rcu_read_unlock();
>
> ? ? ? ?if (rc)
> ? ? ? ? ? ? ? ?return rc;
> @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
>
> ? ? ? ?head = page_buffers(page);
>
> + ? ? ? rcu_read_lock();
> ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> + ? ? ? rcu_read_unlock();
>
> ? ? ? ?if (rc)
> ? ? ? ? ? ? ? ?return rc;
>
>
>


How about this?
Maybe Paul have better idea.
(It's apparently be word-wrapped.)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ab2baa5..135af1e 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
}

/**
+ * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
+ * @pslot: pointer to slot, returned by radix_tree_lookup_slot
+ * Returns: item that was stored in that slot with any direct pointer flag
+ * removed.
+ *
+ * This functions works like radix_tree_deref_slot except it doesn't check
+ * RCU rule. Normally this funcion is used with update-side lock.
+ * You should use this function very carefully.
+ */
+static inline void *radix_tree_deref_slot_nocheck(void **pslot)
+{
+ return rcu_dereference_protected(*pslot, 1);
+}
+/**
* radix_tree_deref_retry - check radix_tree_deref_slot
* @arg: pointer returned by radix_tree_deref_slot
* Returns: 0 if retry is not required, otherwise retry is required
diff --git a/mm/migrate.c b/mm/migrate.c
index 2eb2243..5be2841 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -244,7 +244,8 @@ static int migrate_page_move_mapping(struct
address_space *mappin

expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
- (struct page *)radix_tree_deref_slot(pslot) != page) {
+ (struct page *)radix_tree_deref_slot_nocheck(pslot)
+ != page) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -316,7 +317,8 @@ int migrate_huge_page_move_mapping(struct
address_space *mapping,

expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
- (struct page *)radix_tree_deref_slot(pslot) != page) {
+ (struct page *)radix_tree_deref_slot_nocheck(pslot)
+ != page) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}




--
Kind regards,
Minchan Kim

2010-12-17 05:47:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> <[email protected]> wrote:
> > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > below. Both cases are easily reproducible: memory unplug with big page cache,
> > or adding large pages during run-time.
> >
> > ===================================================
> > [ INFO: suspicious rcu_dereference_check() usage. ]
> > ---------------------------------------------------
> > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> >
> > other info that might help us debug this:
> >
> >
> > rcu_scheduler_active = 1, debug_locks = 0
> > 1 lock held by bash/761:
> > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> >
> > stack backtrace:
> > CPU: 1 Not tainted 2.6.37-rc6 #4
> > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > Call Trace:
> > ([<0000000000100b02>] show_trace+0xee/0x144)
> > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > ?[<0000020000162edc>] 0x20000162edc
> > INFO: lockdep is turned off.
> >
> > I honestly do not understand 100% why this is a false positive, seeing that
> > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > even if it is not necessary for synchronization, would get rid of the warning,
> > like in the following patch. Any ideas?
>
> In case of anon page, we hold rcu_read_lock in unmap_and_move.
> The problem is file-backed page. In case of that, we hold lock_page
> and mapping->tree_lock as update-side lock.
> So we don't need rcu_read_lock.
>
> >
> > ---
> > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > ?2 files changed, 6 insertions(+)
> >
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > ?{
> > ? ? ? ?int rc;
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> > ? ? ? ?migrate_page_copy(newpage, page);
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> >
> > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> >
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> >
> > ? ? ? ?head = page_buffers(page);
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> >
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> >
> >
> >
>
>
> How about this?
> Maybe Paul have better idea.
> (It's apparently be word-wrapped.)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ab2baa5..135af1e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> }
>
> /**
> + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> + * Returns: item that was stored in that slot with any direct pointer flag
> + * removed.
> + *
> + * This functions works like radix_tree_deref_slot except it doesn't check
> + * RCU rule. Normally this funcion is used with update-side lock.
> + * You should use this function very carefully.
> + */
> +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> +{
> + return rcu_dereference_protected(*pslot, 1);

I suggest replacing the "1" with lockdep expressions for the locks
that you say might be held:

return rcu_dereference_check(*pslot,
lockdep_is_held(&mapping->tree_lock));

This assumes that when you said "and" you meant both lock_page() and
mapping->tree_lock. Also you need to pass in the mapping, which
should not be a problem given likely inlining.

If you meant that either mapping->tree_lock or page_lock() might be
held, I suppose that the page_lock() state could be passed in, but
perhaps better to take a general lockdep expression.

So, either or both? ;-)

Thanx, Paul

> +}
> +/**
> * radix_tree_deref_retry - check radix_tree_deref_slot
> * @arg: pointer returned by radix_tree_deref_slot
> * Returns: 0 if retry is not required, otherwise retry is required
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2eb2243..5be2841 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -244,7 +244,8 @@ static int migrate_page_move_mapping(struct
> address_space *mappin
>
> expected_count = 2 + page_has_private(page);
> if (page_count(page) != expected_count ||
> - (struct page *)radix_tree_deref_slot(pslot) != page) {
> + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> + != page) {
> spin_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> }
> @@ -316,7 +317,8 @@ int migrate_huge_page_move_mapping(struct
> address_space *mapping,
>
> expected_count = 2 + page_has_private(page);
> if (page_count(page) != expected_count ||
> - (struct page *)radix_tree_deref_slot(pslot) != page) {
> + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> + != page) {
> spin_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> }
>
>
>
>
> --
> Kind regards,
> Minchan Kim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-12-17 05:59:08

by Eric Dumazet

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

Le jeudi 16 décembre 2010 à 21:47 -0800, Paul E. McKenney a écrit :
> On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > How about this?
> > Maybe Paul have better idea.
> > (It's apparently be word-wrapped.)
> >
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ab2baa5..135af1e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> > }
> >
> > /**
> > + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> > + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> > + * Returns: item that was stored in that slot with any direct pointer flag
> > + * removed.
> > + *
> > + * This functions works like radix_tree_deref_slot except it doesn't check
> > + * RCU rule. Normally this funcion is used with update-side lock.
> > + * You should use this function very carefully.
> > + */
> > +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> > +{
> > + return rcu_dereference_protected(*pslot, 1);
>
> I suggest replacing the "1" with lockdep expressions for the locks
> that you say might be held:
>
> return rcu_dereference_check(*pslot,
> lockdep_is_held(&mapping->tree_lock));
>

Yes, but point was also to use rcu_dereference_protected() here, not
rcu_dereference_check().



> This assumes that when you said "and" you meant both lock_page() and
> mapping->tree_lock. Also you need to pass in the mapping, which
> should not be a problem given likely inlining.
>
> If you meant that either mapping->tree_lock or page_lock() might be
> held, I suppose that the page_lock() state could be passed in, but
> perhaps better to take a general lockdep expression.
>
> So, either or both? ;-)
>

Thanks

2010-12-17 08:39:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> <[email protected]> wrote:
> > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > below. Both cases are easily reproducible: memory unplug with big page cache,
> > or adding large pages during run-time.
> >
> > ===================================================
> > [ INFO: suspicious rcu_dereference_check() usage. ]
> > ---------------------------------------------------
> > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> >
> > other info that might help us debug this:
> >
> >
> > rcu_scheduler_active = 1, debug_locks = 0
> > 1 lock held by bash/761:
> > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> >
> > stack backtrace:
> > CPU: 1 Not tainted 2.6.37-rc6 #4
> > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > Call Trace:
> > ([<0000000000100b02>] show_trace+0xee/0x144)
> > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > ?[<0000020000162edc>] 0x20000162edc
> > INFO: lockdep is turned off.
> >
> > I honestly do not understand 100% why this is a false positive, seeing that
> > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > even if it is not necessary for synchronization, would get rid of the warning,
> > like in the following patch. Any ideas?
>
> In case of anon page, we hold rcu_read_lock in unmap_and_move.
> The problem is file-backed page. In case of that, we hold lock_page
> and mapping->tree_lock as update-side lock.
> So we don't need rcu_read_lock.
>
> >
> > ---
> > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > ?2 files changed, 6 insertions(+)
> >
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > ?{
> > ? ? ? ?int rc;
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> > ? ? ? ?migrate_page_copy(newpage, page);
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> >
> > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> >
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> >
> > ? ? ? ?head = page_buffers(page);
> >
> > + ? ? ? rcu_read_lock();
> > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > + ? ? ? rcu_read_unlock();
> >
> > ? ? ? ?if (rc)
> > ? ? ? ? ? ? ? ?return rc;
> >
> >
> >
>
>
> How about this?
> Maybe Paul have better idea.
> (It's apparently be word-wrapped.)
>

heh, I wrote a patch almost identical to this and ran it overnight for testing
(test was a memory consumer running while a parallel process grew and shrunk
the hugepage pool). It passes but that is hardly a surprise. We differed
slightly in a number of respects though.

> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ab2baa5..135af1e 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> }
>
> /**
> + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> + * Returns: item that was stored in that slot with any direct pointer flag
> + * removed.
> + *
> + * This functions works like radix_tree_deref_slot except it doesn't check
> + * RCU rule. Normally this funcion is used with update-side lock.
> + * You should use this function very carefully.
> + */
> +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> +{
> + return rcu_dereference_protected(*pslot, 1);
> +}

For this, I had

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ab2baa5..252d21c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -146,6 +146,25 @@ static inline void *radix_tree_deref_slot(void **pslot)
}

/**
+ * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
+ * @pslot: pointer to slot, returned by radix_tree_lookup_slot
+ * Returns: item that was stored in that slot with any direct pointer flag
+ * removed.
+ *
+ * For use with radix_tree_lookup_slot(). Caller must hold tree read
+ * locked across slot lookup and dereference. Not required if write lock is
+ * held (ie. items cannot be concurrently inserted).
+ *
+ * radix_tree_deref_retry must be used to confirm validity of the pointer if
+ * only the read lock is held.
+ */
+static inline void *radix_tree_deref_slot_protected(void **pslot,
+ spinlock_t *treelock)
+{
+ return rcu_dereference_protected(*pslot, lockdep_is_held(treelock));
+}
+
+/**
* radix_tree_deref_retry - check radix_tree_deref_slot
* @arg: pointer returned by radix_tree_deref_slot
* Returns: 0 if retry is not required, otherwise retry is required

In the documentation, I noted that the check might be without RCU but with
the knowledge that it's protected by the tree lock. I'm not a RCU expert
but this is only safe when you know there isn't a parallel updater and the
treelock should be preventing that, right?

Even so, other users of rcu_dereference_protected() check a lock condition
which I used tree lock for. I intended to read through the rest of
documentation properly this morning to determine if this was indeed the
right approach.

I used the name _protected instead of _nocheck because the dereference
is still protected (by the tree lock) just not by RCU. Again, have to
check the documentation to ensure this is correct.

> +/**
> * radix_tree_deref_retry - check radix_tree_deref_slot
> * @arg: pointer returned by radix_tree_deref_slot
> * Returns: 0 if retry is not required, otherwise retry is required
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 2eb2243..5be2841 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -244,7 +244,8 @@ static int migrate_page_move_mapping(struct
> address_space *mappin
>
> expected_count = 2 + page_has_private(page);
> if (page_count(page) != expected_count ||
> - (struct page *)radix_tree_deref_slot(pslot) != page) {
> + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> + != page) {
> spin_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> }

We only differed here by my passing in the &mapping->tree_lock



> @@ -316,7 +317,8 @@ int migrate_huge_page_move_mapping(struct
> address_space *mapping,
>
> expected_count = 2 + page_has_private(page);
> if (page_count(page) != expected_count ||
> - (struct page *)radix_tree_deref_slot(pslot) != page) {
> + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> + != page) {
> spin_unlock_irq(&mapping->tree_lock);
> return -EAGAIN;
> }

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-12-17 09:28:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 08:39:12AM +0000, Mel Gorman wrote:
> On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > <[email protected]> wrote:
> > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > or adding large pages during run-time.
> > >
> > > ===================================================
> > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > ---------------------------------------------------
> > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > >
> > > other info that might help us debug this:
> > >
> > >
> > > rcu_scheduler_active = 1, debug_locks = 0
> > > 1 lock held by bash/761:
> > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > >
> > > stack backtrace:
> > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > Call Trace:
> > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > ?[<0000020000162edc>] 0x20000162edc
> > > INFO: lockdep is turned off.
> > >
> > > I honestly do not understand 100% why this is a false positive, seeing that
> > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > even if it is not necessary for synchronization, would get rid of the warning,
> > > like in the following patch. Any ideas?
> >
> > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > The problem is file-backed page. In case of that, we hold lock_page
> > and mapping->tree_lock as update-side lock.
> > So we don't need rcu_read_lock.
> >
> > >
> > > ---
> > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > ?2 files changed, 6 insertions(+)
> > >
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > ?{
> > > ? ? ? ?int rc;
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > ? ? ? ?migrate_page_copy(newpage, page);
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > >
> > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > >
> > > ? ? ? ?head = page_buffers(page);
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > >
> > >
> > >
> >
> >
> > How about this?
> > Maybe Paul have better idea.
> > (It's apparently be word-wrapped.)
> >
>
> heh, I wrote a patch almost identical to this and ran it overnight for testing
> (test was a memory consumer running while a parallel process grew and shrunk
> the hugepage pool). It passes but that is hardly a surprise. We differed
> slightly in a number of respects though.
>

For completeness, this is what I tested last night. There are two "confirms"
in the changelog that I intended to work out today but maybe someone can
confirm faster.

==== CUT HERE ====
mm: migration: Use rcu_dereference_protected when dereferencing the radix tree slot during file page migration

migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous
pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d.
The point of the RCU protection there is part of getting a stable reference
to anon_vma and is only held for anon pages as file pages are locked
which is sufficient protection against freeing.

However, while a file page's mapping is being migrated, the radix tree
is double checked to ensure it is the expected page. This uses
radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
triggering the following warning.

[ 173.674290] ===================================================
[ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 173.676016] ---------------------------------------------------
[ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
[ 173.676016]
[ 173.676016] other info that might help us debug this:
[ 173.676016]
[ 173.676016]
[ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
[ 173.676016] 1 lock held by hugeadm/2899:
[ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
[ 173.676016]
[ 173.676016] stack backtrace:
[ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
[ 173.676016] Call Trace:
[ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
[ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
[ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
[ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
[ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
[ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
[ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
[ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa

This patch introduces radix_tree_deref_slot_protected() which calls
rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock
that is protecting this dereference. Based on the locking hierarchy described
in mm/filemap.c, holding the tree lock is protecting the radix tree from
concurrent updaters in all cases (Confirm that no case has been missed).
According to Documentation/RCU/lockdep.txt, if there is a guarantee that
no parallel updaters exist, use of rcu_dereference_protected() is allowed
(Confirm this is accurate?).

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/radix-tree.h | 19 +++++++++++++++++++
mm/migrate.c | 4 ++--
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index ab2baa5..252d21c 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -146,6 +146,25 @@ static inline void *radix_tree_deref_slot(void **pslot)
}

/**
+ * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
+ * @pslot: pointer to slot, returned by radix_tree_lookup_slot
+ * Returns: item that was stored in that slot with any direct pointer flag
+ * removed.
+ *
+ * For use with radix_tree_lookup_slot(). Caller must hold tree read
+ * locked across slot lookup and dereference. Not required if write lock is
+ * held (ie. items cannot be concurrently inserted).
+ *
+ * radix_tree_deref_retry must be used to confirm validity of the pointer if
+ * only the read lock is held.
+ */
+static inline void *radix_tree_deref_slot_protected(void **pslot,
+ spinlock_t *treelock)
+{
+ return rcu_dereference_protected(*pslot, lockdep_is_held(treelock));
+}
+
+/**
* radix_tree_deref_retry - check radix_tree_deref_slot
* @arg: pointer returned by radix_tree_deref_slot
* Returns: 0 if retry is not required, otherwise retry is required
diff --git a/mm/migrate.c b/mm/migrate.c
index fe5a3c6..7d4686a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -244,7 +244,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,

expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
- (struct page *)radix_tree_deref_slot(pslot) != page) {
+ (struct page *)radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}
@@ -316,7 +316,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,

expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
- (struct page *)radix_tree_deref_slot(pslot) != page) {
+ (struct page *)radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
}

2010-12-17 15:08:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Thu, Dec 16, 2010 at 09:47:22PM -0800, Paul E. McKenney wrote:
> On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > <[email protected]> wrote:
> > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > or adding large pages during run-time.
> > >
> > > ===================================================
> > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > ---------------------------------------------------
> > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > >
> > > other info that might help us debug this:
> > >
> > >
> > > rcu_scheduler_active = 1, debug_locks = 0
> > > 1 lock held by bash/761:
> > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > >
> > > stack backtrace:
> > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > Call Trace:
> > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > ?[<0000020000162edc>] 0x20000162edc
> > > INFO: lockdep is turned off.
> > >
> > > I honestly do not understand 100% why this is a false positive, seeing that
> > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > even if it is not necessary for synchronization, would get rid of the warning,
> > > like in the following patch. Any ideas?
> >
> > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > The problem is file-backed page. In case of that, we hold lock_page
> > and mapping->tree_lock as update-side lock.
> > So we don't need rcu_read_lock.
> >
> > >
> > > ---
> > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > ?2 files changed, 6 insertions(+)
> > >
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > ?{
> > > ? ? ? ?int rc;
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > ? ? ? ?migrate_page_copy(newpage, page);
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > >
> > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > >
> > > ? ? ? ?head = page_buffers(page);
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > >
> > >
> > >
> >
> >
> > How about this?
> > Maybe Paul have better idea.
> > (It's apparently be word-wrapped.)
> >
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ab2baa5..135af1e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> > }
> >
> > /**
> > + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> > + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> > + * Returns: item that was stored in that slot with any direct pointer flag
> > + * removed.
> > + *
> > + * This functions works like radix_tree_deref_slot except it doesn't check
> > + * RCU rule. Normally this funcion is used with update-side lock.
> > + * You should use this function very carefully.
> > + */
> > +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> > +{
> > + return rcu_dereference_protected(*pslot, 1);
>
> I suggest replacing the "1" with lockdep expressions for the locks
> that you say might be held:

It's not hard.

>
> return rcu_dereference_check(*pslot,
> lockdep_is_held(&mapping->tree_lock));

rcu_dereference_check still pass rcu_read_lock_held check we don't want.
I think rcu_dereference_protected is proper.

Why I don't add lockdep expressions is radix_tree_deref_slot is general API.
It might be used anywhere where it doesn't related to mapping->tree_lock.
If we add argument 'mapping', it has a very strong dependency with address_space.
so I decided making the function general and then caller must use it very carefully.
But I am not strong in this point.

>
> This assumes that when you said "and" you meant both lock_page() and
> mapping->tree_lock. Also you need to pass in the mapping, which
> should not be a problem given likely inlining.
>
> If you meant that either mapping->tree_lock or page_lock() might be
> held, I suppose that the page_lock() state could be passed in, but
> perhaps better to take a general lockdep expression.
>
> So, either or both? ;-)
>
> Thanx, Paul

I think either is okay. That's because remove_from_page_cache/__remove_from_page_cache
needs both locks so we can't prevent update if we get a either lock.
In code context, I think mapping->tree_lock is more readable since it is used near by.

Thanks for the comment, Paul.

--
Kind regards,
Minchan Kim

2010-12-17 15:14:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 08:39:12AM +0000, Mel Gorman wrote:
> On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > <[email protected]> wrote:
> > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > or adding large pages during run-time.
> > >
> > > ===================================================
> > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > ---------------------------------------------------
> > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > >
> > > other info that might help us debug this:
> > >
> > >
> > > rcu_scheduler_active = 1, debug_locks = 0
> > > 1 lock held by bash/761:
> > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > >
> > > stack backtrace:
> > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > Call Trace:
> > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > ?[<0000020000162edc>] 0x20000162edc
> > > INFO: lockdep is turned off.
> > >
> > > I honestly do not understand 100% why this is a false positive, seeing that
> > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > even if it is not necessary for synchronization, would get rid of the warning,
> > > like in the following patch. Any ideas?
> >
> > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > The problem is file-backed page. In case of that, we hold lock_page
> > and mapping->tree_lock as update-side lock.
> > So we don't need rcu_read_lock.
> >
> > >
> > > ---
> > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > ?2 files changed, 6 insertions(+)
> > >
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > ?{
> > > ? ? ? ?int rc;
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > ? ? ? ?migrate_page_copy(newpage, page);
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > >
> > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > >
> > > ? ? ? ?head = page_buffers(page);
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > >
> > >
> > >
> >
> >
> > How about this?
> > Maybe Paul have better idea.
> > (It's apparently be word-wrapped.)
> >
>
> heh, I wrote a patch almost identical to this and ran it overnight for testing
> (test was a memory consumer running while a parallel process grew and shrunk
> the hugepage pool). It passes but that is hardly a surprise. We differed

Good.

> slightly in a number of respects though.
>
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ab2baa5..135af1e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> > }
> >
> > /**
> > + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> > + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> > + * Returns: item that was stored in that slot with any direct pointer flag
> > + * removed.
> > + *
> > + * This functions works like radix_tree_deref_slot except it doesn't check
> > + * RCU rule. Normally this funcion is used with update-side lock.
> > + * You should use this function very carefully.
> > + */
> > +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> > +{
> > + return rcu_dereference_protected(*pslot, 1);
> > +}
>
> For this, I had
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ab2baa5..252d21c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -146,6 +146,25 @@ static inline void *radix_tree_deref_slot(void **pslot)
> }
>
> /**
> + * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
> + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> + * Returns: item that was stored in that slot with any direct pointer flag
> + * removed.
> + *
> + * For use with radix_tree_lookup_slot(). Caller must hold tree read
> + * locked across slot lookup and dereference. Not required if write lock is
> + * held (ie. items cannot be concurrently inserted).
> + *
> + * radix_tree_deref_retry must be used to confirm validity of the pointer if
> + * only the read lock is held.
> + */
> +static inline void *radix_tree_deref_slot_protected(void **pslot,
> + spinlock_t *treelock)
> +{
> + return rcu_dereference_protected(*pslot, lockdep_is_held(treelock));
> +}
> +
> +/**
> * radix_tree_deref_retry - check radix_tree_deref_slot
> * @arg: pointer returned by radix_tree_deref_slot
> * Returns: 0 if retry is not required, otherwise retry is required
>
> In the documentation, I noted that the check might be without RCU but with
> the knowledge that it's protected by the tree lock. I'm not a RCU expert
> but this is only safe when you know there isn't a parallel updater and the
> treelock should be preventing that, right?

I think so.

>
> Even so, other users of rcu_dereference_protected() check a lock condition
> which I used tree lock for. I intended to read through the rest of
> documentation properly this morning to determine if this was indeed the
> right approach.
>
> I used the name _protected instead of _nocheck because the dereference
> is still protected (by the tree lock) just not by RCU. Again, have to
> check the documentation to ensure this is correct.

I agree _proected naming is good.

>
> > +/**
> > * radix_tree_deref_retry - check radix_tree_deref_slot
> > * @arg: pointer returned by radix_tree_deref_slot
> > * Returns: 0 if retry is not required, otherwise retry is required
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 2eb2243..5be2841 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -244,7 +244,8 @@ static int migrate_page_move_mapping(struct
> > address_space *mappin
> >
> > expected_count = 2 + page_has_private(page);
> > if (page_count(page) != expected_count ||
> > - (struct page *)radix_tree_deref_slot(pslot) != page) {
> > + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> > + != page) {
> > spin_unlock_irq(&mapping->tree_lock);
> > return -EAGAIN;
> > }
>
> We only differed here by my passing in the &mapping->tree_lock

I will add the comment in your formal patch.

>
>
>
> > @@ -316,7 +317,8 @@ int migrate_huge_page_move_mapping(struct
> > address_space *mapping,
> >
> > expected_count = 2 + page_has_private(page);
> > if (page_count(page) != expected_count ||
> > - (struct page *)radix_tree_deref_slot(pslot) != page) {
> > + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> > + != page) {
> > spin_unlock_irq(&mapping->tree_lock);
> > return -EAGAIN;
> > }
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

--
Kind regards,
Minchan Kim

2010-12-17 15:22:42

by Minchan Kim

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 09:28:28AM +0000, Mel Gorman wrote:
> On Fri, Dec 17, 2010 at 08:39:12AM +0000, Mel Gorman wrote:
> > On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > > <[email protected]> wrote:
> > > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > > or adding large pages during run-time.
> > > >
> > > > ===================================================
> > > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > > ---------------------------------------------------
> > > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > > >
> > > > other info that might help us debug this:
> > > >
> > > >
> > > > rcu_scheduler_active = 1, debug_locks = 0
> > > > 1 lock held by bash/761:
> > > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > > >
> > > > stack backtrace:
> > > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > > Call Trace:
> > > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > > ?[<0000020000162edc>] 0x20000162edc
> > > > INFO: lockdep is turned off.
> > > >
> > > > I honestly do not understand 100% why this is a false positive, seeing that
> > > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > > even if it is not necessary for synchronization, would get rid of the warning,
> > > > like in the following patch. Any ideas?
> > >
> > > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > > The problem is file-backed page. In case of that, we hold lock_page
> > > and mapping->tree_lock as update-side lock.
> > > So we don't need rcu_read_lock.
> > >
> > > >
> > > > ---
> > > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > > ?2 files changed, 6 insertions(+)
> > > >
> > > > --- a/fs/hugetlbfs/inode.c
> > > > +++ b/fs/hugetlbfs/inode.c
> > > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > > ?{
> > > > ? ? ? ?int rc;
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > > ? ? ? ?migrate_page_copy(newpage, page);
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > > >
> > > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > >
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > > >
> > > > ? ? ? ?head = page_buffers(page);
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > >
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > >
> > > >
> > > >
> > >
> > >
> > > How about this?
> > > Maybe Paul have better idea.
> > > (It's apparently be word-wrapped.)
> > >
> >
> > heh, I wrote a patch almost identical to this and ran it overnight for testing
> > (test was a memory consumer running while a parallel process grew and shrunk
> > the hugepage pool). It passes but that is hardly a surprise. We differed
> > slightly in a number of respects though.
> >
>
> For completeness, this is what I tested last night. There are two "confirms"
> in the changelog that I intended to work out today but maybe someone can
> confirm faster.
>
> ==== CUT HERE ====
> mm: migration: Use rcu_dereference_protected when dereferencing the radix tree slot during file page migration
>
> migrate_pages() -> unmap_and_move() only calls rcu_read_lock() for anonymous
> pages, as introduced by git commit 989f89c57e6361e7d16fbd9572b5da7d313b073d.
> The point of the RCU protection there is part of getting a stable reference
> to anon_vma and is only held for anon pages as file pages are locked
> which is sufficient protection against freeing.
>
> However, while a file page's mapping is being migrated, the radix tree
> is double checked to ensure it is the expected page. This uses
> radix_tree_deref_slot() -> rcu_dereference() without the RCU lock held
> triggering the following warning.
>
> [ 173.674290] ===================================================
> [ 173.676016] [ INFO: suspicious rcu_dereference_check() usage. ]
> [ 173.676016] ---------------------------------------------------
> [ 173.676016] include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> [ 173.676016]
> [ 173.676016] other info that might help us debug this:
> [ 173.676016]
> [ 173.676016]
> [ 173.676016] rcu_scheduler_active = 1, debug_locks = 0
> [ 173.676016] 1 lock held by hugeadm/2899:
> [ 173.676016] #0: (&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<c10e3d2b>] migrate_page_move_mapping+0x40/0x1ab
> [ 173.676016]
> [ 173.676016] stack backtrace:
> [ 173.676016] Pid: 2899, comm: hugeadm Not tainted 2.6.37-rc5-autobuild
> [ 173.676016] Call Trace:
> [ 173.676016] [<c128cc01>] ? printk+0x14/0x1b
> [ 173.676016] [<c1063502>] lockdep_rcu_dereference+0x7d/0x86
> [ 173.676016] [<c10e3db5>] migrate_page_move_mapping+0xca/0x1ab
> [ 173.676016] [<c10e41ad>] migrate_page+0x23/0x39
> [ 173.676016] [<c10e491b>] buffer_migrate_page+0x22/0x107
> [ 173.676016] [<c10e48f9>] ? buffer_migrate_page+0x0/0x107
> [ 173.676016] [<c10e425d>] move_to_new_page+0x9a/0x1ae
> [ 173.676016] [<c10e47e6>] migrate_pages+0x1e7/0x2fa
>
> This patch introduces radix_tree_deref_slot_protected() which calls
> rcu_dereference_protected(). Users of it must pass in the mapping->tree_lock
> that is protecting this dereference. Based on the locking hierarchy described
> in mm/filemap.c, holding the tree lock is protecting the radix tree from
> concurrent updaters in all cases (Confirm that no case has been missed).
> According to Documentation/RCU/lockdep.txt, if there is a guarantee that
> no parallel updaters exist, use of rcu_dereference_protected() is allowed
> (Confirm this is accurate?).
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/radix-tree.h | 19 +++++++++++++++++++
> mm/migrate.c | 4 ++--
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ab2baa5..252d21c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -146,6 +146,25 @@ static inline void *radix_tree_deref_slot(void **pslot)
> }
>
> /**
> + * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
> + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> + * Returns: item that was stored in that slot with any direct pointer flag
> + * removed.
> + *
> + * For use with radix_tree_lookup_slot(). Caller must hold tree read
> + * locked across slot lookup and dereference. Not required if write lock is
> + * held (ie. items cannot be concurrently inserted).
> + *
> + * radix_tree_deref_retry must be used to confirm validity of the pointer if
> + * only the read lock is held.
> + */
> +static inline void *radix_tree_deref_slot_protected(void **pslot,
> + spinlock_t *treelock)
> +{
> + return rcu_dereference_protected(*pslot, lockdep_is_held(treelock));
> +}

It seems to be good than mine. Just a nitpick.
Can't we get the mutex lock as update-side lock in future?

--
Kind regards,
Minchan Kim

2010-12-17 16:01:28

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Fri, Dec 17, 2010 at 08:39:12AM +0000, Mel Gorman wrote:
> On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > <[email protected]> wrote:
> > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > or adding large pages during run-time.
> > >
> > > ===================================================
> > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > ---------------------------------------------------
> > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > >
> > > other info that might help us debug this:
> > >
> > >
> > > rcu_scheduler_active = 1, debug_locks = 0
> > > 1 lock held by bash/761:
> > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > >
> > > stack backtrace:
> > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > Call Trace:
> > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > ?[<0000020000162edc>] 0x20000162edc
> > > INFO: lockdep is turned off.
> > >
> > > I honestly do not understand 100% why this is a false positive, seeing that
> > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > even if it is not necessary for synchronization, would get rid of the warning,
> > > like in the following patch. Any ideas?
> >
> > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > The problem is file-backed page. In case of that, we hold lock_page
> > and mapping->tree_lock as update-side lock.
> > So we don't need rcu_read_lock.
> >
> > >
> > > ---
> > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > ?2 files changed, 6 insertions(+)
> > >
> > > --- a/fs/hugetlbfs/inode.c
> > > +++ b/fs/hugetlbfs/inode.c
> > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > ?{
> > > ? ? ? ?int rc;
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > ? ? ? ?migrate_page_copy(newpage, page);
> > > --- a/mm/migrate.c
> > > +++ b/mm/migrate.c
> > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > >
> > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > >
> > > ? ? ? ?head = page_buffers(page);
> > >
> > > + ? ? ? rcu_read_lock();
> > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > + ? ? ? rcu_read_unlock();
> > >
> > > ? ? ? ?if (rc)
> > > ? ? ? ? ? ? ? ?return rc;
> > >
> > >
> > >
> >
> >
> > How about this?
> > Maybe Paul have better idea.
> > (It's apparently be word-wrapped.)
> >
>
> heh, I wrote a patch almost identical to this and ran it overnight for testing
> (test was a memory consumer running while a parallel process grew and shrunk
> the hugepage pool). It passes but that is hardly a surprise. We differed
> slightly in a number of respects though.
>
> > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > index ab2baa5..135af1e 100644
> > --- a/include/linux/radix-tree.h
> > +++ b/include/linux/radix-tree.h
> > @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> > }
> >
> > /**
> > + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> > + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> > + * Returns: item that was stored in that slot with any direct pointer flag
> > + * removed.
> > + *
> > + * This functions works like radix_tree_deref_slot except it doesn't check
> > + * RCU rule. Normally this funcion is used with update-side lock.
> > + * You should use this function very carefully.
> > + */
> > +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> > +{
> > + return rcu_dereference_protected(*pslot, 1);
> > +}
>
> For this, I had
>
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index ab2baa5..252d21c 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -146,6 +146,25 @@ static inline void *radix_tree_deref_slot(void **pslot)
> }
>
> /**
> + * radix_tree_deref_slot_protected - dereference a slot without RCU lock but with tree lock held
> + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> + * Returns: item that was stored in that slot with any direct pointer flag
> + * removed.
> + *
> + * For use with radix_tree_lookup_slot(). Caller must hold tree read
> + * locked across slot lookup and dereference. Not required if write lock is
> + * held (ie. items cannot be concurrently inserted).
> + *
> + * radix_tree_deref_retry must be used to confirm validity of the pointer if
> + * only the read lock is held.
> + */
> +static inline void *radix_tree_deref_slot_protected(void **pslot,
> + spinlock_t *treelock)
> +{
> + return rcu_dereference_protected(*pslot, lockdep_is_held(treelock));
> +}
> +
> +/**
> * radix_tree_deref_retry - check radix_tree_deref_slot
> * @arg: pointer returned by radix_tree_deref_slot
> * Returns: 0 if retry is not required, otherwise retry is required
>
> In the documentation, I noted that the check might be without RCU but with
> the knowledge that it's protected by the tree lock. I'm not a RCU expert
> but this is only safe when you know there isn't a parallel updater and the
> treelock should be preventing that, right?

Yes, if you have prevented updates from happening, then it is OK to
use rcu_dereference_protected() instead of rcu_dereference_check().

However, if RCU readers can invoke this code, things will break.

By the way, this means that something like:

rcu_dereference_protected(p, rcu_read_lock_held()) /* BUGGY!!! */

is just plain wrong.

> Even so, other users of rcu_dereference_protected() check a lock condition
> which I used tree lock for. I intended to read through the rest of
> documentation properly this morning to determine if this was indeed the
> right approach.
>
> I used the name _protected instead of _nocheck because the dereference
> is still protected (by the tree lock) just not by RCU. Again, have to
> check the documentation to ensure this is correct.

Yep, this is indeed the case that rcu_dereference_protected() is for.

> > +/**
> > * radix_tree_deref_retry - check radix_tree_deref_slot
> > * @arg: pointer returned by radix_tree_deref_slot
> > * Returns: 0 if retry is not required, otherwise retry is required
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 2eb2243..5be2841 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -244,7 +244,8 @@ static int migrate_page_move_mapping(struct
> > address_space *mappin
> >
> > expected_count = 2 + page_has_private(page);
> > if (page_count(page) != expected_count ||
> > - (struct page *)radix_tree_deref_slot(pslot) != page) {
> > + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> > + != page) {
> > spin_unlock_irq(&mapping->tree_lock);
> > return -EAGAIN;
> > }
>
> We only differed here by my passing in the &mapping->tree_lock

Which should be optimized away during inlining, so no performance
penalty. ;-)

Thanx, Paul

> > @@ -316,7 +317,8 @@ int migrate_huge_page_move_mapping(struct
> > address_space *mapping,
> >
> > expected_count = 2 + page_has_private(page);
> > if (page_count(page) != expected_count ||
> > - (struct page *)radix_tree_deref_slot(pslot) != page) {
> > + (struct page *)radix_tree_deref_slot_nocheck(pslot)
> > + != page) {
> > spin_unlock_irq(&mapping->tree_lock);
> > return -EAGAIN;
> > }
>
> --
> Mel Gorman
> Part-time Phd Student Linux Technology Center
> University of Limerick IBM Dublin Software Lab

2010-12-17 16:04:02

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [BUG?] memory hotplug: include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!

On Sat, Dec 18, 2010 at 12:08:27AM +0900, Minchan Kim wrote:
> On Thu, Dec 16, 2010 at 09:47:22PM -0800, Paul E. McKenney wrote:
> > On Fri, Dec 17, 2010 at 09:04:13AM +0900, Minchan Kim wrote:
> > > On Thu, Dec 16, 2010 at 10:50 PM, Gerald Schaefer
> > > <[email protected]> wrote:
> > > > I got the same warning now after increasing /proc/sys/vm/nr_hugepages, see
> > > > below. Both cases are easily reproducible: memory unplug with big page cache,
> > > > or adding large pages during run-time.
> > > >
> > > > ===================================================
> > > > [ INFO: suspicious rcu_dereference_check() usage. ]
> > > > ---------------------------------------------------
> > > > include/linux/radix-tree.h:145 invoked rcu_dereference_check() without protection!
> > > >
> > > > other info that might help us debug this:
> > > >
> > > >
> > > > rcu_scheduler_active = 1, debug_locks = 0
> > > > 1 lock held by bash/761:
> > > > ?#0: ?(&(&inode->i_data.tree_lock)->rlock){..-.-.}, at: [<00000000002263ae>] migrate_page_move_mapping+0x4a/0x2d8
> > > >
> > > > stack backtrace:
> > > > CPU: 1 Not tainted 2.6.37-rc6 #4
> > > > Process bash (pid: 761, task: 00000000181b5540, ksp: 00000000181bb7f8)
> > > > 00000000181bb818 00000000181bb798 0000000000000002 0000000000000000
> > > > ? ? ? 00000000181bb838 00000000181bb7b0 00000000181bb7b0 000000000056bafa
> > > > ? ? ? 0000000000000000 000000003f42bdf0 0000000000000002 000000001c43be30
> > > > ? ? ? 000003e00000000d 000003e00000000c 00000000181bb800 0000000000000000
> > > > ? ? ? 0000000000000000 0000000000100bfa 00000000181bb798 00000000181bb7d8
> > > > Call Trace:
> > > > ([<0000000000100b02>] show_trace+0xee/0x144)
> > > > ?[<000000000022654e>] migrate_page_move_mapping+0x1ea/0x2d8
> > > > ?[<0000000000226c80>] migrate_page+0x38/0x68
> > > > ?[<0000000000226d9a>] move_to_new_page+0xea/0x2bc
> > > > ?[<000000000022785a>] migrate_pages+0x496/0x568
> > > > ?[<000000000021e24e>] compact_zone+0x432/0x7d8
> > > > ?[<000000000021e772>] compact_zone_order+0x9e/0xbc
> > > > ?[<000000000021ed52>] try_to_compact_pages+0x1ba/0x24c
> > > > ?[<00000000001e1afa>] __alloc_pages_nodemask+0x86a/0xa64
> > > > ?[<000000000021c80c>] alloc_fresh_huge_page.clone.2+0x68/0x18c
> > > > ?[<000000000021cc4c>] set_max_huge_pages.clone.0+0xa4/0x1ac
> > > > ?[<000000000021ce06>] hugetlb_sysctl_handler+0xb2/0xcc
> > > > ?[<00000000002a6572>] proc_sys_call_handler+0xe6/0x10c
> > > > ?[<00000000002a65be>] proc_sys_write+0x26/0x34
> > > > ?[<00000000002336e0>] vfs_write+0xac/0x18c
> > > > ?[<00000000002338bc>] SyS_write+0x58/0xa8
> > > > ?[<0000000000113976>] sysc_noemu+0x16/0x1c
> > > > ?[<0000020000162edc>] 0x20000162edc
> > > > INFO: lockdep is turned off.
> > > >
> > > > I honestly do not understand 100% why this is a false positive, seeing that
> > > > e.g. find_get_page() will also use radix_tree_deref_slot(), holding only the
> > > > rcu_read_lock, while migrate_page_move_mapping() has no rcu_read_lock() but
> > > > the &mapping->tree_lock instead. So I'm not quite sure how to fix this
> > > > properly, but simply adding rcu_read_lock/unlock() to the affected code paths,
> > > > even if it is not necessary for synchronization, would get rid of the warning,
> > > > like in the following patch. Any ideas?
> > >
> > > In case of anon page, we hold rcu_read_lock in unmap_and_move.
> > > The problem is file-backed page. In case of that, we hold lock_page
> > > and mapping->tree_lock as update-side lock.
> > > So we don't need rcu_read_lock.
> > >
> > > >
> > > > ---
> > > > ?fs/hugetlbfs/inode.c | ? ?2 ++
> > > > ?mm/migrate.c ? ? ? ? | ? ?4 ++++
> > > > ?2 files changed, 6 insertions(+)
> > > >
> > > > --- a/fs/hugetlbfs/inode.c
> > > > +++ b/fs/hugetlbfs/inode.c
> > > > @@ -580,7 +580,9 @@ static int hugetlbfs_migrate_page(struct
> > > > ?{
> > > > ? ? ? ?int rc;
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_huge_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > > ? ? ? ?migrate_page_copy(newpage, page);
> > > > --- a/mm/migrate.c
> > > > +++ b/mm/migrate.c
> > > > @@ -417,7 +417,9 @@ int migrate_page(struct address_space *m
> > > >
> > > > ? ? ? ?BUG_ON(PageWriteback(page)); ? ?/* Writeback must be complete */
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > >
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > > @@ -444,7 +446,9 @@ int buffer_migrate_page(struct address_s
> > > >
> > > > ? ? ? ?head = page_buffers(page);
> > > >
> > > > + ? ? ? rcu_read_lock();
> > > > ? ? ? ?rc = migrate_page_move_mapping(mapping, newpage, page);
> > > > + ? ? ? rcu_read_unlock();
> > > >
> > > > ? ? ? ?if (rc)
> > > > ? ? ? ? ? ? ? ?return rc;
> > > >
> > > >
> > > >
> > >
> > >
> > > How about this?
> > > Maybe Paul have better idea.
> > > (It's apparently be word-wrapped.)
> > >
> > > diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> > > index ab2baa5..135af1e 100644
> > > --- a/include/linux/radix-tree.h
> > > +++ b/include/linux/radix-tree.h
> > > @@ -146,6 +146,20 @@ static inline void *radix_tree_deref_slot(void **pslot)
> > > }
> > >
> > > /**
> > > + * radix_tree_deref_slot_nocheck - dereference a slot without RCU check
> > > + * @pslot: pointer to slot, returned by radix_tree_lookup_slot
> > > + * Returns: item that was stored in that slot with any direct pointer flag
> > > + * removed.
> > > + *
> > > + * This functions works like radix_tree_deref_slot except it doesn't check
> > > + * RCU rule. Normally this funcion is used with update-side lock.
> > > + * You should use this function very carefully.
> > > + */
> > > +static inline void *radix_tree_deref_slot_nocheck(void **pslot)
> > > +{
> > > + return rcu_dereference_protected(*pslot, 1);
> >
> > I suggest replacing the "1" with lockdep expressions for the locks
> > that you say might be held:
>
> It's not hard.
>
> >
> > return rcu_dereference_check(*pslot,
> > lockdep_is_held(&mapping->tree_lock));
>
> rcu_dereference_check still pass rcu_read_lock_held check we don't want.
> I think rcu_dereference_protected is proper.

You are exactly right. The only reason that I used rcu_dereference_check()
instead of rcu_dereference_protected() is because I didn't realize that
RCU readers never called this function.

> Why I don't add lockdep expressions is radix_tree_deref_slot is general API.
> It might be used anywhere where it doesn't related to mapping->tree_lock.
> If we add argument 'mapping', it has a very strong dependency with address_space.
> so I decided making the function general and then caller must use it very carefully.
> But I am not strong in this point.

I believe that this would be a good thing.

> > This assumes that when you said "and" you meant both lock_page() and
> > mapping->tree_lock. Also you need to pass in the mapping, which
> > should not be a problem given likely inlining.
> >
> > If you meant that either mapping->tree_lock or page_lock() might be
> > held, I suppose that the page_lock() state could be passed in, but
> > perhaps better to take a general lockdep expression.
> >
> > So, either or both? ;-)
> >
> > Thanx, Paul
>
> I think either is okay. That's because remove_from_page_cache/__remove_from_page_cache
> needs both locks so we can't prevent update if we get a either lock.
> In code context, I think mapping->tree_lock is more readable since it is used near by.

Good!!! So we only really need to check for one or the other.

> Thanks for the comment, Paul.

No problem!

Thanx, Paul