2015-07-22 08:10:47

by Jerome Marchand

[permalink] [raw]
Subject: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
{IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
("nfs: page cache invalidation for dio").
This naive test patch avoid to take the mutex on a swapfile and makes
lockdep happy again. However I don't know much about NFS code and I
assume it's probably not the proper solution. Any thought?

Signed-off-by: Jerome Marchand <[email protected]>
---
fs/nfs/direct.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/nfs/direct.c b/fs/nfs/direct.c
index 38678d9..42324d4 100644
--- a/fs/nfs/direct.c
+++ b/fs/nfs/direct.c
@@ -974,7 +974,9 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
pos = iocb->ki_pos;
end = (pos + iov_iter_count(iter) - 1) >> PAGE_CACHE_SHIFT;

- mutex_lock(&inode->i_mutex);
+ /* Don't take the mutex while in reclaim_FS */
+ if (!IS_SWAPFILE(inode))
+ mutex_lock(&inode->i_mutex);

result = nfs_sync_mapping(mapping);
if (result)
@@ -1014,7 +1016,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
pos >> PAGE_CACHE_SHIFT, end);
}

- mutex_unlock(&inode->i_mutex);
+ if (!IS_SWAPFILE(inode))
+ mutex_unlock(&inode->i_mutex);

if (!result) {
result = nfs_direct_wait(dreq);
@@ -1035,7 +1038,8 @@ ssize_t nfs_file_direct_write(struct kiocb *iocb, struct iov_iter *iter)
out_release:
nfs_direct_req_release(dreq);
out_unlock:
- mutex_unlock(&inode->i_mutex);
+ if (!IS_SWAPFILE(inode))
+ mutex_unlock(&inode->i_mutex);
return result;
}

--
1.9.3



2015-07-22 12:23:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
>
> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
> ("nfs: page cache invalidation for dio").
> This naive test patch avoid to take the mutex on a swapfile and makes
> lockdep happy again. However I don't know much about NFS code and I
> assume it's probably not the proper solution. Any thought?
>
> Signed-off-by: Jerome Marchand <[email protected]>

NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
Why can't this be fixed in the generic swap code instead of adding
yet-another-exception-for-IS_SWAPFILE?

Cheers
Trond

2015-07-22 13:46:25

by Jerome Marchand

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On 07/22/2015 02:23 PM, Trond Myklebust wrote:
> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
>>
>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
>> ("nfs: page cache invalidation for dio").
>> This naive test patch avoid to take the mutex on a swapfile and makes
>> lockdep happy again. However I don't know much about NFS code and I
>> assume it's probably not the proper solution. Any thought?
>>
>> Signed-off-by: Jerome Marchand <[email protected]>
>
> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
> Why can't this be fixed in the generic swap code instead of adding
> yet-another-exception-for-IS_SWAPFILE?

I meant to cc Mel. Just added him.

AFAIK NFS is the only filesystem that uses swap_activate. Other
swapfiles are handled more or less like block device (divided in a set
of contiguous ranges of disk block called swap extents), so there are
not affected by this possible deadlock.

Also nfs_direct_IO() is special in that it is called only from swap,
nfs_file_direct_write() however has other users.

Jerome
>
> Cheers
> Trond
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-07-27 11:00:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
> > On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
> >>
> >> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
> >> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
> >> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
> >> ("nfs: page cache invalidation for dio").
> >> This naive test patch avoid to take the mutex on a swapfile and makes
> >> lockdep happy again. However I don't know much about NFS code and I
> >> assume it's probably not the proper solution. Any thought?
> >>
> >> Signed-off-by: Jerome Marchand <[email protected]>
> >
> > NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
> > Why can't this be fixed in the generic swap code instead of adding
> > yet-another-exception-for-IS_SWAPFILE?
>
> I meant to cc Mel. Just added him.
>

Can the full lockdep warning be included as it'll be easier to see then if
the generic swap code can somehow special case this? Currently, generic
swapping does not not need to care about how the filesystem locked.
For most filesystems, it's writing directly to the blocks on disk and
bypassing the FS. In the NFS case it'd be surprising to find that there
also are dirty pages in page cache that belong to the swap file as it's
going to cause corruption. If there is any special casing it would to only
attempt the invalidation in the !swap case and warn if mapping->nrpages. It
still would look a bit weird but safer than just not acquiring the mutex
and then potentially attempting an invalidation.

--
Mel Gorman
SUSE Labs

2015-07-27 11:25:58

by Jerome Marchand

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On 07/27/2015 12:52 PM, Mel Gorman wrote:
> On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
>> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
>>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
>>>>
>>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
>>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
>>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
>>>> ("nfs: page cache invalidation for dio").
>>>> This naive test patch avoid to take the mutex on a swapfile and makes
>>>> lockdep happy again. However I don't know much about NFS code and I
>>>> assume it's probably not the proper solution. Any thought?
>>>>
>>>> Signed-off-by: Jerome Marchand <[email protected]>
>>>
>>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
>>> Why can't this be fixed in the generic swap code instead of adding
>>> yet-another-exception-for-IS_SWAPFILE?
>>
>> I meant to cc Mel. Just added him.
>>
>
> Can the full lockdep warning be included as it'll be easier to see then if
> the generic swap code can somehow special case this? Currently, generic
> swapping does not not need to care about how the filesystem locked.
> For most filesystems, it's writing directly to the blocks on disk and
> bypassing the FS. In the NFS case it'd be surprising to find that there
> also are dirty pages in page cache that belong to the swap file as it's
> going to cause corruption. If there is any special casing it would to only
> attempt the invalidation in the !swap case and warn if mapping->nrpages. It
> still would look a bit weird but safer than just not acquiring the mutex
> and then potentially attempting an invalidation.
>

[ 6819.501009] =================================
[ 6819.501009] [ INFO: inconsistent lock state ]
[ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
[ 6819.501009] ---------------------------------
[ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 6819.501009] (&sb->s_type->i_mutex_key#17){+.+.?.}, at: [<ffffffffa03772a5>] nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
[ 6819.501009] [<ffffffff81107f51>] mark_held_locks+0x71/0x90
[ 6819.501009] [<ffffffff8110b775>] lockdep_trace_alloc+0x75/0xe0
[ 6819.501009] [<ffffffff81245529>] kmem_cache_alloc_node_trace+0x39/0x440
[ 6819.501009] [<ffffffff81225b8f>] __get_vm_area_node+0x7f/0x160
[ 6819.501009] [<ffffffff81226eb2>] __vmalloc_node_range+0x72/0x2c0
[ 6819.501009] [<ffffffff81227424>] vzalloc+0x54/0x60
[ 6819.501009] [<ffffffff8122c7c8>] SyS_swapon+0x628/0xfc0
[ 6819.501009] [<ffffffff81867772>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 6819.501009] irq event stamp: 163459
[ 6819.501009] hardirqs last enabled at (163459): [<ffffffff81866c66>] _raw_spin_unlock_irqrestore+0x36/0x60
[ 6819.501009] hardirqs last disabled at (163458): [<ffffffff8186747b>] _raw_spin_lock_irqsave+0x2b/0x90
[ 6819.501009] softirqs last enabled at (162966): [<ffffffff810b13d3>] __do_softirq+0x363/0x630
[ 6819.501009] softirqs last disabled at (162961): [<ffffffff810b1a03>] irq_exit+0xf3/0x100
[ 6819.501009]
other info that might help us debug this:
[ 6819.501009] Possible unsafe locking scenario:

[ 6819.501009] CPU0
[ 6819.501009] ----
[ 6819.501009] lock(&sb->s_type->i_mutex_key#17);
[ 6819.501009] <Interrupt>
[ 6819.501009] lock(&sb->s_type->i_mutex_key#17);
[ 6819.501009]
*** DEADLOCK ***

[ 6819.501009] no locks held by kswapd0/38.
[ 6819.501009]
stack backtrace:
[ 6819.501009] CPU: 1 PID: 38 Comm: kswapd0 Not tainted 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255
[ 6819.501009] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6819.501009] 0000000000000000 00000000cca71737 ffff880033f374d8 ffffffff8185ce5b
[ 6819.501009] 0000000000000000 ffff880033f30000 ffff880033f37538 ffffffff8185732d
[ 6819.501009] 0000000000000000 ffff880000000001 ffff880000000001 ffffffff8102f49f
[ 6819.501009] Call Trace:
[ 6819.501009] [<ffffffff8185ce5b>] dump_stack+0x4c/0x65
[ 6819.501009] [<ffffffff8185732d>] print_usage_bug+0x1f2/0x203
[ 6819.501009] [<ffffffff8102f49f>] ? save_stack_trace+0x2f/0x50
[ 6819.501009] [<ffffffff81107430>] ? check_usage_backwards+0x150/0x150
[ 6819.501009] [<ffffffff81107e52>] mark_lock+0x212/0x2a0
[ 6819.501009] [<ffffffff81108d73>] __lock_acquire+0x8d3/0x1f40
[ 6819.501009] [<ffffffff8110953e>] ? __lock_acquire+0x109e/0x1f40
[ 6819.501009] [<ffffffff8110ac92>] lock_acquire+0xc2/0x280
[ 6819.501009] [<ffffffffa03772a5>] ? nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] [<ffffffff818641bf>] mutex_lock_nested+0x7f/0x3f0
[ 6819.501009] [<ffffffffa03772a5>] ? nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] [<ffffffff81105328>] ? __lock_is_held+0x58/0x80
[ 6819.501009] [<ffffffffa03772a5>] ? nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] [<ffffffff8122a500>] ? get_swap_bio+0x90/0x90
[ 6819.501009] [<ffffffffa03772a5>] nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] [<ffffffff8122a500>] ? get_swap_bio+0x90/0x90
[ 6819.501009] [<ffffffffa0377640>] nfs_direct_IO+0x30/0x50 [nfs]
[ 6819.501009] [<ffffffff8122a9b5>] __swap_writepage+0x105/0x270
[ 6819.501009] [<ffffffff8122ab59>] swap_writepage+0x39/0x70
[ 6819.501009] [<ffffffff811fbef2>] shmem_writepage+0x1f2/0x330
[ 6819.501009] [<ffffffff811f3319>] pageout.isra.48+0x189/0x4a0
[ 6819.501009] [<ffffffff811f5497>] shrink_page_list+0x9b7/0xc80
[ 6819.501009] [<ffffffff811f60a8>] shrink_inactive_list+0x3a8/0x800
[ 6819.501009] [<ffffffff810e72f5>] ? local_clock+0x15/0x30
[ 6819.501009] [<ffffffff811f6f10>] shrink_lruvec+0x610/0x800
[ 6819.501009] [<ffffffff811f71e7>] shrink_zone+0xe7/0x2d0
[ 6819.501009] [<ffffffff811f8ddd>] kswapd+0x55d/0xd30
[ 6819.501009] [<ffffffff811f8880>] ? mem_cgroup_shrink_node_zone+0x490/0x490
[ 6819.501009] [<ffffffff810d1a74>] kthread+0x104/0x120
[ 6819.501009] [<ffffffff810d1970>] ? kthread_create_on_node+0x250/0x250
[ 6819.501009] [<ffffffff81867aef>] ret_from_fork+0x3f/0x70
[ 6819.501009] [<ffffffff810d1970>] ? kthread_create_on_node+0x250/0x250



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-08-20 12:24:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On Mon, Jul 27, 2015 at 01:25:47PM +0200, Jerome Marchand wrote:
> On 07/27/2015 12:52 PM, Mel Gorman wrote:
> > On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
> >> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
> >>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
> >>>>
> >>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
> >>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
> >>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
> >>>> ("nfs: page cache invalidation for dio").
> >>>> This naive test patch avoid to take the mutex on a swapfile and makes
> >>>> lockdep happy again. However I don't know much about NFS code and I
> >>>> assume it's probably not the proper solution. Any thought?
> >>>>
> >>>> Signed-off-by: Jerome Marchand <[email protected]>
> >>>
> >>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
> >>> Why can't this be fixed in the generic swap code instead of adding
> >>> yet-another-exception-for-IS_SWAPFILE?
> >>
> >> I meant to cc Mel. Just added him.
> >>
> >
> > Can the full lockdep warning be included as it'll be easier to see then if
> > the generic swap code can somehow special case this? Currently, generic
> > swapping does not not need to care about how the filesystem locked.
> > For most filesystems, it's writing directly to the blocks on disk and
> > bypassing the FS. In the NFS case it'd be surprising to find that there
> > also are dirty pages in page cache that belong to the swap file as it's
> > going to cause corruption. If there is any special casing it would to only
> > attempt the invalidation in the !swap case and warn if mapping->nrpages. It
> > still would look a bit weird but safer than just not acquiring the mutex
> > and then potentially attempting an invalidation.
> >
>
> [ 6819.501009] =================================
> [ 6819.501009] [ INFO: inconsistent lock state ]
> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
> [ 6819.501009] ---------------------------------

Thanks. Sorry for the long delay but I finally got back to the bug this
week. NFS can be modified to special case the swapfile but I was not happy
with the result for multiple reasons. It took me a while to see a way for
the core VM to deal with it. What do you think of the following
approach? More importantly, does it work for you?

---8<---
nfs: Use swap_lock to prevent parallel swapon activations

Jerome Marchand reported a lockdep warning as follows

[ 6819.501009] =================================
[ 6819.501009] [ INFO: inconsistent lock state ]
[ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
[ 6819.501009] ---------------------------------
[ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 6819.501009] (&sb->s_type->i_mutex_key#17){+.+.?.}, at: [<ffffffffa03772a5>] nfs_file_direct_write+0x85/0x3f0 [nfs]
[ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
[ 6819.501009] [<ffffffff81107f51>] mark_held_locks+0x71/0x90
[ 6819.501009] [<ffffffff8110b775>] lockdep_trace_alloc+0x75/0xe0
[ 6819.501009] [<ffffffff81245529>] kmem_cache_alloc_node_trace+0x39/0x440
[ 6819.501009] [<ffffffff81225b8f>] __get_vm_area_node+0x7f/0x160
[ 6819.501009] [<ffffffff81226eb2>] __vmalloc_node_range+0x72/0x2c0
[ 6819.501009] [<ffffffff81227424>] vzalloc+0x54/0x60
[ 6819.501009] [<ffffffff8122c7c8>] SyS_swapon+0x628/0xfc0
[ 6819.501009] [<ffffffff81867772>] entry_SYSCALL_64_fastpath+0x12/0x76

It's due to NFS acquiring i_mutex since a9ab5e840669 ("nfs: page
cache invalidation for dio") to invalidate page cache before direct I/O.
Filesystems may safely acquire i_mutex during direct writes but NFS is unique
in its treatment of swap files. Ordinarily swap files are supported by the
core VM looking up the physical block for a given offset in advance. There
is no physical block for NFS and the direct write paths are used after
calling mapping->swap_activate.

The lockdep warning is triggered by swapon(), which is not in reclaim
context, acquiring the i_mutex to ensure a swapfile is not activated twice.

swapon does not need the i_mutex for this purpose. There is a requirement
that fallocate not be used on swapfiles but this is protected by the inode
flag S_SWAPFILE and nothing to do with i_mutex. In fact, the current
protection does nothing for block devices. This patch expands the role
of swap_lock to protect against parallel activations of block devices and
swapfiles and removes the use of i_mutex. This both improves the protection
for swapon and avoids the lockdep warning.

Reported-by: Jerome Marchand <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/swapfile.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 41e4581af7c5..d58ed6833fa3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1928,9 +1928,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
set_blocksize(bdev, old_block_size);
blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
} else {
- mutex_lock(&inode->i_mutex);
+ spin_lock(&swap_lock);
inode->i_flags &= ~S_SWAPFILE;
- mutex_unlock(&inode->i_mutex);
+ spin_unlock(&swap_lock);
}
filp_close(swap_file, NULL);

@@ -2156,7 +2156,6 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
p->flags |= SWP_BLKDEV;
} else if (S_ISREG(inode->i_mode)) {
p->bdev = inode->i_sb->s_bdev;
- mutex_lock(&inode->i_mutex);
if (IS_SWAPFILE(inode))
return -EBUSY;
} else
@@ -2386,6 +2385,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap;
}

+ /* prevent parallel swapons */
+ spin_lock(&swap_lock);
p->swap_file = swap_file;
mapping = swap_file->f_mapping;

@@ -2396,13 +2397,14 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
continue;
if (mapping == q->swap_file->f_mapping) {
error = -EBUSY;
+ spin_unlock(&swap_lock);
goto bad_swap;
}
}

inode = mapping->host;
- /* If S_ISREG(inode->i_mode) will do mutex_lock(&inode->i_mutex); */
error = claim_swapfile(p, inode);
+ spin_unlock(&swap_lock);
if (unlikely(error))
goto bad_swap;

@@ -2543,10 +2545,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
vfree(swap_map);
vfree(cluster_info);
if (swap_file) {
- if (inode && S_ISREG(inode->i_mode)) {
- mutex_unlock(&inode->i_mutex);
+ if (inode && S_ISREG(inode->i_mode))
inode = NULL;
- }
filp_close(swap_file, NULL);
}
out:
@@ -2556,8 +2556,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
}
if (name)
putname(name);
- if (inode && S_ISREG(inode->i_mode))
- mutex_unlock(&inode->i_mutex);
return error;
}


2015-09-01 16:22:27

by Jerome Marchand

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On 08/20/2015 02:23 PM, Mel Gorman wrote:
> On Mon, Jul 27, 2015 at 01:25:47PM +0200, Jerome Marchand wrote:
>> On 07/27/2015 12:52 PM, Mel Gorman wrote:
>>> On Wed, Jul 22, 2015 at 03:46:16PM +0200, Jerome Marchand wrote:
>>>> On 07/22/2015 02:23 PM, Trond Myklebust wrote:
>>>>> On Wed, Jul 22, 2015 at 4:10 AM, Jerome Marchand <[email protected]> wrote:
>>>>>>
>>>>>> Lockdep warns about a inconsistent {RECLAIM_FS-ON-W} ->
>>>>>> {IN-RECLAIM_FS-W} usage. The culpritt is the inode->i_mutex taken in
>>>>>> nfs_file_direct_write(). This code was introduced by commit a9ab5e840669
>>>>>> ("nfs: page cache invalidation for dio").
>>>>>> This naive test patch avoid to take the mutex on a swapfile and makes
>>>>>> lockdep happy again. However I don't know much about NFS code and I
>>>>>> assume it's probably not the proper solution. Any thought?
>>>>>>
>>>>>> Signed-off-by: Jerome Marchand <[email protected]>
>>>>>
>>>>> NFS is not the only O_DIRECT implementation to set the inode->i_mutex.
>>>>> Why can't this be fixed in the generic swap code instead of adding
>>>>> yet-another-exception-for-IS_SWAPFILE?
>>>>
>>>> I meant to cc Mel. Just added him.
>>>>
>>>
>>> Can the full lockdep warning be included as it'll be easier to see then if
>>> the generic swap code can somehow special case this? Currently, generic
>>> swapping does not not need to care about how the filesystem locked.
>>> For most filesystems, it's writing directly to the blocks on disk and
>>> bypassing the FS. In the NFS case it'd be surprising to find that there
>>> also are dirty pages in page cache that belong to the swap file as it's
>>> going to cause corruption. If there is any special casing it would to only
>>> attempt the invalidation in the !swap case and warn if mapping->nrpages. It
>>> still would look a bit weird but safer than just not acquiring the mutex
>>> and then potentially attempting an invalidation.
>>>
>>
>> [ 6819.501009] =================================
>> [ 6819.501009] [ INFO: inconsistent lock state ]
>> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
>> [ 6819.501009] ---------------------------------
>
> Thanks. Sorry for the long delay but I finally got back to the bug this
> week. NFS can be modified to special case the swapfile but I was not happy
> with the result for multiple reasons. It took me a while to see a way for
> the core VM to deal with it. What do you think of the following
> approach?

Seems sound to me.

> More importantly, does it work for you?

Yes.

>
> ---8<---
> nfs: Use swap_lock to prevent parallel swapon activations
>
> Jerome Marchand reported a lockdep warning as follows
>
> [ 6819.501009] =================================
> [ 6819.501009] [ INFO: inconsistent lock state ]
> [ 6819.501009] 4.2.0-rc1-shmacct-babka-v2-next-20150709+ #255 Not tainted
> [ 6819.501009] ---------------------------------
> [ 6819.501009] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [ 6819.501009] kswapd0/38 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 6819.501009] (&sb->s_type->i_mutex_key#17){+.+.?.}, at: [<ffffffffa03772a5>] nfs_file_direct_write+0x85/0x3f0 [nfs]
> [ 6819.501009] {RECLAIM_FS-ON-W} state was registered at:
> [ 6819.501009] [<ffffffff81107f51>] mark_held_locks+0x71/0x90
> [ 6819.501009] [<ffffffff8110b775>] lockdep_trace_alloc+0x75/0xe0
> [ 6819.501009] [<ffffffff81245529>] kmem_cache_alloc_node_trace+0x39/0x440
> [ 6819.501009] [<ffffffff81225b8f>] __get_vm_area_node+0x7f/0x160
> [ 6819.501009] [<ffffffff81226eb2>] __vmalloc_node_range+0x72/0x2c0
> [ 6819.501009] [<ffffffff81227424>] vzalloc+0x54/0x60
> [ 6819.501009] [<ffffffff8122c7c8>] SyS_swapon+0x628/0xfc0
> [ 6819.501009] [<ffffffff81867772>] entry_SYSCALL_64_fastpath+0x12/0x76
>
> It's due to NFS acquiring i_mutex since a9ab5e840669 ("nfs: page
> cache invalidation for dio") to invalidate page cache before direct I/O.
> Filesystems may safely acquire i_mutex during direct writes but NFS is unique
> in its treatment of swap files. Ordinarily swap files are supported by the
> core VM looking up the physical block for a given offset in advance. There
> is no physical block for NFS and the direct write paths are used after
> calling mapping->swap_activate.
>
> The lockdep warning is triggered by swapon(), which is not in reclaim
> context, acquiring the i_mutex to ensure a swapfile is not activated twice.
>
> swapon does not need the i_mutex for this purpose. There is a requirement
> that fallocate not be used on swapfiles but this is protected by the inode
> flag S_SWAPFILE and nothing to do with i_mutex. In fact, the current
> protection does nothing for block devices. This patch expands the role
> of swap_lock to protect against parallel activations of block devices and
> swapfiles and removes the use of i_mutex. This both improves the protection
> for swapon and avoids the lockdep warning.
>
> Reported-by: Jerome Marchand <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Tested-by: Jerome Marchand <[email protected]>

Thanks,
Jerome

> ---
> mm/swapfile.c | 16 +++++++---------
> 1 file changed, 7 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 41e4581af7c5..d58ed6833fa3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1928,9 +1928,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> set_blocksize(bdev, old_block_size);
> blkdev_put(bdev, FMODE_READ | FMODE_WRITE | FMODE_EXCL);
> } else {
> - mutex_lock(&inode->i_mutex);
> + spin_lock(&swap_lock);
> inode->i_flags &= ~S_SWAPFILE;
> - mutex_unlock(&inode->i_mutex);
> + spin_unlock(&swap_lock);
> }
> filp_close(swap_file, NULL);
>
> @@ -2156,7 +2156,6 @@ static int claim_swapfile(struct swap_info_struct *p, struct inode *inode)
> p->flags |= SWP_BLKDEV;
> } else if (S_ISREG(inode->i_mode)) {
> p->bdev = inode->i_sb->s_bdev;
> - mutex_lock(&inode->i_mutex);
> if (IS_SWAPFILE(inode))
> return -EBUSY;
> } else
> @@ -2386,6 +2385,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> goto bad_swap;
> }
>
> + /* prevent parallel swapons */
> + spin_lock(&swap_lock);
> p->swap_file = swap_file;
> mapping = swap_file->f_mapping;
>
> @@ -2396,13 +2397,14 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> continue;
> if (mapping == q->swap_file->f_mapping) {
> error = -EBUSY;
> + spin_unlock(&swap_lock);
> goto bad_swap;
> }
> }
>
> inode = mapping->host;
> - /* If S_ISREG(inode->i_mode) will do mutex_lock(&inode->i_mutex); */
> error = claim_swapfile(p, inode);
> + spin_unlock(&swap_lock);
> if (unlikely(error))
> goto bad_swap;
>
> @@ -2543,10 +2545,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> vfree(swap_map);
> vfree(cluster_info);
> if (swap_file) {
> - if (inode && S_ISREG(inode->i_mode)) {
> - mutex_unlock(&inode->i_mutex);
> + if (inode && S_ISREG(inode->i_mode))
> inode = NULL;
> - }
> filp_close(swap_file, NULL);
> }
> out:
> @@ -2556,8 +2556,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> }
> if (name)
> putname(name);
> - if (inode && S_ISREG(inode->i_mode))
> - mutex_unlock(&inode->i_mutex);
> return error;
> }
>
>



Attachments:
signature.asc (473.00 B)
OpenPGP digital signature

2015-09-03 14:01:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH] nfs: avoid swap-over-NFS deadlock

On Tue, Sep 01, 2015 at 06:22:14PM +0200, Jerome Marchand wrote:
> > Thanks. Sorry for the long delay but I finally got back to the bug this
> > week. NFS can be modified to special case the swapfile but I was not happy
> > with the result for multiple reasons. It took me a while to see a way for
> > the core VM to deal with it. What do you think of the following
> > approach?
>
> Seems sound to me.
>
> > More importantly, does it work for you?
>
> Yes.
>

Sweet, thanks. It's the merge window now so it won't get fixed now and I
know there is already a collision with the mm merge. I'll rebase and
repost after 4.3-rc1 comes out.

--
Mel Gorman
SUSE Labs