Hi,
This lockdep warning appears when doing stress memory tests over NFS.
page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
Any ideas?
Thanks,
Fengguang
---
[ 1630.751276] NFS: Server wrote zero bytes, expected 4096.
[ 1637.984875]
[ 1637.984878] =================================
[ 1637.987429] [ INFO: inconsistent lock state ]
[ 1637.987429] 2.6.30-rc8-mm1 #299
[ 1637.987429] ---------------------------------
[ 1637.987429] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1637.987429] kswapd0/387 [HC0[0]:SC0[1]:HE1:SE0] takes:
[ 1637.987429] (sk_lock-AF_INET-RPC){+.+.?.}, at: [<ffffffff81458972>] tcp_sendmsg+0x22/0xbc0
[ 1637.987429] {RECLAIM_FS-ON-W} state was registered at:
[ 1637.987429] [<ffffffff81079b18>] mark_held_locks+0x68/0x90
[ 1637.987429] [<ffffffff81079c35>] lockdep_trace_alloc+0xf5/0x100
[ 1637.987429] [<ffffffff810c7f55>] __alloc_pages_nodemask+0x95/0x6c0
[ 1637.987429] [<ffffffff810f71f9>] __slab_alloc_page+0xb9/0x3b0
[ 1637.987429] [<ffffffff810f8596>] kmem_cache_alloc_node+0x166/0x200
[ 1637.987429] [<ffffffff81423cba>] __alloc_skb+0x4a/0x160
[ 1637.987429] [<ffffffff81466da6>] tcp_send_fin+0x86/0x1a0
[ 1637.987429] [<ffffffff814573c0>] tcp_close+0x3f0/0x4b0
[ 1637.987429] [<ffffffff81478dd2>] inet_release+0x42/0x70
[ 1637.987429] [<ffffffff8141b2d4>] sock_release+0x24/0x90
[ 1637.987429] [<ffffffff814d8fd8>] xs_reset_transport+0xb8/0xd0
[ 1637.987429] [<ffffffff814d900d>] xs_close+0x1d/0x60
[ 1637.987429] [<ffffffff814d9082>] xs_destroy+0x32/0xa0
[ 1637.987429] [<ffffffff814d69be>] xprt_destroy+0x6e/0x90
[ 1637.987429] [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429] [<ffffffff814d6940>] xprt_put+0x10/0x20
[ 1637.987429] [<ffffffff814d5e2b>] rpc_free_client+0x8b/0x100
[ 1637.987429] [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429] [<ffffffff814d5ee1>] rpc_free_auth+0x41/0x70
[ 1637.987429] [<ffffffff8126ed77>] kref_put+0x37/0x70
[ 1637.987429] [<ffffffff814d5d5e>] rpc_release_client+0x2e/0x70
[ 1637.987429] [<ffffffff814d5f5c>] rpc_shutdown_client+0x4c/0xf0
[ 1637.987429] [<ffffffff814e52b1>] rpcb_getport_sync+0xa1/0xf0
[ 1637.987429] [<ffffffff81814f5f>] nfs_root_data+0x3a9/0x40a
[ 1637.987429] [<ffffffff817f63ae>] mount_root+0x1f/0x141
[ 1637.987429] [<ffffffff817f65c8>] prepare_namespace+0xf8/0x190
[ 1637.987429] [<ffffffff817f5728>] kernel_init+0x1b5/0x1d2
[ 1637.987429] [<ffffffff8100d0ca>] child_rip+0xa/0x20
[ 1637.987429] [<ffffffffffffffff>] 0xffffffffffffffff
[ 1637.987429] irq event stamp: 285158
[ 1637.987429] hardirqs last enabled at (285157): [<ffffffff81544cff>] _spin_unlock_irqrestore+0x3f/0x70
[ 1637.987429] hardirqs last disabled at (285156): [<ffffffff8154506d>] _spin_lock_irqsave+0x2d/0x90
[ 1637.987429] softirqs last enabled at (285152): [<ffffffff814d7f8f>] xprt_transmit+0x1bf/0x2d0
[ 1637.987429] softirqs last disabled at (285158): [<ffffffff81544f67>] _spin_lock_bh+0x17/0x70
[ 1637.987429]
[ 1637.987429] other info that might help us debug this:
[ 1637.987429] no locks held by kswapd0/387.
[ 1637.987429]
[ 1637.987429] stack backtrace:
[ 1637.987429] Pid: 387, comm: kswapd0 Not tainted 2.6.30-rc8-mm1 #299
[ 1637.987429] Call Trace:
[ 1637.987429] [<ffffffff810793bc>] print_usage_bug+0x18c/0x1f0
[ 1638.251441] [<ffffffff810798bf>] mark_lock+0x49f/0x690
[ 1638.259418] [<ffffffff8107a310>] ? check_usage_forwards+0x0/0xc0
[ 1638.267420] [<ffffffff8107b039>] __lock_acquire+0x289/0x1b40
[ 1638.267420] [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.277673] [<ffffffff8107c9d1>] lock_acquire+0xe1/0x120
[ 1638.277673] [<ffffffff81458972>] ? tcp_sendmsg+0x22/0xbc0
[ 1638.287418] [<ffffffff8141d9b5>] lock_sock_nested+0x105/0x120
[ 1638.287418] [<ffffffff81458972>] ? tcp_sendmsg+0x22/0xbc0
[ 1638.287418] [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.287418] [<ffffffff81458972>] tcp_sendmsg+0x22/0xbc0
[ 1638.287418] [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418] [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418] [<ffffffff8141ad2f>] sock_sendmsg+0xdf/0x110
[ 1638.287418] [<ffffffff81077944>] ? find_usage_forwards+0x94/0xd0
[ 1638.287418] [<ffffffff81066ae0>] ? autoremove_wake_function+0x0/0x40
[ 1638.287418] [<ffffffff8107a36e>] ? check_usage_forwards+0x5e/0xc0
[ 1638.287418] [<ffffffff8107966e>] ? mark_lock+0x24e/0x690
[ 1638.287418] [<ffffffff8141b0b4>] kernel_sendmsg+0x34/0x50
[ 1638.287418] [<ffffffff814d9234>] xs_send_kvec+0x94/0xa0
[ 1638.287418] [<ffffffff81079e55>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1638.287418] [<ffffffff814d92bd>] xs_sendpages+0x7d/0x220
[ 1638.287418] [<ffffffff814d95b9>] xs_tcp_send_request+0x59/0x190
[ 1638.287418] [<ffffffff814d7e4e>] xprt_transmit+0x7e/0x2d0
[ 1638.287418] [<ffffffff814d4eb8>] call_transmit+0x1c8/0x2a0
[ 1638.287418] [<ffffffff814dccb2>] __rpc_execute+0xb2/0x2b0
[ 1638.287418] [<ffffffff814dced8>] rpc_execute+0x28/0x30
[ 1638.403414] [<ffffffff814d5b5b>] rpc_run_task+0x3b/0x80
[ 1638.403414] [<ffffffff811c7b6d>] nfs_write_rpcsetup+0x1ad/0x250
[ 1638.403414] [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
[ 1638.419417] [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
[ 1638.419417] [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
[ 1638.427413] [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
[ 1638.435414] [<ffffffff811c9ab0>] ? nfs_flush_one+0x0/0x100
[ 1638.435414] [<ffffffff81079e55>] ? trace_hardirqs_on_caller+0x155/0x1a0
[ 1638.435414] [<ffffffff811c8509>] nfs_writepage+0x19/0x40
[ 1638.435414] [<ffffffff810ce005>] shrink_page_list+0x675/0x810
[ 1638.435414] [<ffffffff810127f0>] ? native_sched_clock+0x20/0x80
[ 1638.435414] [<ffffffff810ce761>] shrink_list+0x301/0x650
[ 1638.435414] [<ffffffff810ced23>] shrink_zone+0x273/0x370
[ 1638.435414] [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
[ 1638.435414] [<ffffffff810cca80>] ? isolate_pages_global+0x0/0x250
[ 1638.435414] [<ffffffff81066ae0>] ? autoremove_wake_function+0x0/0x40
[ 1638.435414] [<ffffffff810cf2d0>] ? kswapd+0x0/0x7a0
[ 1638.435414] [<ffffffff810666de>] kthread+0x9e/0xb0
[ 1638.435414] [<ffffffff8100d0ca>] child_rip+0xa/0x20
[ 1638.435414] [<ffffffff8100ca90>] ? restore_args+0x0/0x30
[ 1638.435414] [<ffffffff81066640>] ? kthread+0x0/0xb0
[ 1638.435414] [<ffffffff8100d0c0>] ? child_rip+0x0/0x20
Hi
> Hi,
>
> This lockdep warning appears when doing stress memory tests over NFS.
>
> page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
>
> tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
>
> Any ideas?
AFAIK, btrfs has re-dirty hack.
------------------------------------------------------------------
static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
{
struct extent_io_tree *tree;
if (current->flags & PF_MEMALLOC) {
redirty_page_for_writepage(wbc, page);
unlock_page(page);
return 0;
}
tree = &BTRFS_I(page->mapping->host)->io_tree;
return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
}
---------------------------------------------------------------
PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
Can't nfs does similar hack?
I'm not net nor nfs expert. perhaps I'm wrong :-)
Thanks.
On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> Hi
>
> > Hi,
> >
> > This lockdep warning appears when doing stress memory tests over NFS.
> >
> > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> >
> > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> >
> > Any ideas?
>
> AFAIK, btrfs has re-dirty hack.
>
> ------------------------------------------------------------------
> static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> {
> struct extent_io_tree *tree;
>
>
> if (current->flags & PF_MEMALLOC) {
> redirty_page_for_writepage(wbc, page);
> unlock_page(page);
> return 0;
> }
> tree = &BTRFS_I(page->mapping->host)->io_tree;
> return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> }
> ---------------------------------------------------------------
>
> PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> Can't nfs does similar hack?
But the trace shows that current is kswapd:
[ 1638.403414] [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
[ 1638.419417] [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
[ 1638.419417] [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
[ 1638.427413] [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
[ 1638.435414] [<ffffffff811c8509>] nfs_writepage+0x19/0x40
[ 1638.435414] [<ffffffff810ce005>] shrink_page_list+0x675/0x810
[ 1638.435414] [<ffffffff810ce761>] shrink_list+0x301/0x650
[ 1638.435414] [<ffffffff810ced23>] shrink_zone+0x273/0x370
[ 1638.435414] [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
[ 1638.435414] [<ffffffff810666de>] kthread+0x9e/0xb0
[ 1638.435414] [<ffffffff8100d0ca>] child_rip+0xa/0x20
> On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > Hi
> >
> > > Hi,
> > >
> > > This lockdep warning appears when doing stress memory tests over NFS.
> > >
> > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > >
> > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > >
> > > Any ideas?
> >
> > AFAIK, btrfs has re-dirty hack.
> >
> > ------------------------------------------------------------------
> > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > {
> > struct extent_io_tree *tree;
> >
> >
> > if (current->flags & PF_MEMALLOC) {
> > redirty_page_for_writepage(wbc, page);
> > unlock_page(page);
> > return 0;
> > }
> > tree = &BTRFS_I(page->mapping->host)->io_tree;
> > return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > }
> > ---------------------------------------------------------------
> >
> > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > Can't nfs does similar hack?
>
> But the trace shows that current is kswapd:
>
> [ 1638.403414] [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> [ 1638.419417] [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> [ 1638.419417] [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> [ 1638.427413] [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> [ 1638.435414] [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> [ 1638.435414] [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> [ 1638.435414] [<ffffffff810ce761>] shrink_list+0x301/0x650
> [ 1638.435414] [<ffffffff810ced23>] shrink_zone+0x273/0x370
> [ 1638.435414] [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> [ 1638.435414] [<ffffffff810666de>] kthread+0x9e/0xb0
> [ 1638.435414] [<ffffffff8100d0ca>] child_rip+0xa/0x20
kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
care its bogus warning, I think.
On Mon, Jun 08, 2009 at 01:07:26PM +0800, KOSAKI Motohiro wrote:
> > On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > > Hi
> > >
> > > > Hi,
> > > >
> > > > This lockdep warning appears when doing stress memory tests over NFS.
> > > >
> > > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > >
> > > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > >
> > > > Any ideas?
> > >
> > > AFAIK, btrfs has re-dirty hack.
> > >
> > > ------------------------------------------------------------------
> > > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > > {
> > > struct extent_io_tree *tree;
> > >
> > >
> > > if (current->flags & PF_MEMALLOC) {
> > > redirty_page_for_writepage(wbc, page);
> > > unlock_page(page);
> > > return 0;
> > > }
> > > tree = &BTRFS_I(page->mapping->host)->io_tree;
> > > return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > > }
> > > ---------------------------------------------------------------
> > >
> > > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > > Can't nfs does similar hack?
> >
> > But the trace shows that current is kswapd:
> >
> > [ 1638.403414] [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> > [ 1638.419417] [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> > [ 1638.419417] [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> > [ 1638.427413] [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> > [ 1638.435414] [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> > [ 1638.435414] [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> > [ 1638.435414] [<ffffffff810ce761>] shrink_list+0x301/0x650
> > [ 1638.435414] [<ffffffff810ced23>] shrink_zone+0x273/0x370
> > [ 1638.435414] [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> > [ 1638.435414] [<ffffffff810666de>] kthread+0x9e/0xb0
> > [ 1638.435414] [<ffffffff8100d0ca>] child_rip+0xa/0x20
>
> kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
> care its bogus warning, I think.
Right. Although this path is possible:
tcp_sendmsg() => page reclaim => tcp_send_fin()
But it won't happen for the same socket, so one sk_lock won't be
grabbed twice and go deadlock.
So it's a harmful warning for both direct/background page reclaims?
Thanks,
Fengguang
On Mon, Jun 08, 2009 at 01:53:26PM +0800, Wu Fengguang wrote:
> On Mon, Jun 08, 2009 at 01:07:26PM +0800, KOSAKI Motohiro wrote:
> > > On Mon, Jun 08, 2009 at 12:55:18PM +0800, KOSAKI Motohiro wrote:
> > > > Hi
> > > >
> > > > > Hi,
> > > > >
> > > > > This lockdep warning appears when doing stress memory tests over NFS.
> > > > >
> > > > > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> > > > >
> > > > > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
> > > > >
> > > > > Any ideas?
> > > >
> > > > AFAIK, btrfs has re-dirty hack.
> > > >
> > > > ------------------------------------------------------------------
> > > > static int btrfs_writepage(struct page *page, struct writeback_control *wbc)
> > > > {
> > > > struct extent_io_tree *tree;
> > > >
> > > >
> > > > if (current->flags & PF_MEMALLOC) {
> > > > redirty_page_for_writepage(wbc, page);
> > > > unlock_page(page);
> > > > return 0;
> > > > }
> > > > tree = &BTRFS_I(page->mapping->host)->io_tree;
> > > > return extent_write_full_page(tree, page, btrfs_get_extent, wbc);
> > > > }
> > > > ---------------------------------------------------------------
> > > >
> > > > PF_MEMALLOC mean caller is try_to_free_pages(). (not normal write nor kswapd)
> > > > Can't nfs does similar hack?
> > >
> > > But the trace shows that current is kswapd:
> > >
> > > [ 1638.403414] [<ffffffff811c9b69>] nfs_flush_one+0xb9/0x100
> > > [ 1638.419417] [<ffffffff811c3f82>] nfs_pageio_doio+0x32/0x70
> > > [ 1638.419417] [<ffffffff811c3fc9>] nfs_pageio_complete+0x9/0x10
> > > [ 1638.427413] [<ffffffff811c7ee5>] nfs_writepage_locked+0x85/0xc0
> > > [ 1638.435414] [<ffffffff811c8509>] nfs_writepage+0x19/0x40
> > > [ 1638.435414] [<ffffffff810ce005>] shrink_page_list+0x675/0x810
> > > [ 1638.435414] [<ffffffff810ce761>] shrink_list+0x301/0x650
> > > [ 1638.435414] [<ffffffff810ced23>] shrink_zone+0x273/0x370
> > > [ 1638.435414] [<ffffffff810cf9f9>] kswapd+0x729/0x7a0
> > > [ 1638.435414] [<ffffffff810666de>] kthread+0x9e/0xb0
> > > [ 1638.435414] [<ffffffff8100d0ca>] child_rip+0xa/0x20
> >
> > kswapd can't hold sk-lock before calling reclaim. Thus, we don't need
> > care its bogus warning, I think.
>
> Right. Although this path is possible:
> tcp_sendmsg() => page reclaim => tcp_send_fin()
> But it won't happen for the same socket, so one sk_lock won't be
> grabbed twice and go deadlock.
>
> So it's a harmful warning for both direct/background page reclaims?
btw, can anyone explain these NFS warnings? It happens in a very
memory tight and busy nfsroot system.
[ 113.267340] NFS: Server wrote zero bytes, expected 3671.
[ 423.202607] NFS: Server wrote zero bytes, expected 108.
[ 723.588411] NFS: Server wrote zero bytes, expected 560.
[ 1060.246747] NFS: Server wrote zero bytes, expected 54.
[ 1397.841183] NFS: Server wrote zero bytes, expected 402.
[ 1779.545035] NFS: Server wrote zero bytes, expected 319.
Thanks,
Fengguang
> btw, can anyone explain these NFS warnings? It happens in a very
> memory tight and busy nfsroot system.
>
> [ 113.267340] NFS: Server wrote zero bytes, expected 3671.
> [ 423.202607] NFS: Server wrote zero bytes, expected 108.
> [ 723.588411] NFS: Server wrote zero bytes, expected 560.
> [ 1060.246747] NFS: Server wrote zero bytes, expected 54.
> [ 1397.841183] NFS: Server wrote zero bytes, expected 402.
> [ 1779.545035] NFS: Server wrote zero bytes, expected 319.
server side write function is below
-----------------------------------------------------------
static __be32
nfsd_vfs_write(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen,
unsigned long *cnt, int *stablep)
{
(snip)
host_err = vfs_writev(file, (struct iovec __user *)vec, vlen, &offset); //(1)
(snip)
/*
* Gathered writes: If another process is currently
* writing to the file, there's a high chance
* this is another nfsd (triggered by a bulk write
* from a client's biod). Rather than syncing the
* file with each write request, we sleep for 10 msec.
*
* I don't know if this roughly approximates
* C. Juszak's idea of gathered writes, but it's a
* nice and simple solution (IMHO), and it seems to
* work:-)
*/
if (EX_WGATHER(exp)) {
(snip)
if (inode->i_state & I_DIRTY) {
dprintk("nfsd: write sync %d\n", task_pid_nr(current));
host_err=nfsd_sync(file); // (2)
}
}
(snip)
dprintk("nfsd: write complete host_err=%d\n", host_err);
if (host_err >= 0) {
err = 0;
*cnt = host_err;
} else
err = nfserrno(host_err);
out:
return err;
}
---------------------------------------------------------------------------
if (1) or (2) makes host_err == 0, it makes your warning messages.
Thanks.
From: Herbert Xu <[email protected]>
Date: Fri, 10 Jul 2009 16:02:47 +0800
> On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
>>
>> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
>> the networking code shall do it anyway, because sk_allocation defaults
>> to GFP_KERNEL. It seems that currently the networking code simply uses
>> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
>
> Yep because they're done from softirq context.
Yes, this is the core issue.
All of Wu's talk about how "GFP_ATOMIC will wake up kswapd and
therefore can succeed just as well as GFP_KERNEL" is not relevant,
because GFP_ATOMIC means sleeping is not allowed.
On Wed, Jul 15, 2009 at 12:04:32AM +0800, David Miller wrote:
> From: Herbert Xu <[email protected]>
> Date: Fri, 10 Jul 2009 16:02:47 +0800
>
> > On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
> >>
> >> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
> >> the networking code shall do it anyway, because sk_allocation defaults
> >> to GFP_KERNEL. It seems that currently the networking code simply uses
> >> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
> >
> > Yep because they're done from softirq context.
>
> Yes, this is the core issue.
Yes, that's general true. But..
> All of Wu's talk about how "GFP_ATOMIC will wake up kswapd and
> therefore can succeed just as well as GFP_KERNEL" is not relevant,
> because GFP_ATOMIC means sleeping is not allowed.
We are talking about tcp_send_fin() here, which can sleep.
Thanks,
Fengguang
Wu Fengguang <[email protected]> wrote:
>
> This lockdep warning appears when doing stress memory tests over NFS.
>
> page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
>
> tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
Well perhaps not this particular path, but it is certainly possible
if an existing NFS socket dies and NFS tries to reestablish it.
I suggest that NFS should utilise the sk_allocation field and
set an appropriate value. Note that you may have to patch TCP
so that it uses sk_allocation everywhere necessary, e.g., in
tcp_send_fin.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Mon, Jul 06, 2009 at 06:52:16PM +0800, Herbert Xu wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > This lockdep warning appears when doing stress memory tests over NFS.
> >
> > page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
> >
> > tcp_close => lock sk_lock => tcp_send_fin => alloc_skb_fclone => page reclaim
>
> Well perhaps not this particular path, but it is certainly possible
> if an existing NFS socket dies and NFS tries to reestablish it.
>
> I suggest that NFS should utilise the sk_allocation field and
> set an appropriate value. Note that you may have to patch TCP
> so that it uses sk_allocation everywhere necessary, e.g., in
> tcp_send_fin.
Good suggestion! NFS already sets sk_allocation to GFP_ATOMIC in
linux/net/sunrpc/xprtsock.c <<xs_tcp_finish_connecting>>.
To fix this warning and possible recursions, I converted some
GPF_KERNEL cases to sk_allocation in the tcp/ipv4 code:
---
tcp: replace hard coded GFP_KERNEL with sk_allocation
This fixed a lockdep warning which appeared when doing stress
memory tests over NFS:
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
mount_root => nfs_root_data => tcp_close => lock sk_lock =>
tcp_send_fin => alloc_skb_fclone => page reclaim
page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
CC: Arnaldo Carvalho de Melo <[email protected]>
CC: David S. Miller <[email protected]>
CC: Herbert Xu <[email protected]>
Signed-off-by: Wu Fengguang <[email protected]>
---
net/ipv4/tcp.c | 4 ++--
net/ipv4/tcp_ipv4.c | 5 +++--
net/ipv4/tcp_output.c | 5 +++--
3 files changed, 8 insertions(+), 6 deletions(-)
--- linux.orig/net/ipv4/tcp_ipv4.c
+++ linux/net/ipv4/tcp_ipv4.c
@@ -970,8 +970,9 @@ static int tcp_v4_parse_md5_keys(struct
if (!tcp_sk(sk)->md5sig_info) {
struct tcp_sock *tp = tcp_sk(sk);
- struct tcp_md5sig_info *p = kzalloc(sizeof(*p), GFP_KERNEL);
+ struct tcp_md5sig_info *p;
+ p = kzalloc(sizeof(*p), sk->sk_allocation);
if (!p)
return -EINVAL;
@@ -979,7 +980,7 @@ static int tcp_v4_parse_md5_keys(struct
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
}
- newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, GFP_KERNEL);
+ newkey = kmemdup(cmd.tcpm_key, cmd.tcpm_keylen, sk->sk_allocation);
if (!newkey)
return -ENOMEM;
return tcp_v4_md5_do_add(sk, sin->sin_addr.s_addr,
--- linux.orig/net/ipv4/tcp_output.c
+++ linux/net/ipv4/tcp_output.c
@@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
} else {
/* Socket is locked, keep trying until memory is available. */
for (;;) {
- skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
+ skb = alloc_skb_fclone(MAX_TCP_HEADER,
+ sk->sk_allocation);
if (skb)
break;
yield();
@@ -2358,7 +2359,7 @@ int tcp_connect(struct sock *sk)
sk->sk_wmem_queued += buff->truesize;
sk_mem_charge(sk, buff->truesize);
tp->packets_out += tcp_skb_pcount(buff);
- tcp_transmit_skb(sk, buff, 1, GFP_KERNEL);
+ tcp_transmit_skb(sk, buff, 1, sk->sk_allocation);
/* We change tp->snd_nxt after the tcp_transmit_skb() call
* in order to make this packet get counted in tcpOutSegs.
--- linux.orig/net/ipv4/tcp.c
+++ linux/net/ipv4/tcp.c
@@ -1834,7 +1834,7 @@ void tcp_close(struct sock *sk, long tim
/* Unread data was tossed, zap the connection. */
NET_INC_STATS_USER(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
tcp_set_state(sk, TCP_CLOSE);
- tcp_send_active_reset(sk, GFP_KERNEL);
+ tcp_send_active_reset(sk, sk->sk_allocation);
} else if (sock_flag(sk, SOCK_LINGER) && !sk->sk_lingertime) {
/* Check zero linger _after_ checking for unread data. */
sk->sk_prot->disconnect(sk, 0);
@@ -2666,7 +2666,7 @@ static struct tcp_md5sig_pool **__tcp_al
struct tcp_md5sig_pool *p;
struct crypto_hash *hash;
- p = kzalloc(sizeof(*p), GFP_KERNEL);
+ p = kzalloc(sizeof(*p), sk->sk_allocation);
if (!p)
goto out_free;
*per_cpu_ptr(pool, cpu) = p;
From: Wu Fengguang <[email protected]>
Date: Thu, 9 Jul 2009 21:17:46 +0800
> @@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
> } else {
> /* Socket is locked, keep trying until memory is available. */
> for (;;) {
> - skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
> + skb = alloc_skb_fclone(MAX_TCP_HEADER,
> + sk->sk_allocation);
> if (skb)
> break;
> yield();
I think this specific case needs more thinking.
If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
(which sleeps) and loop endlessly waiting for the allocation to
succeed.
On Thu, Jul 09, 2009 at 05:13:55PM -0700, David Miller wrote:
>
> I think this specific case needs more thinking.
>
> If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
> (which sleeps) and loop endlessly waiting for the allocation to
> succeed.
Indeed. We could do one of the following:
1) Preallocate, either universally or conditinally on sk_allocation.
2) Set a bit somewhere to indicate FIN and retry the allocation as
part of retransmit.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Fri, Jul 10, 2009 at 08:13:55AM +0800, David Miller wrote:
> From: Wu Fengguang <[email protected]>
> Date: Thu, 9 Jul 2009 21:17:46 +0800
>
> > @@ -2100,7 +2100,8 @@ void tcp_send_fin(struct sock *sk)
> > } else {
> > /* Socket is locked, keep trying until memory is available. */
> > for (;;) {
> > - skb = alloc_skb_fclone(MAX_TCP_HEADER, GFP_KERNEL);
> > + skb = alloc_skb_fclone(MAX_TCP_HEADER,
> > + sk->sk_allocation);
> > if (skb)
> > break;
> > yield();
>
> I think this specific case needs more thinking.
>
> If the allocation fails, and it's GFP_ATOMIC, we are going to yield()
> (which sleeps) and loop endlessly waiting for the allocation to
> succeed.
The _retried_ GFP_ATOMIC won't be much worse than GFP_KERNEL.
GFP_KERNEL can directly reclaim FS pages; GFP_ATOMIC will wake up
kswapd to do that. So after yield(), GFP_ATOMIC have good opportunity
to succeed if GFP_KERNEL could succeed.
The original GFP_KERNEL does have _a bit_ better chance to succeed,
but there are no guarantee. It could loop endlessly whether it be
GFP_KERNEL or GFP_ATOMIC.
btw, generally speaking, it would be more robust that NFS set
sk_allocation to GFP_NOIO, and let the networking code choose
whether to use plain sk_allocation or (sk_allocation & ~__GFP_WAIT).
The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
the networking code shall do it anyway, because sk_allocation defaults
to GFP_KERNEL. It seems that currently the networking code simply uses
a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
Thanks,
Fengguang
On Fri, Jul 10, 2009 at 04:00:17PM +0800, Wu Fengguang wrote:
>
> The (sk_allocation & ~__GFP_WAIT) cases should be rare, but I guess
> the networking code shall do it anyway, because sk_allocation defaults
> to GFP_KERNEL. It seems that currently the networking code simply uses
> a lot of GFP_ATOMIC, do they really mean "I cannot sleep"?
Yep because they're done from softirq context.
Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt