Return-Path: Received: from mail-io0-f194.google.com ([209.85.223.194]:34418 "EHLO mail-io0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933361AbcE3UH6 convert rfc822-to-8bit (ORCPT ); Mon, 30 May 2016 16:07:58 -0400 Received: by mail-io0-f194.google.com with SMTP id l9so1510448ioe.1 for ; Mon, 30 May 2016 13:07:57 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: [PATCH v2] nfs: avoid race that crashes nfs_init_commit From: Weston Andros Adamson In-Reply-To: <3FB15BEF-F7A8-4643-90C8-D3292FA0BE61@monkey.org> Date: Mon, 30 May 2016 16:07:54 -0400 Cc: linux-nfs list Message-Id: References: <1464185243-15556-1-git-send-email-dros@primarydata.com> <26144774-9f6a-4a05-8d87-e3eb201adc5e@Netapp.com> <3FB15BEF-F7A8-4643-90C8-D3292FA0BE61@monkey.org> To: Trond Myklebust , Anna Schumaker Sender: linux-nfs-owner@vger.kernel.org List-ID: So, some bad news: this patch isn’t quite right. When cancelling, I called nfs_commitdata_release which puts the open context before freeing the commit data, but the open context isn’t grabbed until nfs_init_commit - the call that we avoid with this path! The fix is to call nfs_commit_free instead. I have a patch that I’m about to hand off to PD's QA to test out… This can be pretty difficult to hit. Anna & Trond - would you like the fix now, or after QA has pushed it through testing? Sorry for the churn, -dros > On May 25, 2016, at 2:31 PM, Weston Andros Adamson wrote: > > > >> On May 25, 2016, at 12:53 PM, Trond Myklebust wrote: >> >> >> >> On 5/25/16, 12:40, "Anna Schumaker" wrote: >> >>> Thanks, Dros! >>> >>> How often are you hitting this? > > Not that often, the only way I’ve seen it hit is if a normal commit races with a commit > from kswapd evicting pages, ie under heavy memory load. > >> Do you want me to send it to stable? >>> >> >> Agreed that it should go to stable. >> >> Dros, do you know how far back? I’d assume 4.0.x, if it still applies to those kernels. > > This was introduced by the patch: > > commit af7cf057933f01dc7f33ddfb5e436ad598ed17ad > Author: Trond Myklebust > Date: Tue Sep 29 20:34:05 2015 -0400 > > NFS: Allow multiple commit requests in flight per file > > > git branch -r --contains=af7cf057933f01dc7f33ddfb5e436ad598ed17ad | grep stable > stable/linux-4.5.y > stable/linux-4.6.y > stable/master > > So yeah, we should also send to stable. > > -dros > >> >>> Anna >>> >>> On 05/25/2016 10:07 AM, Weston Andros Adamson wrote: >>>> Since the patch "NFS: Allow multiple commit requests in flight per file" >>>> we can run multiple simultaneous commits on the same inode. This >>>> introduced a race over collecting pages to commit that made it possible >>>> to call nfs_init_commit() with an empty list - which causes crashes like >>>> the one below. >>>> >>>> The fix is to catch this race and avoid calling nfs_init_commit and >>>> initiate_commit when there is no work to do. >>>> >>>> Here is the crash: >>>> >>>> [600522.076832] BUG: unable to handle kernel NULL pointer dereference at 0000000000000040 >>>> [600522.078475] IP: [] nfs_init_commit+0x22/0x130 [nfs] >>>> [600522.078745] PGD 4272b1067 PUD 4272cb067 PMD 0 >>>> [600522.078972] Oops: 0000 [#1] SMP >>>> [600522.079204] Modules linked in: nfsv3 nfs_layout_flexfiles rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache dcdbas ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw vmw_vsock_vmci_transport vsock bonding ipmi_devintf ipmi_msghandler coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ppdev vmw_balloon parport_pc parport acpi_cpufreq vmw_vmci i2c_piix4 shpchp nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel serio_raw vmxnet3 >>>> [600522.081380] vmw_pvscsi ata_generic pata_acpi >>>> [600522.081809] CPU: 3 PID: 15667 Comm: /usr/bin/python Not tainted 4.1.9-100.pd.88.el7.x86_64 #1 >>>> [600522.082281] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/30/2014 >>>> [600522.082814] task: ffff8800bbbfa780 ti: ffff88042ae84000 task.ti: ffff88042ae84000 >>>> [600522.083378] RIP: 0010:[] [] nfs_init_commit+0x22/0x130 [nfs] >>>> [600522.083973] RSP: 0018:ffff88042ae87438 EFLAGS: 00010246 >>>> [600522.084571] RAX: 0000000000000000 RBX: ffff880003485e40 RCX: ffff88042ae87588 >>>> [600522.085188] RDX: 0000000000000000 RSI: ffff88042ae874b0 RDI: ffff880003485e40 >>>> [600522.085756] RBP: ffff88042ae87448 R08: ffff880003486010 R09: ffff88042ae874b0 >>>> [600522.086332] R10: 0000000000000000 R11: 0000000000000005 R12: ffff88042ae872d0 >>>> [600522.086905] R13: ffff88042ae874b0 R14: ffff880003485e40 R15: ffff88042704c840 >>>> [600522.087484] FS: 00007f4728ff2740(0000) GS:ffff88043fd80000(0000) knlGS:0000000000000000 >>>> [600522.088070] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>>> [600522.088663] CR2: 0000000000000040 CR3: 000000042b6aa000 CR4: 00000000001406e0 >>>> [600522.089327] Stack: >>>> [600522.089926] 0000000000000001 ffff88042ae87588 ffff88042ae874f8 ffffffffa04f09fa >>>> [600522.090549] 0000000000017840 0000000000017840 ffff88042ae87588 ffff8803258d9930 >>>> [600522.091169] ffff88042ae87578 ffffffffa0563d80 0000000000000000 ffff88042704c840 >>>> [600522.091789] Call Trace: >>>> [600522.092420] [] pnfs_generic_commit_pagelist+0x1da/0x320 [nfsv4] >>>> [600522.093052] [] ? ff_layout_commit_prepare_v3+0x30/0x30 [nfs_layout_flexfiles] >>>> [600522.093696] [] ff_layout_commit_pagelist+0x15/0x20 [nfs_layout_flexfiles] >>>> [600522.094359] [] nfs_generic_commit_list+0xe8/0x120 [nfs] >>>> [600522.095032] [] nfs_commit_inode+0xba/0x110 [nfs] >>>> [600522.095719] [] nfs_release_page+0x44/0xd0 [nfs] >>>> [600522.096410] [] try_to_release_page+0x32/0x50 >>>> [600522.097109] [] shrink_page_list+0x961/0xb30 >>>> [600522.097812] [] shrink_inactive_list+0x1cd/0x550 >>>> [600522.098530] [] shrink_lruvec+0x635/0x840 >>>> [600522.099250] [] shrink_zone+0xf0/0x2f0 >>>> [600522.099974] [] do_try_to_free_pages+0x192/0x470 >>>> [600522.100709] [] try_to_free_pages+0xda/0x170 >>>> [600522.101464] [] __alloc_pages_nodemask+0x588/0x970 >>>> [600522.102235] [] alloc_pages_vma+0xb5/0x230 >>>> [600522.103000] [] ? cpumask_any_but+0x39/0x50 >>>> [600522.103774] [] wp_page_copy.isra.55+0x95/0x490 >>>> [600522.104558] [] ? __wake_up+0x48/0x60 >>>> [600522.105357] [] do_wp_page+0xab/0x4f0 >>>> [600522.106137] [] ? release_task+0x36b/0x470 >>>> [600522.106902] [] ? eventfd_ctx_read+0x67/0x1c0 >>>> [600522.107659] [] handle_mm_fault+0xc78/0x1900 >>>> [600522.108431] [] __do_page_fault+0x181/0x420 >>>> [600522.109173] [] ? __audit_syscall_exit+0x1e6/0x280 >>>> [600522.109893] [] do_page_fault+0x30/0x80 >>>> [600522.110594] [] ? syscall_trace_leave+0xc6/0x120 >>>> [600522.111288] [] page_fault+0x28/0x30 >>>> [600522.111947] Code: 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 4c 8d 87 d0 01 00 00 48 89 e5 53 48 89 fb 48 83 ec 08 4c 8b 0e 49 8b 41 18 4c 39 ce <48> 8b 40 40 4c 8b 50 30 74 24 48 8b 87 d0 01 00 00 48 8b 7e 08 >>>> [600522.113343] RIP [] nfs_init_commit+0x22/0x130 [nfs] >>>> [600522.114003] RSP >>>> [600522.114636] CR2: 0000000000000040 >>>> >>>> Signed-off-by: Weston Andros Adamson >>>> --- >>>> >>>> V2 has some cleanup requested by Anna. >>>> >>>> -dros >>>> >>>> fs/nfs/pnfs_nfs.c | 28 ++++++++++++++++++++++++++++ >>>> fs/nfs/write.c | 4 ++++ >>>> 2 files changed, 32 insertions(+) >>>> >>>> diff --git a/fs/nfs/pnfs_nfs.c b/fs/nfs/pnfs_nfs.c >>>> index d2a7c9f7aa94..0dfc476da3e1 100644 >>>> --- a/fs/nfs/pnfs_nfs.c >>>> +++ b/fs/nfs/pnfs_nfs.c >>>> @@ -246,6 +246,23 @@ void pnfs_fetch_commit_bucket_list(struct list_head *pages, >>>> >>>> } >>>> >>>> +/* Helper function for pnfs_generic_commit_pagelist to catch an empty >>>> + * page list. This can happen when two commits race. */ >>>> +static bool >>>> +pnfs_generic_commit_cancel_empty_pagelist(struct list_head *pages, >>>> + struct nfs_commit_data *data, >>>> + struct nfs_commit_info *cinfo) >>>> +{ >>>> + if (list_empty(pages)) { >>>> + if (atomic_dec_and_test(&cinfo->mds->rpcs_out)) >>>> + wake_up_atomic_t(&cinfo->mds->rpcs_out); >>>> + nfs_commitdata_release(data); >>>> + return true; >>>> + } >>>> + >>>> + return false; >>>> +} >>>> + >>>> /* This follows nfs_commit_list pretty closely */ >>>> int >>>> pnfs_generic_commit_pagelist(struct inode *inode, struct list_head *mds_pages, >>>> @@ -280,6 +297,11 @@ pnfs_generic_commit_pagelist(struct inode *inode, struct list_head *mds_pages, >>>> list_for_each_entry_safe(data, tmp, &list, pages) { >>>> list_del_init(&data->pages); >>>> if (data->ds_commit_index < 0) { >>>> + /* another commit raced with us */ >>>> + if (pnfs_generic_commit_cancel_empty_pagelist(mds_pages, >>>> + data, cinfo)) >>>> + continue; >>>> + >>>> nfs_init_commit(data, mds_pages, NULL, cinfo); >>>> nfs_initiate_commit(NFS_CLIENT(inode), data, >>>> NFS_PROTO(data->inode), >>>> @@ -288,6 +310,12 @@ pnfs_generic_commit_pagelist(struct inode *inode, struct list_head *mds_pages, >>>> LIST_HEAD(pages); >>>> >>>> pnfs_fetch_commit_bucket_list(&pages, data, cinfo); >>>> + >>>> + /* another commit raced with us */ >>>> + if (pnfs_generic_commit_cancel_empty_pagelist(&pages, >>>> + data, cinfo)) >>>> + continue; >>>> + >>>> nfs_init_commit(data, &pages, data->lseg, cinfo); >>>> initiate_commit(data, how); >>>> } >>>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c >>>> index 4dac51ba1f7e..438c00110fc9 100644 >>>> --- a/fs/nfs/write.c >>>> +++ b/fs/nfs/write.c >>>> @@ -1710,6 +1710,10 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how, >>>> { >>>> struct nfs_commit_data *data; >>>> >>>> + /* another commit raced with us */ >>>> + if (list_empty(head)) >>>> + return 0; >>>> + >>>> data = nfs_commitdata_alloc(); >>>> >>>> if (!data) >>>> >>> >> >> >> Disclaimer >> >> The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. >> >