2023-01-04 19:56:42

by Jeff Layton

[permalink] [raw]
Subject: Re: weird smp memory-barrier issue with nfsd filecache code

On Wed, 2023-01-04 at 19:40 +0000, Trond Myklebust wrote:
>
> > On Jan 4, 2023, at 14:20, Jeff Layton <[email protected]> wrote:
> >
> > I had a report of a crash today. The kernel in question is older but has
> > a backport of the most recent nfsd filecache patches. In particular
> > nfsd_file_do_acquire is pretty much identical to the current mainline
> > code:
> >
> > [1803883.048506] BUG: kernel NULL pointer dereference, address: 0000000000000020
> > [1803883.048972] #PF: supervisor read access in kernel mode
> > [1803883.049378] #PF: error_code(0x0000) - not-present page
> > [1803883.049798] PGD 0 P4D 0
> > [1803883.050171] Oops: 0000 [#1] PREEMPT SMP NOPTI
> > [1803883.050563] CPU: 16 PID: 3591 Comm: nfsd Kdump: loaded Tainted: G OE --------- --- 5.14.0-210.jlayton.nfsd92.2.el9.x86_64 #1
> > [1803883.051386] Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 1.0b 04/27/2018
> > [1803883.051820] RIP: 0010:nfsd_file_do_acquire+0x7fb/0x8b0 [nfsd]
> > [1803883.052286] Code: 00 00 00 41 bc 00 00 27 18 e9 46 fc ff ff 89 c2 e9 df fb ff ff 48 8b 6c 24 20 65 48 ff 05 d5 57 cc 3e 49 8b 45 28 8b 74 24 08 <48> 8b 78 20 e8 9c 5d ff ff 89 c7 e8 95 27 ff ff 41 89 c4 e9 30 fc
> > [1803883.053180] RSP: 0018:ffffb4ca5269fc40 EFLAGS: 00010206
> > [1803883.053639] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
> > [1803883.054106] RDX: 0000000080000000 RSI: 0000000000000002 RDI: ffff9c9722b6af40
> > [1803883.054576] RBP: ffff9c90265f0000 R08: 0000000000000006 R09: 000000000000062b
> > [1803883.055056] R10: 0000000000000866 R11: 000000000000062b R12: ffffb4ca55e7d000
> > [1803883.055544] R13: ffff9c936c971ea0 R14: ffff9c936c971ea0 R15: ffff9c936c971ee0
> > [1803883.056067] FS: 0000000000000000(0000) GS:ffff9c9b0f700000(0000) knlGS:0000000000000000
> > [1803883.056577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [1803883.057093] CR2: 0000000000000020 CR3: 0000000c3e450000 CR4: 00000000003506e0
> > [1803883.057621] Call Trace:
> > [1803883.058155] <TASK>
> > [1803883.058700] nfs4_get_vfs_file+0x3dd/0x410 [nfsd]
> > [1803883.059283] nfsd4_process_open2+0x412/0x9f0 [nfsd]
> > [1803883.059854] nfsd4_open+0x282/0x4b0 [nfsd]
> > [1803883.060398] nfsd4_proc_compound+0x44b/0x6f0 [nfsd]
> > [1803883.060966] nfsd_dispatch+0x149/0x270 [nfsd]
> > [1803883.061520] svc_process_common+0x3bc/0x5e0 [sunrpc]
> > [1803883.062099] ? nfsd_svc+0x190/0x190 [nfsd]
> > [1803883.062667] ? nfsd_shutdown_threads+0xa0/0xa0 [nfsd]
> > [1803883.063238] svc_process+0xb7/0xf0 [sunrpc]
> > [1803883.063826] nfsd+0xd5/0x190 [nfsd]
> > [1803883.064408] kthread+0xd9/0x100
> > [1803883.064959] ? kthread_complete_and_exit+0x20/0x20
> > [1803883.065501] ret_from_fork+0x22/0x30
> > [1803883.066052] </TASK>
> > [1803883.066592] Modules linked in: mst_pciconf(OE) mst_pci(OE) overlay nfsd nfs_acl binfmt_misc dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs rbd libceph dns_resolver 8021q garp mrp stp llc bonding nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ipmi_ssif ib_iser libiscsi scsi_transport_iscsi intel_rapl_msr ib_umad intel_rapl_common rdma_cm ib_ipoib iw_cm amd64_edac edac_mce_amd ib_cm kvm_amd kvm mlx5_ib mlx4_ib ib_uverbs irqbypass rapl pcspkr ib_core joydev acpi_ipmi i2c_piix4 k10temp ipmi_si ipmi_devintf ipmi_msghandler acpi_cpufreq xfs libcrc32c raid1 sd_mod sg mlx5_core ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul nvme
> > [1803883.066656] fb_sys_fops crc32c_intel ahci mpt3sas drm_ttm_helper nvme_core ttm libahci mlxfw ghash_clmulni_intel igb mlx4_core tls nvme_common drm libata raid_class psample scsi_transport_sas t10_pi ccp pci_hyperv_intf sp5100_tco dca i2c_algo_bit dm_mirror dm_region_hash dm_log dm_mod fuse
> > [1803883.074377] CR2: 0000000000000020
> >
> > We got a coredump from this one, and I did a bit of analysis:
> >
> > wait_for_construction:
> > wait_on_bit(&nf->nf_flags, NFSD_FILE_PENDING, TASK_UNINTERRUPTIBLE);
> >
> > /* Did construction of this file fail? */
> > if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > trace_nfsd_file_cons_err(rqstp, key.inode, may_flags, nf);
> > if (!open_retry) {
> > status = nfserr_jukebox;
> > goto out;
> > }
> > open_retry = false;
> > if (refcount_dec_and_test(&nf->nf_ref))
> > nfsd_file_free(nf);
> > goto retry;
> > }
> >
> > this_cpu_inc(nfsd_file_cache_hits);
> >
> > status = nfserrno(nfsd_open_break_lease(file_inode(nf->nf_file), may_flags)); <<<< CRASH HERE
> > out:
> > if (status == nfs_ok) {
> > if (open)
> > this_cpu_inc(nfsd_file_acquisitions);
> > *pnf = nf;
> > } else {
> > if (refcount_dec_and_test(&nf->nf_ref))
> > nfsd_file_free(nf);
> > nf = NULL;
> > }
> >
> > out_status:
> > put_cred(key.cred);
> > if (open)
> > trace_nfsd_file_acquire(rqstp, key.inode, may_flags, nf, status);
> > return status;
> >
> > open_file:
> > trace_nfsd_file_alloc(nf);
> > nf->nf_mark = nfsd_file_mark_find_or_create(nf, key.inode);
> > if (nf->nf_mark) {
> > if (open) {
> > status = nfsd_open_verified(rqstp, fhp, may_flags,
> > &nf->nf_file);
> > trace_nfsd_file_open(nf, status);
> > } else
> > status = nfs_ok;
> > } else
> > status = nfserr_jukebox;
> > /*
> > * If construction failed, or we raced with a call to unlink()
> > * then unhash.
> > */
> > if (status == nfs_ok && key.inode->i_nlink == 0)
> > status = nfserr_jukebox;
> > if (status != nfs_ok)
> > nfsd_file_unhash(nf);
> > clear_bit_unlock(NFSD_FILE_PENDING, &nf->nf_flags);
> > smp_mb__after_atomic();
> > wake_up_bit(&nf->nf_flags, NFSD_FILE_PENDING);
> > goto out;
> >
> > Relevant disassembly:
> >
> > /usr/src/debug/kernel-5.14.0-210.jlayton.nfsd92.2.el9/linux-5.14.0-210.jlayton.nfsd92.2.el9.x86_64/./include/linux/fs.h: 1348
> > 0xffffffffc1368883 <nfsd_file_do_acquire+0x7f3>: mov 0x28(%r13),%rax
> > /usr/src/debug/kernel-5.14.0-210.jlayton.nfsd92.2.el9/linux-5.14.0-210.jlayton.nfsd92.2.el9.x86_64/fs/nfsd/filecache.c: 1145
> > 0xffffffffc1368887 <nfsd_file_do_acquire+0x7f7>: mov 0x8(%rsp),%esi
> > 0xffffffffc136888b <nfsd_file_do_acquire+0x7fb>: mov 0x20(%rax),%rdi <<< CRASH HERE
> >
> > crash> struct -o nfsd_file
> > struct nfsd_file {
> > [0x0] struct rhash_head nf_rhash;
> > [0x8] struct list_head nf_lru;
> > [0x18] struct callback_head nf_rcu;
> > [0x28] struct file *nf_file;
> > [0x30] const struct cred *nf_cred;
> > [0x38] struct net *nf_net;
> > [0x40] unsigned long nf_flags;
> > [0x48] struct inode *nf_inode;
> > [0x50] refcount_t nf_ref;
> > [0x54] unsigned char nf_may;
> > [0x58] struct nfsd_file_mark *nf_mark;
> > [0x60] ktime_t nf_birthtime;
> > }
> >
> > nf_file is 0x28 bytes into that struct, so the mov at +0x7f3 is probably
> > the dereferencing of that. That means that the addr of the nfsd_file is
> > (probably) in %r13. Based on that, nf_file looks like a legit address.
> >
> > crash> struct nfsd_file ffff9c936c971ea0
> > struct nfsd_file {
> > nf_rhash = {
> > next = 0xffffb4ca55ef9209
> > },
> > nf_lru = {
> > next = 0xffff9c936c971ea8,
> > prev = 0xffff9c936c971ea8
>
> The above means nf_lru is not hashed (next==prev==&nf_lru)
>

It's hashed, but it is not on the LRU. We only put them on the LRU once
there are no more active users as that makes the LRU scanning more
efficient. No need to look at nfsd_files that have active users.

> > },
> > nf_rcu = {
> > next = 0xffff9c93995c3218,
> > func = 0x0
> > },
> > nf_file = 0xffff9c9b3b080600,
> > nf_cred = 0xffff9c9423a55a40,
> > nf_net = 0xffffffffa200cd40 <init_net>,
> > nf_flags = 0x1,
>
> The above value would normally imply it is hashed (nf_flags==(1<<NFSD_FILE_HASHED))
>
> > nf_inode = 0xffff9c9c204c5500,
> > nf_ref = {
> > refs = {
> > counter = 0x2
> > }
> > },
> > nf_may = 0x2,
> > nf_mark = 0xffff9c9633a85b40,
> > nf_birthtime = 0x668a348ce435a
> > }
> >
> > In this code, one task is opening a file for the cache and another is
> > waiting on the construction to complete. After we open the file, we
> > populate the nf_file field, clear NFSD_FILE_PENDING and wake up anyone
> > waiting on it.
> >
> > Here though, the waiter had finished waiting but when it went to fetch
> > nf->nf_file in the register, it still appeared to be NULL. In the core,
> > it is populated with a valid file pointer however.
> >
> > clear_bit_unlock does a full barrier (on x86) before clearing the bit.
> > According to memory-barriers.txt, wait_on_bit should imply a read memory
> > barrier.
> >
> > Why was %rax zeroed out?
> > --
>
> Are you sure the back ported values for NFSD_FILE_HASHED and NFSD_FILE_PENDING are correct?
>

Yes. They are the same.
--
Jeff Layton <[email protected]>