There's a fun problem - for all the complaints about evil, crude
VFS exclusion not letting the smart filesystem developers Do It Right(tm),
NFS has a homegrown kinda-sorta rwsem, with delayed unlinks being readers
and lookups - writers.
IOW, nfs_block_sillyrename() still yields lookup/lookup exclusion,
even with ->i_mutex replaced with rwsem and ->lookup() calls happening in
parallel. What's more, the thing is very much writer(==lookup)-starving.
What kind of ordering do we really want there? Existing variant
is very crude - lookups (along with readdir and atomic_open) are writers,
delayed unlinks - readers, and there's no fairness whatsoever; if delayed
unlink comes during lookup, it is put on a list and once lookup is done,
everything on that list is executed. Moreover, any unlinks coming during
the execution of those are executed immediately. And no lookup (in that
directory) is allowed until there's no unlinks in progress.
Creating a storm of delayed unlinks isn't hard - open-and-unlink
a lot, then exit and you've got it...
Suggestions? Right now my local tree has nfs_lookup() and
nfs_readdir() run with directory locked shared. And they are still
serialized by the damn ->silly_count ;-/
Incidentally, why does nfs_complete_unlink() recheck ->d_flags?
The caller of ->d_iput() is holding the only reference to dentry; who and
what could possibly clear DCACHE_NFSFS_RENAMED between the checks in
nfs_dentry_iput() and nfs_complete_unlink()?
On Sun, 2016-04-24 at 03:34 +0100, Al Viro wrote:
> There's a fun problem - for all the complaints about evil, crude
> VFS exclusion not letting the smart filesystem developers Do It Right(tm),
> NFS has a homegrown kinda-sorta rwsem, with delayed unlinks being readers
> and lookups - writers.
>
> IOW, nfs_block_sillyrename() still yields lookup/lookup exclusion,
> even with ->i_mutex replaced with rwsem and ->lookup() calls happening in
> parallel. What's more, the thing is very much writer(==lookup)-starving.
>
> What kind of ordering do we really want there? Existing variant
> is very crude - lookups (along with readdir and atomic_open) are writers,
> delayed unlinks - readers, and there's no fairness whatsoever; if delayed
> unlink comes during lookup, it is put on a list and once lookup is done,
> everything on that list is executed. Moreover, any unlinks coming during
> the execution of those are executed immediately. And no lookup (in that
> directory) is allowed until there's no unlinks in progress.
>
> Creating a storm of delayed unlinks isn't hard - open-and-unlink
> a lot, then exit and you've got it...
>
> Suggestions? Right now my local tree has nfs_lookup() and
> nfs_readdir() run with directory locked shared. And they are still
> serialized by the damn ->silly_count ;-/
>
Hmm...well, most of that was added in commit 565277f63c61. Looking at
the bug referenced in that commit log, I think that the main thing we
want to ensure is that rmdir waits until all of the sillydeletes for
files in its directory have finished.
But...there's also some code to ensure that if a lookup races in while
we're doing the sillydelete that we transfer all of the dentry info to
the new alias. That's the messy part.
The new infrastructure for parallel lookups might make it simpler
actually. When we go to do the sillydelete, could we add the dentry to
the "lookup in progress" hash that you're adding as part of the
parallel lookup infrastructure? Then the tasks doing lookups could find
it and just wait on the sillydelete to complete. After the sillydelete,
we'd turn the thing into a negative dentry and then wake up the waiters
(if any). That does make the whole dentry teardown a bit more complex
though.
OTOH...Al, I think you mentioned at one time that you thought the whole
sillydelete mechanism was overly-complicated, and that it might be
cleaner to do this somehow in f_op->release? I don't recall the details
of what you had in mind at the time, but it might be good to rethink
the whole mess.
> Incidentally, why does nfs_complete_unlink() recheck ->d_flags?
> The caller of ->d_iput() is holding the only reference to dentry; who and
> what could possibly clear DCACHE_NFSFS_RENAMED between the checks in
> nfs_dentry_iput() and nfs_complete_unlink()?
Yeah, that looks superfluous. I imagine that can be removed.
--
Jeff Layton <[email protected]>
On Sun, Apr 24, 2016 at 08:46:15AM -0400, Jeff Layton wrote:
> > Suggestions???Right now my local tree has nfs_lookup() and
> > nfs_readdir() run with directory locked shared.??And they are still
> > serialized by the damn ->silly_count ;-/
> >
>
> Hmm...well, most of that was added in commit 565277f63c61. Looking at
> the bug referenced in that commit log, I think that the main thing we
> want to ensure is that rmdir waits until all of the sillydeletes for
> files in its directory have finished.
>
> But...there's also some code to ensure that if a lookup races in while
> we're doing the sillydelete that we transfer all of the dentry info to
> the new alias. That's the messy part.
It's a bit more - AFAICS, it also wants to prevent lookup coming after we'd
started with building an unlink request and getting serviced before that
unlink.
> The new infrastructure for parallel lookups might make it simpler
> actually. When we go to do the sillydelete, could we add the dentry to
> the "lookup in progress" hash that you're adding as part of the
> parallel lookup infrastructure? Then the tasks doing lookups could find
> it and just wait on the sillydelete to complete. After the sillydelete,
> we'd turn the thing into a negative dentry and then wake up the waiters
> (if any). That does make the whole dentry teardown a bit more complex
> though.
Umm... A lot more complex, unfortunately - if anything, I would allocate
a _new_ dentry at sillyrename time and used it pretty much instead of
your nfs_unlinkdata. Unhashed, negative and pinned down. And insert it
into in-lookup hash only when we get around to actual delayed unlink.
First of all, dentries in in-lookup hash *must* be negative before they can
be inserted there. That wouldn't be so much PITA per se, but we also have
a problem with the sucker possibly being on a shrink list and only the owner
of the shrink list can remove it from there. So this kind of reuse would be
hard to arrange.
Do we want to end up with a negative hashed after that unlink, though?
Sure, we know it's negative, but will anyone really try to look it up
afterwards? IOW, is it anything but a hash pollution? What's more,
unlike in-lookup hash, there are assumptions about the primary one; namely,
directory-modifying operation can be certain that nobody else will be adding
entries to hash behind its back, positive or negative.
I'm not at all sure that NFS doesn't rely upon this. The in-lookup hash
has no such warranties (and still very few users, of course), so we can
afford adding stuff to it without bothering with locking the parent directory.
_IF_ we don't try to add anything to primary hash, we can bloody well use it
without ->i_rwsem on the parent.
AFAICS, ->args is redundant if you have the sucker with the right name/parent.
So's ->dir; the rest is probably still needed, so a shrunk nfs_unlinkdata
would still be needed, more's the pity... Well, we can point ->d_fsdata of
the replacement dentry to set to it.
And yes, it means allocation in two pieces instead of just one when we hit
sillyrename. Seeing that it's doing at least two RPC roundtrips right in
nfs_sillyrename(), I think that overhead isn't a serious worry.
What we get out of that is fully parallel lookup/readdir/sillyunlink - all
exclusion is on per-name basis (nfs_prime_dcache() vs. nfs_lookup() vs.
nfs_do_call_unlink()). It will require a bit of care in atomic_open(),
though...
I'll play with that a bit and see what can be done...
On Sun, 2016-04-24 at 20:18 +0100, Al Viro wrote:
> On Sun, Apr 24, 2016 at 08:46:15AM -0400, Jeff Layton wrote:
> >
> > >
> > > Suggestions? Right now my local tree has nfs_lookup() and
> > > nfs_readdir() run with directory locked shared. And they are still
> > > serialized by the damn ->silly_count ;-/
> > >
> > Hmm...well, most of that was added in commit 565277f63c61. Looking at
> > the bug referenced in that commit log, I think that the main thing we
> > want to ensure is that rmdir waits until all of the sillydeletes for
> > files in its directory have finished.
> >
> > But...there's also some code to ensure that if a lookup races in while
> > we're doing the sillydelete that we transfer all of the dentry info to
> > the new alias. That's the messy part.
> It's a bit more - AFAICS, it also wants to prevent lookup coming after we'd
> started with building an unlink request and getting serviced before that
> unlink.
>
> >
> > The new infrastructure for parallel lookups might make it simpler
> > actually. When we go to do the sillydelete, could we add the dentry to
> > the "lookup in progress" hash that you're adding as part of the
> > parallel lookup infrastructure? Then the tasks doing lookups could find
> > it and just wait on the sillydelete to complete. After the sillydelete,
> > we'd turn the thing into a negative dentry and then wake up the waiters
> > (if any). That does make the whole dentry teardown a bit more complex
> > though.
> Umm... A lot more complex, unfortunately - if anything, I would allocate
> a _new_ dentry at sillyrename time and used it pretty much instead of
> your nfs_unlinkdata. Unhashed, negative and pinned down. And insert it
> into in-lookup hash only when we get around to actual delayed unlink.
>
> First of all, dentries in in-lookup hash *must* be negative before they can
> be inserted there. That wouldn't be so much PITA per se, but we also have
> a problem with the sucker possibly being on a shrink list and only the owner
> of the shrink list can remove it from there. So this kind of reuse would be
> hard to arrange.
>
> Do we want to end up with a negative hashed after that unlink, though?
> Sure, we know it's negative, but will anyone really try to look it up
> afterwards? IOW, is it anything but a hash pollution? What's more,
> unlike in-lookup hash, there are assumptions about the primary one; namely,
> directory-modifying operation can be certain that nobody else will be adding
> entries to hash behind its back, positive or negative.
>
You may be right. One problematic scenario is something like this:
A READDIR is issued that ends up racing with a sillydelete. The READDIR response comes in first and we start processing entries. Eventually it hits the dentry that's being sillydeleted and blocks on it. The dentry is deleted on the server so we create a negative dentry and wake up the READDIR. It then reinstantiates the dentry as positive.
Hmm...is that really a problem though? Maybe not since we'll just issue the RMDIR to the server at that point and the directory _would_ be empty there. Alternately we'd probably just end up flushing the dentry when we notice that the dir has changed.
In any case, there probably is little benefit to keeping the dentry around. If d_drop'ing it makes things simpler, then I've no objection.
> I'm not at all sure that NFS doesn't rely upon this. The in-lookup hash
> has no such warranties (and still very few users, of course), so we can
> afford adding stuff to it without bothering with locking the parent directory.
> _IF_ we don't try to add anything to primary hash, we can bloody well use it
> without ->i_rwsem on the parent.
>
> AFAICS, ->args is redundant if you have the sucker with the right name/parent.
> So's ->dir; the rest is probably still needed, so a shrunk nfs_unlinkdata
> would still be needed, more's the pity... Well, we can point ->d_fsdata of
> the replacement dentry to set to it.
>
> And yes, it means allocation in two pieces instead of just one when we hit
> sillyrename. Seeing that it's doing at least two RPC roundtrips right in
> nfs_sillyrename(), I think that overhead isn't a serious worry.
>
Agreed.
> What we get out of that is fully parallel lookup/readdir/sillyunlink - all
> exclusion is on per-name basis (nfs_prime_dcache() vs. nfs_lookup() vs.
> nfs_do_call_unlink()). It will require a bit of care in atomic_open(),
> though...
>
> I'll play with that a bit and see what can be done...
--
Jeff Layton <[email protected]>
On Sun, Apr 24, 2016 at 08:18:35PM +0100, Al Viro wrote:
> What we get out of that is fully parallel lookup/readdir/sillyunlink - all
> exclusion is on per-name basis (nfs_prime_dcache() vs. nfs_lookup() vs.
> nfs_do_call_unlink()). It will require a bit of care in atomic_open(),
> though...
>
> I'll play with that a bit and see what can be done...
OK, a bunch of atomic_open cleanups (moderately tested) +
almost untested sillyunlink patch are in vfs.git#untested.nfs.
It ought to make lookups (and readdir, and !O_CREAT case of atomic_open)
on NFS really execute in parallel. Folks, please hit that sucker with
NFS torture tests. In particular, the stuff mentioned in commit
565277f6 would be interesting to try.
On Fri, 2016-04-29 at 08:58 +0100, Al Viro wrote:
> On Sun, Apr 24, 2016 at 08:18:35PM +0100, Al Viro wrote:
>
> >
> > What we get out of that is fully parallel
> > lookup/readdir/sillyunlink - all
> > exclusion is on per-name basis (nfs_prime_dcache() vs. nfs_lookup()
> > vs.
> > nfs_do_call_unlink()). It will require a bit of care in
> > atomic_open(),
> > though...
> >
> > I'll play with that a bit and see what can be done...
> OK, a bunch of atomic_open cleanups (moderately tested) +
> almost untested sillyunlink patch are in vfs.git#untested.nfs.
>
> It ought to make lookups (and readdir, and !O_CREAT case of
> atomic_open)
> on NFS really execute in parallel. Folks, please hit that sucker
> with
> NFS torture tests. In particular, the stuff mentioned in commit
> 565277f6 would be interesting to try.
I pulled down the branch and built it, and then ran the cthon special
tests 100 times in a loop, and ran "ls -l" on the test directory in a
loop at the same time. On pass 42, I hit this:
[ 1168.630763] general protection fault: 0000 [#1] SMP
[ 1168.631617] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs snd_hda_codec_generic snd_hda_intel snd_hda_codec libcrc32c snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd nfsd joydev ppdev soundcore acpi_cpufreq virtio_net pcspkr i2c_piix4 tpm_tis tpm parport_pc parport virtio_balloon floppy pvpanic nfs_acl lockd auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci virtio_ring virtio serio_raw ata_generic pata_acpi
[ 1168.638448] CPU: 3 PID: 1850 Comm: op_ren Not tainted 4.6.0-rc1+ #25
[ 1168.639413] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 1168.640146] task: ffff880035cf5400 ti: ffff8800d064c000 task.ti: ffff8800d064c000
[ 1168.641107] RIP: 0010:[<ffffffff811f6488>] [<ffffffff811f6488>] kmem_cache_alloc+0x78/0x160
[ 1168.642292] RSP: 0018:ffff8800d064fa90 EFLAGS: 00010246
[ 1168.642978] RAX: 73747365746e7572 RBX: 0000000000000894 RCX: 0000000000000020
[ 1168.643920] RDX: 0000000000318271 RSI: 00000000024080c0 RDI: 000000000001a440
[ 1168.644862] RBP: ffff8800d064fac0 R08: ffff88021fd9a440 R09: ffff880035b82400
[ 1168.645794] R10: 0000000000000000 R11: ffff8800d064fb70 R12: 00000000024080c0
[ 1168.646762] R13: ffffffff81317667 R14: ffff880217001d00 R15: 73747365746e7572
[ 1168.647650] FS: 00007f0cb8295700(0000) GS:ffff88021fd80000(0000) knlGS:0000000000000000
[ 1168.648639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1168.649330] CR2: 0000000000401090 CR3: 0000000035f9a000 CR4: 00000000000006e0
[ 1168.650239] Stack:
[ 1168.650498] 00ff8800026084c0 0000000000000894 ffff880035b82400 ffff8800d064fd14
[ 1168.651509] ffff8800d0650000 ffff880201ca5e38 ffff8800d064fae0 ffffffff81317667
[ 1168.652506] ffffffff81c9b140 ffff880035b82400 ffff8800d064fb00 ffffffff8130ef03
[ 1168.653494] Call Trace:
[ 1168.653889] [<ffffffff81317667>] selinux_file_alloc_security+0x37/0x60
[ 1168.654728] [<ffffffff8130ef03>] security_file_alloc+0x33/0x50
[ 1168.655447] [<ffffffff812117da>] get_empty_filp+0x9a/0x1c0
[ 1168.656231] [<ffffffff81399d96>] ? copy_to_iter+0x1b6/0x260
[ 1168.656999] [<ffffffff8121d75e>] path_openat+0x2e/0x1660
[ 1168.657645] [<ffffffff81103133>] ? current_fs_time+0x23/0x30
[ 1168.658311] [<ffffffff81399d96>] ? copy_to_iter+0x1b6/0x260
[ 1168.658999] [<ffffffff81103133>] ? current_fs_time+0x23/0x30
[ 1168.659742] [<ffffffff8122bea3>] ? touch_atime+0x23/0xa0
[ 1168.660435] [<ffffffff8121fe3e>] do_filp_open+0x7e/0xe0
[ 1168.661072] [<ffffffff8120e8d7>] ? __vfs_read+0xa7/0xd0
[ 1168.661792] [<ffffffff8120e8d7>] ? __vfs_read+0xa7/0xd0
[ 1168.662410] [<ffffffff811f6444>] ? kmem_cache_alloc+0x34/0x160
[ 1168.663130] [<ffffffff81214d94>] do_open_execat+0x64/0x150
[ 1168.664100] [<ffffffff8121524b>] open_exec+0x2b/0x50
[ 1168.664949] [<ffffffff8126302a>] load_elf_binary+0x29a/0x1670
[ 1168.665880] [<ffffffff811c43d4>] ? get_user_pages_remote+0x54/0x60
[ 1168.666843] [<ffffffff81215fac>] ? copy_strings.isra.30+0x25c/0x370
[ 1168.667812] [<ffffffff8121595e>] search_binary_handler+0x9e/0x1d0
[ 1168.668753] [<ffffffff8121714c>] do_execveat_common.isra.41+0x4fc/0x6d0
[ 1168.669753] [<ffffffff812175ba>] SyS_execve+0x3a/0x50
[ 1168.670560] [<ffffffff81003cb2>] do_syscall_64+0x62/0x110
[ 1168.671384] [<ffffffff8174ae21>] entry_SYSCALL64_slow_path+0x25/0x25
[ 1168.672305] Code: 49 83 78 10 00 4d 8b 38 0f 84 bd 00 00 00 4d 85 ff 0f 84 b4 00 00 00 49 63 46 20 49 8b 3e 4c 01 f8 40 f6 c7 0f 0f 85 cf 00 00 00 <48> 8b 18 48 8d 4a 01 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74
[ 1168.676071] RIP [<ffffffff811f6488>] kmem_cache_alloc+0x78/0x160
[ 1168.677008] RSP <ffff8800d064fa90>
[ 1168.679699] general protection fault: 0000 [#2]
kmem_cache corruption maybe?
(gdb) list *(kmem_cache_alloc+0x78)
0xffffffff811f6488 is in kmem_cache_alloc (mm/slub.c:245).
240 * Core slab cache functions
241 *******************************************************************/
242
243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
244 {
245 return *(void **)(object + s->offset);
246 }
247
248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
249 {
On Sat, 2016-04-30 at 09:15 -0400, Jeff Layton wrote:
> On Fri, 2016-04-29 at 08:58 +0100, Al Viro wrote:
> >
> > On Sun, Apr 24, 2016 at 08:18:35PM +0100, Al Viro wrote:
> >
> > >
> > >
> > > What we get out of that is fully parallel
> > > lookup/readdir/sillyunlink - all
> > > exclusion is on per-name basis (nfs_prime_dcache() vs.
> > > nfs_lookup()
> > > vs.
> > > nfs_do_call_unlink()). It will require a bit of care in
> > > atomic_open(),
> > > though...
> > >
> > > I'll play with that a bit and see what can be done...
> > OK, a bunch of atomic_open cleanups (moderately tested) +
> > almost untested sillyunlink patch are in vfs.git#untested.nfs.
> >
> > It ought to make lookups (and readdir, and !O_CREAT case of
> > atomic_open)
> > on NFS really execute in parallel. Folks, please hit that sucker
> > with
> > NFS torture tests. In particular, the stuff mentioned in commit
> > 565277f6 would be interesting to try.
>
> I pulled down the branch and built it, and then ran the cthon special
> tests 100 times in a loop, and ran "ls -l" on the test directory in a
> loop at the same time. On pass 42, I hit this:
>
> [ 1168.630763] general protection fault: 0000 [#1] SMP
> [ 1168.631617] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs snd_hda_codec_generic snd_hda_intel snd_hda_codec libcrc32c snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd nfsd joydev ppdev soundcore acpi_cpufreq virtio_net pcspkr i2c_piix4 tpm_tis tpm parport_pc parport virtio_balloon floppy pvpanic nfs_acl lockd auth_rpcgss grace sunrpc qxl drm_kms_helper ttm drm virtio_console virtio_blk virtio_pci virtio_ring virtio serio_raw ata_generic pata_acpi
> [ 1168.638448] CPU: 3 PID: 1850 Comm: op_ren Not tainted 4.6.0-rc1+ #25
> [ 1168.639413] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 1168.640146] task: ffff880035cf5400 ti: ffff8800d064c000 task.ti: ffff8800d064c000
> [ 1168.641107] RIP: 0010:[] [] kmem_cache_alloc+0x78/0x160
> [ 1168.642292] RSP: 0018:ffff8800d064fa90 EFLAGS: 00010246
> [ 1168.642978] RAX: 73747365746e7572 RBX: 0000000000000894 RCX: 0000000000000020
> [ 1168.643920] RDX: 0000000000318271 RSI: 00000000024080c0 RDI: 000000000001a440
> [ 1168.644862] RBP: ffff8800d064fac0 R08: ffff88021fd9a440 R09: ffff880035b82400
> [ 1168.645794] R10: 0000000000000000 R11: ffff8800d064fb70 R12: 00000000024080c0
> [ 1168.646762] R13: ffffffff81317667 R14: ffff880217001d00 R15: 73747365746e7572
> [ 1168.647650] FS: 00007f0cb8295700(0000) GS:ffff88021fd80000(0000) knlGS:0000000000000000
> [ 1168.648639] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1168.649330] CR2: 0000000000401090 CR3: 0000000035f9a000 CR4: 00000000000006e0
> [ 1168.650239] Stack:
> [ 1168.650498] 00ff8800026084c0 0000000000000894 ffff880035b82400 ffff8800d064fd14
> [ 1168.651509] ffff8800d0650000 ffff880201ca5e38 ffff8800d064fae0 ffffffff81317667
> [ 1168.652506] ffffffff81c9b140 ffff880035b82400 ffff8800d064fb00 ffffffff8130ef03
> [ 1168.653494] Call Trace:
> [ 1168.653889] [] selinux_file_alloc_security+0x37/0x60
> [ 1168.654728] [] security_file_alloc+0x33/0x50
> [ 1168.655447] [] get_empty_filp+0x9a/0x1c0
> [ 1168.656231] [] ? copy_to_iter+0x1b6/0x260
> [ 1168.656999] [] path_openat+0x2e/0x1660
> [ 1168.657645] [] ? current_fs_time+0x23/0x30
> [ 1168.658311] [] ? copy_to_iter+0x1b6/0x260
> [ 1168.658999] [] ? current_fs_time+0x23/0x30
> [ 1168.659742] [] ? touch_atime+0x23/0xa0
> [ 1168.660435] [] do_filp_open+0x7e/0xe0
> [ 1168.661072] [] ? __vfs_read+0xa7/0xd0
> [ 1168.661792] [] ? __vfs_read+0xa7/0xd0
> [ 1168.662410] [] ? kmem_cache_alloc+0x34/0x160
> [ 1168.663130] [] do_open_execat+0x64/0x150
> [ 1168.664100] [] open_exec+0x2b/0x50
> [ 1168.664949] [] load_elf_binary+0x29a/0x1670
> [ 1168.665880] [] ? get_user_pages_remote+0x54/0x60
> [ 1168.666843] [] ? copy_strings.isra.30+0x25c/0x370
> [ 1168.667812] [] search_binary_handler+0x9e/0x1d0
> [ 1168.668753] [] do_execveat_common.isra.41+0x4fc/0x6d0
> [ 1168.669753] [] SyS_execve+0x3a/0x50
> [ 1168.670560] [] do_syscall_64+0x62/0x110
> [ 1168.671384] [] entry_SYSCALL64_slow_path+0x25/0x25
> [ 1168.672305] Code: 49 83 78 10 00 4d 8b 38 0f 84 bd 00 00 00 4d 85 ff 0f 84 b4 00 00 00 49 63 46 20 49 8b 3e 4c 01 f8 40 f6 c7 0f 0f 85 cf 00 00 00 <48> 8b 18 48 8d 4a 01 4c 89 f8 65 48 0f c7 0f 0f 94 c0 84 c0 74
> [ 1168.676071] RIP [] kmem_cache_alloc+0x78/0x160
> [ 1168.677008] RSP
> [ 1168.679699] general protection fault: 0000 [#2]
>
>
> kmem_cache corruption maybe?
>
> (gdb) list *(kmem_cache_alloc+0x78)
> 0xffffffff811f6488 is in kmem_cache_alloc (mm/slub.c:245).
> 240 * Core slab cache functions
> 241 *******************************************************************/
> 242
> 243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
> 244 {
> 245 return *(void **)(object + s->offset);
> 246 }
> 247
> 248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> 249 {
>
>
> From /sys/kernel/slab (after rebooting and rerunning the cthon tests once):
>
> lrwxrwxrwx. 1 root root 0 Apr 30 09:10 selinux_file_security -> :t-0000016
>
> [root@rawhide slab]# ls -l | grep :t-0000016
> lrwxrwxrwx. 1 root root 0 Apr 30 09:10 kmalloc-16 -> :t-0000016
> lrwxrwxrwx. 1 root root 0 Apr 30 09:10 selinux_file_security -> :t-0000016
> drwxr-xr-x. 2 root root 0 Apr 30 09:11 :t-0000016
>
>
> Hard to tell if this is related to your changes but it certainly did
> happen in the open codepath. I'll see if I can reproduce it again.
>
Second oops on second attempt. Different codepath this time:
[ 548.093261] general protection fault: 0000 [#1] SMP
[ 548.094023] Modules linked in: rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xfs snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device libcrc32c snd_pcm snd_timer snd acpi_cpufreq ppdev tpm_tis soundcore tpm nfsd parport_pc parport virtio_net floppy virtio_balloon joydev pvpanic i2c_piix4 pcspkr nfs_acl lockd auth_rpcgss grace sunrpc qxl drm_kms_helper ttm virtio_console virtio_blk drm serio_raw virtio_pci virtio_ring ata_generic virtio pata_acpi
[ 548.100968] CPU: 2 PID: 18173 Comm: ls Not tainted 4.6.0-rc1+ #25
[ 548.101799] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 548.102586] task: ffff8800d0275400 ti: ffff880202934000 task.ti: ffff880202934000
[ 548.103610] RIP: 0010:[<ffffffff811f9a06>] [<ffffffff811f9a06>] __kmalloc_track_caller+0x96/0x1b0
[ 548.104793] RSP: 0018:ffff880202937ad0 EFLAGS: 00010246
[ 548.105488] RAX: ff656c6966676962 RBX: 0000000000000006 RCX: 0000000000000000
[ 548.106412] RDX: 0000000000009f9f RSI: 0000000000000000 RDI: 000000000001a420
[ 548.107335] RBP: ffff880202937b00 R08: ffff88021fd1a420 R09: ff656c6966676962
[ 548.108236] R10: 0000000000000001 R11: ffff880212756240 R12: 00000000024000c0
[ 548.109375] R13: 0000000000000006 R14: ffffffffa03d9697 R15: ffff880217001e00
[ 548.110308] FS: 00007fd7839e2800(0000) GS:ffff88021fd00000(0000) knlGS:0000000000000000
[ 548.111378] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 548.112136] CR2: 00007ffe35047f3c CR3: 0000000202816000 CR4: 00000000000006e0
[ 548.113064] Stack:
[ 548.113333] ffff880201065548 0000000000000006 ffff88020222e314 0000000000000000
[ 548.114338] ffff880202f563e8 ffff880202937cb0 ffff880202937b20 ffffffff811b5e50
[ 548.115335] ffff880202f56000 ffff880202937df0 ffff880202937c38 ffffffffa03d9697
[ 548.116342] Call Trace:
[ 548.116665] [<ffffffff811b5e50>] kmemdup+0x20/0x50
[ 548.117295] [<ffffffffa03d9697>] nfs_readdir_page_filler+0x277/0x5f0 [nfs]
[ 548.118173] [<ffffffffa03d9c0f>] nfs_readdir_xdr_to_array+0x1ff/0x330 [nfs]
[ 548.119058] [<ffffffffa03d9d64>] nfs_readdir_filler+0x24/0xb0 [nfs]
[ 548.119856] [<ffffffff8119772e>] ? add_to_page_cache_lru+0x6e/0xc0
[ 548.120625] [<ffffffff81197b4f>] do_read_cache_page+0x13f/0x2c0
[ 548.121397] [<ffffffffa03d9d40>] ? nfs_readdir_xdr_to_array+0x330/0x330 [nfs]
[ 548.122296] [<ffffffff81197ce9>] read_cache_page+0x19/0x20
[ 548.122995] [<ffffffffa03d9f63>] nfs_readdir+0x173/0x5f0 [nfs]
[ 548.123711] [<ffffffff81457a5a>] ? tty_write+0x1ca/0x2f0
[ 548.124568] [<ffffffffa0427cb0>] ? nfs4_xdr_dec_fsinfo+0x70/0x70 [nfsv4]
[ 548.125372] [<ffffffff812232bb>] iterate_dir+0x16b/0x1a0
[ 548.126039] [<ffffffff812236e8>] SyS_getdents+0x88/0x100
[ 548.126686] [<ffffffff812232f0>] ? iterate_dir+0x1a0/0x1a0
[ 548.127651] [<ffffffff81003cb2>] do_syscall_64+0x62/0x110
[ 548.128614] [<ffffffff8174ae21>] entry_SYSCALL64_slow_path+0x25/0x25
[ 548.129667] Code: 49 83 78 10 00 4d 8b 08 0f 84 bf 00 00 00 4d 85 c9 0f 84 b6 00 00 00 49 63 47 20 49 8b 3f 4c 01 c8 40 f6 c7 0f 0f 85 fa 00 00 00 <48> 8b 18 48 8d 4a 01 4c 89 c8 65 48 0f c7 0f 0f 94 c0 84 c0 74
[ 548.133600] RIP [<ffffffff811f9a06>] __kmalloc_track_caller+0x96/0x1b0
[ 548.134659] RSP <ffff880202937ad0>
[ 548.135353] ---[ end trace 77ac2ecac8d76afe ]---
...but looks like same problem:
(gdb) list *(__kmalloc_track_caller+0x96)
0xffffffff811f9a06 is in __kmalloc_track_caller (mm/slub.c:245).
240 * Core slab cache functions
241 *******************************************************************/
242
243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
244 {
245 return *(void **)(object + s->offset);
246 }
247
248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
249 {
--
Jeff Layton <[email protected]>
On Sat, Apr 30, 2016 at 09:22:56AM -0400, Jeff Layton wrote:
> ...but looks like same problem:
>
> (gdb) list *(__kmalloc_track_caller+0x96)
> 0xffffffff811f9a06 is in __kmalloc_track_caller (mm/slub.c:245).
> 240??????*??????????????????????Core slab cache functions
> 241??????*******************************************************************/
> 242
> 243?????static inline void *get_freepointer(struct kmem_cache *s, void *object)
> 244?????{
> 245?????????????return *(void **)(object + s->offset);
> 246?????}
> 247
> 248?????static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> 249?????{
Joy... Does that happen without the last commit as well? I realize that
memory corruption could well have been introduced earlier and changes in
the last commit had only increased the odds, but...
On Sat, 2016-04-30 at 15:22 +0100, Al Viro wrote:
> On Sat, Apr 30, 2016 at 09:22:56AM -0400, Jeff Layton wrote:
> >
> > ...but looks like same problem:
> >
> > (gdb) list *(__kmalloc_track_caller+0x96)
> > 0xffffffff811f9a06 is in __kmalloc_track_caller (mm/slub.c:245).
> > 240 * Core slab cache functions
> > 241 *******************************************************************/
> > 242
> > 243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
> > 244 {
> > 245 return *(void **)(object + s->offset);
> > 246 }
> > 247
> > 248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> > 249 {
> Joy... Does that happen without the last commit as well? I realize that
> memory corruption could well have been introduced earlier and changes in
> the last commit had only increased the odds, but...
Not exactly, but the test seems to have deadlocked without the last
patch in play. Here's the ls command:
[jlayton@rawhide ~]$ cat /proc/1425/stack
[<ffffffffa03d6eec>] nfs_block_sillyrename+0x5c/0xa0 [nfs]
[<ffffffffa03c8ef8>] nfs_readdir+0xf8/0x620 [nfs]
[<ffffffff812232bb>] iterate_dir+0x16b/0x1a0
[<ffffffff812236e8>] SyS_getdents+0x88/0x100
[<ffffffff81003cb2>] do_syscall_64+0x62/0x110
[<ffffffff8174ae21>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
...and here is the nfsidem command:
[jlayton@rawhide ~]$ cat /proc/1295/stack
[<ffffffff813953b7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff8121f65b>] filename_create+0x6b/0x150
[<ffffffff812204e4>] SyS_mkdir+0x44/0xe0
[<ffffffff81003cb2>] do_syscall_64+0x62/0x110
[<ffffffff8174ae21>] return_from_SYSCALL_64+0x0/0x6a
[<ffffffffffffffff>] 0xffffffffffffffff
I'll have to take off here in a bit so I won't be able to help much
until later, but all I was doing was running the cthon special tests
like so:
$ ./server -p /export -s -N 100 tlielax
That makes a directory called "rawhide.test" (since the client's
hostname is "rawhide") and runs its tests in there. Then I ran this in
a different shell:
$ while true; do ls -l /mnt/tlielax/rawhide.test ; done
Probably I should run this on a stock kernel just to see if there are
preexisting problems...
--
Jeff Layton <[email protected]>
On Sat, Apr 30, 2016 at 10:43:34AM -0400, Jeff Layton wrote:
> Not exactly, but the test seems to have deadlocked without the last
> patch in play. Here's the ls command:
>
> [jlayton@rawhide ~]$ cat /proc/1425/stack
> [<ffffffffa03d6eec>] nfs_block_sillyrename+0x5c/0xa0 [nfs]
> [<ffffffffa03c8ef8>] nfs_readdir+0xf8/0x620 [nfs]
> [<ffffffff812232bb>] iterate_dir+0x16b/0x1a0
> [<ffffffff812236e8>] SyS_getdents+0x88/0x100
> [<ffffffff81003cb2>] do_syscall_64+0x62/0x110
> [<ffffffff8174ae21>] return_from_SYSCALL_64+0x0/0x6a
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> ...and here is the nfsidem command:
>
> [jlayton@rawhide ~]$ cat /proc/1295/stack
> [<ffffffff813953b7>] call_rwsem_down_write_failed+0x17/0x30
> [<ffffffff8121f65b>] filename_create+0x6b/0x150
> [<ffffffff812204e4>] SyS_mkdir+0x44/0xe0
> [<ffffffff81003cb2>] do_syscall_64+0x62/0x110
> [<ffffffff8174ae21>] return_from_SYSCALL_64+0x0/0x6a
> [<ffffffffffffffff>] 0xffffffffffffffff
>
>
> I'll have to take off here in a bit so I won't be able to help much
> until later, but all I was doing was running the cthon special tests
> like so:
>
> ? ? $ ./server -p /export -s -N 100 tlielax
>
> That makes a directory called "rawhide.test" (since the client's
> hostname is "rawhide") and runs its tests in there. Then I ran this in
> a different shell:
>
> $ while true; do ls -l /mnt/tlielax/rawhide.test ; done
>
> Probably I should run this on a stock kernel just to see if there are
> preexisting problems...
FWIW, I could reproduce that (and I really wonder WTF is going on - looks
like nfs_async_unlink_release() getting lost somehow), but not the memory
corruption with the last commit... What .config are you using?
On Sat, Apr 30, 2016 at 07:58:36PM +0100, Al Viro wrote:
> FWIW, I could reproduce that (and I really wonder WTF is going on - looks
> like nfs_async_unlink_release() getting lost somehow), but not the memory
> corruption with the last commit... What .config are you using?
OK, I do understand the deadlock. nfs_unblock_sillyrename() doesn't issue
a wakeup. If you have lookup/lookup colliding on nfs_block_sillyrename()
*and* no pending delayed unlinks in that directory, you are stuck. Grr...
I'll see if the obvious fix helps and push it into the queue if it does.
It doesn't explain the memory corruptor, though. Could you post your .config?
On Sat, Apr 30, 2016 at 04:39:25PM -0400, Jeff Layton wrote:
> Attached. Also I ran the same test on a 4.6.0-rc1 kernel and it worked
> fine, so it does seem to be a regression introduced somewhere in your
> patch pile. Also, FWIW the machine is a KVM guest.
OK... Could you try #work.lookups on your setup? That doesn't contain the
last commit; the deadlock ought to be dealt with, though. I'm trying to
reproduce the corruption with KASAN enabled, will try your config once that's
finished...
On Sat, 2016-04-30 at 21:57 +0100, Al Viro wrote:
> On Sat, Apr 30, 2016 at 04:39:25PM -0400, Jeff Layton wrote:
>
> >
> > Attached. Also I ran the same test on a 4.6.0-rc1 kernel and it worked
> > fine, so it does seem to be a regression introduced somewhere in your
> > patch pile. Also, FWIW the machine is a KVM guest.
> OK... Could you try #work.lookups on your setup? That doesn't contain the
> last commit; the deadlock ought to be dealt with, though. I'm trying to
> reproduce the corruption with KASAN enabled, will try your config once that's
> finished...
That branch seems to work fine. The same test didn't hang and didn't crash.
Cheers,
--
Jeff Layton <[email protected]>
On Sat, 2016-04-30 at 21:57 +0100, Al Viro wrote:
> On Sat, Apr 30, 2016 at 04:39:25PM -0400, Jeff Layton wrote:
>
> >
> > Attached. Also I ran the same test on a 4.6.0-rc1 kernel and it
> > worked
> > fine, so it does seem to be a regression introduced somewhere in
> > your
> > patch pile. Also, FWIW the machine is a KVM guest.
> OK... Could you try #work.lookups on your setup? That doesn't
> contain the
> last commit; the deadlock ought to be dealt with, though. I'm trying
> to
> reproduce the corruption with KASAN enabled, will try your config
> once that's
> finished...
I'll do the same (re: KASAN).
Also FWIW, a few months ago I hit some oopses in the same inline
function (get_freepointer). It turned out to be a double-free due to my
own misuse of the fsnotify API. I wonder though if this might also be a
double free somewhere?
--
Jeff Layton <[email protected]>
On Sat, 2016-04-30 at 21:57 +0100, Al Viro wrote:
> On Sat, Apr 30, 2016 at 04:39:25PM -0400, Jeff Layton wrote:
>
> >
> > Attached. Also I ran the same test on a 4.6.0-rc1 kernel and it
> > worked
> > fine, so it does seem to be a regression introduced somewhere in
> > your
> > patch pile. Also, FWIW the machine is a KVM guest.
> OK... Could you try #work.lookups on your setup? That doesn't
> contain the
> last commit; the deadlock ought to be dealt with, though. I'm trying
> to
> reproduce the corruption with KASAN enabled, will try your config
> once that's
> finished...
Here's what I got with your untested.nfs branch + KASAN:
[ 192.453252] BUG: unable to handle kernel paging request at ffff814bf270ffff
[ 192.454102] IP: [<ffffffff8156622b>] deactivate_slab+0xab/0x3e0
[ 192.454795] PGD 0
[ 192.455040] Oops: 0000 [#1] SMP KASAN
[ 192.455469] Modules linked in: nfsv3 nfs fscache xfs snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core libcrc32c snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer nfsd joydev snd acpi_cpufreq ppdev parport_pc tpm_tis virtio_net soundcore parport tpm pcspkr virtio_balloon pvpanic i2c_piix4 nfs_acl lockd auth_rpcgss grace sunrpc qxl drm_kms_helper virtio_console virtio_blk ttm drm serio_raw virtio_pci ata_generic virtio_ring pata_acpi virtio floppy
[ 192.460993] CPU: 0 PID: 8628 Comm: rm Not tainted 4.6.0-rc1+ #29
[ 192.461679] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 192.462359] task: ffff8801bf990000 ti: ffff8800c7e88000 task.ti: ffff8800c7e88000
[ 192.463372] RIP: 0010:[<ffffffff8156622b>] [<ffffffff8156622b>] deactivate_slab+0xab/0x3e0
[ 192.464266] RSP: 0018:ffff8800c7e8f550 EFLAGS: 00010002
[ 192.464839] RAX: 0000000000000000 RBX: ffff8801d6c02200 RCX: 00000001801a0019
[ 192.465646] RDX: ffff8800c5949c2e RSI: ffffea0003165200 RDI: ffffea0003165200
[ 192.466443] RBP: ffff8800c7e8f650 R08: ffff8800c5948750 R09: 000000000000801a
[ 192.467319] R10: ffff8800c5948750 R11: ffffed0037f53c7f R12: ffff8801d6c02208
[ 192.468282] R13: ffff814bf270ffff R14: ffff814bf270ffff R15: ffffea0003165200
[ 192.469064] FS: 00007f8ae54d7700(0000) GS:ffff8801d7000000(0000) knlGS:0000000000000000
[ 192.469919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 192.470528] CR2: ffff814bf270ffff CR3: 00000000c5950000 CR4: 00000000000006f0
[ 192.471337] Stack:
[ 192.471567] ffff8800c7e8f58c 00000000801a001a ffff8801d7007fc0 ffff8801bf990000
[ 192.472453] 0000000000000000 ffff8800c7e88000 ffff8800c7e8f5c0 0000000003165200
[ 192.473388] 0000000fc7e8f650 ffff8801d6c01e00 ffffffffa098b3df ffff8801bf990000
[ 192.474349] Call Trace:
[ 192.474641] [<ffffffffa098b3df>] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 192.475456] [<ffffffffa098b3df>] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 192.476246] [<ffffffff810a6eeb>] ? save_stack_trace+0x2b/0x50
[ 192.476872] [<ffffffff81563614>] ? set_track+0x74/0x120
[ 192.477453] [<ffffffff81565953>] ? alloc_debug_processing+0x73/0x1b0
[ 192.478143] [<ffffffff81566ecf>] ___slab_alloc+0x4ff/0x520
[ 192.478742] [<ffffffff8156b5f6>] ? memcpy+0x36/0x40
[ 192.479309] [<ffffffffa098b3df>] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 192.480084] [<ffffffffa0a3cb73>] ? nfs3_decode_dirent+0x773/0xd90 [nfsv3]
[ 192.480846] [<ffffffffa098b3df>] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 192.481642] [<ffffffff81566f10>] __slab_alloc+0x20/0x40
[ 192.482209] [<ffffffff8156a837>] __kmalloc_track_caller+0x167/0x1d0
[ 192.482884] [<ffffffff814bf270>] kmemdup+0x20/0x50
[ 192.483478] [<ffffffffa098b3df>] nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 192.484226] [<ffffffff81567f7a>] ? __slab_free+0x9a/0x250
[ 192.484851] [<ffffffffa098ad80>] ? nfs_do_filldir+0x730/0x730 [nfs]
[ 192.485916] [<ffffffffa0a32d47>] ? nfs3_proc_readdir+0x2a7/0x4e0 [nfsv3]
[ 192.486990] [<ffffffff81566ecf>] ? ___slab_alloc+0x4ff/0x520
[ 192.487965] [<ffffffffa0a32aa0>] ? nfs3_proc_rmdir+0x2f0/0x2f0 [nfsv3]
[ 192.489009] [<ffffffff8156b3e6>] ? kasan_unpoison_shadow+0x36/0x50
[ 192.490075] [<ffffffff81550693>] ? alloc_pages_current+0xd3/0x290
[ 192.491144] [<ffffffffa098c74d>] nfs_readdir_xdr_to_array+0x59d/0xaa0 [nfs]
[ 192.492346] [<ffffffffa098c1b0>] ? nfs_readdir_page_filler+0x1430/0x1430 [nfs]
[ 192.493506] [<ffffffff81a809c8>] ? radix_tree_lookup_slot+0x58/0x90
[ 192.494564] [<ffffffff81550604>] ? alloc_pages_current+0x44/0x290
[ 192.495558] [<ffffffff81467d60>] ? add_to_page_cache_locked+0x20/0x20
[ 192.496593] [<ffffffffa098ccba>] nfs_readdir_filler+0x6a/0x1c0 [nfs]
[ 192.497595] [<ffffffff8146879e>] do_read_cache_page+0x22e/0x6a0
[ 192.498568] [<ffffffffa098cc50>] ? nfs_readdir_xdr_to_array+0xaa0/0xaa0 [nfs]
[ 192.499646] [<ffffffff81468c50>] read_cache_page+0x40/0x70
[ 192.500547] [<ffffffffa098d266>] nfs_readdir+0x456/0x1680 [nfs]
[ 192.501501] [<ffffffff81495dfb>] ? lru_cache_add_active_or_unevictable+0xcb/0x260
[ 192.502616] [<ffffffffa098ce10>] ? nfs_readdir_filler+0x1c0/0x1c0 [nfs]
[ 192.503634] [<ffffffffa0a3c400>] ? nfs3_xdr_dec_getattr3res+0xb0/0xb0 [nfsv3]
[ 192.504714] [<ffffffff815dfb53>] iterate_dir+0x3f3/0x5b0
[ 192.505591] [<ffffffff815e0708>] SyS_getdents+0xf8/0x1a0
[ 192.506473] [<ffffffff81006460>] ? enter_from_user_mode+0x50/0x50
[ 192.507454] [<ffffffff815e0610>] ? SyS_old_readdir+0x100/0x100
[ 192.508412] [<ffffffff815dfd10>] ? iterate_dir+0x5b0/0x5b0
[ 192.509332] [<ffffffff81006a61>] ? syscall_trace_enter_phase2+0xf1/0x510
[ 192.510377] [<ffffffff815e0610>] ? SyS_old_readdir+0x100/0x100
[ 192.511351] [<ffffffff810072fa>] do_syscall_64+0x19a/0x410
[ 192.512258] [<ffffffff81460bfd>] ? context_tracking_enter+0x1d/0x20
[ 192.513275] [<ffffffff8271e321>] entry_SYSCALL64_slow_path+0x25/0x25
[ 192.514271] Code: 8b 57 10 49 8b 4f 18 49 89 54 05 00 48 89 4c 24 68 66 83 6c 24 68 01 80 7c 24 6b 00 78 c6 0f 0b 48 63 43 20 49 8b 57 10 4d 89 f5 <49> 8b 0c 06 48 85 c9 74 05 49 89 ce eb ca 48 8b 7c 24 48 45 31
[ 192.518193] RIP [<ffffffff8156622b>] deactivate_slab+0xab/0x3e0
[ 192.519099] RSP <ffff8800c7e8f550>
[ 192.519738] CR2: ffff814bf270ffff
[ 192.520410] ---[ end trace a406a29677c40c04 ]---
Same spot, apparently:
(gdb) list *(deactivate_slab+0xab)
0xffffffff8156622b is in deactivate_slab (mm/slub.c:245).
240 * Core slab cache functions
241 *******************************************************************/
242
243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
244 {
245 return *(void **)(object + s->offset);
246 }
247
248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
249 {
IIRC, when I looked at this code last time, my take was that it wasn't
so much that the slab objects were getting corrupted, but rather the
cache metadata.
--
Jeff Layton <[email protected]>
On Sat, 2016-04-30 at 19:23 -0400, Jeff Layton wrote:
> On Sat, 2016-04-30 at 21:57 +0100, Al Viro wrote:
> >
> > On Sat, Apr 30, 2016 at 04:39:25PM -0400, Jeff Layton wrote:
> >
> > >
> > >
> > > Attached. Also I ran the same test on a 4.6.0-rc1 kernel and it
> > > worked
> > > fine, so it does seem to be a regression introduced somewhere in
> > > your
> > > patch pile. Also, FWIW the machine is a KVM guest.
> > OK... Could you try #work.lookups on your setup? That doesn't
> > contain the
> > last commit; the deadlock ought to be dealt with, though. I'm trying
> > to
> > reproduce the corruption with KASAN enabled, will try your config
> > once that's
> > finished...
>
> Here's what I got with your untested.nfs branch + KASAN:
>
> [ 192.453252] BUG: unable to handle kernel paging request at ffff814bf270ffff
> [ 192.454102] IP: [] deactivate_slab+0xab/0x3e0
> [ 192.454795] PGD 0
> [ 192.455040] Oops: 0000 [#1] SMP KASAN
> [ 192.455469] Modules linked in: nfsv3 nfs fscache xfs snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core libcrc32c snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer nfsd joydev snd acpi_cpufreq ppdev parport_pc tpm_tis virtio_net soundcore parport tpm pcspkr virtio_balloon pvpanic i2c_piix4 nfs_acl lockd auth_rpcgss grace sunrpc qxl drm_kms_helper virtio_console virtio_blk ttm drm serio_raw virtio_pci ata_generic virtio_ring pata_acpi virtio floppy
> [ 192.460993] CPU: 0 PID: 8628 Comm: rm Not tainted 4.6.0-rc1+ #29
> [ 192.461679] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [ 192.462359] task: ffff8801bf990000 ti: ffff8800c7e88000 task.ti: ffff8800c7e88000
> [ 192.463372] RIP: 0010:[] [] deactivate_slab+0xab/0x3e0
> [ 192.464266] RSP: 0018:ffff8800c7e8f550 EFLAGS: 00010002
> [ 192.464839] RAX: 0000000000000000 RBX: ffff8801d6c02200 RCX: 00000001801a0019
> [ 192.465646] RDX: ffff8800c5949c2e RSI: ffffea0003165200 RDI: ffffea0003165200
> [ 192.466443] RBP: ffff8800c7e8f650 R08: ffff8800c5948750 R09: 000000000000801a
> [ 192.467319] R10: ffff8800c5948750 R11: ffffed0037f53c7f R12: ffff8801d6c02208
> [ 192.468282] R13: ffff814bf270ffff R14: ffff814bf270ffff R15: ffffea0003165200
> [ 192.469064] FS: 00007f8ae54d7700(0000) GS:ffff8801d7000000(0000) knlGS:0000000000000000
> [ 192.469919] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 192.470528] CR2: ffff814bf270ffff CR3: 00000000c5950000 CR4: 00000000000006f0
> [ 192.471337] Stack:
> [ 192.471567] ffff8800c7e8f58c 00000000801a001a ffff8801d7007fc0 ffff8801bf990000
> [ 192.472453] 0000000000000000 ffff8800c7e88000 ffff8800c7e8f5c0 0000000003165200
> [ 192.473388] 0000000fc7e8f650 ffff8801d6c01e00 ffffffffa098b3df ffff8801bf990000
> [ 192.474349] Call Trace:
> [ 192.474641] [] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
> [ 192.475456] [] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
> [ 192.476246] [] ? save_stack_trace+0x2b/0x50
> [ 192.476872] [] ? set_track+0x74/0x120
> [ 192.477453] [] ? alloc_debug_processing+0x73/0x1b0
> [ 192.478143] [] ___slab_alloc+0x4ff/0x520
> [ 192.478742] [] ? memcpy+0x36/0x40
> [ 192.479309] [] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
> [ 192.480084] [] ? nfs3_decode_dirent+0x773/0xd90 [nfsv3]
> [ 192.480846] [] ? nfs_readdir_page_filler+0x65f/0x1430 [nfs]
> [ 192.481642] [] __slab_alloc+0x20/0x40
> [ 192.482209] [] __kmalloc_track_caller+0x167/0x1d0
> [ 192.482884] [] kmemdup+0x20/0x50
> [ 192.483478] [] nfs_readdir_page_filler+0x65f/0x1430 [nfs]
> [ 192.484226] [] ? __slab_free+0x9a/0x250
> [ 192.484851] [] ? nfs_do_filldir+0x730/0x730 [nfs]
> [ 192.485916] [] ? nfs3_proc_readdir+0x2a7/0x4e0 [nfsv3]
> [ 192.486990] [] ? ___slab_alloc+0x4ff/0x520
> [ 192.487965] [] ? nfs3_proc_rmdir+0x2f0/0x2f0 [nfsv3]
> [ 192.489009] [] ? kasan_unpoison_shadow+0x36/0x50
> [ 192.490075] [] ? alloc_pages_current+0xd3/0x290
> [ 192.491144] [] nfs_readdir_xdr_to_array+0x59d/0xaa0 [nfs]
> [ 192.492346] [] ? nfs_readdir_page_filler+0x1430/0x1430 [nfs]
> [ 192.493506] [] ? radix_tree_lookup_slot+0x58/0x90
> [ 192.494564] [] ? alloc_pages_current+0x44/0x290
> [ 192.495558] [] ? add_to_page_cache_locked+0x20/0x20
> [ 192.496593] [] nfs_readdir_filler+0x6a/0x1c0 [nfs]
> [ 192.497595] [] do_read_cache_page+0x22e/0x6a0
> [ 192.498568] [] ? nfs_readdir_xdr_to_array+0xaa0/0xaa0 [nfs]
> [ 192.499646] [] read_cache_page+0x40/0x70
> [ 192.500547] [] nfs_readdir+0x456/0x1680 [nfs]
> [ 192.501501] [] ? lru_cache_add_active_or_unevictable+0xcb/0x260
> [ 192.502616] [] ? nfs_readdir_filler+0x1c0/0x1c0 [nfs]
> [ 192.503634] [] ? nfs3_xdr_dec_getattr3res+0xb0/0xb0 [nfsv3]
> [ 192.504714] [] iterate_dir+0x3f3/0x5b0
> [ 192.505591] [] SyS_getdents+0xf8/0x1a0
> [ 192.506473] [] ? enter_from_user_mode+0x50/0x50
> [ 192.507454] [] ? SyS_old_readdir+0x100/0x100
> [ 192.508412] [] ? iterate_dir+0x5b0/0x5b0
> [ 192.509332] [] ? syscall_trace_enter_phase2+0xf1/0x510
> [ 192.510377] [] ? SyS_old_readdir+0x100/0x100
> [ 192.511351] [] do_syscall_64+0x19a/0x410
> [ 192.512258] [] ? context_tracking_enter+0x1d/0x20
> [ 192.513275] [] entry_SYSCALL64_slow_path+0x25/0x25
> [ 192.514271] Code: 8b 57 10 49 8b 4f 18 49 89 54 05 00 48 89 4c 24 68 66 83 6c 24 68 01 80 7c 24 6b 00 78 c6 0f 0b 48 63 43 20 49 8b 57 10 4d 89 f5 <49> 8b 0c 06 48 85 c9 74 05 49 89 ce eb ca 48 8b 7c 24 48 45 31
> [ 192.518193] RIP [] deactivate_slab+0xab/0x3e0
> [ 192.519099] RSP
> [ 192.519738] CR2: ffff814bf270ffff
> [ 192.520410] ---[ end trace a406a29677c40c04 ]---
>
> Same spot, apparently:
>
> (gdb) list *(deactivate_slab+0xab)
> 0xffffffff8156622b is in deactivate_slab (mm/slub.c:245).
> 240 * Core slab cache functions
> 241 *******************************************************************/
> 242
> 243 static inline void *get_freepointer(struct kmem_cache *s, void *object)
> 244 {
> 245 return *(void **)(object + s->offset);
> 246 }
> 247
> 248 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
> 249 {
>
> IIRC, when I looked at this code last time, my take was that it wasn't
> so much that the slab objects were getting corrupted, but rather the
> cache metadata.
>
...and then a little while later, this pops:
[ 650.375768] ==================================================================
[ 650.378114] BUG: KASAN: slab-out-of-bounds in memcpy+0x28/0x40 at addr ffff8800c5949c2e
[ 650.380379] Write of size 6 by task systemd/1
[ 650.381756] =============================================================================
[ 650.383894] BUG kmalloc-8 (Tainted: G D ): kasan: bad access detected
[ 650.386032] -----------------------------------------------------------------------------
[ 650.386032]
[ 650.389053] INFO: Allocated in nfs_readdir_page_filler+0x65f/0x1430 [nfs] age=457944 cpu=0 pid=8626
[ 650.391395] ___slab_alloc+0x4de/0x520
[ 650.392730] __slab_alloc+0x20/0x40
[ 650.393986] 0x8750ffff8156a837
[ 650.395171] 0xffffffff8800c594
[ 650.396387] nfs_readdir_page_filler+0x65f/0x1430 [nfs]
[ 650.397871] nfs_readdir_xdr_to_array+0x59d/0xaa0 [nfs]
[ 650.399455] nfs_readdir_filler+0x6a/0x1c0 [nfs]
[ 650.400856] do_read_cache_page+0x22e/0x6a0
[ 650.402183] read_cache_page+0x40/0x70
[ 650.403539] nfs_readdir+0x456/0x1680 [nfs]
[ 650.404971] iterate_dir+0x3f3/0x5b0
[ 650.406172] SyS_getdents+0xf8/0x1a0
[ 650.407323] do_syscall_64+0x19a/0x410
[ 650.408453] return_from_SYSCALL_64+0x0/0x6a
[ 650.409699] INFO: Freed in nfs_readdir_clear_array+0xdf/0x1d0 [nfs] age=457957 cpu=0 pid=8626
[ 650.411805] __slab_free+0x17f/0x250
[ 650.412935] kfree+0x173/0x190
[ 650.413983] nfs_readdir_clear_array+0xdf/0x1d0 [nfs]
[ 650.415364] cache_page_release.isra.27+0xf2/0x160 [nfs]
[ 650.416794] nfs_do_filldir+0x3af/0x730 [nfs]
[ 650.418011] nfs_readdir+0x764/0x1680 [nfs]
[ 650.419165] iterate_dir+0x3f3/0x5b0
[ 650.420260] SyS_getdents+0xf8/0x1a0
[ 650.421320] do_syscall_64+0x19a/0x410
[ 650.422401] return_from_SYSCALL_64+0x0/0x6a
[ 650.423564] INFO: Slab 0xffffea0003165200 objects=26 used=26 fp=0x (null) flags=0x3fff8000004080
[ 650.425673] INFO: Object 0xffff8800c5949c08 @offset=7176 fp=0xffff8800c5948fd8
[ 650.425673]
[ 650.427984] Bytes b4 ffff8800c5949bf8: 00 00 00 00 b2 21 00 00 10 59 fe ff 00 00 00 00 .....!...Y......
[ 650.430030] Object ffff8800c5949c08: d8 8f 94 c5 00 88 ff ff ........
[ 650.431896] CPU: 0 PID: 1 Comm: systemd Tainted: G B D 4.6.0-rc1+ #29
[ 650.433537] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 650.434982] ffffea0003165200 ffff8801d5e97978 ffffffff81a72334 ffff8801d6c02200
[ 650.436722] ffff8800c5949c08 ffff8801d5e979a8 ffffffff815641a2 ffff8801d6c02200
[ 650.438416] ffffea0003165200 ffff8800c5949c08 0000000000000002 ffff8801d5e979d0
[ 650.440271] Call Trace:
[ 650.441108] [<ffffffff81a72334>] dump_stack+0x63/0x8f
[ 650.442345] [<ffffffff815641a2>] print_trailer+0x112/0x1a0
[ 650.443669] [<ffffffff81569a04>] object_err+0x34/0x40
[ 650.444908] [<ffffffff8156bd54>] kasan_report_error+0x224/0x530
[ 650.446297] [<ffffffff8271dd0e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 650.447975] [<ffffffff8156c459>] kasan_report+0x39/0x40
[ 650.449195] [<ffffffff8156b701>] ? kasan_poison_slab+0x61/0x70
[ 650.450687] [<ffffffff8156b5e8>] ? memcpy+0x28/0x40
[ 650.451815] [<ffffffff8156b1dd>] __asan_storeN+0x12d/0x180
[ 650.453059] [<ffffffff8156b5e8>] memcpy+0x28/0x40
[ 650.454194] [<ffffffff814bf209>] kstrdup+0x49/0x60
[ 650.455333] [<ffffffff814bf243>] kstrdup_const+0x23/0x30
[ 650.456513] [<ffffffff817059dc>] __kernfs_new_node+0x2c/0x280
[ 650.457690] [<ffffffff81708b03>] kernfs_new_node+0x63/0xf0
[ 650.458822] [<ffffffff8170d42c>] __kernfs_create_file+0x2c/0x1e0
[ 650.460062] [<ffffffff81317657>] cgroup_addrm_files+0x2a7/0x850
[ 650.461303] [<ffffffff813173b0>] ? cgroup_idr_alloc.constprop.34+0x70/0x70
[ 650.462672] [<ffffffff81312ef0>] ? cgroup_file_name+0x1f0/0x1f0
[ 650.463853] [<ffffffff81318082>] css_populate_dir+0x2f2/0x3d0
[ 650.464918] [<ffffffff81325a74>] cgroup_mkdir+0x5a4/0x7b0
[ 650.466003] [<ffffffff813254d0>] ? cgroup_rmdir+0x40/0x40
[ 650.467107] [<ffffffff8170891d>] kernfs_iop_mkdir+0x14d/0x240
[ 650.468261] [<ffffffff815c8bbf>] vfs_mkdir+0x25f/0x470
[ 650.469331] [<ffffffff815d9366>] SyS_mkdir+0x196/0x200
[ 650.470407] [<ffffffff815d91d0>] ? SyS_mkdirat+0x200/0x200
[ 650.471526] [<ffffffff815d91d0>] ? SyS_mkdirat+0x200/0x200
[ 650.472635] [<ffffffff810072fa>] do_syscall_64+0x19a/0x410
[ 650.473699] [<ffffffff81460bfd>] ? context_tracking_enter+0x1d/0x20
[ 650.474872] [<ffffffff8271e321>] entry_SYSCALL64_slow_path+0x25/0x25
[ 650.476083] Memory state around the buggy address:
[ 650.477101] ffff8800c5949b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 650.478412] ffff8800c5949b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 650.479687] >ffff8800c5949c00: fc fb fc fc fc 00 fc fc fc fc fc fc fc fc fc fc
[ 650.480929] ^
[ 650.481896] ffff8800c5949c80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[ 650.483227] ffff8800c5949d00: fc fc fc fc fc fc fc fc fb fc fc fc fc fc fc fc
[ 650.484541] ==================================================================
--
Jeff Layton <[email protected]>
On Sat, Apr 30, 2016 at 06:33:36PM -0400, Jeff Layton wrote:
> I'll do the same (re: KASAN).
>
> Also FWIW, a few months ago I hit some oopses in the same inline
> function (get_freepointer). It turned out to be a double-free due to my
> own misuse of the fsnotify API. I wonder though if this might also be a
> double free somewhere?
It is a double-free somewhere, all right... What happens there is that
nfs_readdir really relies upon being the only thread to manipulate the
page cache of that directory. We get nfs_revalidate_mapping() called
and if it ends up evicting a page currently in use by nfs_do_filldir(),
you get nfs_readdir_clear_array() called _twice_ - once on kicking it
out of page cache (and those kfree of the names are obviously Not Good(tm)
for nfs_do_filldir() copying those names to userland) and then when
nfs_do_filldir() gets to cache_page_release().
Sigh...
On Sun, May 01, 2016 at 12:31:38AM +0100, Al Viro wrote:
> On Sat, Apr 30, 2016 at 06:33:36PM -0400, Jeff Layton wrote:
> > I'll do the same (re: KASAN).
> >
> > Also FWIW, a few months ago I hit some oopses in the same inline
> > function (get_freepointer). It turned out to be a double-free due to my
> > own misuse of the fsnotify API. I wonder though if this might also be a
> > double free somewhere?
>
> It is a double-free somewhere, all right... What happens there is that
> nfs_readdir really relies upon being the only thread to manipulate the
> page cache of that directory. We get nfs_revalidate_mapping() called
> and if it ends up evicting a page currently in use by nfs_do_filldir(),
> you get nfs_readdir_clear_array() called _twice_ - once on kicking it
> out of page cache (and those kfree of the names are obviously Not Good(tm)
> for nfs_do_filldir() copying those names to userland) and then when
> nfs_do_filldir() gets to cache_page_release().
>
> Sigh...
AFAICS, we have desc->page coming either from get_cache_page() or from
direct assignment in uncached_readdir(). The latter is not a problem;
it won't be hit with pagecache eviction anyway. The former, OTOH, is.
I wonder if we ought to put a counter into nfs_cache_array, initialized to 1
(in nfs_readdir_xdr_to_array()), bumped in get_cache_page() and decremented
both in cache_page_release() and in ->freepage(). With actual freeing
of names happening only when the sucker reaches 0, and get_cache_page()
treating "oops, it's already 0, someone has just evicted it from page cache"
as "page_cache_release() and retry". Objections?
On Sun, May 01, 2016 at 01:02:55AM +0100, Al Viro wrote:
> I wonder if we ought to put a counter into nfs_cache_array, initialized to 1
> (in nfs_readdir_xdr_to_array()), bumped in get_cache_page() and decremented
> both in cache_page_release() and in ->freepage(). With actual freeing
> of names happening only when the sucker reaches 0, and get_cache_page()
> treating "oops, it's already 0, someone has just evicted it from page cache"
> as "page_cache_release() and retry". Objections?
Something like (completely untested)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index da109e6..037bfb4 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -145,6 +145,7 @@ struct nfs_cache_array_entry {
};
struct nfs_cache_array {
+ atomic_t refcount;
int size;
int eof_index;
u64 last_cookie;
@@ -200,11 +201,20 @@ void nfs_readdir_clear_array(struct page *page)
int i;
array = kmap_atomic(page);
- for (i = 0; i < array->size; i++)
- kfree(array->array[i].string.name);
+ if (atomic_dec_and_test(&array->refcount))
+ for (i = 0; i < array->size; i++)
+ kfree(array->array[i].string.name);
kunmap_atomic(array);
}
+static bool grab_page(struct page *page)
+{
+ struct nfs_cache_array *array = kmap_atomic(page);
+ bool res = atomic_inc_not_zero(&array->refcount);
+ kunmap_atomic(array);
+ return res;
+}
+
/*
* the caller is responsible for freeing qstr.name
* when called by nfs_readdir_add_to_array, the strings will be freed in
@@ -645,6 +655,7 @@ int nfs_readdir_xdr_to_array(nfs_readdir_descriptor_t *desc, struct page *page,
goto out_label_free;
}
memset(array, 0, sizeof(struct nfs_cache_array));
+ atomic_set(&array->refcount, 1);
array->eof_index = -1;
status = nfs_readdir_alloc_pages(pages, array_size);
@@ -707,8 +718,7 @@ int nfs_readdir_filler(nfs_readdir_descriptor_t *desc, struct page* page)
static
void cache_page_release(nfs_readdir_descriptor_t *desc)
{
- if (!desc->page->mapping)
- nfs_readdir_clear_array(desc->page);
+ nfs_readdir_clear_array(desc->page);
page_cache_release(desc->page);
desc->page = NULL;
}
@@ -716,8 +726,16 @@ void cache_page_release(nfs_readdir_descriptor_t *desc)
static
struct page *get_cache_page(nfs_readdir_descriptor_t *desc)
{
- return read_cache_page(file_inode(desc->file)->i_mapping,
+ struct page *page;
+
+ for (;;) {
+ page = read_cache_page(file_inode(desc->file)->i_mapping,
desc->page_index, (filler_t *)nfs_readdir_filler, desc);
+ if (IS_ERR(page) || grab_page(page))
+ break;
+ page_cache_release(page);
+ }
+ return page;
}
/*
On Sun, May 01, 2016 at 01:18:16AM +0100, Al Viro wrote:
> On Sun, May 01, 2016 at 01:02:55AM +0100, Al Viro wrote:
> > I wonder if we ought to put a counter into nfs_cache_array, initialized to 1
> > (in nfs_readdir_xdr_to_array()), bumped in get_cache_page() and decremented
> > both in cache_page_release() and in ->freepage(). With actual freeing
> > of names happening only when the sucker reaches 0, and get_cache_page()
> > treating "oops, it's already 0, someone has just evicted it from page cache"
> > as "page_cache_release() and retry". Objections?
>
> Something like (completely untested)
No problems after 100 iterations... Folded and pushed.
On Sun, 2016-05-01 at 02:08 +0100, Al Viro wrote:
> On Sun, May 01, 2016 at 01:18:16AM +0100, Al Viro wrote:
> >
> > On Sun, May 01, 2016 at 01:02:55AM +0100, Al Viro wrote:
> > >
> > > I wonder if we ought to put a counter into nfs_cache_array,
> > > initialized to 1
> > > (in nfs_readdir_xdr_to_array()), bumped in get_cache_page() and
> > > decremented
> > > both in cache_page_release() and in ->freepage(). With actual
> > > freeing
> > > of names happening only when the sucker reaches 0, and
> > > get_cache_page()
> > > treating "oops, it's already 0, someone has just evicted it from
> > > page cache"
> > > as "page_cache_release() and retry". Objections?
> > Something like (completely untested)
> No problems after 100 iterations... Folded and pushed.
Same here. Ran same test vs. your work.lookups branch and it seems fine
now.
Cheers,
--
Jeff Layton <[email protected]>