LinuxLists.cc - dcache_readdir NULL inode oops

2018-11-09 14:39:23

Subject: dcache_readdir NULL inode oops

Hi Al,

I'm seeing the following oops reproducible with upstream kernel on arm64 (ThunderX2):

[ 5428.795719] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040
[ 5428.813838] Mem abort info:
[ 5428.820721] ESR = 0x96000006
[ 5428.828476] Exception class = DABT (current EL), IL = 32 bits
[ 5428.841590] SET = 0, FnV = 0
[ 5428.848939] EA = 0, S1PTW = 0
[ 5428.855941] Data abort info:
[ 5428.862422] ISV = 0, ISS = 0x00000006
[ 5428.870787] CM = 0, WnR = 0
[ 5428.877359] user pgtable: 4k pages, 48-bit VAs, pgdp = 0000000052f9e034
[ 5428.891098] [0000000000000040] pgd=0000007ebb0d6003, pud=0000007ed3073003, pmd=0000000000000000
[ 5428.909251] Internal error: Oops: 96000006 [#1] SMP
[ 5428.919122] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ipmi_ssif ip_tables x_tables ipv6 crc32_ce bnx2x crct10dif_ce igb nvme nvme_core i2c_algo_bit mdio gpio_xlp i2c_xlp9xx
[ 5428.972724] CPU: 45 PID: 220018 Comm: stress-ng-dev Not tainted 4.19.0-jang+ #45
[ 5428.987664] Hardware name: To be filled by O.E.M. Saber/To be filled by O.E.M., BIOS 0ACKL018 03/30/2018
[ 5429.006819] pstate: 60400009 (nZCv daif +PAN -UAO)
[ 5429.016567] pc : dcache_readdir+0xfc/0x1a8
[ 5429.024903] lr : dcache_readdir+0x134/0x1a8
[ 5429.033376] sp : ffff00002d553d70
[ 5429.040101] x29: ffff00002d553d70 x28: ffff807db4988000
[ 5429.050892] x27: 0000000000000000 x26: 0000000000000000
[ 5429.061679] x25: 0000000056000000 x24: ffff8024577106c0
[ 5429.072457] x23: 0000000000000000 x22: ffff80267b92a480
[ 5429.083248] x21: ffff80267b92a520 x20: ffff8024575e5e00
[ 5429.094029] x19: ffff00002d553e40 x18: 0000000000000000
[ 5429.104805] x17: 0000000000000000 x16: 0000000000000000
[ 5429.115553] x15: 0000000000000000 x14: 0000000000000000
[ 5429.126332] x13: 0000000000000000 x12: 0000000000000000
[ 5429.137096] x11: 0000000000000000 x10: ffff80266b398228
[ 5429.147849] x9 : ffff80266b398000 x8 : 0000000000007e4e
[ 5429.158580] x7 : 0000000000000000 x6 : ffff00000830d190
[ 5429.169362] x5 : 0000000000000000 x4 : ffff00000d7506a8
[ 5429.180123] x3 : 0000000000000002 x2 : 0000000000000002
[ 5429.190890] x1 : ffff8024575e5e38 x0 : ffff00002d553e40
[ 5429.201715] Process stress-ng-dev (pid: 220018, stack limit = 0x000000009437ac28)
[ 5429.216828] Call trace:
[ 5429.221855] dcache_readdir+0xfc/0x1a8
[ 5429.229459] iterate_dir+0x8c/0x1a0
[ 5429.236561] ksys_getdents64+0xa4/0x188
[ 5429.244357] __arm64_sys_getdents64+0x28/0x38
[ 5429.253201] el0_svc_handler+0x7c/0x100
[ 5429.260989] el0_svc+0x8/0xc
[ 5429.266878] Code: a9429681 aa1303e0 b9402682 a9400e66 (f94020a4)
[ 5429.279192] ---[ end trace 5c1e28c07cf016c5 ]---

It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase does a scandir of /dev
and then calls random stuff like ioctl, lseek, open/close etc. on the entries. I assume no files are
deleted under /dev during the testcase.

The NULL pointer is the inode pointer of next. The next dentry->d_flags is DCACHE_RCUACCESS
when this happens.

Any hints on how to further debug this?

--Jan

2018-11-09 16:02:59

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> I'm seeing the following oops reproducible with upstream kernel on arm64
> (ThunderX2):

[...]

> It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> does a scandir of /dev and then calls random stuff like ioctl, lseek,
> open/close etc. on the entries. I assume no files are deleted under /dev
> during the testcase.
>
> The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> DCACHE_RCUACCESS when this happens.
>
> Any hints on how to further debug this?

Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
kernel?

Will

2018-11-10 11:18:21

by Jan Glauber

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Fri, Nov 09, 2018 at 03:58:56PM +0000, Will Deacon wrote:
> On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> > I'm seeing the following oops reproducible with upstream kernel on arm64
> > (ThunderX2):
>
> [...]
>
> > It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> > does a scandir of /dev and then calls random stuff like ioctl, lseek,
> > open/close etc. on the entries. I assume no files are deleted under /dev
> > during the testcase.
> >
> > The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> > DCACHE_RCUACCESS when this happens.
> >
> > Any hints on how to further debug this?
>
> Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
> kernel?

I can try out -rc1, but IIRC this wasn't bisectible as the bug was present at
least back to 4.14. I need to double check that as there were other issues
that are resolved now so I may confuse things here. I've defintely seen
the same bug with 4.18.

Unfortunately I lost access to the machine as our data center seems to be
moving currently so it might take some days until I can try -rc1.

thanks,
Jan

2018-11-20 18:29:38

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Sat, Nov 10, 2018 at 11:17:03AM +0000, Jan Glauber wrote:
> On Fri, Nov 09, 2018 at 03:58:56PM +0000, Will Deacon wrote:
> > On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> > > I'm seeing the following oops reproducible with upstream kernel on arm64
> > > (ThunderX2):
> >
> > [...]
> >
> > > It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> > > does a scandir of /dev and then calls random stuff like ioctl, lseek,
> > > open/close etc. on the entries. I assume no files are deleted under /dev
> > > during the testcase.
> > >
> > > The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> > > DCACHE_RCUACCESS when this happens.
> > >
> > > Any hints on how to further debug this?
> >
> > Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
> > kernel?
>
> I can try out -rc1, but IIRC this wasn't bisectible as the bug was present at
> least back to 4.14. I need to double check that as there were other issues
> that are resolved now so I may confuse things here. I've defintely seen
> the same bug with 4.18.
>
> Unfortunately I lost access to the machine as our data center seems to be
> moving currently so it might take some days until I can try -rc1.

Ok, I've just managed to reproduce this in a KVM guest running v4.20-rc3 on
both the host and the guest, so if anybody has any ideas of things to try then
I'm happy to give them a shot. In the meantime, I'll try again with a bunch of
debug checks enabled.

Interestingly, I see many CPUs crashing one after the other in the same place
with *0x40, which indicates that the underlying data structure is corrupted
somehow. The final crash was in a different place with *0x10, which I've also
included below.

Will

--->8

[ 353.086276] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000040
[ 353.088334] Mem abort info:
[ 353.088501] ESR = 0x96000004
[ 353.123277] Exception class = DABT (current EL), IL = 32 bits
[ 353.126126] SET = 0, FnV = 0
[ 353.127064] EA = 0, S1PTW = 0
[ 353.127917] Data abort info:
[ 353.130869] ISV = 0, ISS = 0x00000004
[ 353.131793] CM = 0, WnR = 0
[ 353.133998] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000344077db
[ 353.135410] [0000000000000040] pgd=0000000000000000
[ 353.137903] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 353.139146] Modules linked in:
[ 353.140232] CPU: 41 PID: 2514 Comm: stress-ng-dev Not tainted 4.20.0-rc3-00012-g40b114779944 #1
[ 353.140367] Hardware name: linux,dummy-virt (DT)
[ 353.190775] pstate: 40400005 (nZcv daif +PAN -UAO)
[ 353.191833] pc : dcache_readdir+0xd0/0x170
[ 353.193058] lr : dcache_readdir+0x108/0x170
[ 353.194075] sp : ffff00000e17bd70
[ 353.195027] x29: ffff00000e17bd70 x28: ffff8003cbe60000
[ 353.196232] x27: 0000000000000000 x26: 0000000000000000
[ 353.196334] x25: 0000000056000000 x24: ffff80037e3a9200
[ 353.255951] x23: 0000000000000000 x22: ffff8003d692ae40
[ 353.257708] x21: ffff8003d692aee0 x20: ffff00000e17be40
[ 353.259044] x19: ffff80037d875b00 x18: 0000000000000000
[ 353.259210] x17: 0000000000000000 x16: 0000000000000000
[ 353.259354] x15: 0000000000000000 x14: 0000000000000000
[ 353.259469] x13: 0000000000000000 x12: 0000000000000000
[ 353.259610] x11: 0000000000000000 x10: 0000000000000000
[ 353.259746] x9 : 0000ffffffffffff x8 : 0000ffffffffffff
[ 353.422637] x7 : 0000000000000005 x6 : ffff000008245768
[ 353.422639] x5 : 0000000000000000 x4 : 0000000000002000
[ 353.422640] x3 : 0000000000000002 x2 : 0000000000000001
[ 353.422642] x1 : ffff80037d875b38 x0 : ffff00000e17be40
[ 353.422646] Process stress-ng-dev (pid: 2514, stack limit = 0x000000006721788f)
[ 353.422647] Call trace:
[ 353.422654] dcache_readdir+0xd0/0x170
[ 353.422664] iterate_dir+0x13c/0x190
[ 353.429254] ksys_getdents64+0x88/0x168
[ 353.429256] __arm64_sys_getdents64+0x1c/0x28
[ 353.429260] el0_svc_common+0x84/0xd8
[ 353.429261] el0_svc_handler+0x2c/0x80
[ 353.429264] el0_svc+0x8/0xc
[ 353.429267] Code: a9429661 aa1403e0 a9400e86 b9402662 (f94020a4)
[ 353.429272] ---[ end trace 7bc53f0d6caaf0d1 ]---

[ 1770.346163] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[ 1770.364229] Mem abort info:
[ 1770.364411] ESR = 0x96000004
[ 1770.364419] Exception class = DABT (current EL), IL = 32 bits
[ 1770.364434] SET = 0, FnV = 0
[ 1770.364441] EA = 0, S1PTW = 0
[ 1770.364442] Data abort info:
[ 1770.364443] ISV = 0, ISS = 0x00000004
[ 1770.364444] CM = 0, WnR = 0
[ 1770.364480] user pgtable: 4k pages, 48-bit VAs, pgdp = 00000000d05dfa48
[ 1770.364491] [0000000000000010] pgd=0000000000000000
[ 1770.364537] Internal error: Oops: 96000004 [#34] PREEMPT SMP
[ 1770.364586] Modules linked in:
[ 1770.364592] CPU: 2 PID: 2491 Comm: stress-ng-dev Tainted: G D 4.20.0-rc3-00012-g40b114779944 #1
[ 1770.364594] Hardware name: linux,dummy-virt (DT)
[ 1770.364596] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 1770.364665] pc : n_tty_ioctl+0x128/0x1a0
[ 1770.364668] lr : n_tty_ioctl+0xac/0x1a0
[ 1770.364669] sp : ffff00000e723ca0
[ 1770.364691] x29: ffff00000e723ca0 x28: ffff8003d2a94f80
[ 1770.485270] x27: 0000000000000000 x26: 0000000000000000
[ 1770.485343] x25: ffff8003955a9780 x24: 0000fffff3c025f0
[ 1770.485346] x23: ffff80038ad46100 x22: ffff800394c1c0c0
[ 1770.496821] x21: 0000000000000000 x20: ffff800394c1c000
[ 1770.496824] x19: 0000fffff3c025f0 x18: 0000000000000000
[ 1770.496825] x17: 0000000000000000 x16: 0000000000000000
[ 1770.496827] x15: 0000000000000000 x14: 0000000000000000
[ 1770.496828] x13: 0000000000000000 x12: 0000000000000000
[ 1770.496829] x11: 0000000000000000 x10: 0000000000000000
[ 1770.496830] x9 : 0000000000000000 x8 : 0000000000000000
[ 1770.496831] x7 : 0000000000000000 x6 : 0000000000000000
[ 1770.496833] x5 : 000000000000541b x4 : ffff0000085b4780
[ 1770.496834] x3 : 0000fffff3c025f0 x2 : 000000000000541b
[ 1770.496835] x1 : ffffffff00000001 x0 : 0000000000000002
[ 1770.496839] Process stress-ng-dev (pid: 2491, stack limit = 0x000000001177919b)
[ 1770.496840] Call trace:
[ 1770.496845] n_tty_ioctl+0x128/0x1a0
[ 1770.496847] tty_ioctl+0x2fc/0xb70
[ 1770.496851] do_vfs_ioctl+0xb8/0x890
[ 1770.496853] ksys_ioctl+0x78/0xa8
[ 1770.496854] __arm64_sys_ioctl+0x1c/0x28
[ 1770.496858] el0_svc_common+0x84/0xd8
[ 1770.496860] el0_svc_handler+0x2c/0x80
[ 1770.496863] el0_svc+0x8/0xc
[ 1770.496865] Code: a94153f3 a9425bf5 a8c37bfd d65f03c0 (f9400aa4)
[ 1770.496869] ---[ end trace 7bc53f0d6caaf0f2 ]---

2018-11-20 20:20:04

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Tue, Nov 20, 2018 at 06:28:54PM +0000, Will Deacon wrote:
> On Sat, Nov 10, 2018 at 11:17:03AM +0000, Jan Glauber wrote:
> > On Fri, Nov 09, 2018 at 03:58:56PM +0000, Will Deacon wrote:
> > > On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> > > > I'm seeing the following oops reproducible with upstream kernel on arm64
> > > > (ThunderX2):
> > >
> > > [...]
> > >
> > > > It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> > > > does a scandir of /dev and then calls random stuff like ioctl, lseek,
> > > > open/close etc. on the entries. I assume no files are deleted under /dev
> > > > during the testcase.
> > > >
> > > > The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> > > > DCACHE_RCUACCESS when this happens.
> > > >
> > > > Any hints on how to further debug this?
> > >
> > > Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
> > > kernel?
> >
> > I can try out -rc1, but IIRC this wasn't bisectible as the bug was present at
> > least back to 4.14. I need to double check that as there were other issues
> > that are resolved now so I may confuse things here. I've defintely seen
> > the same bug with 4.18.
> >
> > Unfortunately I lost access to the machine as our data center seems to be
> > moving currently so it might take some days until I can try -rc1.
>
> Ok, I've just managed to reproduce this in a KVM guest running v4.20-rc3 on
> both the host and the guest, so if anybody has any ideas of things to try then
> I'm happy to give them a shot. In the meantime, I'll try again with a bunch of
> debug checks enabled.

Weee, I eventually hit a use-after-free from KASAN. See below.

Will

--->8

[ 615.973367] ==================================================================
[ 615.974675] BUG: KASAN: use-after-free in next_positive.isra.2+0x188/0x1a0
[ 615.975574] Read of size 8 at addr ffff8002fb33c190 by task stress-ng-dev/3145
[ 615.977348]
[ 615.977692] CPU: 16 PID: 3145 Comm: stress-ng-dev Tainted: G D 4.20.0-rc3-00012-g40b114779944 #2
[ 615.980171] Hardware name: linux,dummy-virt (DT)
[ 615.981325] Call trace:
[ 615.981765] dump_backtrace+0x0/0x280
[ 615.982386] show_stack+0x14/0x20
[ 615.983125] dump_stack+0xc4/0xec
[ 615.983141] print_address_description+0x60/0x25c
[ 615.985226] kasan_report+0x1a8/0x358
[ 615.986161] __asan_report_load8_noabort+0x18/0x20
[ 615.986978] next_positive.isra.2+0x188/0x1a0
[ 615.987767] dcache_readdir+0x2cc/0x488
[ 615.988428] iterate_dir+0x168/0x448
[ 615.989342] ksys_getdents64+0xe8/0x248
[ 615.990334] __arm64_sys_getdents64+0x68/0x98
[ 615.990341] el0_svc_common+0x104/0x210
[ 615.990345] el0_svc_handler+0x48/0xb0
[ 615.990349] el0_svc+0x8/0xc
[ 615.990356]
[ 615.994175] Allocated by task 2720:
[ 615.994184] kasan_kmalloc.part.1+0x40/0x108
[ 615.994188] kasan_kmalloc+0xb4/0xc8
[ 615.994192] kasan_slab_alloc+0x14/0x20
[ 615.994195] kmem_cache_alloc+0x130/0x1f8
[ 615.994203] __d_alloc+0x30/0x848
[ 615.994215] d_alloc+0x30/0x1d0
[ 616.000554] d_alloc_name+0x84/0xb0
[ 616.000562] devpts_pty_new+0x2e0/0x5e8
[ 616.000568] ptmx_open+0x14c/0x288
[ 616.000576] chrdev_open+0x194/0x408
[ 616.000586] do_dentry_open+0x2e8/0xac8
[ 616.004282] vfs_open+0x8c/0xc0
[ 616.004286] path_openat+0x694/0x33e8
[ 616.004288] do_filp_open+0x13c/0x200
[ 616.004296] do_sys_open+0x1dc/0x2e0
[ 616.006865] __arm64_sys_openat+0x88/0xc8
[ 616.006872] el0_svc_common+0x104/0x210
[ 616.006876] el0_svc_handler+0x48/0xb0
[ 616.006880] el0_svc+0x8/0xc
[ 616.006881]
[ 616.006883] Freed by task 0:
[ 616.006889] __kasan_slab_free+0x114/0x228
[ 616.006897] kasan_slab_free+0x10/0x18
[ 616.012068] kmem_cache_free+0x60/0x1e8
[ 616.012071] __d_free+0x18/0x20
[ 616.012081] rcu_process_callbacks+0x46c/0x940
[ 616.012086] __do_softirq+0x28c/0x6cc
[ 616.012087]
[ 616.012100] The buggy address belongs to the object at ffff8002fb33c100
[ 616.012100] which belongs to the cache dentry of size 192
[ 616.017462] The buggy address is located 144 bytes inside of
[ 616.017462] 192-byte region [ffff8002fb33c100, ffff8002fb33c1c0)
[ 616.017465] The buggy address belongs to the page:
[ 616.017470] page:ffff7e000beccf00 count:1 mapcount:0 mapping:ffff800358c13400 index:0x0 compound_mapcount: 0
[ 616.017477] flags: 0x1ffff00000010200(slab|head)
[ 616.017488] raw: 1ffff00000010200 dead000000000100 dead000000000200 ffff800358c13400
[ 616.024873] raw: 0000000000000000 0000000080400040 00000001ffffffff 0000000000000000
[ 616.024875] page dumped because: kasan: bad access detected
[ 616.024876]
[ 616.024877] Memory state around the buggy address:
[ 616.024882] ffff8002fb33c080: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 616.024885] ffff8002fb33c100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 616.024887] >ffff8002fb33c180: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
[ 616.024889] ^
[ 616.024891] ffff8002fb33c200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 616.024893] ffff8002fb33c280: 00 00 00 00 00 00 00 00 fc fc fc fc fc fc fc fc
[ 616.024894] ==================================================================

2018-11-21 13:33:38

by Jan Glauber

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Tue, Nov 20, 2018 at 07:03:17PM +0000, Will Deacon wrote:
> On Tue, Nov 20, 2018 at 06:28:54PM +0000, Will Deacon wrote:
> > On Sat, Nov 10, 2018 at 11:17:03AM +0000, Jan Glauber wrote:
> > > On Fri, Nov 09, 2018 at 03:58:56PM +0000, Will Deacon wrote:
> > > > On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> > > > > I'm seeing the following oops reproducible with upstream kernel on arm64
> > > > > (ThunderX2):
> > > >
> > > > [...]
> > > >
> > > > > It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> > > > > does a scandir of /dev and then calls random stuff like ioctl, lseek,
> > > > > open/close etc. on the entries. I assume no files are deleted under /dev
> > > > > during the testcase.
> > > > >
> > > > > The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> > > > > DCACHE_RCUACCESS when this happens.
> > > > >
> > > > > Any hints on how to further debug this?
> > > >
> > > > Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
> > > > kernel?
> > >
> > > I can try out -rc1, but IIRC this wasn't bisectible as the bug was present at
> > > least back to 4.14. I need to double check that as there were other issues
> > > that are resolved now so I may confuse things here. I've defintely seen
> > > the same bug with 4.18.
> > >
> > > Unfortunately I lost access to the machine as our data center seems to be
> > > moving currently so it might take some days until I can try -rc1.
> >
> > Ok, I've just managed to reproduce this in a KVM guest running v4.20-rc3 on
> > both the host and the guest, so if anybody has any ideas of things to try then
> > I'm happy to give them a shot. In the meantime, I'll try again with a bunch of
> > debug checks enabled.

Hi Will,

good that you can reproduce the issue. I've verified that the issue is
indeed reproducible with 4.14.

>
> Weee, I eventually hit a use-after-free from KASAN. See below.

I ran KASAN (and all the other debug stuff) but didn't trigger anything
in the host.

--Jan

2018-11-24 08:46:46

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

Hi all,

I've now managed to reproduce this on x86 (log below) but I'm out of my
depth with this one. Looping in Greg and Jiri because I fear this is
specific to the pty code. Rest of the thread is here:

http://lkml.kernel.org/r/20181109143744.GA12128@hc

On Wed, Nov 21, 2018 at 01:19:06PM +0000, Jan Glauber wrote:
> On Tue, Nov 20, 2018 at 07:03:17PM +0000, Will Deacon wrote:
> > On Tue, Nov 20, 2018 at 06:28:54PM +0000, Will Deacon wrote:
> > > On Sat, Nov 10, 2018 at 11:17:03AM +0000, Jan Glauber wrote:
> > > > On Fri, Nov 09, 2018 at 03:58:56PM +0000, Will Deacon wrote:
> > > > > On Fri, Nov 09, 2018 at 02:37:51PM +0000, Jan Glauber wrote:
> > > > > > I'm seeing the following oops reproducible with upstream kernel on arm64
> > > > > > (ThunderX2):
> > > > >
> > > > > [...]
> > > > >
> > > > > > It happens after 1-3 hours of running 'stress-ng --dev 128'. This testcase
> > > > > > does a scandir of /dev and then calls random stuff like ioctl, lseek,
> > > > > > open/close etc. on the entries. I assume no files are deleted under /dev
> > > > > > during the testcase.
> > > > > >
> > > > > > The NULL pointer is the inode pointer of next. The next dentry->d_flags is
> > > > > > DCACHE_RCUACCESS when this happens.
> > > > > >
> > > > > > Any hints on how to further debug this?
> > > > >
> > > > > Can you reproduce the issue with vanilla -rc1 and do you have a "known good"
> > > > > kernel?
> > > >
> > > > I can try out -rc1, but IIRC this wasn't bisectible as the bug was present at
> > > > least back to 4.14. I need to double check that as there were other issues
> > > > that are resolved now so I may confuse things here. I've defintely seen
> > > > the same bug with 4.18.
> > > >
> > > > Unfortunately I lost access to the machine as our data center seems to be
> > > > moving currently so it might take some days until I can try -rc1.
> > >
> > > Ok, I've just managed to reproduce this in a KVM guest running v4.20-rc3 on
> > > both the host and the guest, so if anybody has any ideas of things to try then
> > > I'm happy to give them a shot. In the meantime, I'll try again with a bunch of
> > > debug checks enabled.
>
> good that you can reproduce the issue. I've verified that the issue is
> indeed reproducible with 4.14.
>
> >
> > Weee, I eventually hit a use-after-free from KASAN. See below.
>
> I ran KASAN (and all the other debug stuff) but didn't trigger anything
> in the host.

Doing some more debugging, it looks like the usual failure case is where
one CPU clears the inode field in the dentry via:

devpts_pty_kill()
-> d_delete() // dentry->d_lockref.count == 1
-> dentry_unlink_inode()

whilst another CPU gets a pointer to the dentry via:

sys_getdents64()
-> iterate_dir()
-> dcache_readdir()
-> next_positive()

and explodes on the subsequent inode dereference when trying to pass the
inode number to dir_emit():

if (!dir_emit(..., d_inode(next)->i_ino, ...))

Indeed, the hack below triggers a warning, indicating that the inode
is being cleared concurrently.

I can't work out whether the getdents64() path should hold a refcount
to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
be calling d_delete() like this at all.

Any help debugging this would be really appreciated.

Will

--->8

diff --git a/fs/libfs.c b/fs/libfs.c
index 0fb590d79f30..fe3f72c5cf72 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -117,6 +117,7 @@ static struct dentry *next_positive(struct dentry *parent,
if (unlikely(*seq != n))
goto retry;
}
+ WARN_ON(res && !d_inode(res));
return res;
}

--->8

[ 410.560885] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 410.561458] PGD 0 P4D 0
[ 410.561458] Oops: 0000 [#1] PREEMPT SMP PTI
[ 410.561458] CPU: 1 PID: 983 Comm: stress-ng-dev Not tainted 4.20.0-rc3 #1
[ 410.561458] RIP: 0010:dcache_readdir+0xd6/0x160
[ 410.561458] Code: 7d 48 c7 43 08 02 00 00 00 4d 8d a5 a0 00 00 00 45 31 f6 eb 41 48 8b 45 30 48 8b 4b 08 48 89 df 8b 55 24 48 8b 75 28 4c 8b 13 <44> 0f b7 08 4c 8b 40 40 66 41 c1 e9 0c 41 83 e1 0f e8 24 5e c1 00
[ 410.565432] RSP: 0018:ffffb0a2004e7e48 EFLAGS: 00010286
[ 410.565432] RAX: 0000000000000000 RBX: ffffb0a2004e7ec0 RCX: 0000000000000002
[ 410.565432] RDX: 0000000000000001 RSI: ffff9ccef8396938 RDI: ffffb0a2004e7ec0
[ 410.565432] RBP: ffff9ccef8396900 R08: 0000000000000001 R09: 0000000000000000
[ 410.565432] R10: ffffffff947d7070 R11: 0000000000000000 R12: ffff9ccefcc23e20
[ 410.565432] R13: ffff9ccefcc23d80 R14: 0000000000000000 R15: ffff9ccef7c1c0c0
[ 410.565432] FS: 00007f1883901740(0000) GS:ffff9ccefd700000(0000) knlGS:0000000000000000
[ 410.565432] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 410.565432] CR2: 0000000000000000 CR3: 000000007b840004 CR4: 0000000000360ea0
[ 410.565432] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 410.573446] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 410.573446] Call Trace:
[ 410.573446] iterate_dir+0x90/0x180
[ 410.573446] __x64_sys_getdents+0xa6/0x130
[ 410.573446] ? __ia32_compat_sys_getdents+0x130/0x130
[ 410.573446] ? do_syscall_64+0x43/0xf0
[ 410.573446] ? __ia32_sys_getdents+0x130/0x130
[ 410.573446] do_syscall_64+0x43/0xf0
[ 410.573446] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 410.577435] RIP: 0033:0x7f1881d8a0a8
[ 410.577435] Code: e9 0d d8 fb ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 57 41 56 b8 4e 00 00 00 41 55 41 54 55 53 48 89 f3 48 83 ec 08 0f 05 <48> 3d 00 f0 ff ff 77 58 4c 8d 2c 06 49 89 c4 4c 39 ee 73 36 0f 1f
[ 410.577435] RSP: 002b:00007ffd64877580 EFLAGS: 00000202 ORIG_RAX: 000000000000004e
[ 410.577435] RAX: ffffffffffffffda RBX: 00005586c709eb30 RCX: 00007f1881d8a0a8
[ 410.577435] RDX: 0000000000008000 RSI: 00005586c709eb30 RDI: 0000000000000006
[ 410.577435] RBP: 00005586c709eb30 R08: 0000000000000078 R09: 0000000000000000
[ 410.577435] R10: 00005586c709c010 R11: 0000000000000202 R12: ffffffffffffff78
[ 410.585529] R13: 0000000000000000 R14: ffffffffffffff78 R15: 00005586c709eb00
[ 410.585529] Modules linked in:
[ 410.585529] CR2: 0000000000000000
[ 410.585529] ---[ end trace aece97fc2fce9ace ]---

2018-11-28 20:10:51

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

I spent some more time looking at this today...

On Fri, Nov 23, 2018 at 06:05:25PM +0000, Will Deacon wrote:
> Doing some more debugging, it looks like the usual failure case is where
> one CPU clears the inode field in the dentry via:
>
> devpts_pty_kill()
> -> d_delete() // dentry->d_lockref.count == 1
> -> dentry_unlink_inode()
>
> whilst another CPU gets a pointer to the dentry via:
>
> sys_getdents64()
> -> iterate_dir()
> -> dcache_readdir()
> -> next_positive()
>
> and explodes on the subsequent inode dereference when trying to pass the
> inode number to dir_emit():
>
> if (!dir_emit(..., d_inode(next)->i_ino, ...))
>
> Indeed, the hack below triggers a warning, indicating that the inode
> is being cleared concurrently.
>
> I can't work out whether the getdents64() path should hold a refcount
> to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
> be calling d_delete() like this at all.

So the issue is that opening /dev/pts/ptmx creates a new pty in /dev/pts,
which disappears when you close /dev/pts/ptmx. Consequently, when we tear
down the dentry for the magic new file, we have to take the i_node rwsem of
the *parent* so that concurrent path walkers don't trip over it whilst its
being freed. I wrote a simple concurrent program to getdents(/dev/pts/) in
one thread, whilst another opens and closes /dev/pts/ptmx: it crashes the
kernel in seconds.

Patch below, but I'd still like somebody else to look at this, please.

Will

--->8

diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
index c53814539070..50ddb95ff84c 100644
--- a/fs/devpts/inode.c
+++ b/fs/devpts/inode.c
@@ -619,11 +619,17 @@ void *devpts_get_priv(struct dentry *dentry)
*/
void devpts_pty_kill(struct dentry *dentry)
{
- WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC);
+ struct super_block *sb = dentry->d_sb;
+ struct dentry *parent = sb->s_root;

+ WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC);
+
+ inode_lock(parent->d_inode);
dentry->d_fsdata = NULL;
drop_nlink(dentry->d_inode);
d_delete(dentry);
+ inode_unlock(parent->d_inode);
+
dput(dentry); /* d_alloc_name() in devpts_pty_new() */
}

2018-11-29 19:27:32

by Jan Glauber

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Wed, Nov 28, 2018 at 08:08:06PM +0000, Will Deacon wrote:
> I spent some more time looking at this today...
>
> On Fri, Nov 23, 2018 at 06:05:25PM +0000, Will Deacon wrote:
> > Doing some more debugging, it looks like the usual failure case is where
> > one CPU clears the inode field in the dentry via:
> >
> > devpts_pty_kill()
> > -> d_delete() // dentry->d_lockref.count == 1
> > -> dentry_unlink_inode()
> >
> > whilst another CPU gets a pointer to the dentry via:
> >
> > sys_getdents64()
> > -> iterate_dir()
> > -> dcache_readdir()
> > -> next_positive()
> >
> > and explodes on the subsequent inode dereference when trying to pass the
> > inode number to dir_emit():
> >
> > if (!dir_emit(..., d_inode(next)->i_ino, ...))
> >
> > Indeed, the hack below triggers a warning, indicating that the inode
> > is being cleared concurrently.
> >
> > I can't work out whether the getdents64() path should hold a refcount
> > to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
> > be calling d_delete() like this at all.
>
> So the issue is that opening /dev/pts/ptmx creates a new pty in /dev/pts,
> which disappears when you close /dev/pts/ptmx. Consequently, when we tear
> down the dentry for the magic new file, we have to take the i_node rwsem of
> the *parent* so that concurrent path walkers don't trip over it whilst its
> being freed. I wrote a simple concurrent program to getdents(/dev/pts/) in
> one thread, whilst another opens and closes /dev/pts/ptmx: it crashes the
> kernel in seconds.

I also made a testcase and verified that your fix is fine. I also tried
replacing open-close on /dev/ptmx with mkdir-rmdir but that does not
trigger the error.

> Patch below, but I'd still like somebody else to look at this, please.

I wonder why no inode_lock on parent is needed for devpts_pty_new(), but
I'm obviously not a VFS expert... So your patch looks good to me and
clearly solves the issue.

thanks,
Jan

> Will
>
> --->8
>
> diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
> index c53814539070..50ddb95ff84c 100644
> --- a/fs/devpts/inode.c
> +++ b/fs/devpts/inode.c
> @@ -619,11 +619,17 @@ void *devpts_get_priv(struct dentry *dentry)
> */
> void devpts_pty_kill(struct dentry *dentry)
> {
> - WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC);
> + struct super_block *sb = dentry->d_sb;
> + struct dentry *parent = sb->s_root;
>
> + WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC);
> +
> + inode_lock(parent->d_inode);
> dentry->d_fsdata = NULL;
> drop_nlink(dentry->d_inode);
> d_delete(dentry);
> + inode_unlock(parent->d_inode);
> +
> dput(dentry); /* d_alloc_name() in devpts_pty_new() */
> }

2018-11-30 10:44:07

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

Adding Eric as he touched this code last :)

On Thu, Nov 29, 2018 at 07:25:48PM +0000, Jan Glauber wrote:
> On Wed, Nov 28, 2018 at 08:08:06PM +0000, Will Deacon wrote:
> > I spent some more time looking at this today...
> >
> > On Fri, Nov 23, 2018 at 06:05:25PM +0000, Will Deacon wrote:
> > > Doing some more debugging, it looks like the usual failure case is where
> > > one CPU clears the inode field in the dentry via:
> > >
> > > devpts_pty_kill()
> > > -> d_delete() // dentry->d_lockref.count == 1
> > > -> dentry_unlink_inode()
> > >
> > > whilst another CPU gets a pointer to the dentry via:
> > >
> > > sys_getdents64()
> > > -> iterate_dir()
> > > -> dcache_readdir()
> > > -> next_positive()
> > >
> > > and explodes on the subsequent inode dereference when trying to pass the
> > > inode number to dir_emit():
> > >
> > > if (!dir_emit(..., d_inode(next)->i_ino, ...))
> > >
> > > Indeed, the hack below triggers a warning, indicating that the inode
> > > is being cleared concurrently.
> > >
> > > I can't work out whether the getdents64() path should hold a refcount
> > > to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
> > > be calling d_delete() like this at all.
> >
> > So the issue is that opening /dev/pts/ptmx creates a new pty in /dev/pts,
> > which disappears when you close /dev/pts/ptmx. Consequently, when we tear
> > down the dentry for the magic new file, we have to take the i_node rwsem of
> > the *parent* so that concurrent path walkers don't trip over it whilst its
> > being freed. I wrote a simple concurrent program to getdents(/dev/pts/) in
> > one thread, whilst another opens and closes /dev/pts/ptmx: it crashes the
> > kernel in seconds.
>
> I also made a testcase and verified that your fix is fine. I also tried
> replacing open-close on /dev/ptmx with mkdir-rmdir but that does not
> trigger the error.
>
> > Patch below, but I'd still like somebody else to look at this, please.
>
> I wonder why no inode_lock on parent is needed for devpts_pty_new(), but
> I'm obviously not a VFS expert... So your patch looks good to me and
> clearly solves the issue.
>
> thanks,
> Jan
>
> > Will
> >
> > --->8
> >
> > diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
> > index c53814539070..50ddb95ff84c 100644
> > --- a/fs/devpts/inode.c
> > +++ b/fs/devpts/inode.c
> > @@ -619,11 +619,17 @@ void *devpts_get_priv(struct dentry *dentry)
> > */
> > void devpts_pty_kill(struct dentry *dentry)
> > {
> > - WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC);
> > + struct super_block *sb = dentry->d_sb;
> > + struct dentry *parent = sb->s_root;
> >
> > + WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC);

Side note, I wonder if this is even needed anymore...

> > +
> > + inode_lock(parent->d_inode);
> > dentry->d_fsdata = NULL;
> > drop_nlink(dentry->d_inode);
> > d_delete(dentry);
> > + inode_unlock(parent->d_inode);
> > +
> > dput(dentry); /* d_alloc_name() in devpts_pty_new() */
> > }

This feels right but getting some feedback from others would be good.

thanks,

greg k-h

2018-11-30 15:19:07

by Eric W. Biederman

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

"[email protected]" <[email protected]> writes:

> Adding Eric as he touched this code last :)
>
> On Thu, Nov 29, 2018 at 07:25:48PM +0000, Jan Glauber wrote:
>> On Wed, Nov 28, 2018 at 08:08:06PM +0000, Will Deacon wrote:
>> > I spent some more time looking at this today...
>> >
>> > On Fri, Nov 23, 2018 at 06:05:25PM +0000, Will Deacon wrote:
>> > > Doing some more debugging, it looks like the usual failure case is where
>> > > one CPU clears the inode field in the dentry via:
>> > >
>> > > devpts_pty_kill()
>> > > -> d_delete() // dentry->d_lockref.count == 1
>> > > -> dentry_unlink_inode()
>> > >
>> > > whilst another CPU gets a pointer to the dentry via:
>> > >
>> > > sys_getdents64()
>> > > -> iterate_dir()
>> > > -> dcache_readdir()
>> > > -> next_positive()
>> > >
>> > > and explodes on the subsequent inode dereference when trying to pass the
>> > > inode number to dir_emit():
>> > >
>> > > if (!dir_emit(..., d_inode(next)->i_ino, ...))
>> > >
>> > > Indeed, the hack below triggers a warning, indicating that the inode
>> > > is being cleared concurrently.
>> > >
>> > > I can't work out whether the getdents64() path should hold a refcount
>> > > to stop d_delete() in its tracks, or whether devpts_pty_kill() shouldn't
>> > > be calling d_delete() like this at all.
>> >
>> > So the issue is that opening /dev/pts/ptmx creates a new pty in /dev/pts,
>> > which disappears when you close /dev/pts/ptmx. Consequently, when we tear
>> > down the dentry for the magic new file, we have to take the i_node rwsem of
>> > the *parent* so that concurrent path walkers don't trip over it whilst its
>> > being freed. I wrote a simple concurrent program to getdents(/dev/pts/) in
>> > one thread, whilst another opens and closes /dev/pts/ptmx: it crashes the
>> > kernel in seconds.
>>
>> I also made a testcase and verified that your fix is fine. I also tried
>> replacing open-close on /dev/ptmx with mkdir-rmdir but that does not
>> trigger the error.
>>
>> > Patch below, but I'd still like somebody else to look at this, please.
>>
>> I wonder why no inode_lock on parent is needed for devpts_pty_new(), but
>> I'm obviously not a VFS expert... So your patch looks good to me and
>> clearly solves the issue.
>>
>> thanks,
>> Jan
>>
>> > Will
>> >
>> > --->8
>> >
>> > diff --git a/fs/devpts/inode.c b/fs/devpts/inode.c
>> > index c53814539070..50ddb95ff84c 100644
>> > --- a/fs/devpts/inode.c
>> > +++ b/fs/devpts/inode.c
>> > @@ -619,11 +619,17 @@ void *devpts_get_priv(struct dentry *dentry)
>> > */
>> > void devpts_pty_kill(struct dentry *dentry)
>> > {
>> > - WARN_ON_ONCE(dentry->d_sb->s_magic != DEVPTS_SUPER_MAGIC);
>> > + struct super_block *sb = dentry->d_sb;
>> > + struct dentry *parent = sb->s_root;
>> >
>> > + WARN_ON_ONCE(sb->s_magic != DEVPTS_SUPER_MAGIC);
>
> Side note, I wonder if this is even needed anymore...
>
>> > +
>> > + inode_lock(parent->d_inode);
>> > dentry->d_fsdata = NULL;
>> > drop_nlink(dentry->d_inode);
>> > d_delete(dentry);
>> > + inode_unlock(parent->d_inode);
>> > +
>> > dput(dentry); /* d_alloc_name() in devpts_pty_new() */
>> > }
>
> This feels right but getting some feedback from others would be good.

This is going to be special at least because we are not coming through
the normal unlink path and we are manipulating the dcache.

This looks plausible. If this is whats going on then we have had this
bug for a very long time. I will see if I can make some time.

It looks like in the general case everything is serialized by the
devpts_mutex. I wonder if just changing the order of operations
here would be enough.

AKA: drop_nlink d_delete then dentry->d_fsdata. Ugh d_fsdata is not
implicated so that won't help here.

I will look more shortly. I am in the middle in the move so I don't
have time to complete the analysis today.

Eric

2018-11-30 16:10:33

by Al Viro

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Fri, Nov 30, 2018 at 09:16:49AM -0600, Eric W. Biederman wrote:
> >> > + inode_lock(parent->d_inode);
> >> > dentry->d_fsdata = NULL;
> >> > drop_nlink(dentry->d_inode);
> >> > d_delete(dentry);
> >> > + inode_unlock(parent->d_inode);
> >> > +
> >> > dput(dentry); /* d_alloc_name() in devpts_pty_new() */
> >> > }
> >
> > This feels right but getting some feedback from others would be good.
>
> This is going to be special at least because we are not coming through
> the normal unlink path and we are manipulating the dcache.
>
> This looks plausible. If this is whats going on then we have had this
> bug for a very long time. I will see if I can make some time.
>
> It looks like in the general case everything is serialized by the
> devpts_mutex. I wonder if just changing the order of operations
> here would be enough.
>
> AKA: drop_nlink d_delete then dentry->d_fsdata. Ugh d_fsdata is not
> implicated so that won't help here.

It certainly won't. The thing is, this
if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
d_inode(next)->i_ino, dt_type(d_inode(next))))
in dcache_readdir() obviously can block, so all we can hold over it is
blocking locks. Which we do - specifically, ->i_rwsem on our directory.

It's actually worse than missing inode_lock() - consider the effects
of mount --bind /mnt/foo /dev/pts/42. What happens when that thing
goes away? Right, a lost mount...

I'll resurrect the "kernel-internal rm -rf done right" series and
post it; devpts is not the only place suffering such problem (binfmt_misc,
etc.)

2018-11-30 16:33:44

by Will Deacon

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

On Fri, Nov 30, 2018 at 04:08:52PM +0000, Al Viro wrote:
> On Fri, Nov 30, 2018 at 09:16:49AM -0600, Eric W. Biederman wrote:
> > >> > + inode_lock(parent->d_inode);
> > >> > dentry->d_fsdata = NULL;
> > >> > drop_nlink(dentry->d_inode);
> > >> > d_delete(dentry);
> > >> > + inode_unlock(parent->d_inode);
> > >> > +
> > >> > dput(dentry); /* d_alloc_name() in devpts_pty_new() */
> > >> > }
> > >
> > > This feels right but getting some feedback from others would be good.
> >
> > This is going to be special at least because we are not coming through
> > the normal unlink path and we are manipulating the dcache.
> >
> > This looks plausible. If this is whats going on then we have had this
> > bug for a very long time. I will see if I can make some time.
> >
> > It looks like in the general case everything is serialized by the
> > devpts_mutex. I wonder if just changing the order of operations
> > here would be enough.
> >
> > AKA: drop_nlink d_delete then dentry->d_fsdata. Ugh d_fsdata is not
> > implicated so that won't help here.
>
> It certainly won't. The thing is, this
> if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
> d_inode(next)->i_ino, dt_type(d_inode(next))))
> in dcache_readdir() obviously can block, so all we can hold over it is
> blocking locks. Which we do - specifically, ->i_rwsem on our directory.
>
> It's actually worse than missing inode_lock() - consider the effects
> of mount --bind /mnt/foo /dev/pts/42. What happens when that thing
> goes away? Right, a lost mount...

Ha, I hadn't even considered that scenario. Urgh!

> I'll resurrect the "kernel-internal rm -rf done right" series and
> post it; devpts is not the only place suffering such problem (binfmt_misc,
> etc.)

Thanks. I'm happy to test that it solves this issue if you throw me on cc.

Will

2019-04-30 09:36:38

by Jan Glauber

[permalink] [raw]

Subject: Re: dcache_readdir NULL inode oops

Hi Al,

On Fri, Nov 30, 2018 at 04:32:28PM +0000, Will Deacon wrote:
> On Fri, Nov 30, 2018 at 04:08:52PM +0000, Al Viro wrote:
> > On Fri, Nov 30, 2018 at 09:16:49AM -0600, Eric W. Biederman wrote:
> > > >> > + inode_lock(parent->d_inode);
> > > >> > dentry->d_fsdata = NULL;
> > > >> > drop_nlink(dentry->d_inode);
> > > >> > d_delete(dentry);
> > > >> > + inode_unlock(parent->d_inode);
> > > >> > +
> > > >> > dput(dentry); /* d_alloc_name() in devpts_pty_new() */
> > > >> > }
> > > >
> > > > This feels right but getting some feedback from others would be good.
> > >
> > > This is going to be special at least because we are not coming through
> > > the normal unlink path and we are manipulating the dcache.
> > >
> > > This looks plausible. If this is whats going on then we have had this
> > > bug for a very long time. I will see if I can make some time.
> > >
> > > It looks like in the general case everything is serialized by the
> > > devpts_mutex. I wonder if just changing the order of operations
> > > here would be enough.
> > >
> > > AKA: drop_nlink d_delete then dentry->d_fsdata. Ugh d_fsdata is not
> > > implicated so that won't help here.
> >
> > It certainly won't. The thing is, this
> > if (!dir_emit(ctx, next->d_name.name, next->d_name.len,
> > d_inode(next)->i_ino, dt_type(d_inode(next))))
> > in dcache_readdir() obviously can block, so all we can hold over it is
> > blocking locks. Which we do - specifically, ->i_rwsem on our directory.
> >
> > It's actually worse than missing inode_lock() - consider the effects
> > of mount --bind /mnt/foo /dev/pts/42. What happens when that thing
> > goes away? Right, a lost mount...
>
> Ha, I hadn't even considered that scenario. Urgh!
>
> > I'll resurrect the "kernel-internal rm -rf done right" series and
> > post it; devpts is not the only place suffering such problem (binfmt_misc,
> > etc.)

I've not seen anything merged regarding this issue so I guess this is
still open? We see a similar crash (dcache_readdir hitting a NULL inode
ptr) but this time not with devpts.

Debugging is ongoing and we're not even sure which filesystem is having
the issue. Is my assumption correct that we should only see this when
d_delete(dentry) is called?

thanks,
Jan