2022-07-09 06:05:02

by Helge Deller

[permalink] [raw]
Subject: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On the debian buildd servers for the parisc architecture I'm facing
the following warning, after which it gets stuck so that I need to
reboot the server.
This happens rarely, but I can reproduce it after some time.
Filesystem is tmpfs.

I'm happy to add debug code if necessary...

Helge

[128321.224739] ------------[ cut here ]------------
[128321.283492] WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128
[128321.371369] Modules linked in: dm_mod(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_multiport(E) nft_compat(E) nf_tables(E) nfnetlink(E) ipmi_si(E) ipmi_devintf(E) sg(E) ipmi_msghandler(E) fuse(E) configfs(E) sunrpc(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) ohci_pci(E) crct10dif_common(E) ata_generic(E) sata_sil(E) ehci_pci(E) pata_sil680(E) mptspi(E) mptscsih(E) ohci_hcd(E) mptbase(E) scsi_transport_spi(E) ehci_hcd(E) libata(E) scsi_mod(E) usbcore(E) e1000(E) usb_common(E) scsi_common(E)
[128322.103374] CPU: 1 PID: 14735 Comm: cc1plus Tainted: G E 5.18.9+ #27
[128322.195315] Hardware name: 9000/785/C8000
[128322.247318]
[128322.263314] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[128322.323316] PSW: 00001000000001101111000000001111 Tainted: G E
[128322.407315] r00-03 000000000806f00f 0000000239aa8700 000000004073be90 0000000239aa87a0
[128322.507313] r04-07 00000000410e1a00 000000411c8e3b40 0000000211715ad8 000000411c8e3b98
[128322.603314] r08-11 00000000410b8e80 0000000041147200 00000000000800e0 0000000041147200
[128322.699311] r12-15 0000000000000001 0000000010000000 0000000041147200 0000000041147200
[128322.795312] r16-19 000000411c8e3b98 0000000000000000 00000000415084d0 0000000002c55000
[128322.895311] r20-23 000000000800000f 000000004fb33600 000000000800000f 0000000211715b20
[128322.991311] r24-27 000000411c8e3bd0 000000411c8e3bd0 000000411c8e3b40 00000000410e1a00
[128323.087309] r28-31 00000000416bf9d8 0000000239aa8770 0000000239aa8820 000000000000cf14
[128323.187310] sr00-03 000000000500a800 0000000000000000 0000000000000000 000000000500a800
[128323.283324] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[128323.383313]
[128323.399309] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004073b3d4 000000004073b3d8
[128323.503310] IIR: 03ffe01f ISR: 0000000010340400 IOR: 000003e6aa2a87a8
[128323.587306] CPU: 1 CR30: 0000000212ff0ce0 CR31: ffffffffffffffff
[128323.667328] ORIG_R28: 0000000239aa8920
[128323.715307] IAOQ[0]: dentry_free+0x100/0x128
[128323.771306] IAOQ[1]: dentry_free+0x104/0x128
[128323.823305] RP(r2): __dentry_kill+0x284/0x2e8
[128323.875347] Backtrace:
[128323.907319] [<000000004073be90>] __dentry_kill+0x284/0x2e8
[128323.975309] [<000000004073d6d8>] dput+0x334/0x5a8
[128324.031311] [<0000000040726d94>] step_into+0x790/0xa88
[128324.095309] [<0000000040728910>] path_openat+0x21c/0x1ba8
[128324.163310] [<000000004072c018>] do_filp_open+0x9c/0x198
[128324.231310] [<0000000040702320>] do_sys_openat2+0x14c/0x2a8
[128324.299309] [<0000000040702c54>] compat_sys_openat+0x58/0xb8
[128324.367308] [<0000000040303e30>] syscall_exit+0x0/0x10

after that:

[128324.451303] ---[ end trace 0000000000000000 ]---
[128345.511159] rcu: INFO: rcu_sched self-detected stall on CPU
[128345.511159] rcu: 0-...!: (5249 ticks this GP) idle=3c3/1/0x4000000000000002 softirq=35954078/35954078 fqs=54
[128345.511159] (t=5250 jiffies g=33838821 q=175418)
[128345.511159] rcu: rcu_sched kthread starved for 5146 jiffies! g33838821 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=1
[128345.511159] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[128345.511159] ...
[128345.511159]
[128345.511159] rcu: Stack dump where RCU GP kthread last ran:
[128345.511159] Task dump for CPU 1:
[128345.511159] task:cc1plus state:R running task stack: 0 pid:14735 ppid: 14734 flags:0x00000014
[128345.511159] Backtrace:
[128345.511159] [<000000004071f5c8>] __legitimize_path+0x7c/0x108
[128345.511159] [<00000000407210d4>] lookup_fast+0x1c8/0x290
[128345.511159] [<00000000407277b4>] walk_component+0x1e8/0x330
[128345.511159] [<0000000040727be8>] link_path_walk.part.0.constprop.0+0x2ec/0x4d0
[128345.511159] [<0000000040728844>] path_openat+0x150/0x1ba8
[128345.511159] [<000000004072c018>] do_filp_open+0x9c/0x198
[128345.511159] [<0000000040702320>] do_sys_openat2+0x14c/0x2a8
[128345.511159] [<0000000040702c54>] compat_sys_openat+0x58/0xb8
[128345.511159] [<0000000040303e30>] syscall_exit+0x0/0x10
[128345.511159]


2022-07-15 09:21:21

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/9/22 11:07, Hillf Danton wrote:
> On Sat, 9 Jul 2022 07:33:51 +0200 Helge Deller <[email protected]> wrote:
>> On the debian buildd servers for the parisc architecture I'm facing
>> the following warning, after which it gets stuck so that I need to
>> reboot the server.
>> This happens rarely, but I can reproduce it after some time.
>> Filesystem is tmpfs.
>>
>> I'm happy to add debug code if necessary...
>
> Spin with the diff below applied if it makes two-cent sense to you.
>[...]
> +++ b/fs/namei.c
> @@ -3349,8 +3349,8 @@ static const char *open_last_lookups(str
> struct dentry *dir = nd->path.dentry;
> int open_flag = op->open_flag;
> bool got_write = false;
> - unsigned seq;
> - struct inode *inode;
> + unsigned seq = 0;
> + struct inode *inode = NULL;
> struct dentry *dentry;
> const char *res;

That patch didn't helped.

I've faced another crash a few days back (see below), but right now I can't
trigger it any longer. I'll added some more debug info and switched to v5.18.11,
so when it happens again I can probably provide more input...

[71492.491336] ------------[ cut here ]------------
[71492.544917] WARNING: CPU: 1 PID: 16312 at fs/dcache.c:365 dentry_free+0x100/0x128
[71492.632868] Modules linked in: ipt_REJECT(E) nf_reject_ipv4(E) xt_multiport(E) nft_compat(E) nf_tables(E) nfnetlink(E) dm_mod(E) sunrpc(E) ipmi_si(E) ipmi_devintf(E) ipmi_msghandler(E) sg(E) fuse(E) configfs(E) ip_tables(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) blake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) crct10dif_common(E) ata_generic(E) ohci_pci(E) sata_sil(E) mptspi(E) mptscsih(E) ohci_hcd(E) pata_sil680(E) ehci_pci(E) mptbase(E) ehci_hcd(E) scsi_transport_spi(E) libata(E) e1000(E) scsi_mod(E) usbcore(E) usb_common(E) scsi_common(E)
[71493.366569] CPU: 1 PID: 16312 Comm: rm Tainted: G E 5.18.10+ #28
[71493.454274] Hardware name: 9000/785/C8000
[71493.500913]
[71493.516909] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[71493.572891] PSW: 00001000000001001111111100001111 Tainted: G E
[71493.662405] r00-03 000000ff0804ff0f 000000005b2c0490 000000004073bea0 000000005b2c0530
[71493.758449] r04-07 00000000410e1a00 00000003050366c0 000000016b97b718 0000000305036718
[71493.854493] r08-11 000000005b2c0460 0000000305036718 00000000410dae60 0000000000000001
[71493.950539] r12-15 0000000000800000 0000000041147200 0000000000000001 0000000000000000
[71494.046582] r16-19 0000000041146200 0000000000000000 0000000041146200 0000000002c55000
[71494.142628] r20-23 000000000800000f 000000004f9ca578 000000000800000f 000000016b97b760
[71494.238670] r24-27 0000000305036750 0000000305036750 00000003050366c0 00000000410e1a00
[71494.334714] r28-31 00000000416bf380 000000005b2c0500 000000005b2c05b0 fffffffffffff89e
[71494.430759] sr00-03 0000000001df3800 0000000000000000 0000000000000000 0000000001df3800
[71494.527845] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[71494.624929]
[71494.642842] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004073b3e4 000000004073b3e8
[71494.743059] IIR: 03ffe01f ISR: 0000000010340000 IOR: 0000006cb02c0538
[71494.825560] CPU: 1 CR30: 00000000584d0ce0 CR31: ffffffffffffffff
[71494.908063] ORIG_R28: 0000000042e5e388
[71494.954102] IAOQ[0]: dentry_free+0x100/0x128
[71495.006400] IAOQ[1]: dentry_free+0x104/0x128
[71495.058695] RP(r2): __dentry_kill+0x284/0x2e8
[71495.112026] Backtrace:
[71495.140360] [<000000004073bea0>] __dentry_kill+0x284/0x2e8
[71495.207240] [<000000004073ec50>] shrink_dentry_list+0xe4/0x1e8
[71495.278284] [<000000004073f044>] shrink_dcache_parent+0x1a0/0x250
[71495.352452] [<000000004072489c>] vfs_rmdir+0x268/0x468
[71495.412868] [<000000004072ce08>] do_rmdir+0x39c/0x418
[71495.476826] [<000000004072d644>] sys_unlinkat+0xd0/0x110
[71495.540869] [<0000000040303e30>] syscall_exit+0x0/0x10
[71495.604330]
[71495.622248] ---[ end trace 0000000000000000 ]---
[71583.296513] Backtrace:
[71583.300330] [<0000000040834460>] proc_fill_cache+0x194/0x2b8
[71583.300330] [<000000004083d678>] proc_readfd_common+0x16c/0x3c0
[71583.300330] [<000000004083d904>] proc_readfd+0x38/0x68
[71583.300330] [<0000000040733cf0>] iterate_dir+0xec/0x2f0
[71583.300330] [<0000000040734ae4>] sys_getdents64+0xd8/0x1e0
[71583.300330] [<0000000040303e30>] syscall_exit+0x0/0x10
[71583.300330]
[71583.300330]
[71583.300330] Page fault: bad address: Code=15 (Data TLB miss fault) at addr 800002e76660997f
[71583.300330] CPU: 1 PID: 16466 Comm: ssh Tainted: G W E 5.18.10+ #28
[71583.300330] Hardware name: 9000/785/C8000
[71583.300330]
[71583.300330] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
[71583.300330] PSW: 00001000000001101111111100001111 Tainted: G W E
[71583.300330] r00-03 000000ff0806ff0f 0000000041146a00 000000004073fd50 0000000057d7c490
[71583.300330] r04-07 00000000410e1a00 8d00dae766609a0b 000000005f0ea600 0000000057d7c3b0
[71583.300330] r08-11 0000000000000001 0000000057d7c5e0 0000000000000002 00000000416bf380
[71583.300330] r12-15 0000000057d7c4c0 0000000000000000 0000000000000000 0000000041412ad0
[71583.300330] r16-19 000000005f0ea900 00000000a0c15ff4 00000000416483e0 000000000800000f
[71583.300330] r20-23 000000005f0eaa50 000000000800000f 0000000055555556 000000005f0ea9a0
[71583.300330] r24-27 0000000057d7c4c0 0000000057d7c3b0 000000005f0ea900 00000000410e1a00
[71583.300330] r28-31 0000000000000000 0000000057d7c5e0 0000000057d7c610 0000000305036771
[71583.300330] sr00-03 0000000001e35000 0000000000000000 0000000000000000 0000000001e35000
[71583.300330] sr04-07 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[71583.300330]
[71583.300330] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004073fdec 000000004073fdf0
[71583.300330] IIR: 48bc3ee9 ISR: 000000000d00d800 IOR: 800002e76660997f
[71583.300330] CPU: 1 CR30: 000000007fed0ce0 CR31: ffffffffffffffff
[71583.300330] ORIG_R28: 0000000000000cc0
[71583.300330] IAOQ[0]: d_alloc_parallel+0x220/0x748
[71583.300330] IAOQ[1]: d_alloc_parallel+0x224/0x748
[71583.300330] RP(r2): d_alloc_parallel+0x184/0x748
[71583.300330] Backtrace:
[71585.616335] [<0000000040834460>] proc_fill_cache+0x194/0x2b8
[71585.616335] [<000000004083d678>] proc_readfd_common+0x16c/0x3c0
[71585.616335] [<000000004083d904>] proc_readfd+0x38/0x68
[71585.616335] [<0000000040733cf0>] iterate_dir+0xec/0x2f0
[71585.616335] [<0000000040734ae4>] sys_getdents64+0xd8/0x1e0
[71585.616335] [<0000000040303e30>] syscall_exit+0x0/0x10
[71585.616335]
<Cpu1> 0300109101e00000 0000000000000000 CC_PROCS_ENTRY_OUT
[71586.384319] Kernel panic - not syncing: Page fault: bad address

2022-07-16 06:18:01

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/15/22 15:33, Hillf Danton wrote:
> On Fri, 15 Jul 2022 10:18:23 +0200 Helge Deller wrote:
>>
>> I've faced another crash a few days back (see below), but right now I can'=
>> t
>> trigger it any longer. I'll added some more debug info and switched to v5.=
>> 18.11,
>> so when it happens again I can probably provide more input...
>>
>> [71492.491336] ------------[ cut here ]------------
>> [71492.544917] WARNING: CPU: 1 PID: 16312 at fs/dcache.c:365 dentry_free+0=
>> x100/0x128
>> [71492.632868] Modules linked in: ipt_REJECT(E) nf_reject_ipv4(E) xt_multi=
>> port(E) nft_compat(E) nf_tables(E) nfnetlink(E) dm_mod(E) sunrpc(E) ipmi_s=
>> i(E) ipmi_devintf(E) ipmi_msghandler(E) sg(E) fuse(E) configfs(E) ip_table=
>> s(E) x_tables(E) autofs4(E) ext4(E) crc16(E) mbcache(E) jbd2(E) btrfs(E) b=
>> lake2b_generic(E) xor(E) raid6_pq(E) zstd_compress(E) libcrc32c(E) crc32c_=
>> generic(E) sd_mod(E) t10_pi(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) cr=
>> ct10dif_generic(E) crct10dif_common(E) ata_generic(E) ohci_pci(E) sata_sil=
>> (E) mptspi(E) mptscsih(E) ohci_hcd(E) pata_sil680(E) ehci_pci(E) mptbase(E=
>> ) ehci_hcd(E) scsi_transport_spi(E) libata(E) e1000(E) scsi_mod(E) usbcore=
>> (E) usb_common(E) scsi_common(E)
>> [71493.366569] CPU: 1 PID: 16312 Comm: rm Tainted: G E 5.18=
>> .10+ #28
>> [71493.454274] Hardware name: 9000/785/C8000
>> [71493.500913]
>> [71493.516909] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>> [71493.572891] PSW: 00001000000001001111111100001111 Tainted: G =
>> E
>> [71493.662405] r00-03 000000ff0804ff0f 000000005b2c0490 000000004073bea0 =
>> 000000005b2c0530
>> [71493.758449] r04-07 00000000410e1a00 00000003050366c0 000000016b97b718 =
>> 0000000305036718
>> [71493.854493] r08-11 000000005b2c0460 0000000305036718 00000000410dae60 =
>> 0000000000000001
>> [71493.950539] r12-15 0000000000800000 0000000041147200 0000000000000001 =
>> 0000000000000000
>> [71494.046582] r16-19 0000000041146200 0000000000000000 0000000041146200 =
>> 0000000002c55000
>> [71494.142628] r20-23 000000000800000f 000000004f9ca578 000000000800000f =
>> 000000016b97b760
>> [71494.238670] r24-27 0000000305036750 0000000305036750 00000003050366c0 =
>> 00000000410e1a00
>> [71494.334714] r28-31 00000000416bf380 000000005b2c0500 000000005b2c05b0 =
>> fffffffffffff89e
>> [71494.430759] sr00-03 0000000001df3800 0000000000000000 0000000000000000=
>> 0000000001df3800
>> [71494.527845] sr04-07 0000000000000000 0000000000000000 0000000000000000=
>> 0000000000000000
>> [71494.624929]
>> [71494.642842] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004073b=
>> 3e4 000000004073b3e8
>> [71494.743059] IIR: 03ffe01f ISR: 0000000010340000 IOR: 0000006cb02c0=
>> 538
>> [71494.825560] CPU: 1 CR30: 00000000584d0ce0 CR31: fffffffffffff=
>> fff
>> [71494.908063] ORIG_R28: 0000000042e5e388
>> [71494.954102] IAOQ[0]: dentry_free+0x100/0x128
>> [71495.006400] IAOQ[1]: dentry_free+0x104/0x128
>> [71495.058695] RP(r2): __dentry_kill+0x284/0x2e8
>> [71495.112026] Backtrace:
>> [71495.140360] [<000000004073bea0>] __dentry_kill+0x284/0x2e8
>> [71495.207240] [<000000004073ec50>] shrink_dentry_list+0xe4/0x1e8
>> [71495.278284] [<000000004073f044>] shrink_dcache_parent+0x1a0/0x250
>> [71495.352452] [<000000004072489c>] vfs_rmdir+0x268/0x468
>> [71495.412868] [<000000004072ce08>] do_rmdir+0x39c/0x418
>> [71495.476826] [<000000004072d644>] sys_unlinkat+0xd0/0x110
>> [71495.540869] [<0000000040303e30>] syscall_exit+0x0/0x10
>> [71495.604330]
>> [71495.622248] ---[ end trace 0000000000000000 ]---
>> [71583.296513] Backtrace:
>> [71583.300330] [<0000000040834460>] proc_fill_cache+0x194/0x2b8
>> [71583.300330] [<000000004083d678>] proc_readfd_common+0x16c/0x3c0
>> [71583.300330] [<000000004083d904>] proc_readfd+0x38/0x68
>> [71583.300330] [<0000000040733cf0>] iterate_dir+0xec/0x2f0
>> [71583.300330] [<0000000040734ae4>] sys_getdents64+0xd8/0x1e0
>> [71583.300330] [<0000000040303e30>] syscall_exit+0x0/0x10
>> [71583.300330]
>> [71583.300330]
>> [71583.300330] Page fault: bad address: Code=3D15 (Data TLB miss fault) at=
>> addr 800002e76660997f
>> [71583.300330] CPU: 1 PID: 16466 Comm: ssh Tainted: G W E 5.1=
>> 8.10+ #28
>> [71583.300330] Hardware name: 9000/785/C8000
>> [71583.300330]
>> [71583.300330] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
>> [71583.300330] PSW: 00001000000001101111111100001111 Tainted: G W =
>> E
>> [71583.300330] r00-03 000000ff0806ff0f 0000000041146a00 000000004073fd50 =
>> 0000000057d7c490
>> [71583.300330] r04-07 00000000410e1a00 8d00dae766609a0b 000000005f0ea600 =
>> 0000000057d7c3b0
>> [71583.300330] r08-11 0000000000000001 0000000057d7c5e0 0000000000000002 =
>> 00000000416bf380
>> [71583.300330] r12-15 0000000057d7c4c0 0000000000000000 0000000000000000 =
>> 0000000041412ad0
>> [71583.300330] r16-19 000000005f0ea900 00000000a0c15ff4 00000000416483e0 =
>> 000000000800000f
>> [71583.300330] r20-23 000000005f0eaa50 000000000800000f 0000000055555556 =
>> 000000005f0ea9a0
>> [71583.300330] r24-27 0000000057d7c4c0 0000000057d7c3b0 000000005f0ea900 =
>> 00000000410e1a00
>> [71583.300330] r28-31 0000000000000000 0000000057d7c5e0 0000000057d7c610 =
>> 0000000305036771
>> [71583.300330] sr00-03 0000000001e35000 0000000000000000 0000000000000000=
>> 0000000001e35000
>> [71583.300330] sr04-07 0000000000000000 0000000000000000 0000000000000000=
>> 0000000000000000
>> [71583.300330]
>> [71583.300330] IASQ: 0000000000000000 0000000000000000 IAOQ: 000000004073f=
>> dec 000000004073fdf0
>> [71583.300330] IIR: 48bc3ee9 ISR: 000000000d00d800 IOR: 800002e766609=
>> 97f
>> [71583.300330] CPU: 1 CR30: 000000007fed0ce0 CR31: fffffffffffff=
>> fff
>> [71583.300330] ORIG_R28: 0000000000000cc0
>> [71583.300330] IAOQ[0]: d_alloc_parallel+0x220/0x748
>> [71583.300330] IAOQ[1]: d_alloc_parallel+0x224/0x748
>> [71583.300330] RP(r2): d_alloc_parallel+0x184/0x748
>> [71583.300330] Backtrace:
>> [71585.616335] [<0000000040834460>] proc_fill_cache+0x194/0x2b8
>> [71585.616335] [<000000004083d678>] proc_readfd_common+0x16c/0x3c0
>> [71585.616335] [<000000004083d904>] proc_readfd+0x38/0x68
>> [71585.616335] [<0000000040733cf0>] iterate_dir+0xec/0x2f0
>> [71585.616335] [<0000000040734ae4>] sys_getdents64+0xd8/0x1e0
>> [71585.616335] [<0000000040303e30>] syscall_exit+0x0/0x10
>> [71585.616335]
>> <Cpu1> 0300109101e00000 0000000000000000 CC_PROCS_ENTRY_OUT
>> [71586.384319] Kernel panic - not syncing: Page fault: bad address
>
> Looks like the same issue.

It crashed again, this time I had following patch applied:

diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bf..a3c8dba97e85 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -362,9 +362,25 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)

static void dentry_free(struct dentry *dentry)
{
- WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+ int unhashed = hlist_unhashed(&dentry->d_u.d_alias);
+ if (WARN_ON(!unhashed)) {
+ pr_err("dentry = %px\n", dentry);
+ pr_err("spin_is_locked(&dentry->d_lock) = %d\n", spin_is_locked(&dentry->d_lock));
+ pr_err("dname_external(dentry) = %d\n", dname_external(dentry));
+ pr_err("dentry->d_flags & DCACHE_NORCU = %d\n", dentry->d_flags & DCACHE_NORCU);
+ // pr_err("ERROR on file %pd\n", &dentry); HANGS
+ pr_err("dentry->d_name.len = %d\n", dentry->d_name.len);
+ pr_err("dentry->d_name.hash = %x\n", dentry->d_name.hash);
+ pr_err("dentry->d_lockref.count = %d\n", dentry->d_lockref.count);
+ pr_err("dentry->d_flags = %d\n", dentry->d_flags);
+ pr_err("dentry->d_inode = %px\n", dentry->d_inode);
+ pr_err("dentry->d_parent = %px\n", dentry->d_parent);
+ pr_err("dentry->d_u.d_rcu = %pS\n", dentry->d_u.d_rcu.func);
+ }
if (unlikely(dname_external(dentry))) {
struct external_name *p = external_name(dentry);
+ if (!unhashed)
+ pr_err("value of &p->u.count = %d\n", p->u.count.counter);
if (likely(atomic_dec_and_test(&p->u.count))) {
call_rcu(&dentry->d_u.d_rcu, __d_free_external);
return;

and this was the output:

[108563.953441] IAOQ[1]: dentry_free+0xc4/0x338
[108564.005441] RP(r2): __dentry_kill+0x284/0x2e8
[108564.057443] Backtrace:
[108564.089454] [<000000004073bab0>] __dentry_kill+0x284/0x2e8
[108564.157445] [<000000004073d2f8>] dput+0x334/0x5a8
[108564.213444] [<00000000407267a4>] step_into+0x790/0xa88
[108564.277444] [<0000000040727084>] walk_component+0xa8/0x330
[108564.345443] [<00000000407275f8>] link_path_walk.part.0.constprop.0+0x2ec/0x4d0
[108564.437443] [<0000000040728254>] path_openat+0x150/0x1ba8
[108564.501441] [<000000004072ba28>] do_filp_open+0x9c/0x198
[108564.569441] [<0000000040701d20>] do_sys_openat2+0x14c/0x2a8
[108564.637441] [<0000000040702654>] compat_sys_openat+0x58/0xb8
[108564.705440] [<0000000040303e30>] syscall_exit+0x0/0x10
[108564.769444]
[108564.789435] ---[ end trace 0000000000000000 ]---
[108564.845444] dentry = 000000031624e6c0
[108564.889437] spin_is_locked(&dentry->d_lock) = 0
[108564.945436] dname_external(dentry) = 0
[108564.993436] dentry->d_flags & DCACHE_NORCU = 0
[108565.045446] dentry->d_name.len = 3
[108565.089435] dentry->d_name.hash = 89116695
[108565.137435] dentry->d_lockref.count = -128
[108565.189434] dentry->d_flags = 32776
[108565.233435] dentry->d_inode = 0000000000000000
[108565.285429] dentry->d_parent = 000000022ef756c0
[108565.341434] dentry->d_u.d_rcu = 0x416be770


and

[108568.877430] IAOQ[0]: dentry_free+0xc0/0x338
[108568.929412] IAOQ[1]: dentry_free+0xc4/0x338
[108568.981411] RP(r2): __dentry_kill+0x284/0x2e8
[108569.037428] Backtrace:
[108569.065421] [<000000004073bab0>] __dentry_kill+0x284/0x2e8
[108569.133416] [<000000004073d2f8>] dput+0x334/0x5a8
[108569.193435] [<00000000407267a4>] step_into+0x790/0xa88
[108569.257414] [<0000000040727084>] walk_component+0xa8/0x330
[108569.325432] [<0000000040727ee8>] path_lookupat+0xdc/0x2f8
[108569.389412] [<000000004072b100>] filename_lookup+0xb4/0x238
[108569.461434] [<000000004072b4d0>] user_path_at_empty+0x8c/0xf0
[108569.529412] [<0000000040712af0>] do_readlinkat+0xdc/0x280
[108569.597430] [<00000000407140a4>] sys_readlink+0x40/0x70
[108569.661409] [<0000000040303e30>] syscall_exit+0x0/0x10
[108569.725428]
[108569.745405] ---[ end trace 0000000000000000 ]---
[108569.801407] dentry = 000000016d7d0000
[108569.845407] spin_is_locked(&dentry->d_lock) = 0
[108569.901422] dname_external(dentry) = 0
[108569.949405] dentry->d_flags & DCACHE_NORCU = 0
[108570.001405] dentry->d_name.len = 10
[108570.045421] dentry->d_name.hash = e6582e53
[108570.093476] dentry->d_lockref.count = -128
[108570.145404] dentry->d_flags = 32776
[108570.189420] dentry->d_inode = 0000000000000000
[108570.241404] dentry->d_parent = 00000002332c2780
[108570.297403] dentry->d_u.d_rcu = 0x416be770


> Add debug info to the diff below. Cut the first BUG_ON off if it
> triggers and see if the second one can be hit.

I will add that code now and try.
Helge


> +++ y/fs/dcache.c
> @@ -605,8 +605,10 @@ static void __dentry_kill(struct dentry
> spin_unlock(&parent->d_lock);
> if (dentry->d_inode)
> dentry_unlink_inode(dentry);
> - else
> + else {
> + BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
> spin_unlock(&dentry->d_lock);
> + }
> this_cpu_dec(nr_dentry);
> if (dentry->d_op && dentry->d_op->d_release)
> dentry->d_op->d_release(dentry);
> @@ -616,6 +618,7 @@ static void __dentry_kill(struct dentry
> dentry->d_flags |= DCACHE_MAY_FREE;
> can_free = false;
> }
> + BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
> spin_unlock(&dentry->d_lock);
> if (likely(can_free))
> dentry_free(dentry);

2022-07-17 09:44:05

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

Hi Hillf,

On 7/16/22 07:27, Helge Deller wrote:
> On 7/15/22 15:33, Hillf Danton wrote:
>> On Fri, 15 Jul 2022 10:18:23 +0200 Helge Deller wrote:
> and this was the output:
>
> [108563.953441] IAOQ[1]: dentry_free+0xc4/0x338
> [108564.005441] RP(r2): __dentry_kill+0x284/0x2e8
> [108564.057443] Backtrace:
> [108564.089454] [<000000004073bab0>] __dentry_kill+0x284/0x2e8
> [108564.157445] [<000000004073d2f8>] dput+0x334/0x5a8
> [108564.213444] [<00000000407267a4>] step_into+0x790/0xa88
> [108564.277444] [<0000000040727084>] walk_component+0xa8/0x330
> [108564.345443] [<00000000407275f8>] link_path_walk.part.0.constprop.0+0x2ec/0x4d0
> [108564.437443] [<0000000040728254>] path_openat+0x150/0x1ba8
> [108564.501441] [<000000004072ba28>] do_filp_open+0x9c/0x198
> [108564.569441] [<0000000040701d20>] do_sys_openat2+0x14c/0x2a8
> [108564.637441] [<0000000040702654>] compat_sys_openat+0x58/0xb8
> [108564.705440] [<0000000040303e30>] syscall_exit+0x0/0x10
> [108564.769444]
> [108564.789435] ---[ end trace 0000000000000000 ]---
> [108564.845444] dentry = 000000031624e6c0
> [108564.889437] spin_is_locked(&dentry->d_lock) = 0
> [108564.945436] dname_external(dentry) = 0
> [108564.993436] dentry->d_flags & DCACHE_NORCU = 0
> [108565.045446] dentry->d_name.len = 3
> [108565.089435] dentry->d_name.hash = 89116695
> [108565.137435] dentry->d_lockref.count = -128
> [108565.189434] dentry->d_flags = 32776
> [108565.233435] dentry->d_inode = 0000000000000000
> [108565.285429] dentry->d_parent = 000000022ef756c0
> [108565.341434] dentry->d_u.d_rcu = 0x416be770
>
>> Add debug info to the diff below. Cut the first BUG_ON off if it>> triggers and see if the second one can be hit.

I used WARN_ON() instead of BUG_ON().
With that, both triggered, first the first one, then the second one.
Full log is here:
http://dellerweb.de/testcases/minicom.dcache.crash.6-warn

Helge

>> +++ y/fs/dcache.>> @@ -605,8 +605,10 @@ static void __dentry_kill(struct dentry
>> spin_unlock(&parent->d_lock);
>> if (dentry->d_inode)
>> dentry_unlink_inode(dentry);
>> - else
>> + else {
>> + BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
>> spin_unlock(&dentry->d_lock);
>> + }
>> this_cpu_dec(nr_dentry);
>> if (dentry->d_op && dentry->d_op->d_release)
>> dentry->d_op->d_release(dentry);
>> @@ -616,6 +618,7 @@ static void __dentry_kill(struct dentry
>> dentry->d_flags |= DCACHE_MAY_FREE;
>> can_free = false;
>> }
>> + BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
>> spin_unlock(&dentry->d_lock);
>> if (likely(can_free))
>> dentry_free(dentry);

2022-07-19 17:00:49

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

Hello Hillf,

On 7/17/22 13:36, Hillf Danton wrote:
> On Sun, 17 Jul 2022 11:42:48 +0200
>> I used WARN_ON() instead of BUG_ON().
>> With that, both triggered, first the first one, then the second one.
>> Full log is here:
>> http://dellerweb.de/testcases/minicom.dcache.crash.6-warn
>
> Given the first BUG_ON triggered, and dentry at the moment is supposed to
> not be alias, see if it is still in lookup with d_lock held. That is the
> step before de-unioning d_alias with d_in_lookup_hash.
>
> On the other hand if only the second one triggered, we should track
> DCACHE_DENTRY_KILLED instead in assumption that killed dentry was
> used again after releasing d_lock surrounding the firt one.

The machine has now been up for 2 days without any issues, while it had pretty
much the same load as when it was crashing earlier.
So, in summary I'd assume that your patch below fixes the issue.

I'm now rebooting the machine with a new kernel, where I just changed
if (unlikely(d_in_lookup(dentry)))
to
if (WARN_ON_ONCE(d_in_lookup(dentry)))
in order to see if this really triggered.

Anyway, I think your patch is good so far.
Would that be the final patch, or should I test some others?

Thanks!
Helge

>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -605,8 +605,12 @@ static void __dentry_kill(struct dentry
> spin_unlock(&parent->d_lock);
> if (dentry->d_inode)
> dentry_unlink_inode(dentry);
> - else
> + else {
> + if (unlikely(d_in_lookup(dentry))) {
> + __d_lookup_done(dentry);
> + }
> spin_unlock(&dentry->d_lock);
> + }
> this_cpu_dec(nr_dentry);
> if (dentry->d_op && dentry->d_op->d_release)
> dentry->d_op->d_release(dentry);

2022-07-19 21:19:49

by John David Anglin

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

Hi Helge,

I hit this warning with the patch below building ghc on mx3210:

mx3210 login: ------------[ cut here ]------------
WARNING: CPU: 2 PID: 32654 at fs/dcache.c:365 dentry_free+0xfc/0x108
Modules linked in: binfmt_misc ext2 ext4 crc16 mbcache jbd2 ipmi_watchdog sg ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler fuse nfsd
ip_tables x_tables ipv6 autofs4 xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
crc32c_generic raid1 raid0 multipath linear md_mod sd_mod t10_pi ses enclosure scsi_transport_sas crc64_rocksoft crc64 uas usb_storage sr_mod
cdrom ohci_pci sym53c8xx pata_cmd64x ehci_pci ohci_hcd libata scsi_transport_spi ehci_hcd tg3 scsi_mod usbcore scsi_common usb_common
CPU: 2 PID: 32654 Comm: cc1 Not tainted 5.18.12+ #2
Hardware name: 9000/800/rp3440

     YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
PSW: 00001000000001000110100000001111 Not tainted
r00-03  000000000804680f 00000040ce7fc880 00000000404f2b74 00000040ce7fc920
r04-07  0000000040be4940 000000410f6cd630 00000001413e4068 000000410f6cd688
r08-11  0000000040fd2e60 0000000040bc5020 0000000040c2c940 00000000000800e0
r12-15  0000000040c2c940 0000000000000001 0000000040c2c940 000000410f6cd688
r16-19  00000001f9fe105d 00000040ce7fc1f8 000000000000002f 000000000a0c1000
r20-23  000000000800000f 000000000800000f 000000410f6cd639 000000000800000f
r24-27  0000000000000000 0000000000000385 000000410f6cd630 0000000040be4940
r28-31  0000000041104530 00000040ce7fc8f0 00000040ce7fc9a0 0000000000000000
sr00-03  0000000000a03800 0000000000000000 0000000000000000 0000000000a03800
sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000

IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000404f18bc 00000000404f18c0
 IIR: 03ffe01f    ISR: 0000000010350000  IOR: 00000239ff3fc928
 CPU:        2   CR30: 00000040cadd1380 CR31: ffffffffffffffff
 ORIG_R28: 00000040ce7fcb70
 IAOQ[0]: dentry_free+0xfc/0x108
 IAOQ[1]: dentry_free+0x100/0x108
 RP(r2): __dentry_kill+0x2bc/0x338
Backtrace:
 [<00000000404f2b74>] __dentry_kill+0x2bc/0x338
 [<00000000404f37b8>] dentry_kill+0xb0/0x318
 [<00000000404f3d08>] dput+0x2e8/0x328
 [<00000000404dd7dc>] step_into+0x344/0x390
 [<00000000404dda4c>] walk_component+0xa4/0x310
 [<00000000404df234>] link_path_walk.part.0+0x2ec/0x4b0
 [<00000000404e0000>] path_openat+0xe8/0x348
 [<00000000404e2c58>] do_filp_open+0x98/0x178
 [<00000000404babe8>] do_sys_openat2+0x148/0x288
 [<00000000404bb41c>] compat_sys_openat+0x54/0x98
 [<0000000040203e30>] syscall_exit+0x0/0x10

---[ end trace 0000000000000000 ]---
watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cc1:32657]

Regards,
Dave

On 2022-07-19 12:32 p.m., Helge Deller wrote:
> Hello Hillf,
>
> On 7/17/22 13:36, Hillf Danton wrote:
>> On Sun, 17 Jul 2022 11:42:48 +0200
>>> I used WARN_ON() instead of BUG_ON().
>>> With that, both triggered, first the first one, then the second one.
>>> Full log is here:
>>> http://dellerweb.de/testcases/minicom.dcache.crash.6-warn
>> Given the first BUG_ON triggered, and dentry at the moment is supposed to
>> not be alias, see if it is still in lookup with d_lock held. That is the
>> step before de-unioning d_alias with d_in_lookup_hash.
>>
>> On the other hand if only the second one triggered, we should track
>> DCACHE_DENTRY_KILLED instead in assumption that killed dentry was
>> used again after releasing d_lock surrounding the firt one.
> The machine has now been up for 2 days without any issues, while it had pretty
> much the same load as when it was crashing earlier.
> So, in summary I'd assume that your patch below fixes the issue.
>
> I'm now rebooting the machine with a new kernel, where I just changed
> if (unlikely(d_in_lookup(dentry)))
> to
> if (WARN_ON_ONCE(d_in_lookup(dentry)))
> in order to see if this really triggered.
>
> Anyway, I think your patch is good so far.
> Would that be the final patch, or should I test some others?
>
> Thanks!
> Helge
>
>> --- a/fs/dcache.c
>> +++ b/fs/dcache.c
>> @@ -605,8 +605,12 @@ static void __dentry_kill(struct dentry
>> spin_unlock(&parent->d_lock);
>> if (dentry->d_inode)
>> dentry_unlink_inode(dentry);
>> - else
>> + else {
>> + if (unlikely(d_in_lookup(dentry))) {
>> + __d_lookup_done(dentry);
>> + }
>> spin_unlock(&dentry->d_lock);
>> + }
>> this_cpu_dec(nr_dentry);
>> if (dentry->d_op && dentry->d_op->d_release)
>> dentry->d_op->d_release(dentry);


--
John David Anglin [email protected]

2022-07-19 21:28:41

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/19/22 22:59, John David Anglin wrote:
> Hi Helge,
>
> I hit this warning with the patch below building ghc on mx3210:

As I wrote, I didn't faced it yet on my buildd server, but that could
just have been luck...
Hillf, should we try if this second hunk triggers?

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -616,6 +618,7 @@ static void __dentry_kill(struct dentry
dentry->d_flags |= DCACHE_MAY_FREE;
can_free = false;
}
+ BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
spin_unlock(&dentry->d_lock);
if (likely(can_free))
dentry_free(dentry);

Helge


> mx3210 login: ------------[ cut here ]------------
> WARNING: CPU: 2 PID: 32654 at fs/dcache.c:365 dentry_free+0xfc/0x108
> Modules linked in: binfmt_misc ext2 ext4 crc16 mbcache jbd2 ipmi_watchdog sg ipmi_si ipmi_poweroff ipmi_devintf ipmi_msghandler fuse nfsd ip_tables x_tables ipv6 autofs4 xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sd_mod t10_pi ses enclosure scsi_transport_sas crc64_rocksoft crc64 uas usb_storage sr_mod cdrom ohci_pci sym53c8xx pata_cmd64x ehci_pci ohci_hcd libata scsi_transport_spi ehci_hcd tg3 scsi_mod usbcore scsi_common usb_common
> CPU: 2 PID: 32654 Comm: cc1 Not tainted 5.18.12+ #2
> Hardware name: 9000/800/rp3440
>
>      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
> PSW: 00001000000001000110100000001111 Not tainted
> r00-03  000000000804680f 00000040ce7fc880 00000000404f2b74 00000040ce7fc920
> r04-07  0000000040be4940 000000410f6cd630 00000001413e4068 000000410f6cd688
> r08-11  0000000040fd2e60 0000000040bc5020 0000000040c2c940 00000000000800e0
> r12-15  0000000040c2c940 0000000000000001 0000000040c2c940 000000410f6cd688
> r16-19  00000001f9fe105d 00000040ce7fc1f8 000000000000002f 000000000a0c1000
> r20-23  000000000800000f 000000000800000f 000000410f6cd639 000000000800000f
> r24-27  0000000000000000 0000000000000385 000000410f6cd630 0000000040be4940
> r28-31  0000000041104530 00000040ce7fc8f0 00000040ce7fc9a0 0000000000000000
> sr00-03  0000000000a03800 0000000000000000 0000000000000000 0000000000a03800
> sr04-07  0000000000000000 0000000000000000 0000000000000000 0000000000000000
>
> IASQ: 0000000000000000 0000000000000000 IAOQ: 00000000404f18bc 00000000404f18c0
>  IIR: 03ffe01f    ISR: 0000000010350000  IOR: 00000239ff3fc928
>  CPU:        2   CR30: 00000040cadd1380 CR31: ffffffffffffffff
>  ORIG_R28: 00000040ce7fcb70
>  IAOQ[0]: dentry_free+0xfc/0x108
>  IAOQ[1]: dentry_free+0x100/0x108
>  RP(r2): __dentry_kill+0x2bc/0x338
> Backtrace:
>  [<00000000404f2b74>] __dentry_kill+0x2bc/0x338
>  [<00000000404f37b8>] dentry_kill+0xb0/0x318
>  [<00000000404f3d08>] dput+0x2e8/0x328
>  [<00000000404dd7dc>] step_into+0x344/0x390
>  [<00000000404dda4c>] walk_component+0xa4/0x310
>  [<00000000404df234>] link_path_walk.part.0+0x2ec/0x4b0
>  [<00000000404e0000>] path_openat+0xe8/0x348
>  [<00000000404e2c58>] do_filp_open+0x98/0x178
>  [<00000000404babe8>] do_sys_openat2+0x148/0x288
>  [<00000000404bb41c>] compat_sys_openat+0x54/0x98
>  [<0000000040203e30>] syscall_exit+0x0/0x10
>
> ---[ end trace 0000000000000000 ]---
> watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [cc1:32657]
>
> Regards,
> Dave
>
> On 2022-07-19 12:32 p.m., Helge Deller wrote:
>> Hello Hillf,
>>
>> On 7/17/22 13:36, Hillf Danton wrote:
>>> On Sun, 17 Jul 2022 11:42:48 +0200
>>>> I used WARN_ON() instead of BUG_ON().
>>>> With that, both triggered, first the first one, then the second one.
>>>> Full log is here:
>>>> http://dellerweb.de/testcases/minicom.dcache.crash.6-warn
>>> Given the first BUG_ON triggered, and dentry at the moment is supposed to
>>> not be alias, see if it is still in lookup with d_lock held. That is the
>>> step before de-unioning d_alias with d_in_lookup_hash.
>>>
>>> On the other hand if only the second one triggered, we should track
>>> DCACHE_DENTRY_KILLED instead in assumption that killed dentry was
>>> used again after releasing d_lock surrounding the firt one.
>> The machine has now been up for 2 days without any issues, while it had pretty
>> much the same load as when it was crashing earlier.
>> So, in summary I'd assume that your patch below fixes the issue.
>>
>> I'm now rebooting the machine with a new kernel, where I just changed
>>     if (unlikely(d_in_lookup(dentry)))
>> to
>>     if (WARN_ON_ONCE(d_in_lookup(dentry)))
>> in order to see if this really triggered.
>>
>> Anyway, I think your patch is good so far.
>> Would that be the final patch, or should I test some others?
>>
>> Thanks!
>> Helge
>>
>>> --- a/fs/dcache.c
>>> +++ b/fs/dcache.c
>>> @@ -605,8 +605,12 @@ static void __dentry_kill(struct dentry
>>>           spin_unlock(&parent->d_lock);
>>>       if (dentry->d_inode)
>>>           dentry_unlink_inode(dentry);
>>> -    else
>>> +    else {
>>> +        if (unlikely(d_in_lookup(dentry))) {
>>> +            __d_lookup_done(dentry);
>>> +        }
>>>           spin_unlock(&dentry->d_lock);
>>> +    }
>>>       this_cpu_dec(nr_dentry);
>>>       if (dentry->d_op && dentry->d_op->d_release)
>>>           dentry->d_op->d_release(dentry);
>
>

2022-07-20 02:06:03

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Tue, Jul 19, 2022 at 11:25:07PM +0200, Helge Deller wrote:
> On 7/19/22 22:59, John David Anglin wrote:
> > Hi Helge,
> >
> > I hit this warning with the patch below building ghc on mx3210:
>
> As I wrote, I didn't faced it yet on my buildd server, but that could
> just have been luck...
> Hillf, should we try if this second hunk triggers?
>
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -616,6 +618,7 @@ static void __dentry_kill(struct dentry
> dentry->d_flags |= DCACHE_MAY_FREE;
> can_free = false;
> }
> + BUG_ON(!hlist_unhashed(&dentry->d_u.d_alias));
> spin_unlock(&dentry->d_lock);
> if (likely(can_free))
> dentry_free(dentry);

Would you gentlemen mind Cc'ing the entire thing to fsdevel? Or at
least forwarding it my way...

2022-07-20 03:26:01

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
> [108564.845444] dentry = 000000031624e6c0

*blink*
Is that really a plausible address of a kernel object on
parisc? Or is that "mangle pointers for security fetish^W
purposes, lest somebody manages to get a useful information
out of dmesg" in action?

> [108564.889437] spin_is_locked(&dentry->d_lock) = 0
> [108564.945436] dname_external(dentry) = 0
> [108564.993436] dentry->d_flags & DCACHE_NORCU = 0
> [108565.045446] dentry->d_name.len = 3
> [108565.089435] dentry->d_name.hash = 89116695
> [108565.137435] dentry->d_lockref.count = -128
> [108565.189434] dentry->d_flags = 32776

0x8008, i.e. DCACHE_OP_DELETE | DCACHE_DENTRY_KILLED. No
DCACHE_PAR_LOOKUP in sight...

> [108569.801407] dentry = 000000016d7d0000
> [108569.845407] spin_is_locked(&dentry->d_lock) = 0
> [108569.901422] dname_external(dentry) = 0
> [108569.949405] dentry->d_flags & DCACHE_NORCU = 0
> [108570.001405] dentry->d_name.len = 10
> [108570.045421] dentry->d_name.hash = e6582e53
> [108570.093476] dentry->d_lockref.count = -128
> [108570.145404] dentry->d_flags = 32776
> [108570.189420] dentry->d_inode = 0000000000000000
> [108570.241404] dentry->d_parent = 00000002332c2780
> [108570.297403] dentry->d_u.d_rcu = 0x416be770

Ditto. Pointers look really weird...

2022-07-20 03:30:31

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Tue, Jul 19, 2022 at 06:32:04PM +0200, Helge Deller wrote:
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -605,8 +605,12 @@ static void __dentry_kill(struct dentry
> > spin_unlock(&parent->d_lock);
> > if (dentry->d_inode)
> > dentry_unlink_inode(dentry);
> > - else
> > + else {
> > + if (unlikely(d_in_lookup(dentry))) {
> > + __d_lookup_done(dentry);
> > + }
> > spin_unlock(&dentry->d_lock);
> > + }

Ugh... If that triggers, we really have a problem. I don't see any
way for that to happen (well, short of generally fucked refcounting -
stray dput() on wrong address and all bets are off)

2022-07-20 03:30:40

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Wed, Jul 20, 2022 at 03:31:10AM +0100, Al Viro wrote:
> On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
> > [108564.845444] dentry = 000000031624e6c0
>
> *blink*
> Is that really a plausible address of a kernel object on
> parisc? Or is that "mangle pointers for security fetish^W
> purposes, lest somebody manages to get a useful information
> out of dmesg" in action?
>
> > [108564.889437] spin_is_locked(&dentry->d_lock) = 0
> > [108564.945436] dname_external(dentry) = 0
> > [108564.993436] dentry->d_flags & DCACHE_NORCU = 0
> > [108565.045446] dentry->d_name.len = 3
> > [108565.089435] dentry->d_name.hash = 89116695
> > [108565.137435] dentry->d_lockref.count = -128
> > [108565.189434] dentry->d_flags = 32776
>
> 0x8008, i.e. DCACHE_OP_DELETE | DCACHE_DENTRY_KILLED. No
> DCACHE_PAR_LOOKUP in sight...
>
> > [108569.801407] dentry = 000000016d7d0000
> > [108569.845407] spin_is_locked(&dentry->d_lock) = 0
> > [108569.901422] dname_external(dentry) = 0
> > [108569.949405] dentry->d_flags & DCACHE_NORCU = 0
> > [108570.001405] dentry->d_name.len = 10
> > [108570.045421] dentry->d_name.hash = e6582e53
> > [108570.093476] dentry->d_lockref.count = -128
> > [108570.145404] dentry->d_flags = 32776
> > [108570.189420] dentry->d_inode = 0000000000000000
> > [108570.241404] dentry->d_parent = 00000002332c2780
> > [108570.297403] dentry->d_u.d_rcu = 0x416be770
>
> Ditto. Pointers look really weird...

PS: I think I've caught what could be found on l-k; could you
please forward the rest (parisc list? private thread?) to
fsdevel or directly my way.

2022-07-20 03:54:10

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
> On 7/15/22 15:33, Hillf Danton wrote:

> [108565.341434] dentry->d_u.d_rcu = 0x416be770

Incidentally, does that match the address of __d_free() on your build,
or is it something different?

2022-07-20 07:15:19

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Wed, Jul 20, 2022 at 08:53:53AM +0200, Helge Deller wrote:
> On 7/20/22 05:29, Al Viro wrote:
> > On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
> >> On 7/15/22 15:33, Hillf Danton wrote:
> >
> >> [108565.341434] dentry->d_u.d_rcu = 0x416be770
> >
> > Incidentally, does that match the address of __d_free() on your build,
> > or is it something different?
>
> I don't think it's __d_free().
> The source is:
> pr_err("dentry->d_u.d_rcu = %pS\n", dentry->d_u.d_rcu.func);
> so the "%pS" would probably have resolved the pointer to string "__d_free" (or something else).

????

That should've taken a word from desc->addr and printed it.
If it had managed to get to a string (you'd needed to have
CONFIG_KALLSYMS enabled), it would've printed it as a string,
not as hex address. Seriously, check System.map for that
kernel...

2022-07-20 07:18:20

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/20/22 05:29, Al Viro wrote:
> On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
>> On 7/15/22 15:33, Hillf Danton wrote:
>
>> [108565.341434] dentry->d_u.d_rcu = 0x416be770
>
> Incidentally, does that match the address of __d_free() on your build,
> or is it something different?

I don't think it's __d_free().
The source is:
pr_err("dentry->d_u.d_rcu = %pS\n", dentry->d_u.d_rcu.func);
so the "%pS" would probably have resolved the pointer to string "__d_free" (or something else).

Helge

2022-07-20 09:53:18

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

* Al Viro <[email protected]>:
> On Wed, Jul 20, 2022 at 08:53:53AM +0200, Helge Deller wrote:
> > On 7/20/22 05:29, Al Viro wrote:
> > > On Sat, Jul 16, 2022 at 07:27:30AM +0200, Helge Deller wrote:
> > >> On 7/15/22 15:33, Hillf Danton wrote:
> > >
> > >> [108565.341434] dentry->d_u.d_rcu = 0x416be770
> > >
> > > Incidentally, does that match the address of __d_free() on your build,
> > > or is it something different?
> >
> > I don't think it's __d_free().
> > The source is:
> > pr_err("dentry->d_u.d_rcu = %pS\n", dentry->d_u.d_rcu.func);
> > so the "%pS" would probably have resolved the pointer to string "__d_free" (or something else).
>
> ????
>
> That should've taken a word from desc->addr and printed it.
> If it had managed to get to a string (you'd needed to have
> CONFIG_KALLSYMS enabled), it would've printed it as a string,
> not as hex address. Seriously, check System.map for that
> kernel...

CONFIG_KALLSYMS is enabled - otherwise I wouldn't see the backtrace either
(just to be sure I did check the .config again).
But it might be, that when I produced that debug info I had
%p instead of %pS - that might explain why the function name wasn't resolved.

The kernel functions on my build on parisc range from
0000000040100000 T __init_begin
to
0000000041700000 D _end

In my current System.map (which may not fit the address above!) I see:
00000000416bda50 d in_lookup_hashtable
00000000416bfa50 d counter.0
so, 0x416be770 would be in in_lookup_hashtable(). Does that make sense?
Again - this is a somewhat newer build...

Btw, you asked if there were any other private discussions/mails on other mailing lists:
There are none, only those which are in this mail thread.

Below is the current patch with which I will try to reproduce the issue.
I'll disable hashed pointers too.
Any other info/patches I should add?

Reproducing it is time-consuming. It takes me usually a day to trigger.

Helge


diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bf..047729fc1a97 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -360,11 +360,32 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
this_cpu_inc(nr_dentry_negative);
}

+static void show_dentry_info(struct dentry *dentry)
+{
+ pr_err("dentry = %px\n", dentry);
+ pr_err("spin_is_locked(&dentry->d_lock) = %d\n", spin_is_locked(&dentry->d_lock));
+ pr_err("dname_external(dentry) = %d\n", dname_external(dentry));
+ pr_err("dentry->d_flags = 0x%x\n", dentry->d_flags);
+ // pr_err("ERROR on file %pd\n", &dentry); HANGS
+ pr_err("dentry->d_name.len = %d\n", dentry->d_name.len);
+ pr_err("dentry->d_name.hash = 0x%x\n", dentry->d_name.hash);
+ pr_err("dentry->d_lockref.count = %d\n", dentry->d_lockref.count);
+ pr_err("dentry->d_flags = 0x%x\n", dentry->d_flags);
+ pr_err("dentry->d_inode = %px\n", dentry->d_inode);
+ pr_err("dentry->d_parent = %px\n", dentry->d_parent);
+ pr_err("dentry->d_u.d_rcu = %pS\n", dentry->d_u.d_rcu.func);
+}
+
static void dentry_free(struct dentry *dentry)
{
- WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+ int unhashed = hlist_unhashed(&dentry->d_u.d_alias);
+ if (WARN_ON(!unhashed)) {
+ show_dentry_info(dentry);
+ }
if (unlikely(dname_external(dentry))) {
struct external_name *p = external_name(dentry);
+ if (!unhashed)
+ pr_err("value of &p->u.count = %d\n", p->u.count.counter);
if (likely(atomic_dec_and_test(&p->u.count))) {
call_rcu(&dentry->d_u.d_rcu, __d_free_external);
return;
@@ -605,8 +626,13 @@ static void __dentry_kill(struct dentry *dentry)
spin_unlock(&parent->d_lock);
if (dentry->d_inode)
dentry_unlink_inode(dentry);
- else
+ else {
+ if (WARN_ON_ONCE(d_in_lookup(dentry))) {
+ show_dentry_info(dentry);
+ __d_lookup_done(dentry);
+ }
spin_unlock(&dentry->d_lock);
+ }
this_cpu_dec(nr_dentry);
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);
@@ -616,6 +642,8 @@ static void __dentry_kill(struct dentry *dentry)
dentry->d_flags |= DCACHE_MAY_FREE;
can_free = false;
}
+ if (WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias)))
+ show_dentry_info(dentry);
spin_unlock(&dentry->d_lock);
if (likely(can_free))
dentry_free(dentry);

2022-07-20 18:08:44

by Al Viro

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On Wed, Jul 20, 2022 at 07:00:32PM +0800, Hillf Danton wrote:

> To help debug it, de-union d_in_lookup_hash with d_alias and add debug
> info after dentry is killed. If any warning hits, we know where to add
> something like
>
> WARN_ON(dentry->d_flags & DCACHE_DENTRY_KILLED);
>
> before hlist_bl_add or hlist_add.

IDGI. That clearly has nothing to do with in-lookup stuff - no
DCACHE_PAR_LOOKUP in reported flags, so it either never had it set,
or it went through __d_lookup_done() already.

If anything, it might have already been through d_free(), with
d_rcu being confused for d_alias.

I'd do something like
WARN_ON(dentry->d_flags & (1U<<31));
dentry->d_flags |= 1U << 31;
in the begining of d_free() (possibly with dumping dentry state if we
hit that, not that there would be much to report; d_name.name might
be informative, though).

Again, in-lookup looks like a red herring - DCACHE_PAR_LOOKUP is set
only in d_alloc_parallel(), right next to the insertion into the list
and removed only in __d_lookup_free(), right next to the removal from
the same. No DCACHE_PAR_LOOKUP in ->d_flags (it's 0x8008 in reported
cases, i.e. DCACHE_OP_REVALIDATE | DCACHE_DENTRY_KILLED).

What's more, take a look at retain_dentry(); WARN_ON(d_in_lookup(dentry))
right at the top and it had not triggered in any of the reports I've
seen in that thread. Granted, it's not called on each path to
__dentry_kill(), but it is on the call chains I've seen reported...

Another thing that might be interesting to know is ->d_sb, along with
->d_sb->s_type->name and ->d_sb->s_id. That should tell which fs it's
on...

I wonder if anyone had seen anything similar outside of parisc...
I don't know if I have any chance to reproduce it here - the only
parisc box I've got is a 715/100 (assuming the disk is still alive)
and it's 32bit, unlike the reported setups and, er, not fast.
qemu seems to have some parisc support, but it's 32bit-only at the
moment...

PS: please, Cc fsdevel on anything VFS-related. Very few people are
still subscribed to l-k these days - I am, but it's impossible to read
through and postings can easily get missed.

2022-07-20 23:23:18

by Sam James

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128



> On 20 Jul 2022, at 18:06, Al Viro <[email protected]> wrote:
>
> On Wed, Jul 20, 2022 at 07:00:32PM +0800, Hillf Danton wrote:
>
>> To help debug it, de-union d_in_lookup_hash with d_alias and add debug
>> info after dentry is killed. If any warning hits, we know where to add
>> something like
>>
>> WARN_ON(dentry->d_flags & DCACHE_DENTRY_KILLED);
>>
>> before hlist_bl_add or hlist_add.

> [snip]
> I wonder if anyone had seen anything similar outside of parisc...
> I don't know if I have any chance to reproduce it here - the only
> parisc box I've got is a 715/100 (assuming the disk is still alive)
> and it's 32bit, unlike the reported setups and, er, not fast.
> qemu seems to have some parisc support, but it's 32bit-only at the
> moment...

I don't think I've seen this on parisc either, but I don't think
I've used tmpfs that heavily. I'll try it in case it's somehow more
likely to trigger it.

Helge, were there any particular steps to reproduce this? Or just
start doing your normal Debian builds on a tmpfs and it happens
soon enough?


Attachments:
signature.asc (367.00 B)
Message signed with OpenPGP

2022-07-21 04:00:46

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/21/22 01:15, Sam James wrote:
>> On 20 Jul 2022, at 18:06, Al Viro <[email protected]> wrote:
>>
>> On Wed, Jul 20, 2022 at 07:00:32PM +0800, Hillf Danton wrote:
>>
>>> To help debug it, de-union d_in_lookup_hash with d_alias and add debug
>>> info after dentry is killed. If any warning hits, we know where to add
>>> something like
>>>
>>> WARN_ON(dentry->d_flags & DCACHE_DENTRY_KILLED);
>>>
>>> before hlist_bl_add or hlist_add.
>
>> [snip]
>> I wonder if anyone had seen anything similar outside of parisc...

Me too.
Of course it could be caused by the platform code, as we have had
issues with caches, spinlocks and so on.
On older kernels we also have seen RCU stalls in d_alloc_parallel().

>> I don't know if I have any chance to reproduce it here - the only
>> parisc box I've got is a 715/100 (assuming the disk is still alive)
>> and it's 32bit, unlike the reported setups and, er, not fast.

It's fun to boot it, but it will be too slow for actual testing.

>> qemu seems to have some parisc support, but it's 32bit-only at the
>> moment...

Yes. I think it will be hard to reproduce it in the VM.

> I don't think I've seen this on parisc either, but I don't think
> I've used tmpfs that heavily. I'll try it in case it's somehow more
> likely to trigger it.

It happened on the debian buildd server with tmpfs. To rule out tmpfs
I switched to ext4 (on SATA SSD) and it happened there as well.
I assume Dave's report is on ext3/ext4 with SCSI discs.

> Helge, were there any particular steps to reproduce this? Or just
> start doing your normal Debian builds on a tmpfs and it happens
> soon enough?

Currently it's not easy to reproduce for me either.
It happens on the debian buildd server (4-way c8000 machine) while building
the webkit2gtk package. I think it happens at the end when sbuild
cleans the build directories by deleting all files.
Maybe there is a filesystem test toolkit which you could try which hammers
the fs by deleting lots of files in parallel?

Helge

2022-07-30 20:32:21

by Helge Deller

[permalink] [raw]
Subject: Re: WARNING: CPU: 1 PID: 14735 at fs/dcache.c:365 dentry_free+0x100/0x128

On 7/21/22 05:54, Helge Deller wrote:
> On 7/21/22 01:15, Sam James wrote:
>>> On 20 Jul 2022, at 18:06, Al Viro <[email protected]> wrote:
>>>
>>> On Wed, Jul 20, 2022 at 07:00:32PM +0800, Hillf Danton wrote:
>>>
>>>> To help debug it, de-union d_in_lookup_hash with d_alias and add debug
>>>> info after dentry is killed. If any warning hits, we know where to add
>>>> something like
>>>>
>>>> WARN_ON(dentry->d_flags & DCACHE_DENTRY_KILLED);
>>>>
>>>> before hlist_bl_add or hlist_add.
>>
>>> [snip]
>>> I wonder if anyone had seen anything similar outside of parisc...
>
> Me too.
> Of course it could be caused by the platform code, as we have had
> issues with caches, spinlocks and so on.
> On older kernels we also have seen RCU stalls in d_alloc_parallel().
>
>>> I don't know if I have any chance to reproduce it here - the only
>>> parisc box I've got is a 715/100 (assuming the disk is still alive)
>>> and it's 32bit, unlike the reported setups and, er, not fast.
>
> It's fun to boot it, but it will be too slow for actual testing.
>
>>> qemu seems to have some parisc support, but it's 32bit-only at the
>>> moment...
>
> Yes. I think it will be hard to reproduce it in the VM.
>
>> I don't think I've seen this on parisc either, but I don't think
>> I've used tmpfs that heavily. I'll try it in case it's somehow more
>> likely to trigger it.
>
> It happened on the debian buildd server with tmpfs. To rule out tmpfs
> I switched to ext4 (on SATA SSD) and it happened there as well.
> I assume Dave's report is on ext3/ext4 with SCSI discs.
>
>> Helge, were there any particular steps to reproduce this? Or just
>> start doing your normal Debian builds on a tmpfs and it happens
>> soon enough?
>
> Currently it's not easy to reproduce for me either.
> It happens on the debian buildd server (4-way c8000 machine) while building
> the webkit2gtk package. I think it happens at the end when sbuild
> cleans the build directories by deleting all files.
> Maybe there is a filesystem test toolkit which you could try which hammers
> the fs by deleting lots of files in parallel?

I currently can't reproduce the issue any longer.
In case it pops up again, I'll follow up here again.

Helge