Hi,
In one of our production servers where we run kernel version 4.14.32, I noticed
the following:
----
[138630.417246] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
object is from kmalloc-2048(361:anycast-healthchecker.service)
[138630.477075] ------------[ cut here ]------------
[138630.500028] WARNING: CPU: 14 PID: 26002 at mm/slab.h:377
kmem_cache_free+0x133/0x1c0
[138630.538130] Modules linked in: netconsole binfmt_misc sctp_diag sctp dccp_diag
dccp tcp_diag udp_diag inet_diag unix_diag cfg80211 rfkill 8021q garp mrp xfs loop
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd
glue_helper cryptd iTCO_wdt intel_cstate hpwdt hpilo iTCO_vendor_support sg
intel_rapl_perf ipmi_si pcspkr ioatdma shpchp lpc_ich i2c_i801 ipmi_devintf dca
mfd_core wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
ext4 mbcache jbd2 i2c_algo_bit sd_mod drm_kms_helper syscopyarea sysfillrect
sysimgblt fb_sys_fops ttm bnx2x serio_raw mdio libcrc32c crc32c_intel hpsa ptp drm
scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod dax
[138630.871763] CPU: 14 PID: 26002 Comm: kworker/u66:3 Not tainted
4.14.32-1.el7.x86_64 #1
[138630.910864] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/25/2017
[138630.943498] Workqueue: events_unbound fsnotify_connector_destroy_workfn
[138630.976569] task: ffff88203b62ae80 task.stack: ffffc90008a20000
[138631.005895] RIP: 0010:kmem_cache_free+0x133/0x1c0
[138631.029230] RSP: 0018:ffffc90008a23e20 EFLAGS: 00010246
[138631.055217] RAX: 000000000000007c RBX: ffff882000000000 RCX: 0000000000000000
[138631.091346] RDX: 0000000000000000 RSI: ffff88203f3969d8 RDI: ffff88203f3969d8
[138631.127139] RBP: ffffc90008a23e38 R08: 0000000000000001 R09: 000000000000057d
[138631.162575] R10: ffff88102b906cf0 R11: ffff88102b913ea0 R12: ffff88103f579980
[138631.198018] R13: ffff88203bd57c00 R14: ffff88203bd63f00 R15: ffffffff82128388
[138631.233371] FS: 0000000000000000(0000) GS:ffff88203f380000(0000)
knlGS:0000000000000000
[138631.273426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[138631.301980] CR2: 0000000003600fd8 CR3: 000000000200a006 CR4: 00000000003606e0
[138631.338783] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[138631.374525] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[138631.409797] Call Trace:
[138631.422387] fsnotify_connector_destroy_workfn+0x68/0x80
[138631.448516] process_one_work+0x15c/0x380
[138631.468352] worker_thread+0x4d/0x3e0
[138631.486705] kthread+0x10c/0x140
[138631.503091] ? max_active_store+0x80/0x80
[138631.523217] ? __kthread_parkme+0x70/0x70
[138631.543338] ? do_syscall_64+0x79/0x1b0
[138631.562518] ret_from_fork+0x35/0x40
[138631.580199] Code: 4c 3b a7 d8 00 00 00 0f 84 11 ff ff ff 48 8b 4f 60 49 8b 54
24 60 48 c7 c6 c0 19 c3 81 48 c7 c7 c0 9c e7 81 31 c0 e8 90 a5 eb ff <0f> 0b4c 89
e7 e9 e9 fe ff ff 65 8b 05 2c 3c de 7e 89 c0 48 0f
[138631.672489] ---[ end trace 6748d86d682915c2 ]---
[138631.695691] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
object is from kmalloc-2048(361:anycast-healthchecker.service)
---
The kernel was fine and server was responsive.
A coworker mentioned that it may come from
https://github.com/torvalds/linux/blob/master/fs/notify/mark.c#L160
and commit
08991e83b728 ("fsnotify: Free fsnotify_mark_connector when there is no mark
attached") arrived with 4.12-rc1 may be the cause of it.
I am wondering if it is related to the issue I reported
https://lkml.org/lkml/2018/4/16/506. But, that issue causes the whole server to
crash, which is not the case for above kernel dump.
Cheers,
Pavlos
Hi,
On Tue 17-04-18 12:39:32, Pavlos Parissis wrote:
> In one of our production servers where we run kernel version 4.14.32, I noticed
> the following:
Thanks for the report! So I believe this is the same underlying problem as
for the softlockups you were observing - there I believe the linked list
just gets corrupted which causes the list iteration to never finish. But
this more points to a place where could be some use-after-free issues
(either in fsnotify or somewhere else). I'll look into the code.
Honza
>
> ----
> [138630.417246] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
> object is from kmalloc-2048(361:anycast-healthchecker.service)
> [138630.477075] ------------[ cut here ]------------
> [138630.500028] WARNING: CPU: 14 PID: 26002 at mm/slab.h:377
> kmem_cache_free+0x133/0x1c0
> [138630.538130] Modules linked in: netconsole binfmt_misc sctp_diag sctp dccp_diag
> dccp tcp_diag udp_diag inet_diag unix_diag cfg80211 rfkill 8021q garp mrp xfs loop
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd
> glue_helper cryptd iTCO_wdt intel_cstate hpwdt hpilo iTCO_vendor_support sg
> intel_rapl_perf ipmi_si pcspkr ioatdma shpchp lpc_ich i2c_i801 ipmi_devintf dca
> mfd_core wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
> ext4 mbcache jbd2 i2c_algo_bit sd_mod drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm bnx2x serio_raw mdio libcrc32c crc32c_intel hpsa ptp drm
> scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod dax
> [138630.871763] CPU: 14 PID: 26002 Comm: kworker/u66:3 Not tainted
> 4.14.32-1.el7.x86_64 #1
> [138630.910864] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/25/2017
> [138630.943498] Workqueue: events_unbound fsnotify_connector_destroy_workfn
> [138630.976569] task: ffff88203b62ae80 task.stack: ffffc90008a20000
> [138631.005895] RIP: 0010:kmem_cache_free+0x133/0x1c0
> [138631.029230] RSP: 0018:ffffc90008a23e20 EFLAGS: 00010246
> [138631.055217] RAX: 000000000000007c RBX: ffff882000000000 RCX: 0000000000000000
> [138631.091346] RDX: 0000000000000000 RSI: ffff88203f3969d8 RDI: ffff88203f3969d8
> [138631.127139] RBP: ffffc90008a23e38 R08: 0000000000000001 R09: 000000000000057d
> [138631.162575] R10: ffff88102b906cf0 R11: ffff88102b913ea0 R12: ffff88103f579980
> [138631.198018] R13: ffff88203bd57c00 R14: ffff88203bd63f00 R15: ffffffff82128388
> [138631.233371] FS: 0000000000000000(0000) GS:ffff88203f380000(0000)
> knlGS:0000000000000000
> [138631.273426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [138631.301980] CR2: 0000000003600fd8 CR3: 000000000200a006 CR4: 00000000003606e0
> [138631.338783] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [138631.374525] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [138631.409797] Call Trace:
> [138631.422387] fsnotify_connector_destroy_workfn+0x68/0x80
> [138631.448516] process_one_work+0x15c/0x380
> [138631.468352] worker_thread+0x4d/0x3e0
> [138631.486705] kthread+0x10c/0x140
> [138631.503091] ? max_active_store+0x80/0x80
> [138631.523217] ? __kthread_parkme+0x70/0x70
> [138631.543338] ? do_syscall_64+0x79/0x1b0
> [138631.562518] ret_from_fork+0x35/0x40
> [138631.580199] Code: 4c 3b a7 d8 00 00 00 0f 84 11 ff ff ff 48 8b 4f 60 49 8b 54
> 24 60 48 c7 c6 c0 19 c3 81 48 c7 c7 c0 9c e7 81 31 c0 e8 90 a5 eb ff <0f> 0b4c 89
> e7 e9 e9 fe ff ff 65 8b 05 2c 3c de 7e 89 c0 48 0f
> [138631.672489] ---[ end trace 6748d86d682915c2 ]---
> [138631.695691] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
> object is from kmalloc-2048(361:anycast-healthchecker.service)
> ---
>
> The kernel was fine and server was responsive.
> A coworker mentioned that it may come from
> https://github.com/torvalds/linux/blob/master/fs/notify/mark.c#L160
> and commit
> 08991e83b728 ("fsnotify: Free fsnotify_mark_connector when there is no mark
> attached") arrived with 4.12-rc1 may be the cause of it.
>
> I am wondering if it is related to the issue I reported
> https://lkml.org/lkml/2018/4/16/506. But, that issue causes the whole server to
> crash, which is not the case for above kernel dump.
>
>
> Cheers,
> Pavlos
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 17-04-18 12:39:32, Pavlos Parissis wrote:
> In one of our production servers where we run kernel version 4.14.32, I noticed
> the following:
OK, I was looking into this for some time and couldn't find a problem in
4.14.32 code. Can you try running a kernel with CONFIG_DEBUG_SLAB and
CONFIG_DEBUG_PAGEALLOC enabled to hopefully catch the problem earlier?
Thanks!
Honza
>
> ----
> [138630.417246] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
> object is from kmalloc-2048(361:anycast-healthchecker.service)
> [138630.477075] ------------[ cut here ]------------
> [138630.500028] WARNING: CPU: 14 PID: 26002 at mm/slab.h:377
> kmem_cache_free+0x133/0x1c0
> [138630.538130] Modules linked in: netconsole binfmt_misc sctp_diag sctp dccp_diag
> dccp tcp_diag udp_diag inet_diag unix_diag cfg80211 rfkill 8021q garp mrp xfs loop
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd
> glue_helper cryptd iTCO_wdt intel_cstate hpwdt hpilo iTCO_vendor_support sg
> intel_rapl_perf ipmi_si pcspkr ioatdma shpchp lpc_ich i2c_i801 ipmi_devintf dca
> mfd_core wmi ipmi_msghandler nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
> ext4 mbcache jbd2 i2c_algo_bit sd_mod drm_kms_helper syscopyarea sysfillrect
> sysimgblt fb_sys_fops ttm bnx2x serio_raw mdio libcrc32c crc32c_intel hpsa ptp drm
> scsi_transport_sas pps_core dm_mirror dm_region_hash dm_log dm_mod dax
> [138630.871763] CPU: 14 PID: 26002 Comm: kworker/u66:3 Not tainted
> 4.14.32-1.el7.x86_64 #1
> [138630.910864] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 10/25/2017
> [138630.943498] Workqueue: events_unbound fsnotify_connector_destroy_workfn
> [138630.976569] task: ffff88203b62ae80 task.stack: ffffc90008a20000
> [138631.005895] RIP: 0010:kmem_cache_free+0x133/0x1c0
> [138631.029230] RSP: 0018:ffffc90008a23e20 EFLAGS: 00010246
> [138631.055217] RAX: 000000000000007c RBX: ffff882000000000 RCX: 0000000000000000
> [138631.091346] RDX: 0000000000000000 RSI: ffff88203f3969d8 RDI: ffff88203f3969d8
> [138631.127139] RBP: ffffc90008a23e38 R08: 0000000000000001 R09: 000000000000057d
> [138631.162575] R10: ffff88102b906cf0 R11: ffff88102b913ea0 R12: ffff88103f579980
> [138631.198018] R13: ffff88203bd57c00 R14: ffff88203bd63f00 R15: ffffffff82128388
> [138631.233371] FS: 0000000000000000(0000) GS:ffff88203f380000(0000)
> knlGS:0000000000000000
> [138631.273426] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [138631.301980] CR2: 0000000003600fd8 CR3: 000000000200a006 CR4: 00000000003606e0
> [138631.338783] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [138631.374525] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [138631.409797] Call Trace:
> [138631.422387] fsnotify_connector_destroy_workfn+0x68/0x80
> [138631.448516] process_one_work+0x15c/0x380
> [138631.468352] worker_thread+0x4d/0x3e0
> [138631.486705] kthread+0x10c/0x140
> [138631.503091] ? max_active_store+0x80/0x80
> [138631.523217] ? __kthread_parkme+0x70/0x70
> [138631.543338] ? do_syscall_64+0x79/0x1b0
> [138631.562518] ret_from_fork+0x35/0x40
> [138631.580199] Code: 4c 3b a7 d8 00 00 00 0f 84 11 ff ff ff 48 8b 4f 60 49 8b 54
> 24 60 48 c7 c6 c0 19 c3 81 48 c7 c7 c0 9c e7 81 31 c0 e8 90 a5 eb ff <0f> 0b4c 89
> e7 e9 e9 fe ff ff 65 8b 05 2c 3c de 7e 89 c0 48 0f
> [138631.672489] ---[ end trace 6748d86d682915c2 ]---
> [138631.695691] cache_from_obj: Wrong slab cache. fsnotify_mark_connector but
> object is from kmalloc-2048(361:anycast-healthchecker.service)
> ---
>
> The kernel was fine and server was responsive.
> A coworker mentioned that it may come from
> https://github.com/torvalds/linux/blob/master/fs/notify/mark.c#L160
> and commit
> 08991e83b728 ("fsnotify: Free fsnotify_mark_connector when there is no mark
> attached") arrived with 4.12-rc1 may be the cause of it.
>
> I am wondering if it is related to the issue I reported
> https://lkml.org/lkml/2018/4/16/506. But, that issue causes the whole server to
> crash, which is not the case for above kernel dump.
>
>
> Cheers,
> Pavlos
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 17/04/2018 04:02 μμ, Jan Kara wrote:
> On Tue 17-04-18 12:39:32, Pavlos Parissis wrote:
>> In one of our production servers where we run kernel version 4.14.32, I noticed
>> the following:
>
> OK, I was looking into this for some time and couldn't find a problem in
> 4.14.32 code. Can you try running a kernel with CONFIG_DEBUG_SLAB and
> CONFIG_DEBUG_PAGEALLOC enabled to hopefully catch the problem earlier?
> Thanks!
>
I can certainly do that, but I can't reproduce that specific crash. So, it may
take days before we get a similar crash. Having said that, the soft lockup issue,
which I mentioned in another thread, happens once per day, so I hope running a
kernel with those debug settings will help you.
Cheers,
Pavlos
On Tue 17-04-18 16:15:07, Pavlos Parissis wrote:
> On 17/04/2018 04:02 μμ, Jan Kara wrote:
> > On Tue 17-04-18 12:39:32, Pavlos Parissis wrote:
> >> In one of our production servers where we run kernel version 4.14.32, I noticed
> >> the following:
> >
> > OK, I was looking into this for some time and couldn't find a problem in
> > 4.14.32 code. Can you try running a kernel with CONFIG_DEBUG_SLAB and
> > CONFIG_DEBUG_PAGEALLOC enabled to hopefully catch the problem earlier?
> > Thanks!
> >
>
> I can certainly do that, but I can't reproduce that specific crash. So,
> it may take days before we get a similar crash. Having said that, the
> soft lockup issue, which I mentioned in another thread, happens once per
> day, so I hope running a kernel with those debug settings will help you.
Yes, the hope is debug checks will trigger faster than the actual data
corruption (or softlockup) you've hit.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR