Hi guys,
I'm experimenting by trying to boot an allmodconfig arm64 kernel, as
mentioned here:
https://lore.kernel.org/linux-arm-kernel/[email protected]/
One thing that I noticed - it's hard to miss actually - is the amount of
complaining from KASAN about the EDAC/ghes code. Maybe this is something
I should not care about/red herring, or maybe something genuine. Let me
know what you think.
The kernel is v5.4-rc3, and I raised the EDAC mc debug level to get
extra debug prints.
Log below, Thanks,
John
Log snippet (I cut off after the first KASAN warning):
[ 70.471011][ T1] random: get_random_u32 called from
new_slab+0x360/0x698 with crng_init=0
[ 70.478671][ T1] [Firmware Bug]: APEI: Invalid bit width + offset
in GAR [0x94110034/64/0/3/0]
[ 70.526585][ T1] EDAC DEBUG: edac_mc_alloc: allocating 3524 bytes
for mci data (32 dimms, 32 csrows/channels)
[ 70.542013][ T1] EDAC DEBUG: ghes_edac_dmidecode: DIMM2:
Registered-DDR4 size = 16384 MB(ECC)
[ 70.551044][ T1] EDAC DEBUG: ghes_edac_dmidecode: type 26,
detail 0x2080, width 72(total 64)
[ 70.559986][ T1] EDAC DEBUG: edac_mc_add_mc_with_groups:
[ 70.567082][ T1] EDAC DEBUG: edac_create_sysfs_mci_device: device
mc0 created
[ 70.575608][ T1] EDAC DEBUG: edac_create_dimm_object: device dimm2
created at location memory 2
[ 70.585818][ T1] EDAC DEBUG: edac_create_csrow_object: device
csrow2 created
[ 70.594110][ T1] EDAC MC0: Giving out device to module ghes_edac.c
controller ghes_edac: DEV ghes (INTERRUPT)
[ 70.605936][ T1] EDAC DEBUG: edac_mc_del_mc:
[ 70.611188][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device:
[ 70.619443][ T1] random: get_random_u32 called from
kobject_put+0x8c/0x190 with crng_init=0
[ 70.628163][ T1] kobject: 'csrow2' ((____ptrval____)):
kobject_release, parent (____ptrval____) (delayed 750)
[ 70.638477][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device:
unregistering device dimm2
[ 70.647903][ T1] kobject: 'dimm2' ((____ptrval____)):
kobject_release, parent (____ptrval____) (delayed 250)
[ 70.658105][ T1] EDAC MC: Removed device 0 for ghes_edac.c
ghes_edac: DEV ghes
[ 70.665673][ T1] EDAC DEBUG: edac_mc_free:
[ 70.670211][ T1] EDAC DEBUG: edac_unregister_sysfs: unregistering
device mc0
[ 70.679027][ T1] kobject: 'mc0' ((____ptrval____)):
kobject_release, parent (____ptrval____) (delayed 500)
[ 70.690987][ T1] EDAC DEBUG: edac_mc_del_mc:
[ 70.695769][ T1] EDAC DEBUG: edac_mc_free:
[ 70.700412][ T1] ------------[ cut here ]------------
[ 70.705832][ T1] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x48
[ 70.716663][ T1] WARNING: CPU: 50 PID: 1 at lib/debugobjects.c:484
debug_print_object+0xec/0x130
[ 70.725721][ T1] Modules linked in:
[ 70.729491][ T1] CPU: 50 PID: 1 Comm: swapper/0 Not tainted
5.4.0-rc3+ #1146
[ 70.736811][ T1] Hardware name: Huawei D06 /D06, BIOS Hisilicon
D06 UEFI RC0 - V1.16.01 03/15/2019
[ 70.746039][ T1] pstate: 80800009 (Nzcv daif -PAN +UAO)
[ 70.746056][ T1] pc : debug_print_object+0xec/0x130
[ 70.756681][ T1] lr : debug_print_object+0xec/0x130
[ 70.756691][ T1] sp : ffff0020bf2c7740
[ 70.756699][ T1] x29: ffff0020bf2c7740 x28: ffff0023242c5000
[ 70.756715][ T1] x27: ffff0023242c5090 x26: ffffa00017543de0
[ 70.756730][ T1] x25: ffffa000101cd558 x24: ffffa00012051fc0
[ 70.756750][ T1] x23: ffffa000150d2200 x22: ffffa000120523a0
[ 70.765894][ T1] x21: ffffa00012051640 x20: 0000000000000000
[ 70.765910][ T1] x19: ffffa00015019000 x18: 00000000000025a8
[ 70.765924][ T1] x17: 00000000000025a0 x16: 00000000000026b0
[ 70.765939][ T1] x15: 0000000000001470 x14: 64203a746e696820
[ 70.765954][ T1] x13: 7473696c5f72656d x12: 1fffe00417e58e5a
[ 70.777974][ T1] x11: ffff800417e58e5a x10: dfffa00000000000
[ 70.789995][ T1] x9 : ffff800417e58e5b x8 : 0000000000000001
[ 70.790011][ T1] x7 : ffff0020bf2c72d7 x6 : ffff800417e58e5b
[ 70.790026][ T1] x5 : 1fffe00417e57936 x4 : ffff0020bf2bc058
[ 70.790041][ T1] x3 : ffffa00010000000 x2 : ffff800417e58eb0
[ 70.790055][ T1] x1 : f8aafc30f531b000 x0 : 0000000000000000
[ 70.802080][ T1] Call trace:
[ 70.802093][ T1] debug_print_object+0xec/0x130
[ 70.802106][ T1] __debug_check_no_obj_freed+0x114/0x290
[ 70.802119][ T1] debug_check_no_obj_freed+0x18/0x28
[ 70.802130][ T1] slab_free_freelist_hook+0x18c/0x228
[ 70.802140][ T1] kfree+0x264/0x420
[ 70.802157][ T1] _edac_mc_free+0x6c/0x210
[ 70.814163][ T1] edac_mc_free+0x68/0x88
[ 70.814177][ T1] ghes_edac_unregister+0x44/0x70
[ 70.814193][ T1] ghes_remove+0x274/0x2a0
[ 70.814207][ T1] platform_drv_remove+0x44/0x78
[ 70.814217][ T1] really_probe+0x404/0x840
[ 70.814228][ T1] driver_probe_device+0x190/0x1f0
[ 70.814239][ T1] device_driver_attach+0x7c/0xb0
[ 70.814249][ T1] __driver_attach+0x1b8/0x1d0
[ 70.814261][ T1] bus_for_each_dev+0xf8/0x190
[ 70.814277][ T1] driver_attach+0x34/0x40
[ 70.826289][ T1] bus_add_driver+0x1d8/0x340
[ 70.826301][ T1] driver_register+0x168/0x1e8
[ 70.826312][ T1] __platform_driver_register+0x80/0x90
[ 70.826326][ T1] ghes_init+0xc4/0x174
[ 70.826338][ T1] do_one_initcall+0x328/0x788
[ 70.826356][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 70.838361][ T1] kernel_init+0x18/0x178
[ 70.838373][ T1] ret_from_fork+0x10/0x18
[ 70.838381][ T1] irq event stamp: 4398006
[ 70.838394][ T1] hardirqs last enabled at (4398005):
[<ffffa000100c0e78>] el1_irq+0x138/0x200
[ 70.838409][ T1] hardirqs last disabled at (4398006):
[<ffffa000100fd884>] debug_exception_enter+0x8c/0x190
[ 70.838422][ T1] softirqs last enabled at (4398004):
[<ffffa000100bf4a4>] __do_softirq+0x894/0x920
[ 70.838439][ T1] softirqs last disabled at (4397997):
[<ffffa000101965e4>] irq_exit+0x114/0x1a0
[ 70.875171][ T1] ---[ end trace a9b7b2cbbb0f7263 ]---
[ 70.885805][ T1] ------------[ cut here ]------------
[ 70.892929][ T1] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x48
[ 70.907197][ T1] WARNING: CPU: 50 PID: 1 at lib/debugobjects.c:484
debug_print_object+0xec/0x130
[ 70.916349][ T1] Modules linked in:
[ 70.916368][ T1] CPU: 50 PID: 1 Comm: swapper/0 Tainted: G
W 5.4.0-rc3+ #1146
[ 70.916378][ T1] Hardware name: Huawei D06 /D06, BIOS Hisilicon
D06 UEFI RC0 - V1.16.01 03/15/2019
[ 70.916388][ T1] pstate: 80800009 (Nzcv daif -PAN +UAO)
[ 70.916400][ T1] pc : debug_print_object+0xec/0x130
[ 70.916412][ T1] lr : debug_print_object+0xec/0x130
[ 70.916424][ T1] sp : ffff0020bf2c7740
[ 70.925916][ T1] x29: ffff0020bf2c7740 x28: ffff00232427a000
[ 70.925933][ T1] x27: ffff00232427a090 x26: ffffa00017543de0
[ 70.925948][ T1] x25: ffffa000101cd558 x24: ffffa00012051fc0
[ 70.925963][ T1] x23: ffffa000150d2200 x22: ffffa000120523a0
[ 70.971505][ T1] x21: ffffa00012051640 x20: 0000000000000000
[ 70.984654][ T1] x19: ffffa00015019000 x18: 00000000000025a8
[ 70.984671][ T1] x17: 00000000000025a0 x16: 00000000000026b0
[ 70.984685][ T1] x15: 0000000000001470 x14: 726f775f64657961
[ 70.984701][ T1] x13: 6c6564203a746e69 x12: 1fffe00417e58e5a
[ 71.004012][ T1] x11: ffff800417e58e5a x10: dfffa00000000000
[ 71.004028][ T1] x9 : ffff800417e58e5b x8 : 0000000000000001
[ 71.004043][ T1] x7 : ffff0020bf2c72d7 x6 : ffff800417e58e5b
[ 71.004058][ T1] x5 : 1fffe00417e57936 x4 : ffff0020bf2bc058
[ 71.034246][ T1] x3 : ffffa00010000000 x2 : ffff800417e58eb0
[ 71.047049][ T1] x1 : f8aafc30f531b000 x0 : 0000000000000000
[ 71.047065][ T1] Call trace:
[ 71.047078][ T1] debug_print_object+0xec/0x130
[ 71.047090][ T1] __debug_check_no_obj_freed+0x114/0x290
[ 71.047103][ T1] debug_check_no_obj_freed+0x18/0x28
[ 71.047114][ T1] slab_free_freelist_h T1] edac_mc_free+0x68/0x88
[ 71.065065][ T1] ghes_edac_unregister+0x44/0x70
[ 71.065077][ T1] ghes_remove+0x274/0x2a0
[ 71.065088][ T1] platform_drv_remove+0x44/0x78
[ 71.065099][ T1] really_probe+0x404/0x840
[ 71.065112][ T1] driver_probe_device+0x190/0x1f0
[ 71.132887][ T1] device_driver_attach+0x7c/0xb0
[ 71.132898][ T1] __driver_attach+0x1b8/0x1d0
[ 71.132911][ T1] bus_for_each_dev+0xf8/0x190
[ 71.132921][ T1] driver_attach+0x34/0x40
[ 71.132931][ T1] bus_add_driver+0x1d8/0x340
[ 71.132942][ T1] driver_register+0x168/0x1e8
[ 71.132953][ T1] __platform_driver_register+0x80/0x90
[ 71.132964][ T1] ghes_init+0xc4/0x174
[ 71.132975][ T1] do_one_initcall+0x328/0x788
[ 71.132989][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.144995][ T1] kernel_init+0x18/0x178
[ 71.145006][ T1] ret_from_fork+0x10/0x18
[ 71.145015][ T1] irq event stamp: 4398362
[ 71.145027][ T1] hardirqs last enabled at (4398361):
[<ffffa000100c0e78>] el1_irq+0x138/0x200
[ 71.145042][ T1] hardirqs last disabled at (4398362):
[<ffffa000100fd884>] debug_exception_enter+0x8c/0x190
[ 71.145056][ T1] softirqs last enabled at (4398360):
[<irq_exit+0x114/0x1a0
[ 71.157069][ T1] ---[ end trace a9b7b2cbbb0f7264 ]---
[ 71.158439][ T1] ------------[ cut here ]------------
[ 71.194319][ T1] ODEBUG: free active (active state 0) object type:
timer_list hint: delayed_work_timer_fn+0x0/0x48
[ 71.203588][ T1] WARNING: CPU: 50 PID: 1 at lib/debugobjects.c:484
debug_print_object+0xec/0x130
[ 71.212094][ T1] Modules linked in:
[ 71.212112][ T1] CPU: 50 PID: 1 Comm: swapper/0 Tainted: G
W 5.4.0-rc3+ #1146
[ 71.212121][ T1] Hardware name: Huawei D06 /D06, BIOS Hisilicon
D06 UEFI RC0 - V1.16.01 03/15/2019
[ 71.212131][ T1] pstate: 80800009 (Nzcv daif -PAN +UAO)
[ 71.212144][ T1] pc : debug_print_object+0xec/0x130
[ 71.212158][ T1] lr : debug_print_object+0xec/0x130
[ 71.224447][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.226086][ T1] sp : ffff0020bf2c7740
[ 71.226099][ T1] x29: ffff0020bf2c7740 x28: ffff002324274000
[ 71.230557][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)): auto
cleanup 'remove' event
[ 71.235419][ T1] x27: ffff002324274090 x26: ffffa00017543de0
[ 71.235435][ T1] x25: ffffa000101cd558 x24: ffffa00012051fc0
[ 71.235450][ T1] x23: ffffa000150d2200 x22: ffffa000120523a0
[ 71.235465][ T1] x21: ffffa00012051640 x20: 0000000000000000
[ 71.240402][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)):
kobject_uevent_env
[ 71.244968][ T1] x19: ffffa00015019000 x18: 00000000000025a8
[ 71.244984][ T1] x17: 00000000000025a0 x16: 00000000000026b0
[ 71.244999][ T1] x15: 0000000000001470 x14: 726f775f64657961
[ 71.245014][ T1] x13: 6c6564203a746e69 x12: 1fffe00417e58e5a
[ 71.249837][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)):
fill_kobj_path: path = '/bus/platform/drivers/brcm-gisb-arb'
[ 71.253908][ T1] x11: ffff800417e58e5a x10: dfffa00000000000
[ 71.253925][ T1] x9 : ffff800417e58e5b x8 : 0000000000000001
[ 71.253939][ T1] x7 : ffff0020bf2c72d7 x6 : ffff800417e58e5b
[ 71.253954][ T1] x5 : 1fffe00417e57936 x4 : ffff0020bf2bc058
[ 71.256447][ T832] kobject: 'wakeup40' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.256466][ T832] kobject: 'wakeup40' ((____ptrval____)): calling
ktype release
[ 71.256516][ T832] kobject: 'wakeup40': free name
[ 71.258600][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)): auto
cleanup kobject_del
[ 71.263109][ T1] x3 : ffffa00010000000 x2 : ffff800417e58eb0
[ 71.263125][ T1] x1 : f8aafc30f531b000 x0 : 0000000000000000
[ 71.263139][ T1] Call trace:
[ 71.263152][ T1] debug_print_object+0xec/0x130
[ 71.263169][ T1] __debug_check_no_obj_freed+0x114/0x290
[ 71.268667][ T830] kobject: 'brcm-gisb-arb' ((____ptrval____)):
calling ktype release
[ 71.272574][ T1] debug_check_no_obj_freed+0x18/0x28
[ 71.272586][ T1] slab_free_freelist_hook+0x18c/0x228
[ 71.272596][ T1] kfree+0x264/0x420
[ 71.272608][ T1] _edac_mc_free+0x1f8/0x210
[ 71.272619][ T1] edac_mc_free+0x68/0x88
[ 71.272632][ T1] ghes_edac_unregister+0x44/0x70
[ 71.277292][ T830] driver: 'brcm-gisb-arb': driver_release
[ 71.282298][ T1] ghes_remove+0x274/0x2a0
[ 71.282310][ T1] platform_drv_remove+0x44/0x78
[ 71.282321][ T1] really_probe+0x404/0x840
[ 71.282331][ T1] driver_probe_device+0x190/0x1f0
[ 71.282342][ T1] device_driver_attach+0x7c/0xb0
[ 71.282352][ T1] __driver_attach+0x1b8/0x1d0
[ 71.282368][ T1] bus_for_each_dev+0xf8/0x190
[ 71.286608][ T830] kobject: 'brcm-gisb-arb': free name
[ 71.290816][ T1] driver_attach+0x34/0x40
[ 71.290826][ T1] bus_add_driver+0x1d8/0x340
[ 71.290838][ T1] driver_register+0x168/0x1e8
[ 71.290849][ T1] __platform_driver_register+0x80/0x90
[ 71.290859][ T1] ghes_init+0xc4/0x174
[ 71.290872][ T1] do_one_initcall+0x328/0x788
[ 71.320457][ T833] kobject: 'wakeup' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.323307][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.323324][ T1] kernel_init+0x18/0x178
[ 71.332431][ T833] kobject: 'wakeup' ((____ptrval____)): calling
ktype release
[ 71.337592][ T1] ret_from_fork+0x10/0x18
[ 71.337601][ T1] irq event stamp: 4399038
[ 71.337613][ T1] hardirqs last enabled at (4399037):
[<ffffa000100c0e78>] el1_irq+0x138/0x200
[ 71.337627][ T1] hardirqs last disabled at (4399038):
[<ffffa000100fd884>] debug_exception_enter+0x8c/0x190
[ 71.337640][ T1] softirqs last enabled at (4399036):
[<ffffa000100bf4a4>] __do_softirq+0x894/0x920
[ 71.337655][ T1] softirqs last disabled at (4399029):
[<ffffa000101965e4>] irq_exit+0x114/0x1a0
[ 71.343025][ T833] kobject: 'wakeup': free name
[ 71.352445][ T834] kobject: 'stmpe-pwm' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.352463][ T834] kobject: 'stmpe-pwm' ((____ptrval____)): auto
cleanup 'remove' event
[ 71.352481][ T834] kobject: 'stmpe-pwm' ((____ptrval____)):
kobject_uevent_env
[ 71.352587][ T834] kobject: 'stmpe-pwm' ((____ptrval____)):
fill_kobj_path: path = '/bus/platform/drivers/stmpe-pwm'
[ 71.352645][ T834] kobject: 'stmpe-pwm' ((____ptrval____)): auto
cleanup kobject_del
[ 71.352713][ T834] kobject: 'stmpe-pwm' ((____ptrval____)): calling
ktype release
[ 71.352730][ T834] driver: 'stmpe-pwm': driver_release
[ 71.352763][ T834] kobject: 'stmpe-pwm': free name
[ 71.353566][ T1] ---[ end trace a9b7b2cbbb0f7265 ]---
[ 71.353899][ T1] GHES GHES.1: no default pinctrl state
[ 71.384529][ T851] kobject: 'wakeup15' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.384654][ T848] kobject: 'wakeup' ((____ptrval____)):
kobject_cleanup, parent (____ptrval____)
[ 71.386131][ T1] driver: 'GHES': driver_bound: bound to device
'GHES.1'
[ 71.386163][ T1] kobject: 'GHES.1' ((____ptrval____)):
kobject_uevent_env
[ 71.386272][ T1] kobject: 'GHES.1' ((____ptrval____)):
fill_kobj_path: path = '/devices/platform/GHES.1'
[ 71.386334][ T1] bus: 'platform': really_probe: bound device
GHES.1 to driver GHES
[ 71.386378][ T1] bus: 'platform': driver_probe_device: matched
device GHES.2 with driver GHES
[ 71.386410][ T1] bus: 'platform': really_probe: probing driver
GHES with device GHES.2
[ 71.386512][ T1] GHES GHES.2: no default pinctrl state
[ 71.390169][ T851] kobject: 'wakeup15' ((____ptrval____)): calling
ktype release
[ 71.395406][ T848] kobject: 'wakeup' ((____ptrval____)): calling
ktype release
[ 71.395681][ T1]
==================================================================
[ 71.395716][ T1] BUG: KASAN: use-after-free in
ghes_edac_unregister+0x28/0x70
[ 71.395728][ T1] Read of size 8 at addr ffff002324274bdc by task
swapper/0/1
[ 71.395735][ T1]
[ 71.395749][ T1] CPU: 48 PID: 1 Comm: swapper/0 Tainted: G
W 5.4.0-rc3+ #1146
[ 71.395759][ T1] Hardware name: Huawei D06 /D06, BIOS Hisilicon
D06 UEFI RC0 - V1.16.01 03/15/2019
[ 71.395768][ T1] Call trace:
[ 71.395780][ T1] dump_backtrace+0x0/0x298
[ 71.395790][ T1] show_stack+0x20/0x30
[ 71.395802][ T1] dump_stack+0x190/0x21c
[ 71.395815][ T1] print_address_description.isra.6+0x80/0x3d0
[ 71.395827][ T1] __kasan_report+0x174/0x23c
[ 71.395838][ T1] kasan_report+0xc/0x18
[ 71.395849][ T1] __asan_load8+0xa4/0xb0
[ 71.395861][ T1] ghes_edac_unregister+0x28/0x70
[ 71.395873][ T1] ghes_remove+0x274/0x2a0
[ 71.395884][ T1] platform_drv_remove+0x44/0x78
[ 71.395895][ T1] really_probe+0x404/0x840
[ 71.395905][ T1] driver_probe_device+0x190/0x1f0
[ 71.395916][ T1] device_driver_attach+0x7c/0xb0
[ 71.395927][ T1] __driver_attach+0x1b8/0x1d0
[ 71.395939][ T1] bus_for_each_dev+0xf8/0x190
[ 71.395949][ T1] driver_attach+0x34/0x40
[ 71.395960][ T1] bus_add_driver+0x1d8/0x340
[ 71.395970][ T1] driver_register+0x168/0x1e8
[ 71.395982][ T1] __platform_driver_register+0x80/0x90
[ 71.395993][ T1] ghes_init+0xc4/0x174
[ 71.396004][ T1] do_one_initcall+0x328/0x788
[ 71.396017][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.396028][ T1] kernel_init+0x18/0x178
[ 71.396039][ T1] ret_from_fork+0x10/0x18
[ 71.396047][ T1]
[ 71.396056][ T1] Allocated by task 1:
[ 71.396068][ T1] save_stack+0x28/0xb0
[ 71.396080][ T1] __kasan_kmalloc.isra.9+0xa0/0xc8
[ 71.396091][ T1] kasan_kmalloc+0xc/0x18
[ 71.396102][ T1] __kmalloc+0x2d0/0x338
[ 71.396114][ T1] edac_mc_alloc+0xaa8/0xb18
[ 71.396125][ T1] ghes_edac_register+0x164/0x398
[ 71.396137][ T1] ghes_probe+0x648/0x6d8
[ 71.396148][ T1] platform_drv_probe+0x8c/0x110
[ 71.396159][ T1] really_probe+0x32c/0x840
[ 71.396170][ T1] driver_probe_device+0x190/0x1f0
[ 71.396181][ T1] device_driver_attach+0x7c/0xb0
[ 71.396192][ T1] __driver_attach+0x1b8/0x1d0
[ 71.396203][ T1] bus_for_each_dev+0xf8/0x190
[ 71.396214][ T1] driver_attach+0x34/0x40
[ 71.396224][ T1] bus_add_driver+0x1d8/0x340
[ 71.396235][ T1] driver_register+0x168/0x1e8
[ 71.396247][ T1] __platform_driver_register+0x80/0x90
[ 71.396257][ T1] ghes_init+0xc4/0x174
[ 71.396268][ T1] do_one_initcall+0x328/0x788
[ 71.396281][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.396292][ T1] kernel_init+0x18/0x178
[ 71.396303][ T1] ret_from_fork+0x10/0x18
[ 71.396310][ T1]
[ 71.396318][ T1] Freed by task 1:
[ 71.396330][ T1] save_stack+0x28/0xb0
[ 71.396341][ T1] __kasan_slab_free+0x140/0x170
[ 71.396353][ T1] kasan_slab_free+0x10/0x18
[ 71.396364][ T1] slab_free_freelist_hook+0x19c/0x228
[ 71.396375][ T1] kfree+0x264/0x420
[ 71.396386][ T1] _edac_mc_free+0x1f8/0x210
[ 71.396398][ T1] edac_mc_free+0x68/0x88
[ 71.396409][ T1] ghes_edac_unregister+0x44/0x70
[ 71.396420][ T1] ghes_remove+0x274/0x2a0
[ 71.396432][ T1] platform_drv_remove+0x44/0x78
[ 71.396442][ T1] really_probe+0x404/0x840
[ 71.396453][ T1] driver_probe_device+0x190/0x1f0
[ 71.396464][ T1] device_driver_attach+0x7c/0xb0
[ 71.396475][ T1] __driver_attach+0x1b8/0x1d0
[ 71.396487][ T1] bus_for_each_dev+0xf8/0x190
[ 71.396497][ T1] driver_attach+0x34/0x40
[ 71.396508][ T1] bus_add_driver+0x1d8/0x340
[ 71.396519][ T1] driver_register+0x168/0x1e8
[ 71.396530][ T1] __platform_driver_register+0x80/0x90
[ 71.396541][ T1] ghes_init+0xc4/0x174
[ 71.396552][ T1] do_one_initcall+0x328/0x788
[ 71.396564][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.396575][ T1] kernel_init+0x18/0x178
[ 71.396586][ T1] ret_from_fork+0x10/0x18
[ 71.396593][ T1]
[ 71.396604][ T1] The buggy address belongs to the object at
ffff002324274000
[ 71.396604][ T1] which belongs to the cache kmalloc-4k of size 4096
[ 71.396615][ T1] The buggy address is located 3036 bytes inside of
[ 71.396615][ T1] 4096-byte region [ffff002324274000,
ffff002324275000)
[ 71.396624][ T1] The buggy address belongs to the page:
[ 71.396637][ T1] page:fffffe008c709c00 refcount:1 mapcount:0
mapping:ffff0020bfc16980 index:0x0 compound_mapcount: 0
[ 71.396655][ T1] flags: 0x1ffff00000010200(slab|head)
[ 71.396671][ T1] raw: 1ffff00000010200 fffffe008c709a08
fffffe008c70c408 ffff0020bfc16980
[ 71.396685][ T1] raw: 0000000000000000 0000000000020002
00000001ffffffff 0000000000000000
[ 71.396693][ T1] page dumped because: kasan: bad access detected
[ 71.396701][ T1]
[ 71.396709][ T1] Memory state around the buggy address:
[ 71.396721][ T1] ffff002324274a80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.396732][ T1] ffff002324274b00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.396743][ T1] >ffff002324274b80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.396751][ T1] ^
[ 71.396762][ T1] ffff002324274c00: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.396773][ T1] ffff002324274c80: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.396781][ T1]
==================================================================
[ 71.396789][ T1] Disabling lock debugging due to kernel taint
[ 71.396834][ T1] EDAC DEBUG: edac_mc_del_mc:
[ 71.396846][ T1] EDAC DEBUG: edac_mc_free:
[ 71.396866][ T1]
==================================================================
[ 71.396874][ T1] BUG: KASAN: double-free or invalid-free in
kfree+0x264/0x420
[ 71.396877][ T1]
[ 71.396886][ T1] CPU: 48 PID: 1 Comm: swapper/0 Tainted: G B
W 5.4.0-rc3+ #1146
[ 71.396891][ T1] Hardware name: Huawei D06 /D06, BIOS Hisilicon
D06 UEFI RC0 - V1.16.01 03/15/2019
[ 71.396895][ T1] Call trace:
[ 71.396902][ T1] dump_backtrace+0x0/0x298
[ 71.396909][ T1] show_stack+0x20/0x30
[ 71.396915][ T1] dump_stack+0x190/0x21c
[ 71.396923][ T1] print_address_description.isra.6+0x80/0x3d0
[ 71.396931][ T1] kasan_report_invalid_free+0x78/0xa0
[ 71.396939][ T1] __kasan_slab_free+0xbc/0x170
[ 71.396946][ T1] kasan_slab_free+0x10/0x18
[ 71.396953][ T1] slab_free_freelist_hook+0x19c/0x228
[ 71.396959][ T1] kfree+0x264/0x420
[ 71.396967][ T1] _edac_mc_free+0x6c/0x210
[ 71.396974][ T1] edac_mc_free+0x68/0x88
[ 71.396981][ T1] ghes_edac_unregister+0x44/0x70
[ 71.396989][ T1] ghes_remove+0x274/0x2a0
[ 71.396996][ T1] platform_drv_remove+0x44/0x78
[ 71.397002][ T1] really_probe+0x404/0x840
[ 71.397009][ T1] driver_probe_device+0x190/0x1f0
[ 71.397016][ T1] device_driver_attach+0x7c/0xb0
[ 71.397022][ T1] __driver_attach+0x1b8/0x1d0
[ 71.397030][ T1] bus_for_each_dev+0xf8/0x190
[ 71.397037][ T1] driver_attach+0x34/0x40
[ 71.397043][ T1] bus_add_driver+0x1d8/0x340
[ 71.397049][ T1] driver_register+0x168/0x1e8
[ 71.397057][ T1] __platform_driver_register+0x80/0x90
[ 71.397063][ T1] ghes_init+0xc4/0x174
[ 71.397070][ T1] do_one_initcall+0x328/0x788
[ 71.397078][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.397085][ T1] kernel_init+0x18/0x178
[ 71.397092][ T1] ret_from_fork+0x10/0x18
[ 71.397096][ T1]
[ 71.397100][ T1] Allocated by task 1:
[ 71.397108][ T1] save_stack+0x28/0xb0
[ 71.397116][ T1] __kasan_kmalloc.isra.9+0xa0/0xc8
[ 71.397123][ T1] kasan_kmalloc+0xc/0x18
[ 71.397130][ T1] kmem_cache_alloc_trace+0x2a0/0x2e8
[ 71.397138][ T1] edac_mc_alloc+0x5d4/0xb18
[ 71.397145][ T1] ghes_edac_register+0x164/0x398
[ 71.397152][ T1] ghes_probe+0x648/0x6d8
[ 71.397160][ T1] platform_drv_probe+0x8c/0x110
[ 71.397166][ T1] really_probe+0x32c/0x840
[ 71.397173][ T1] driver_probe_device+0x190/0x1f0
[ 71.397180][ T1] device_driver_attach+0x7c/0xb0
[ 71.397186][ T1] __driver_attach+0x1b8/0x1d0
[ 71.397194][ T1] bus_for_each_dev+0xf8/0x190
[ 71.397201][ T1] driver_attach+0x34/0x40
[ 71.397207][ T1] bus_add_driver+0x1d8/0x340
[ 71.397213][ T1] driver_register+0x168/0x1e8
[ 71.397221][ T1] __platform_driver_register+0x80/0x90
[ 71.397227][ T1] ghes_init+0xc4/0x174
[ 71.397235][ T1] do_one_initcall+0x328/0x788
[ 71.397243][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.397250][ T1] kernel_init+0x18/0x178
[ 71.397257][ T1] ret_from_fork+0x10/0x18
[ 71.397260][ T1]
[ 71.397264][ T1] Freed by task 1:
[ 71.397272][ T1] save_stack+0x28/0xb0
[ 71.397279][ T1] __kasan_slab_free+0x140/0x170
[ 71.397286][ T1] kasan_slab_free+0x10/0x18
[ 71.397294][ T1] slab_free_freelist_hook+0x19c/0x228
[ 71.397300][ T1] kfree+0x264/0x420
[ 71.397307][ T1] _edac_mc_free+0x15c/0x210
[ 71.397315][ T1] edac_mc_free+0x68/0x88
[ 71.397322][ T1] ghes_edac_unregister+0x44/0x70
[ 71.397329][ T1] ghes_remove+0x274/0x2a0
[ 71.397337][ T1] platform_drv_remove+0x44/0x78
[ 71.397343][ T1] really_probe+0x404/0x840
[ 71.397350][ T1] driver_probe_device+0x190/0x1f0
[ 71.397357][ T1] device_driver_attach+0x7c/0xb0
[ 71.397363][ T1] __driver_attach+0x1b8/0x1d0
[ 71.397371][ T1] bus_for_each_dev+0xf8/0x190
[ 71.397377][ T1] driver_attach+0x34/0x40
[ 71.397384][ T1] bus_add_driver+0x1d8/0x340
[ 71.397391][ T1] driver_register+0x168/0x1e8
[ 71.397398][ T1] __platform_driver_register+0x80/0x90
[ 71.397404][ T1] ghes_init+0xc4/0x174
[ 71.397411][ T1] do_one_initcall+0x328/0x788
[ 71.397419][ T1] kernel_init_freeable+0x2fc/0x3d4
[ 71.397427][ T1] kernel_init+0x18/0x178
[ 71.397433][ T1] ret_from_fork+0x10/0x18
[ 71.397437][ T1]
[ 71.397443][ T1] The buggy address belongs to the object at
ffff0023245a9200
[ 71.397443][ T1] which belongs to the cache kmalloc-128 of size 128
[ 71.397451][ T1] The buggy address is located 0 bytes inside of
[ 71.397451][ T1] 128-byte region [ffff0023245a9200, ffff0023245a9280)
[ 71.397455][ T1] The buggy address belongs to the page:
[ 71.397462][ T1] page:fffffe008c716a00 refcount:1 mapcount:0
mapping:ffff0020bfc10580 index:0xffff0023245ada80 compound_mapcount: 0
[ 71.397471][ T1] flags: 0x1ffff00000010200(slab|head)
[ 71.397482][ T1] raw: 1ffff00000010200 fffffe008c716808
fffffe008c70a008 ffff0020bfc10580
[ 71.397492][ T1] raw: ffff0023245ada80 0000000000330016
00000001ffffffff 0000000000000000
[ 71.397496][ T1] page dumped because: kasan: bad access detected
[ 71.397499][ T1]
[ 71.397503][ T1] Memory state around the buggy address:
[ 71.397510][ T1] ffff0023245a9100: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 71.397517][ T1] ffff0023245a9180: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 71.397523][ T1] >ffff0023245a9200: fb fb fb fb fb fb fb fb fb fb
fb fb fb fb fb fb
[ 71.397527][ T1] ^
[ 71.397534][ T1] ffff0023245a9280: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 71.397541][ T1] ffff0023245a9300: fc fc fc fc fc fc fc fc fc fc
fc fc fc fc fc fc
[ 71.397545][ T1]
==================================================================
On Mon, Oct 14, 2019 at 04:18:49PM +0100, John Garry wrote:
> Hi guys,
>
> I'm experimenting by trying to boot an allmodconfig arm64 kernel, as
> mentioned here:
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
> One thing that I noticed - it's hard to miss actually - is the amount of
> complaining from KASAN about the EDAC/ghes code. Maybe this is something I
> should not care about/red herring, or maybe something genuine. Let me know
> what you think.
>
> The kernel is v5.4-rc3, and I raised the EDAC mc debug level to get extra
> debug prints.
>
> Log below, Thanks,
> John
> Log snippet (I cut off after the first KASAN warning):
>
> [ 70.471011][ T1] random: get_random_u32 called from new_slab+0x360/0x698 with crng_init=0
> [ 70.478671][ T1] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0x94110034/64/0/3/0]
> [ 70.526585][ T1] EDAC DEBUG: edac_mc_alloc: allocating 3524 bytes for mci data (32 dimms, 32 csrows/channels)
> [ 70.542013][ T1] EDAC DEBUG: ghes_edac_dmidecode: DIMM2: Registered-DDR4 size = 16384 MB(ECC)
> [ 70.551044][ T1] EDAC DEBUG: ghes_edac_dmidecode: type 26, detail 0x2080, width 72(total 64)
> [ 70.559986][ T1] EDAC DEBUG: edac_mc_add_mc_with_groups:
> [ 70.567082][ T1] EDAC DEBUG: edac_create_sysfs_mci_device: device mc0 created
> [ 70.575608][ T1] EDAC DEBUG: edac_create_dimm_object: device dimm2 created at location memory 2
> [ 70.585818][ T1] EDAC DEBUG: edac_create_csrow_object: device csrow2 created
> [ 70.594110][ T1] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
> [ 70.605936][ T1] EDAC DEBUG: edac_mc_del_mc:
> [ 70.611188][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device:
> [ 70.619443][ T1] random: get_random_u32 called from kobject_put+0x8c/0x190 with crng_init=0
> [ 70.628163][ T1] kobject: 'csrow2' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 750)
> [ 70.638477][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device: unregistering device dimm2
> [ 70.647903][ T1] kobject: 'dimm2' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 250)
> [ 70.658105][ T1] EDAC MC: Removed device 0 for ghes_edac.c ghes_edac: DEV ghes
> [ 70.665673][ T1] EDAC DEBUG: edac_mc_free:
> [ 70.670211][ T1] EDAC DEBUG: edac_unregister_sysfs: unregistering device mc0
> [ 70.679027][ T1] kobject: 'mc0' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 500)
> [ 70.690987][ T1] EDAC DEBUG: edac_mc_del_mc:
> [ 70.695769][ T1] EDAC DEBUG: edac_mc_free:
> [ 70.700412][ T1] ------------[ cut here ]------------
> [ 70.705832][ T1] ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x48
> [ 70.716663][ T1] WARNING: CPU: 50 PID: 1 at lib/debugobjects.c:484 debug_print_object+0xec/0x130
If I am parsing these unwrapped messages correctly (btw, pls use another
mail client for pasting log lines - thunderbird is usually ok but I
guess you need to configure it properly), that must be some workqueue
object of sorts.
Now, ghes_edac doesn't init the workqueue:
[ 70.594110][ T1] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
as it is in interrupt mode.
So the only other workqueue I see is that "delayed XXX" stuff which is in
kobject_release().
AFAICT.
Do you have CONFIG_DEBUG_KOBJECT_RELEASE enabled and if so, does the
warning go away if you disable it?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Hi John,
On 14/10/2019 16:18, John Garry wrote:
> I'm experimenting by trying to boot an allmodconfig arm64 kernel, as mentioned here:
Crumbs!
> One thing that I noticed - it's hard to miss actually - is the amount of complaining from
> KASAN about the EDAC/ghes code. Maybe this is something I should not care about/red
> herring, or maybe something genuine. Let me know what you think.
Hmmm, I thought I tested this recently...
> Log snippet (I cut off after the first KASAN warning):
>
> [ 70.471011][ T1] random: get_random_u32 called from new_slab+0x360/0x698 with
> crng_init=0
> [ 70.478671][ T1] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
> [0x94110034/64/0/3/0]
(this one's for you right?)
> [ 70.700412][ T1] ------------[ cut here ]------------
> [ 70.802080][ T1] Call trace:
> [ 70.802093][ T1] debug_print_object+0xec/0x130
> [ 70.802106][ T1] __debug_check_no_obj_freed+0x114/0x290
> [ 70.802119][ T1] debug_check_no_obj_freed+0x18/0x28
> [ 70.802130][ T1] slab_free_freelist_hook+0x18c/0x228
> [ 70.802140][ T1] kfree+0x264/0x420
> [ 70.802157][ T1] _edac_mc_free+0x6c/0x210
> [ 70.814163][ T1] edac_mc_free+0x68/0x88
> [ 70.814177][ T1] ghes_edac_unregister+0x44/0x70
> [ 70.814193][ T1] ghes_remove+0x274/0x2a0
Ugh. This must be the test driver remove thing.
I've reproduced this, but had to remove the parent GHES twice. It looks like it tries to
use the first ghes_edac global variables when freeing the second. ghes_init prevents it
from re-allocating over the top.
The below diff fixes it for me. (I'll post it as a proper patch once I've done the
archaeology)
-----------%<-----------
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index d413a0bdc9ad..955b59b6aade 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -554,6 +554,7 @@ void ghes_edac_unregister(struct ghes *ghes)
return;
mci = ghes_pvt->mci;
+ ghes_pvt = NULL;
edac_mc_del_mc(mci->pdev);
edac_mc_free(mci);
}
-----------%<-----------
Thanks!
James
On 14/10/2019 17:09, Borislav Petkov wrote:
> On Mon, Oct 14, 2019 at 04:18:49PM +0100, John Garry wrote:
>> Hi guys,
>>
>> I'm experimenting by trying to boot an allmodconfig arm64 kernel, as
>> mentioned here:
>> https://lore.kernel.org/linux-arm-kernel/[email protected]/
>>
>> One thing that I noticed - it's hard to miss actually - is the amount of
>> complaining from KASAN about the EDAC/ghes code. Maybe this is something I
>> should not care about/red herring, or maybe something genuine. Let me know
>> what you think.
>>
>> The kernel is v5.4-rc3, and I raised the EDAC mc debug level to get extra
>> debug prints.
>>
>> Log below, Thanks,
>> John
>> Log snippet (I cut off after the first KASAN warning):
>>
>> [ 70.471011][ T1] random: get_random_u32 called from new_slab+0x360/0x698 with crng_init=0
>> [ 70.478671][ T1] [Firmware Bug]: APEI: Invalid bit width + offset in GAR [0x94110034/64/0/3/0]
>> [ 70.526585][ T1] EDAC DEBUG: edac_mc_alloc: allocating 3524 bytes for mci data (32 dimms, 32 csrows/channels)
>> [ 70.542013][ T1] EDAC DEBUG: ghes_edac_dmidecode: DIMM2: Registered-DDR4 size = 16384 MB(ECC)
>> [ 70.551044][ T1] EDAC DEBUG: ghes_edac_dmidecode: type 26, detail 0x2080, width 72(total 64)
>> [ 70.559986][ T1] EDAC DEBUG: edac_mc_add_mc_with_groups:
>> [ 70.567082][ T1] EDAC DEBUG: edac_create_sysfs_mci_device: device mc0 created
>> [ 70.575608][ T1] EDAC DEBUG: edac_create_dimm_object: device dimm2 created at location memory 2
>> [ 70.585818][ T1] EDAC DEBUG: edac_create_csrow_object: device csrow2 created
>> [ 70.594110][ T1] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
>> [ 70.605936][ T1] EDAC DEBUG: edac_mc_del_mc:
>> [ 70.611188][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device:
>> [ 70.619443][ T1] random: get_random_u32 called from kobject_put+0x8c/0x190 with crng_init=0
>> [ 70.628163][ T1] kobject: 'csrow2' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 750)
>> [ 70.638477][ T1] EDAC DEBUG: edac_remove_sysfs_mci_device: unregistering device dimm2
>> [ 70.647903][ T1] kobject: 'dimm2' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 250)
>> [ 70.658105][ T1] EDAC MC: Removed device 0 for ghes_edac.c ghes_edac: DEV ghes
>> [ 70.665673][ T1] EDAC DEBUG: edac_mc_free:
>> [ 70.670211][ T1] EDAC DEBUG: edac_unregister_sysfs: unregistering device mc0
>> [ 70.679027][ T1] kobject: 'mc0' ((____ptrval____)): kobject_release, parent (____ptrval____) (delayed 500)
>> [ 70.690987][ T1] EDAC DEBUG: edac_mc_del_mc:
>> [ 70.695769][ T1] EDAC DEBUG: edac_mc_free:
>> [ 70.700412][ T1] ------------[ cut here ]------------
>> [ 70.705832][ T1] ODEBUG: free active (active state 0) object type: timer_list hint: delayed_work_timer_fn+0x0/0x48
>> [ 70.716663][ T1] WARNING: CPU: 50 PID: 1 at lib/debugobjects.c:484 debug_print_object+0xec/0x130
>
> If I am parsing these unwrapped messages correctly (btw, pls use another
> mail client for pasting log lines - thunderbird is usually ok but I
> guess you need to configure it properly
Maybe you can receive the cutdown log attachment while I figure out how
to do that...
), that must be some workqueue
> object of sorts.
>
> Now, ghes_edac doesn't init the workqueue:
>
> [ 70.594110][ T1] EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
>
> as it is in interrupt mode.
>
> So the only other workqueue I see is that "delayed XXX" stuff which is in
> kobject_release().
>
> AFAICT.
>
> Do you have CONFIG_DEBUG_KOBJECT_RELEASE enabled and if so, does the
> warning go away if you disable it?
>
Yes, it's enabled with allmodconfig, but no, it does not go away with
disabling (see log #2).
Cheers,
John
> Thx.
>
On 14/10/2019 17:15, James Morse wrote:
> Hi John,
>
Hi James,
> On 14/10/2019 16:18, John Garry wrote:
>> I'm experimenting by trying to boot an allmodconfig arm64 kernel, as mentioned here:
>
> Crumbs!
>
>
>> One thing that I noticed - it's hard to miss actually - is the amount of complaining from
>> KASAN about the EDAC/ghes code. Maybe this is something I should not care about/red
>> herring, or maybe something genuine. Let me know what you think.
>
> Hmmm, I thought I tested this recently...
>
>> Log snippet (I cut off after the first KASAN warning):
>>
>> [ 70.471011][ T1] random: get_random_u32 called from new_slab+0x360/0x698 with
>> crng_init=0
>
>> [ 70.478671][ T1] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
>> [0x94110034/64/0/3/0]
>
> (this one's for you right?)
Yeah, I'll report it. It might be already fixed.
>
>> [ 70.700412][ T1] ------------[ cut here ]------------
>
>> [ 70.802080][ T1] Call trace:
>> [ 70.802093][ T1] debug_print_object+0xec/0x130
>> [ 70.802106][ T1] __debug_check_no_obj_freed+0x114/0x290
>> [ 70.802119][ T1] debug_check_no_obj_freed+0x18/0x28
>> [ 70.802130][ T1] slab_free_freelist_hook+0x18c/0x228
>> [ 70.802140][ T1] kfree+0x264/0x420
>> [ 70.802157][ T1] _edac_mc_free+0x6c/0x210
>> [ 70.814163][ T1] edac_mc_free+0x68/0x88
>> [ 70.814177][ T1] ghes_edac_unregister+0x44/0x70
>> [ 70.814193][ T1] ghes_remove+0x274/0x2a0
>
> Ugh. This must be the test driver remove thing.
Yeah, the probe, remove, probe again flow from
CONFIG_DEBUG_TEST_DRIVER_REMOVE.
>
> I've reproduced this, but had to remove the parent GHES twice. It looks like it tries to
> use the first ghes_edac global variables when freeing the second. ghes_init prevents it
> from re-allocating over the top.
>
> The below diff fixes it for me.
And for me by the looks of it. That's with CONFIG_DEBUG_KOBJECT_RELEASE
now unset, but I expect the same with it set.
(I'll post it as a proper patch once I've done the
> archaeology)
>
> -----------%<-----------
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index d413a0bdc9ad..955b59b6aade 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -554,6 +554,7 @@ void ghes_edac_unregister(struct ghes *ghes)
> return;
>
> mci = ghes_pvt->mci;
> + ghes_pvt = NULL;
> edac_mc_del_mc(mci->pdev);
> edac_mc_free(mci);
> }
>
> -----------%<-----------
>
Thanks,
John
BTW, I am not sure if my response to Boris was rejected due to
attachments, as but it is here:
https://lore.kernel.org/linux-edac/[email protected]/T/#ma0e122ca0eda9d80e869af179352f75037146d3c
>
> Thanks!
>
> James
>
> .
>
On Mon, Oct 14, 2019 at 05:56:02PM +0100, John Garry wrote:
> BTW, I am not sure if my response to Boris was rejected due to attachments,
> as but it is here:
>
> https://lore.kernel.org/linux-edac/[email protected]/T/#ma0e122ca0eda9d80e869af179352f75037146d3c
No, all good. It went through.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 14/10/2019 16:18, John Garry wrote:
Hi guys,
JFYI, I see an issue on linuxnext-2019119, as follows:
21.645388] io scheduler kyber registered
[ 21.734011] input: Power Button as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
[ 21.743295] ACPI: Power Button [PWRB]
[ 21.809644] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
[0x94110034/64/0/3/0]
[ 21.821974] EDAC MC0: Giving out device to module ghes_edac.c
controller ghes_edac: DEV ghes (INTERRUPT)
[ 21.831763] ------------[ cut here ]------------
[ 21.836374] refcount_t: increment on 0; use-after-free.
[ 21.841620] WARNING: CPU: 36 PID: 1 at lib/refcount.c:156
refcount_inc_checked+0x44/0x50
[ 21.849697] Modules linked in:
[ 21.852745] CPU: 36 PID: 1 Comm: swapper/0 Not tainted
5.4.0-rc8-next-20191119-00003-g141a9fef5092-dirty #650
[ 21.862645] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
RC0 - V1.16.01 03/15/2019
[ 21.871157] pstate: 60c00009 (nZCv daif +PAN +UAO)
[ 21.875936] pc : refcount_inc_checked+0x44/0x50
[ 21.880455] lr : refcount_inc_checked+0x44/0x50
[ 21.884972] sp : ffff00236ffbf8a0
[ 21.888274] x29: ffff00236ffbf8a0 x28: 0000000000000002
[ 21.893576] x27: ffff00236cd07900 x26: ffff002369063010
[ 21.898876] x25: 0000000000000000 x24: ffff00233c236824
[ 21.904177] x23: ffffa000137b9000 x22: ffffa00016fbb7c0
[ 21.909477] x21: ffffa00012dfd000 x20: 1fffe0046dff7f24
[ 21.914777] x19: ffff00233c236000 x18: 0000000000000000
[ 21.920077] x17: 0000000000000000 x16: 0000000000000000
[ 21.925377] x15: 0000000000007700 x14: 64655f7365686720
[ 21.930677] x13: 72656c6c6f72746e x12: 1ffff40002719618
[ 21.935977] x11: ffff940002719618 x10: dfffa00000000000
[ 21.941278] x9 : ffff940002719619 x8 : 0000000000000001
[ 21.946578] x7 : 0000000000000000 x6 : 0000000000000001
[ 21.951877] x5 : ffff940002719618 x4 : ffff00236ffb0010
[ 21.957178] x3 : ffffa000112415e4 x2 : ffff80046dff7ede
[ 21.962478] x1 : 5aff78756b1cf400 x0 : 0000000000000000
[ 21.967779] Call trace:
[ 21.970214] refcount_inc_checked+0x44/0x50
[ 21.974389] ghes_edac_register+0x258/0x388
[ 21.978562] ghes_probe+0x28c/0x5f0
[ 21.982041] platform_drv_probe+0x70/0xd8
[ 21.986039] really_probe+0x174/0x468
[ 21.989690] driver_probe_device+0x7c/0x148
[ 21.993862] device_driver_attach+0x94/0xa0
[ 21.998033] __driver_attach+0xa4/0x110
[ 22.001857] bus_for_each_dev+0xe8/0x158
[ 22.005768] driver_attach+0x30/0x40
[ 22.009331] bus_add_driver+0x234/0x2f0
[ 22.013156] driver_register+0xbc/0x1d0
[ 22.016981] __platform_driver_register+0x7c/0x88
[ 22.021675] ghes_init+0xbc/0x14c
[ 22.024979] do_one_initcall+0xb4/0x254
[ 22.028805] kernel_init_freeable+0x248/0x2f4
[ 22.033151] kernel_init+0x10/0x118
[ 22.036628] ret_from_fork+0x10/0x18
[ 22.040194] ---[ end trace 33655bb65a9835fe ]---
[ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
[ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.058311] ghes_edac: Can't register at EDAC core
[ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
[ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.077080] ghes_edac: Can't register at EDAC core
[ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
[ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.095789] ghes_edac: Can't register at EDAC core
[ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
[ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.115442] ghes_edac: Can't register at EDAC core
[ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
[ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.134344] ghes_edac: Can't register at EDAC core
[ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
[ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.153089] ghes_edac: Can't register at EDAC core
[ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
[ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
[ 22.171810] ghes_edac: Can't register at EDAC core
[ 22.178933] GHES: APEI firmware first mode is enabled by APEI bit and
WHEA _OSC.
This time I'm using a standard arm64 defconfig, except kasan and
kmemleak is enabled (I need to enable them when developing software -
joke). Maybe it's a known issue, I don't know.
Cheers,
John
> Hi guys,
>
> I'm experimenting by trying to boot an allmodconfig arm64 kernel, as
> mentioned here:
> https://lore.kernel.org/linux-arm-kernel/[email protected]/
>
>
> One thing that I noticed - it's hard to miss actually - is the amount of
> complaining from KASAN about the EDAC/ghes code. Maybe this is something
> I should not care about/red herring, or maybe something genuine. Let me
> know what you think.
>
> The kernel is v5.4-rc3, and I raised the EDAC mc debug level to get
> extra debug prints.
>
[cut]
Hi John,
thanks for testing and reporting this. See inline.
On 21.11.19 12:34:22, John Garry wrote:
> On 14/10/2019 16:18, John Garry wrote:
> JFYI, I see an issue on linuxnext-2019119, as follows:
>
> 21.645388] io scheduler kyber registered
> [ 21.734011] input: Power Button as
> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
> [ 21.743295] ACPI: Power Button [PWRB]
> [ 21.809644] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
> [0x94110034/64/0/3/0]
> [ 21.821974] EDAC MC0: Giving out device to module ghes_edac.c controller
> ghes_edac: DEV ghes (INTERRUPT)
> [ 21.831763] ------------[ cut here ]------------
> [ 21.836374] refcount_t: increment on 0; use-after-free.
> [ 21.841620] WARNING: CPU: 36 PID: 1 at lib/refcount.c:156
> refcount_inc_checked+0x44/0x50
> [ 21.849697] Modules linked in:
> [ 21.852745] CPU: 36 PID: 1 Comm: swapper/0 Not tainted
> 5.4.0-rc8-next-20191119-00003-g141a9fef5092-dirty #650
> [ 21.862645] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> V1.16.01 03/15/2019
> [ 21.871157] pstate: 60c00009 (nZCv daif +PAN +UAO)
> [ 21.875936] pc : refcount_inc_checked+0x44/0x50
> [ 21.880455] lr : refcount_inc_checked+0x44/0x50
This is a warning from the refcount framework. It warns if we
increment from zero. This is reasonable as typically a kernel object
is created with a refcount of 1 and thrown away once the refcount is
zero. Afterwards the object is used-after-free.
For ghes the refcount is initialized with zero, and that is why we see
this message. However, we protect the refcount with the ghes_reg_mutex
and thus there is no use after free. The device is allocated and
registered if the refcount is zero. So this works fine.
Enclosed a fix that avoids the warning, please test.
But see below...
> [ 21.884972] sp : ffff00236ffbf8a0
> [ 21.888274] x29: ffff00236ffbf8a0 x28: 0000000000000002
> [ 21.893576] x27: ffff00236cd07900 x26: ffff002369063010
> [ 21.898876] x25: 0000000000000000 x24: ffff00233c236824
> [ 21.904177] x23: ffffa000137b9000 x22: ffffa00016fbb7c0
> [ 21.909477] x21: ffffa00012dfd000 x20: 1fffe0046dff7f24
> [ 21.914777] x19: ffff00233c236000 x18: 0000000000000000
> [ 21.920077] x17: 0000000000000000 x16: 0000000000000000
> [ 21.925377] x15: 0000000000007700 x14: 64655f7365686720
> [ 21.930677] x13: 72656c6c6f72746e x12: 1ffff40002719618
> [ 21.935977] x11: ffff940002719618 x10: dfffa00000000000
> [ 21.941278] x9 : ffff940002719619 x8 : 0000000000000001
> [ 21.946578] x7 : 0000000000000000 x6 : 0000000000000001
> [ 21.951877] x5 : ffff940002719618 x4 : ffff00236ffb0010
> [ 21.957178] x3 : ffffa000112415e4 x2 : ffff80046dff7ede
> [ 21.962478] x1 : 5aff78756b1cf400 x0 : 0000000000000000
> [ 21.967779] Call trace:
> [ 21.970214] refcount_inc_checked+0x44/0x50
> [ 21.974389] ghes_edac_register+0x258/0x388
> [ 21.978562] ghes_probe+0x28c/0x5f0
> [ 21.982041] platform_drv_probe+0x70/0xd8
> [ 21.986039] really_probe+0x174/0x468
> [ 21.989690] driver_probe_device+0x7c/0x148
> [ 21.993862] device_driver_attach+0x94/0xa0
> [ 21.998033] __driver_attach+0xa4/0x110
> [ 22.001857] bus_for_each_dev+0xe8/0x158
> [ 22.005768] driver_attach+0x30/0x40
> [ 22.009331] bus_add_driver+0x234/0x2f0
> [ 22.013156] driver_register+0xbc/0x1d0
> [ 22.016981] __platform_driver_register+0x7c/0x88
> [ 22.021675] ghes_init+0xbc/0x14c
> [ 22.024979] do_one_initcall+0xb4/0x254
> [ 22.028805] kernel_init_freeable+0x248/0x2f4
> [ 22.033151] kernel_init+0x10/0x118
> [ 22.036628] ret_from_fork+0x10/0x18
> [ 22.040194] ---[ end trace 33655bb65a9835fe ]---
> [ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.058311] ghes_edac: Can't register at EDAC core
> [ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.077080] ghes_edac: Can't register at EDAC core
> [ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.095789] ghes_edac: Can't register at EDAC core
> [ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.115442] ghes_edac: Can't register at EDAC core
> [ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.134344] ghes_edac: Can't register at EDAC core
> [ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.153089] ghes_edac: Can't register at EDAC core
> [ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
> [ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
> [ 22.171810] ghes_edac: Can't register at EDAC core
What I am more concerned is this here. In total this implies 8 ghes
users that all try to register a (single-instance) ghes mc device. For
non-x86 only one instance is allowed (see ghes_edac_register(), idx =
0).
So on your platform, when parsing the HEST table
(hest_ghes_dev_register()), more than one "GHES" device is parsed,
allocated and registered. Mind sending me your HEST table
(/sys/firmware/acpi/tables/HEST), or explain what happens here? If
this is a valid use case, we need to change ghes_edac_register() to
support more than one instance.
Again, please try the patch below.
Thanks,
-Robert
From 6962f8af4a7c1051c9e87a5ac60571f70d2b6814 Mon Sep 17 00:00:00 2001
From: Robert Richter <[email protected]>
Date: Thu, 21 Nov 2019 15:01:28 +0100
Subject: [PATCH] EDAC/ghes: Do not warn on when increment refcount on 0
Signed-off-by: Robert Richter <[email protected]>
---
drivers/edac/ghes_edac.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 47f4e7f90ef0..b99080d8a10c 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -556,8 +556,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
ghes_pvt = pvt;
spin_unlock_irqrestore(&ghes_lock, flags);
- /* only increment on success */
- refcount_inc(&ghes_refcount);
+ /* only set on success */
+ refcount_set(&ghes_refcount, 1);
unlock:
mutex_unlock(&ghes_reg_mutex);
--
2.20.1
On 21/11/2019 14:23, Robert Richter wrote:
> Hi John,
>
> thanks for testing and reporting this. See inline.
>
> On 21.11.19 12:34:22, John Garry wrote:
>> On 14/10/2019 16:18, John Garry wrote:
>> JFYI, I see an issue on linuxnext-2019119, as follows:
>>
>> 21.645388] io scheduler kyber registered
>> [ 21.734011] input: Power Button as
>> /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
>> [ 21.743295] ACPI: Power Button [PWRB]
>> [ 21.809644] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
>> [0x94110034/64/0/3/0]
>> [ 21.821974] EDAC MC0: Giving out device to module ghes_edac.c controller
>> ghes_edac: DEV ghes (INTERRUPT)
>> [ 21.831763] ------------[ cut here ]------------
>> [ 21.836374] refcount_t: increment on 0; use-after-free.
>> [ 21.841620] WARNING: CPU: 36 PID: 1 at lib/refcount.c:156
>> refcount_inc_checked+0x44/0x50
>> [ 21.849697] Modules linked in:
>> [ 21.852745] CPU: 36 PID: 1 Comm: swapper/0 Not tainted
>> 5.4.0-rc8-next-20191119-00003-g141a9fef5092-dirty #650
>> [ 21.862645] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
>> V1.16.01 03/15/2019
>> [ 21.871157] pstate: 60c00009 (nZCv daif +PAN +UAO)
>> [ 21.875936] pc : refcount_inc_checked+0x44/0x50
>> [ 21.880455] lr : refcount_inc_checked+0x44/0x50
>
> This is a warning from the refcount framework. It warns if we
> increment from zero. This is reasonable as typically a kernel object
> is created with a refcount of 1 and thrown away once the refcount is
> zero. Afterwards the object is used-after-free.
>
> For ghes the refcount is initialized with zero, and that is why we see
> this message. However, we protect the refcount with the ghes_reg_mutex
> and thus there is no use after free. The device is allocated and
> registered if the refcount is zero. So this works fine.
>
> Enclosed a fix that avoids the warning, please test.
>
> But see below...
>
>> [ 21.884972] sp : ffff00236ffbf8a0
>> [ 21.888274] x29: ffff00236ffbf8a0 x28: 0000000000000002
>> [ 21.893576] x27: ffff00236cd07900 x26: ffff002369063010
>> [ 21.898876] x25: 0000000000000000 x24: ffff00233c236824
>> [ 21.904177] x23: ffffa000137b9000 x22: ffffa00016fbb7c0
>> [ 21.909477] x21: ffffa00012dfd000 x20: 1fffe0046dff7f24
>> [ 21.914777] x19: ffff00233c236000 x18: 0000000000000000
>> [ 21.920077] x17: 0000000000000000 x16: 0000000000000000
>> [ 21.925377] x15: 0000000000007700 x14: 64655f7365686720
>> [ 21.930677] x13: 72656c6c6f72746e x12: 1ffff40002719618
>> [ 21.935977] x11: ffff940002719618 x10: dfffa00000000000
>> [ 21.941278] x9 : ffff940002719619 x8 : 0000000000000001
>> [ 21.946578] x7 : 0000000000000000 x6 : 0000000000000001
>> [ 21.951877] x5 : ffff940002719618 x4 : ffff00236ffb0010
>> [ 21.957178] x3 : ffffa000112415e4 x2 : ffff80046dff7ede
>> [ 21.962478] x1 : 5aff78756b1cf400 x0 : 0000000000000000
>> [ 21.967779] Call trace:
>> [ 21.970214] refcount_inc_checked+0x44/0x50
>> [ 21.974389] ghes_edac_register+0x258/0x388
>> [ 21.978562] ghes_probe+0x28c/0x5f0
>> [ 21.982041] platform_drv_probe+0x70/0xd8
>> [ 21.986039] really_probe+0x174/0x468
>> [ 21.989690] driver_probe_device+0x7c/0x148
>> [ 21.993862] device_driver_attach+0x94/0xa0
>> [ 21.998033] __driver_attach+0xa4/0x110
>> [ 22.001857] bus_for_each_dev+0xe8/0x158
>> [ 22.005768] driver_attach+0x30/0x40
>> [ 22.009331] bus_add_driver+0x234/0x2f0
>> [ 22.013156] driver_register+0xbc/0x1d0
>> [ 22.016981] __platform_driver_register+0x7c/0x88
>> [ 22.021675] ghes_init+0xbc/0x14c
>> [ 22.024979] do_one_initcall+0xb4/0x254
>> [ 22.028805] kernel_init_freeable+0x248/0x2f4
>> [ 22.033151] kernel_init+0x10/0x118
>> [ 22.036628] ret_from_fork+0x10/0x18
>> [ 22.040194] ---[ end trace 33655bb65a9835fe ]---
>> [ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.058311] ghes_edac: Can't register at EDAC core
>> [ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.077080] ghes_edac: Can't register at EDAC core
>> [ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.095789] ghes_edac: Can't register at EDAC core
>> [ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.115442] ghes_edac: Can't register at EDAC core
>> [ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.134344] ghes_edac: Can't register at EDAC core
>> [ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.153089] ghes_edac: Can't register at EDAC core
>> [ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
>> [ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
>> [ 22.171810] ghes_edac: Can't register at EDAC core
>
> What I am more concerned is this here. In total this implies 8 ghes
> users that all try to register a (single-instance) ghes mc device. For
> non-x86 only one instance is allowed (see ghes_edac_register(), idx =
> 0).
>
[cc some guys about HEST]
> So on your platform, when parsing the HEST table
> (hest_ghes_dev_register()), more than one "GHES" device is parsed,
> allocated and registered. Mind sending me your HEST table
> (/sys/firmware/acpi/tables/HEST), or explain what happens here?
I think that this should be the same:
https://github.com/tianocore/edk2-platforms/tree/master/Silicon/Hisilicon/Hi1620/Drivers/Apei/Hest
If
> this is a valid use case, we need to change ghes_edac_register() to
> support more than one instance.
>
> Again, please try the patch below.
>
> Thanks,
>
> -Robert
>
>
>>From 6962f8af4a7c1051c9e87a5ac60571f70d2b6814 Mon Sep 17 00:00:00 2001
> From: Robert Richter <[email protected]>
> Date: Thu, 21 Nov 2019 15:01:28 +0100
> Subject: [PATCH] EDAC/ghes: Do not warn on when increment refcount on 0
>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> drivers/edac/ghes_edac.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
> index 47f4e7f90ef0..b99080d8a10c 100644
> --- a/drivers/edac/ghes_edac.c
> +++ b/drivers/edac/ghes_edac.c
> @@ -556,8 +556,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
> ghes_pvt = pvt;
> spin_unlock_irqrestore(&ghes_lock, flags);
>
> - /* only increment on success */
> - refcount_inc(&ghes_refcount);
> + /* only set on success */
> + refcount_set(&ghes_refcount, 1);
Yep, that seems to have silenced it all:
[ 21.739895] input: Power Button as
/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0C0C:00/input/input0
[ 21.749848] ACPI: Power Button [PWRB]
[ 21.816977] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
[0x94110034/64/0/3/0]
[ 21.827880] EDAC MC0: Giving out device to module ghes_edac.c
controller ghes_edac: DEV ghes (INTERRUPT)
[ 21.841313] GHES: APEI firmware first mode is enabled by APEI bit and
WHEA _OSC.
[ 21.849176] EINJ: Error INJection is initialized.
[ 21.855135] ACPI GTDT: found 1 SBSA generic Watchdog(s).
Thanks,
John
>
> unlock:
> mutex_unlock(&ghes_reg_mutex);
>
Cheers,
John
Following warning from the refcount framework is seen during ghes
initialization:
EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
------------[ cut here ]------------
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked+0x44/0x50
[...]
Call trace:
refcount_inc_checked+0x44/0x50
ghes_edac_register+0x258/0x388
ghes_probe+0x28c/0x5f0
It warns if the refcount is incremented from zero. This warning is
reasonable as a kernel object is typically created with a refcount of
one and freed once the refcount is zero. Afterwards the object would
be "used-after-free".
For ghes the refcount is initialized with zero, and that is why this
message is seen when initializing the first instance. However,
whenever the refcount is zero, the device will be allocated and
registered. Since the ghes_reg_mutex protects the refcount and
serializes allocation and freeing of ghes devices, a use-after-free
cannot happen here.
Instead of using refcount_inc() for the first instance, use
refcount_set(). This can be used here because the refcount is zero at
this point and can not change due to its protection by the mutex.
Reported-by: John Garry <[email protected]>
Tested-by: John Garry <[email protected]>
Signed-off-by: Robert Richter <[email protected]>
---
drivers/edac/ghes_edac.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 47f4e7f90ef0..b99080d8a10c 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -556,8 +556,8 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)
ghes_pvt = pvt;
spin_unlock_irqrestore(&ghes_lock, flags);
- /* only increment on success */
- refcount_inc(&ghes_refcount);
+ /* only set on success */
+ refcount_set(&ghes_refcount, 1);
unlock:
mutex_unlock(&ghes_reg_mutex);
--
2.20.1
On Thu, Nov 21, 2019 at 09:36:57PM +0000, Robert Richter wrote:
> Following warning from the refcount framework is seen during ghes
> initialization:
>
> EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
> ------------[ cut here ]------------
> refcount_t: increment on 0; use-after-free.
> WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked+0x44/0x50
> [...]
> Call trace:
> refcount_inc_checked+0x44/0x50
> ghes_edac_register+0x258/0x388
> ghes_probe+0x28c/0x5f0
>
> It warns if the refcount is incremented from zero. This warning is
> reasonable as a kernel object is typically created with a refcount of
> one and freed once the refcount is zero. Afterwards the object would
> be "used-after-free".
>
> For ghes the refcount is initialized with zero, and that is why this
> message is seen when initializing the first instance. However,
> whenever the refcount is zero, the device will be allocated and
> registered. Since the ghes_reg_mutex protects the refcount and
> serializes allocation and freeing of ghes devices, a use-after-free
> cannot happen here.
>
> Instead of using refcount_inc() for the first instance, use
> refcount_set(). This can be used here because the refcount is zero at
> this point and can not change due to its protection by the mutex.
>
> Reported-by: John Garry <[email protected]>
> Tested-by: John Garry <[email protected]>
> Signed-off-by: Robert Richter <[email protected]>
> ---
> drivers/edac/ghes_edac.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
Queued, thanks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 21.11.19 15:23:42, John Garry wrote:
> On 21/11/2019 14:23, Robert Richter wrote:
> > On 21.11.19 12:34:22, John Garry wrote:
> > > [ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.058311] ghes_edac: Can't register at EDAC core
> > > [ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.077080] ghes_edac: Can't register at EDAC core
> > > [ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.095789] ghes_edac: Can't register at EDAC core
> > > [ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.115442] ghes_edac: Can't register at EDAC core
> > > [ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.134344] ghes_edac: Can't register at EDAC core
> > > [ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.153089] ghes_edac: Can't register at EDAC core
> > > [ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
> > > [ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
> > > [ 22.171810] ghes_edac: Can't register at EDAC core
> >
> > What I am more concerned is this here. In total this implies 8 ghes
> > users that all try to register a (single-instance) ghes mc device. For
> > non-x86 only one instance is allowed (see ghes_edac_register(), idx =
> > 0).
I also looked into this: With refcount_inc_checked() enabled, the
refcount is *not* increased from 0 to 1. Under the hood only
refcount_inc_not_zero() is called instead of refcount_inc(). So the
refcount is still zero after an edac mc device was registered. Instead
of sharing the edac mc device, the driver tries to allocate another mc
device for each GHESv2 entry in the HEST table. This causes the
'duplicate mc_idx' message. Also, it is ok to have multiple GHESv2
entries (your system seems to have 8 entries), e.g. to serve different
kind of errors in the system.
Thanks,
-Robert
On 22/11/2019 09:01, Borislav Petkov wrote:
> On Thu, Nov 21, 2019 at 09:36:57PM +0000, Robert Richter wrote:
>> Following warning from the refcount framework is seen during ghes
>> initialization:
>>
>> EDAC MC0: Giving out device to module ghes_edac.c controller ghes_edac: DEV ghes (INTERRUPT)
>> ------------[ cut here ]------------
>> refcount_t: increment on 0; use-after-free.
>> WARNING: CPU: 36 PID: 1 at lib/refcount.c:156 refcount_inc_checked+0x44/0x50
>> [...]
>> Call trace:
>> refcount_inc_checked+0x44/0x50
>> ghes_edac_register+0x258/0x388
>> ghes_probe+0x28c/0x5f0
>>
>> It warns if the refcount is incremented from zero. This warning is
>> reasonable as a kernel object is typically created with a refcount of
>> one and freed once the refcount is zero. Afterwards the object would
>> be "used-after-free".
>>
>> For ghes the refcount is initialized with zero, and that is why this
>> message is seen when initializing the first instance. However,
>> whenever the refcount is zero, the device will be allocated and
>> registered. Since the ghes_reg_mutex protects the refcount and
>> serializes allocation and freeing of ghes devices, a use-after-free
>> cannot happen here.
>>
>> Instead of using refcount_inc() for the first instance, use
>> refcount_set(). This can be used here because the refcount is zero at
>> this point and can not change due to its protection by the mutex.
>>
>> Reported-by: John Garry <[email protected]>
>> Tested-by: John Garry <[email protected]>
According to kernel dev process Doc, this should be explicitly granted, so:
Tested-by: John Garry <[email protected]>
Thanks,
John
>> Signed-off-by: Robert Richter <[email protected]>
>> ---
>> drivers/edac/ghes_edac.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Queued, thanks.
>
On 22/11/2019 11:28, Robert Richter wrote:
> On 21.11.19 15:23:42, John Garry wrote:
>> On 21/11/2019 14:23, Robert Richter wrote:
>>> On 21.11.19 12:34:22, John Garry wrote:
>
>>>> [ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.058311] ghes_edac: Can't register at EDAC core
>>>> [ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.077080] ghes_edac: Can't register at EDAC core
>>>> [ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.095789] ghes_edac: Can't register at EDAC core
>>>> [ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.115442] ghes_edac: Can't register at EDAC core
>>>> [ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.134344] ghes_edac: Can't register at EDAC core
>>>> [ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.153089] ghes_edac: Can't register at EDAC core
>>>> [ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
>>>> [ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
>>>> [ 22.171810] ghes_edac: Can't register at EDAC core
>>>
>>> What I am more concerned is this here. In total this implies 8 ghes
>>> users that all try to register a (single-instance) ghes mc device. For
>>> non-x86 only one instance is allowed (see ghes_edac_register(), idx =
>>> 0).
>
> I also looked into this: With refcount_inc_checked() enabled, the
> refcount is *not* increased from 0 to 1.
Yeah, I had quickly checked this back then and I think you're right.
Thanks,
John
Under the hood only
> refcount_inc_not_zero() is called instead of refcount_inc(). So the
> refcount is still zero after an edac mc device was registered. Instead
> of sharing the edac mc device, the driver tries to allocate another mc
> device for each GHESv2 entry in the HEST table. This causes the
> 'duplicate mc_idx' message. Also, it is ok to have multiple GHESv2
> entries (your system seems to have 8 entries), e.g. to serve different
> kind of errors in the system.
>
> Thanks,
>
> -Robert
> .
>
On 26/11/2019 09:59, John Garry wrote:
> On 22/11/2019 11:28, Robert Richter wrote:
>> On 21.11.19 15:23:42, John Garry wrote:
>>> On 21/11/2019 14:23, Robert Richter wrote:
>>>> On 21.11.19 12:34:22, John Garry wrote:
>>
>>>>> [ 22.046666] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.046666] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.058311] ghes_edac: Can't register at EDAC core
>>>>> [ 22.065402] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.065402] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.077080] ghes_edac: Can't register at EDAC core
>>>>> [ 22.084140] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.084140] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.095789] ghes_edac: Can't register at EDAC core
>>>>> [ 22.102873] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.102873] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.115442] ghes_edac: Can't register at EDAC core
>>>>> [ 22.122536] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.122536] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.134344] ghes_edac: Can't register at EDAC core
>>>>> [ 22.141441] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.141441] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.153089] ghes_edac: Can't register at EDAC core
>>>>> [ 22.160161] EDAC MC: bug in low-level driver: attempt to assign
>>>>> [ 22.160161] duplicate mc_idx 0 in add_mc_to_global_list()
>>>>> [ 22.171810] ghes_edac: Can't register at EDAC core
>>>>
>>>> What I am more concerned is this here. In total this implies 8 ghes
>>>> users that all try to register a (single-instance) ghes mc device. For
>>>> non-x86 only one instance is allowed (see ghes_edac_register(), idx =
>>>> 0).
>>
>> I also looked into this: With refcount_inc_checked() enabled, the
>> refcount is *not* increased from 0 to 1.
>
> Yeah, I had quickly checked this back then and I think you're right.
>
> Thanks,
> John
Hi guys,
Me again ... For linux-next 27 Nov, I now see this on my same arm64 system:
[ 21.936616] ACPI: Power Button [PWRB]
[ 22.074582] [Firmware Bug]: APEI: Invalid bit width + offset in GAR
[0x94110034/64/0/3/0]
[ 22.086095] EDAC MC0: Giving out device to module ghes_edac.c
controller ghes_edac: DEV ghes (INTERRUPT)
[ 22.097276]
==================================================================
[ 22.104498] BUG: KASAN: use-after-free in
edac_remove_sysfs_mci_device+0x148/0x180
[ 22.112055] Read of size 4 at addr ffff00233bc69338 by task swapper/0/1
[ 22.118656]
[ 22.120139] CPU: 33 PID: 1 Comm: swapper/0 Not tainted
5.4.0-next-20191127-dirty #667
[ 22.127956] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI
RC0 - V1.16.01 03/15/2019
[ 22.136467] Call trace:
[ 22.138907] dump_backtrace+0x0/0x290
[ 22.142558] show_stack+0x14/0x20
[ 22.145865] dump_stack+0xf0/0x14c
[ 22.149258] print_address_description.isra.11+0x6c/0x3b8
[ 22.154645] __kasan_report+0x12c/0x23c
[ 22.158470] kasan_report+0xc/0x18
[ 22.161860] __asan_load4+0x94/0xb8
[ 22.165338] edac_remove_sysfs_mci_device+0x148/0x180
[ 22.170378] edac_mc_del_mc+0x154/0x1b8
[ 22.174203] ghes_edac_unregister+0xa0/0x188
[ 22.178465] ghes_remove+0x11c/0x1f8
[ 22.182033] platform_drv_remove+0x3c/0x68
[ 22.186119] really_probe+0x174/0x548
[ 22.189770] driver_probe_device+0x7c/0x148
[ 22.193942] device_driver_attach+0x94/0xa0
[ 22.198113] __driver_attach+0xa4/0x110
[ 22.201938] bus_for_each_dev+0xe8/0x158
[ 22.205849] driver_attach+0x30/0x40
[ 22.209413] bus_add_driver+0x234/0x2f0
[ 22.213237] driver_register+0xbc/0x1d0
[ 22.217063] __platform_driver_register+0x7c/0x88
[ 22.221757] ghes_init+0xbc/0x14c
[ 22.225060] do_one_initcall+0xb4/0x254
[ 22.228887] kernel_init_freeable+0x248/0x2f4
[ 22.233233] kernel_init+0x10/0x118
[ 22.236710] ret_from_fork+0x10/0x18
[ 22.240273]
[ 22.241753] Allocated by task 1:
[ 22.244971] save_stack+0x28/0xc8
[ 22.248274] __kasan_kmalloc.isra.9+0xbc/0xd8
[ 22.252619] kasan_kmalloc+0xc/0x18
[ 22.256096] edac_mc_alloc+0x62c/0x888
[ 22.259834] ghes_edac_register+0x1c8/0x3f0
[ 22.264006] ghes_probe+0x28c/0x5f0
[ 22.267484] platform_drv_probe+0x70/0xd8
[ 22.271482] really_probe+0x118/0x548
[ 22.275133] driver_probe_device+0x7c/0x148
[ 22.279305] device_driver_attach+0x94/0xa0
[ 22.283476] __driver_attach+0xa4/0x110
[ 22.287301] bus_for_each_dev+0xe8/0x158
[ 22.291212] driver_attach+0x30/0x40
[ 22.294776] bus_add_driver+0x234/0x2f0
[ 22.298600] driver_register+0xbc/0x1d0
[ 22.302425] __platform_driver_register+0x7c/0x88
[ 22.307118] ghes_init+0xbc/0x14c
[ 22.310421] do_one_initcall+0xb4/0x254
[ 22.314246] kernel_init_freeable+0x248/0x2f4
[ 22.318591] kernel_init+0x10/0x118
[ 22.322068] ret_from_fork+0x10/0x18
[ 22.325630]
[ 22.327109] Freed by task 1:
[ 22.329978] save_stack+0x28/0xc8
[ 22.333282] __kasan_slab_free+0x118/0x180
[ 22.337366] kasan_slab_free+0x10/0x18
[ 22.341106] kfree+0x110/0x2b0
[ 22.344150] dimm_attr_release+0xc/0x18
[ 22.347978] device_release+0x7c/0xe0
[ 22.351629] kobject_put+0xb0/0x180
[ 22.355106] device_unregister+0x20/0x30
[ 22.359018] edac_remove_sysfs_mci_device+0x140/0x180
[ 22.364057] edac_mc_del_mc+0x154/0x1b8
[ 22.367882] ghes_edac_unregister+0xa0/0x188
[ 22.372140] ghes_remove+0x11c/0x1f8
[ 22.375705] platform_drv_remove+0x3c/0x68
[ 22.379789] really_probe+0x174/0x548
[ 22.383440] driver_probe_device+0x7c/0x148
[ 22.387612] device_driver_attach+0x94/0xa0
[ 22.391783] __driver_attach+0xa4/0x110
[ 22.395608] bus_for_each_dev+0xe8/0x158
[ 22.399519] driver_attach+0x30/0x40
[ 22.403083] bus_add_driver+0x234/0x2f0
[ 22.406907] driver_register+0xbc/0x1d0
[ 22.410732] __platform_driver_register+0x7c/0x88
[ 22.415424] ghes_init+0xbc/0x14c
[ 22.418728] do_one_initcall+0xb4/0x254
[ 22.422553] kernel_init_freeable+0x248/0x2f4
[ 22.426898] kernel_init+0x10/0x118
[ 22.430375] ret_from_fork+0x10/0x18
[ 22.433937]
[ 22.435417] The buggy address belongs to the object at ffff00233bc69000
[ 22.435417] which belongs to the cache kmalloc-1k of size 1024
[ 22.447922] The buggy address is located 824 bytes inside of
[ 22.447922] 1024-byte region [ffff00233bc69000, ffff00233bc69400)
[ 22.459731] The buggy address belongs to the page:
[ 22.464512] page:fffffe008ccf1a00 refcount:1 mapcount:0
mapping:ffff00237080f600 index:0x0 compound_mapcount: 0
[ 22.474590] raw: 2ffff00000010200 dead000000000100 dead00000000012ge
dumped because: kasan: bad access detected
[ 22.495608]
[ 22.497087] Memory state around the buggy address:
[ 22.501867] ffff00233bc69200: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 22.509076] ffff00233bc69280: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 22.516286] >ffff00233bc69300: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 22.523494] ^
[ 22.528534] ffff00233bc69380: fb fb fb fb fb fb fb fb fb fb fb fb fb
fb fb fb
[ 22.535744] ffff00233bc69400: fc fc fc fc fc fc fc fc fc fc fc fc fc
fc fc fc
[ 22.542952]
==================================================================
[ 22.550161] Disabling lock debugging due to kernel taint
[ 22.555511] EDAC MC: Removed device 0 for ghes_edac.c ghes_edac: DEV ghes
[ 22.564893] EDAC MC0: Giving out device to module ghes_edac.c
controller ghes_edac: DEV ghes (INTERRUPT)
[ 22.578292] GHES: APEI firmware first mode is enabled by APEI bit and
WHEA _OSC.
[ 22.586264] EINJ: Error INJection is initialized.
root@(none)$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff00236c273600 (size 256):
comm "swapper/0", pid 1, jiffies 4294897813 (age 177.596s)
hex dump (first 32 bytes):
00 00 c5 3b 23 00 ff ff 00 08 c5 3b 23 00 ff ff ...;#......;#...
00 10 c5 3b 23 00 ff ff 00 18 c5 3b 23 00 ff ff ...;#......;#...
backtrace:
[<000000007144931a>] __kmalloc+0x1e0/0x2c0
[<00000000ffb454a9>] edac_mc_alloc+0x31c/0x888
[<00000000f71ac8ce>] ghes_edac_register+0x1c8/0x3f0
[<00000000c9708978>] ghes_probe+0x28c/0x5f0
[<0000000082688646>] platform_drv_probe+0x70/0xd8
[<0000000040ba35c7>] really_probe+0x118/0x548
[<00000000603befc1>] driver_probe_device+0x7c/0x148
[<000000002b50a9eb>] device_driver_attach+0x94/0xa0
[<000000000d74ae48>] __driver_attach+0xa4/0x110
[<0000000080f51922>] bus_for_each_dev+0xe8/0x158
[<00000000300e9429>] driver_attach+0x30/0x40
[<00000000721f69ab>] bus_add_driver+0x234/0x2f0
[<00000000bc8fe749>] driver_register+0xbc/0x1d0
[<000000001cc8671e>] __platform_driver_register+0x7c/0x88
[<00000000324890ef>] ghes_init+0xbc/0x14c
[<00000000bbe18b33>] do_one_initcall+0xb4/0x254
unreferenced object 0xffff00233bc50000 (size 1024):
comm "swapper/0", pid 1, jiffies 4294897813 (age 177.596s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<00000000f51a8341>] kmem_cache_alloc+0x188/0x260
[<000000006c66db0a>] edac_mc_alloc+0x38c/0x888
[<00000000f71ac8ce>] ghes_edac_register+0x1c8/0x3f0
[<00000000c9708978>] ghes_probe+0x28c/0x5f0
[<0000000082688646>] platform_drv_probe+0x70/0xd8
[<0000000040ba35c7>] really_probe+0x118/0x548
[<00000000603befc1>] driver_probe_device+0x7c/0x148
[<000000002b50a9eb>] device_driver_attach+0x94/0xa0
[<000000000d74ae48>] __driver_attach+0xa4/0x110
[<0000000080f51922>] bus_for_each_dev+0xe8/0x158
[<00000000300e9429>] driver_attach+0x30/0x40
[<00000000721f69ab>] bus_add_driver+0x234/0x2f0
[<00000000bc8fe749>] driver_register+0xbc/0x1d0
[<000000001cc8671e>] __platform_driver_register+0x7c/0x88
[<00000000324890ef>] ghes_init+0xbc/0x14c
[<00000000bbe18b33>] do_one_initcall+0xb4/0x254
unreferenced object 0xffff00236daa2b00 (size 128):
comm "swapper/0", pid 1, jiffies 4294897813 (age 177.596s)
hex dump (first 32 bytes):
00 2a aa 6d 23 00 ff ff 00 00 00 00 00 00 00 00 .*.m#...........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000007144931a>] __kmalloc+0x1e0/0x2c0
[<000000003b8ce7e7>] edac_mc_alloc+0x400/0x888
[<00000000f71ac8ce>] ghes_edac_register+0x1c8/0x3f0
[<00000000c9708978>] ghes_probe+0x28c/0x5f0
[<0000000082688646>] platform_drv_probe+0x70/0xd8
[<0000000040ba35c7>] really_probe+0x118/0x548
[<00000000603befc1>] driver_probe_device+0x7c/0x148
[<000000002b50a9eb>] device_driver_attach+0x94/0xa0
[<000000000d74ae48>] __driver_attach+0xa4/0x110
[<0000000080f51922>] bus_for_each_dev+0xe8/0x158
[<00000000300e9429>] driver_attach+0x30/0x40
[<00000000721f69ab>] bus_add_driver+0x234/0x2f0
[<00000000bc8fe749>] driver_register+0xbc/0x1d0
[<000000001cc8671e>] __platform_driver_register+0x7c/0x88
[<00000000324890ef>] ghes_init+0xbc/0x14c
[<00000000bbe18b33>] do_one_initcall+0xb4/0x254
unreferenced object 0xffff00236daa2a00 (size 128):
[snip]
I have test enabled:
+CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
+CONFIG_KASAN=y
+CONFIG_DEBUG_KMEMLEAK=y
Cheers,
John
Hi John,
thank you for testing.
On 27.11.19 17:07:33, John Garry wrote:
> [snip]
>
> I have test enabled:
> +CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
> +CONFIG_KASAN=y
> +CONFIG_DEBUG_KMEMLEAK=y
Is this a regression (did it work before?), or a new test that you
newly run?
Thanks,
-Robert
Hi Robert,
> thank you for testing.
I'm just stumbling across these, TBH.
>
> On 27.11.19 17:07:33, John Garry wrote:
>
>> [snip]
>>
>> I have test enabled:
>> +CONFIG_DEBUG_TEST_DRIVER_REMOVE=y
>> +CONFIG_KASAN=y
>> +CONFIG_DEBUG_KMEMLEAK=y
>
> Is this a regression (did it work before?), or a new test that you
> newly run?
linuxnext-20191119 does not look to have the issue - that's when I
cherry-pick your refcount fix - but has lots of memory leaks:
root@(none)$
root@(none)$ echo scan > /sys/kernel/debug/kmemleak
root@(none)$ [ 121.639978] kmemleak: 128 new suspected memory leaks
(see /sys/kernel/debug/kmemleak)
root@(none)$ cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff00236c24ba00 (size 256):
comm "swapper/0", pid 1, jiffies 4294897826 (age 107.824s)
hex dump (first 32 bytes):
00 40 2d 3c 23 00 ff ff 00 48 2d 3c 23 00 ff ff .@-<#....H-<#...
00 50 2d 3c 23 00 ff ff 00 58 2d 3c 23 00 ff ff .P-<#....X-<#...
backtrace:
[<0000000009aed8e3>] __kmalloc+0x1e0/0x2c0
[<00000000bf599427>] edac_mc_alloc+0x31c/0x888
[<00000000c070e314>] ghes_edac_register+0x15c/0x390
[<00000000e4aad1c2>] ghes_probe+0x28c/0x5f0
[<0000000079c357cb>] platform_drv_probe+0x70/0xd8
[<00000000d4ab9188>] really_probe+0x118/0x548
[<00000000763d50f1>] driver_probe_device+0x7c/0x148
[<0000000058e623c3>] device_driver_attach+0x94/0xa0
[<00000000d7cb679d>] __driver_attach+0xa4/0x110
[<000000007d0942a0>] bus_for_each_dev+0xe8/0x158
[<000000004cf734d1>] driver_attach+0x30/0x40
[<000000009aa3536e>] bus_add_driver+0x234/0x2f0
[<00000000d163cfe0>] driver_register+0xbc/0x1d0
[<000000007e4f0ac1>] __platform_driver_register+0x7c/0x88
[<00000000a63c8dd0>] ghes_init+0xbc/0x14c
[<00000000356c8a7f>] do_one_initcall+0xb4/0x254
unreferenced object 0xffff00233c2d4000 (size 1024):
comm "swapper/0", pid 1, jiffies 4294897826 (age 107.824s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<000000004945469f>] kmem_cache_alloc+0x188/0x260
[<0000000032ea779d>] edac_mc_alloc+0x38c/0x888
Unfortunately v5.4 has similar memory leaks.
Thanks,
John
On Thu, Nov 28, 2019 at 11:02:32AM +0000, John Garry wrote:
> linuxnext-20191119 does not look to have the issue - that's when I
> cherry-pick your refcount fix - but has lots of memory leaks:
Can you forget linux-next for a while and test the latest Linus master
branch?
Also, pls send your .config and full dmesg. Privately's fine too.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 27.11.19 17:07:33, John Garry wrote:
> [ 22.104498] BUG: KASAN: use-after-free in
> edac_remove_sysfs_mci_device+0x148/0x180
It is triggered in edac_remove_sysfs_mci_device().
device_unregister(&dimm->dev) not only removes the sysfs entry, it
also frees the dimm struct in dimm_attr_release(). When incrementing
the loop in mci_for_each_dimm(), the dimm struct is accessed again
which causes the use-after-free. But, the dimm struct shouln'd be
released here already.
edac_remove_sysfs_mci_device() should not release the devices at this
point. We need clean release functions for mci and dimm_info and
refcounts to protect pdev/dev mappings. And mci_for_each_dimm() must
be checked how it handles device removals and if it is safe.
Let's see how this can be fixed.
Thanks for reporting the issue.
-Robert
On 28/11/2019 21:12, Robert Richter wrote:
> On 27.11.19 17:07:33, John Garry wrote:
>> [ 22.104498] BUG: KASAN: use-after-free in
>> edac_remove_sysfs_mci_device+0x148/0x180
>
> It is triggered in edac_remove_sysfs_mci_device().
>
> device_unregister(&dimm->dev) not only removes the sysfs entry, it
> also frees the dimm struct in dimm_attr_release(). When incrementing
> the loop in mci_for_each_dimm(), the dimm struct is accessed again
> which causes the use-after-free. But, the dimm struct shouln'd be
> released here already.
>
> edac_remove_sysfs_mci_device() should not release the devices at this
> point. We need clean release functions for mci and dimm_info and
> refcounts to protect pdev/dev mappings. And mci_for_each_dimm() must
> be checked how it handles device removals and if it is safe.
>
> Let's see how this can be fixed.
>
> Thanks for reporting the issue.
Fine, and would any fix also deal with the v5.4 mem leak which I
mentioned also?
Cheers,
John
On 02.12.19 10:23:29, John Garry wrote:
> On 28/11/2019 21:12, Robert Richter wrote:
> > On 27.11.19 17:07:33, John Garry wrote:
> > > [ 22.104498] BUG: KASAN: use-after-free in
> > > edac_remove_sysfs_mci_device+0x148/0x180
> >
> > It is triggered in edac_remove_sysfs_mci_device().
> >
> > device_unregister(&dimm->dev) not only removes the sysfs entry, it
> > also frees the dimm struct in dimm_attr_release(). When incrementing
> > the loop in mci_for_each_dimm(), the dimm struct is accessed again
> > which causes the use-after-free. But, the dimm struct shouln'd be
> > released here already.
> >
> > edac_remove_sysfs_mci_device() should not release the devices at this
> > point. We need clean release functions for mci and dimm_info and
> > refcounts to protect pdev/dev mappings. And mci_for_each_dimm() must
> > be checked how it handles device removals and if it is safe.
> >
> > Let's see how this can be fixed.
> >
> > Thanks for reporting the issue.
>
> Fine, and would any fix also deal with the v5.4 mem leak which I mentioned
> also?
Yes, I have identified the leaks:
# grep edac /sys/kernel/debug/kmemleak | sort | uniq -c
1 [<000000003c0f58f9>] edac_mc_alloc+0x3bc/0x9d0 # mci->csrows
16 [<00000000bb932dc0>] edac_mc_alloc+0x49c/0x9d0 # csr->channels
16 [<00000000e2734dba>] edac_mc_alloc+0x518/0x9d0 # csr->channels[chn]
1 [<00000000eb040168>] edac_mc_alloc+0x5c8/0x9d0 # mci->dimms
34 [<00000000ef737c29>] ghes_edac_register+0x1c8/0x3f8 # see edac_mc_alloc()
Thanks,
-Robert