2019-03-04 13:55:44

by Chen, Rong A

[permalink] [raw]
Subject: [LKP] [workqueue] 4d43d395fe: WARNING:at_kernel/workqueue.c:#__flush_work

FYI, we noticed the following commit (built with gcc-7):

commit: 4d43d395fed124631ca02356c711facb90185175 ("workqueue: Try to catch flush_work() without INIT_WORK().")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: blktests
with following parameters:

disk: 1SSD
test: nvme-005



on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+---------------------------------------------+------------+------------+
| | d73aba1115 | 4d43d395fe |
+---------------------------------------------+------------+------------+
| boot_successes | 11 | 15 |
| boot_failures | 0 | 4 |
| WARNING:at_kernel/workqueue.c:#__flush_work | 0 | 4 |
| RIP:__flush_work | 0 | 4 |
+---------------------------------------------+------------+------------+



[ 62.446979] WARNING: CPU: 0 PID: 169 at kernel/workqueue.c:2911 __flush_work+0x1eb/0x200
[ 62.450153] Modules linked in: nvme_loop nvme_fabrics nvmet nvme_core loop sr_mod cdrom sd_mod sg crct10dif_pclmul crc32_pclmul ppdev crc32c_intel ata_generic pata_acpi ghash_clmulni_intel snd_pcm snd_timer aesni_intel crypto_simd snd cryptd glue_helper soundcore i2c_piix4 pcspkr virtio_scsi ata_piix serio_raw libata parport_pc floppy parport ip_tables
[ 62.461143] CPU: 0 PID: 169 Comm: kworker/u4:2 Not tainted 5.0.0-rc3-00131-g4d43d39 #1
[ 62.463444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 62.465843] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
[ 62.467769] RIP: 0010:__flush_work+0x1eb/0x200
[ 62.469176] Code: 66 66 90 65 ff 0d b5 de f6 7e 31 c0 eb a2 8b 4d 00 48 8b 55 08 83 e1 08 48 0f ba 6d 00 03 80 c9 f0 e9 58 ff ff ff 0f 0b eb 86 <0f> 0b 31 c0 eb 80 e8 0a 0d fe ff 66 2e 0f 1f 84 00 00 00 00 00 66
[ 62.475512] RSP: 0000:ffffc9000093fcf8 EFLAGS: 00010246
[ 62.477040] RAX: 0000000000000000 RBX: ffff888077f9c148 RCX: 0000000000000000
[ 62.478944] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888077f9c148
[ 62.480945] RBP: ffff888077f9c148 R08: ffff88812ec00000 R09: ffff88812ec00028
[ 62.483203] R10: 0000000000000000 R11: ffffffff8265eef8 R12: 0000000000000000
[ 62.485841] R13: ffffc9000093fd90 R14: ffffffff810a49a0 R15: 0000000000000000
[ 62.488334] FS: 0000000000000000(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
[ 62.491077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 62.492692] CR2: 0000558ad08acff0 CR3: 000000007f398000 CR4: 00000000000006f0
[ 62.494662] Call Trace:
[ 62.495566] ? del_timer+0x53/0x80
[ 62.496668] __cancel_work_timer+0x12b/0x1b0
[ 62.497961] ? wake_up_q+0x80/0x80
[ 62.499361] ? pcpu_free_area+0x270/0x2e0
[ 62.501054] nvmet_ctrl_free+0x10f/0x1b0 [nvmet]
[ 62.502767] ? free_percpu+0x7c/0x1c0
[ 62.504193] nvmet_sq_destroy+0x74/0x110 [nvmet]
[ 62.505566] nvme_loop_destroy_admin_queue+0x20/0x40 [nvme_loop]
[ 62.507223] nvme_loop_reset_ctrl_work+0x2a/0xe0 [nvme_loop]
[ 62.508876] process_one_work+0x19c/0x3b0
[ 62.510167] worker_thread+0x1e5/0x3b0
[ 62.511357] ? process_one_work+0x3b0/0x3b0
[ 62.512890] kthread+0x11e/0x140
[ 62.514273] ? kthread_park+0x90/0x90
[ 62.515735] ret_from_fork+0x35/0x40
[ 62.517157] ---[ end trace d741a57df27dcf91 ]---


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



Thanks,
lkp


Attachments:
(No filename) (3.72 kB)
config-5.0.0-rc3-00131-g4d43d39 (171.79 kB)
job-script (5.40 kB)
dmesg.xz (23.27 kB)
blktests (2.75 kB)
Download all attachments

2019-03-04 13:58:04

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [LKP] [workqueue] 4d43d395fe: WARNING:at_kernel/workqueue.c:#__flush_work

Forwarding to NMVE people:

kernel test robot found that
flush_work(&ctrl->async_event_work) is called from nvmet_ctrl_free()
without INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work)
after ctrl was allocated (probably initialized with 0).
Will you make sure that INIT_WORK() is always called?

On 2019/03/04 22:41, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
>
> commit: 4d43d395fed124631ca02356c711facb90185175 ("workqueue: Try to catch flush_work() without INIT_WORK().")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: blktests
> with following parameters:
>
> disk: 1SSD
> test: nvme-005
>
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
> +---------------------------------------------+------------+------------+
> | | d73aba1115 | 4d43d395fe |
> +---------------------------------------------+------------+------------+
> | boot_successes | 11 | 15 |
> | boot_failures | 0 | 4 |
> | WARNING:at_kernel/workqueue.c:#__flush_work | 0 | 4 |
> | RIP:__flush_work | 0 | 4 |
> +---------------------------------------------+------------+------------+
>
>
>
> [ 62.446979] WARNING: CPU: 0 PID: 169 at kernel/workqueue.c:2911 __flush_work+0x1eb/0x200
> [ 62.450153] Modules linked in: nvme_loop nvme_fabrics nvmet nvme_core loop sr_mod cdrom sd_mod sg crct10dif_pclmul crc32_pclmul ppdev crc32c_intel ata_generic pata_acpi ghash_clmulni_intel snd_pcm snd_timer aesni_intel crypto_simd snd cryptd glue_helper soundcore i2c_piix4 pcspkr virtio_scsi ata_piix serio_raw libata parport_pc floppy parport ip_tables
> [ 62.461143] CPU: 0 PID: 169 Comm: kworker/u4:2 Not tainted 5.0.0-rc3-00131-g4d43d39 #1
> [ 62.463444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [ 62.465843] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
> [ 62.467769] RIP: 0010:__flush_work+0x1eb/0x200
> [ 62.469176] Code: 66 66 90 65 ff 0d b5 de f6 7e 31 c0 eb a2 8b 4d 00 48 8b 55 08 83 e1 08 48 0f ba 6d 00 03 80 c9 f0 e9 58 ff ff ff 0f 0b eb 86 <0f> 0b 31 c0 eb 80 e8 0a 0d fe ff 66 2e 0f 1f 84 00 00 00 00 00 66
> [ 62.475512] RSP: 0000:ffffc9000093fcf8 EFLAGS: 00010246
> [ 62.477040] RAX: 0000000000000000 RBX: ffff888077f9c148 RCX: 0000000000000000
> [ 62.478944] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888077f9c148
> [ 62.480945] RBP: ffff888077f9c148 R08: ffff88812ec00000 R09: ffff88812ec00028
> [ 62.483203] R10: 0000000000000000 R11: ffffffff8265eef8 R12: 0000000000000000
> [ 62.485841] R13: ffffc9000093fd90 R14: ffffffff810a49a0 R15: 0000000000000000
> [ 62.488334] FS: 0000000000000000(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
> [ 62.491077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 62.492692] CR2: 0000558ad08acff0 CR3: 000000007f398000 CR4: 00000000000006f0
> [ 62.494662] Call Trace:
> [ 62.495566] ? del_timer+0x53/0x80
> [ 62.496668] __cancel_work_timer+0x12b/0x1b0
> [ 62.497961] ? wake_up_q+0x80/0x80
> [ 62.499361] ? pcpu_free_area+0x270/0x2e0
> [ 62.501054] nvmet_ctrl_free+0x10f/0x1b0 [nvmet]
> [ 62.502767] ? free_percpu+0x7c/0x1c0
> [ 62.504193] nvmet_sq_destroy+0x74/0x110 [nvmet]
> [ 62.505566] nvme_loop_destroy_admin_queue+0x20/0x40 [nvme_loop]
> [ 62.507223] nvme_loop_reset_ctrl_work+0x2a/0xe0 [nvme_loop]
> [ 62.508876] process_one_work+0x19c/0x3b0
> [ 62.510167] worker_thread+0x1e5/0x3b0
> [ 62.511357] ? process_one_work+0x3b0/0x3b0
> [ 62.512890] kthread+0x11e/0x140
> [ 62.514273] ? kthread_park+0x90/0x90
> [ 62.515735] ret_from_fork+0x35/0x40
> [ 62.517157] ---[ end trace d741a57df27dcf91 ]---
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>
>
>
> Thanks,
> lkp
>


2019-03-04 18:50:06

by James Smart

[permalink] [raw]
Subject: Re: [LKP] [workqueue] 4d43d395fe: WARNING:at_kernel/workqueue.c:#__flush_work

On 3/4/2019 10:21 AM, Sagi Grimberg wrote:
>
>> Forwarding to NMVE people:
>>
>> kernel test robot found that
>> flush_work(&ctrl->async_event_work) is called from nvmet_ctrl_free()
>> without INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work)
>> after ctrl was allocated (probably initialized with 0).
>> Will you make sure that INIT_WORK() is always called?
>
> I cannot reproduce this issue. When following the code I don't
> immediately see how this can happen.. Was there something special
> in this specific test run? Is it 100% reproduce-able?

I agree. INIT_WORK is setup as almost one of the first items for a new
controller. Smells more like a double-free or a corrupt ctrl struct
from the transport.

-- james


2019-03-04 19:23:17

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [LKP] [workqueue] 4d43d395fe: WARNING:at_kernel/workqueue.c:#__flush_work


> Forwarding to NMVE people:
>
> kernel test robot found that
> flush_work(&ctrl->async_event_work) is called from nvmet_ctrl_free()
> without INIT_WORK(&ctrl->async_event_work, nvmet_async_event_work)
> after ctrl was allocated (probably initialized with 0).
> Will you make sure that INIT_WORK() is always called?

I cannot reproduce this issue. When following the code I don't
immediately see how this can happen.. Was there something special
in this specific test run? Is it 100% reproduce-able?

>
> On 2019/03/04 22:41, kernel test robot wrote:
>> FYI, we noticed the following commit (built with gcc-7):
>>
>> commit: 4d43d395fed124631ca02356c711facb90185175 ("workqueue: Try to catch flush_work() without INIT_WORK().")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>>
>> in testcase: blktests
>> with following parameters:
>>
>> disk: 1SSD
>> test: nvme-005
>>
>>
>>
>> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
>>
>> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>>
>>
>> +---------------------------------------------+------------+------------+
>> | | d73aba1115 | 4d43d395fe |
>> +---------------------------------------------+------------+------------+
>> | boot_successes | 11 | 15 |
>> | boot_failures | 0 | 4 |
>> | WARNING:at_kernel/workqueue.c:#__flush_work | 0 | 4 |
>> | RIP:__flush_work | 0 | 4 |
>> +---------------------------------------------+------------+------------+
>>
>>
>>
>> [ 62.446979] WARNING: CPU: 0 PID: 169 at kernel/workqueue.c:2911 __flush_work+0x1eb/0x200
>> [ 62.450153] Modules linked in: nvme_loop nvme_fabrics nvmet nvme_core loop sr_mod cdrom sd_mod sg crct10dif_pclmul crc32_pclmul ppdev crc32c_intel ata_generic pata_acpi ghash_clmulni_intel snd_pcm snd_timer aesni_intel crypto_simd snd cryptd glue_helper soundcore i2c_piix4 pcspkr virtio_scsi ata_piix serio_raw libata parport_pc floppy parport ip_tables
>> [ 62.461143] CPU: 0 PID: 169 Comm: kworker/u4:2 Not tainted 5.0.0-rc3-00131-g4d43d39 #1
>> [ 62.463444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> [ 62.465843] Workqueue: nvme-reset-wq nvme_loop_reset_ctrl_work [nvme_loop]
>> [ 62.467769] RIP: 0010:__flush_work+0x1eb/0x200
>> [ 62.469176] Code: 66 66 90 65 ff 0d b5 de f6 7e 31 c0 eb a2 8b 4d 00 48 8b 55 08 83 e1 08 48 0f ba 6d 00 03 80 c9 f0 e9 58 ff ff ff 0f 0b eb 86 <0f> 0b 31 c0 eb 80 e8 0a 0d fe ff 66 2e 0f 1f 84 00 00 00 00 00 66
>> [ 62.475512] RSP: 0000:ffffc9000093fcf8 EFLAGS: 00010246
>> [ 62.477040] RAX: 0000000000000000 RBX: ffff888077f9c148 RCX: 0000000000000000
>> [ 62.478944] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff888077f9c148
>> [ 62.480945] RBP: ffff888077f9c148 R08: ffff88812ec00000 R09: ffff88812ec00028
>> [ 62.483203] R10: 0000000000000000 R11: ffffffff8265eef8 R12: 0000000000000000
>> [ 62.485841] R13: ffffc9000093fd90 R14: ffffffff810a49a0 R15: 0000000000000000
>> [ 62.488334] FS: 0000000000000000(0000) GS:ffff88813fc00000(0000) knlGS:0000000000000000
>> [ 62.491077] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 62.492692] CR2: 0000558ad08acff0 CR3: 000000007f398000 CR4: 00000000000006f0
>> [ 62.494662] Call Trace:
>> [ 62.495566] ? del_timer+0x53/0x80
>> [ 62.496668] __cancel_work_timer+0x12b/0x1b0
>> [ 62.497961] ? wake_up_q+0x80/0x80
>> [ 62.499361] ? pcpu_free_area+0x270/0x2e0
>> [ 62.501054] nvmet_ctrl_free+0x10f/0x1b0 [nvmet]
>> [ 62.502767] ? free_percpu+0x7c/0x1c0
>> [ 62.504193] nvmet_sq_destroy+0x74/0x110 [nvmet]
>> [ 62.505566] nvme_loop_destroy_admin_queue+0x20/0x40 [nvme_loop]
>> [ 62.507223] nvme_loop_reset_ctrl_work+0x2a/0xe0 [nvme_loop]
>> [ 62.508876] process_one_work+0x19c/0x3b0
>> [ 62.510167] worker_thread+0x1e5/0x3b0
>> [ 62.511357] ? process_one_work+0x3b0/0x3b0
>> [ 62.512890] kthread+0x11e/0x140
>> [ 62.514273] ? kthread_park+0x90/0x90
>> [ 62.515735] ret_from_fork+0x35/0x40
>> [ 62.517157] ---[ end trace d741a57df27dcf91 ]---
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
>>
>>
>>
>> Thanks,
>> lkp
>>
>