Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754026AbbDRUaw (ORCPT ); Sat, 18 Apr 2015 16:30:52 -0400 Received: from mail-pd0-f178.google.com ([209.85.192.178]:33957 "EHLO mail-pd0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752033AbbDRUav (ORCPT ); Sat, 18 Apr 2015 16:30:51 -0400 Message-ID: <5532BEF8.3070008@kernel.dk> Date: Sat, 18 Apr 2015 14:30:48 -0600 From: Jens Axboe User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Ming Lei , Dongsu Park CC: Linux Kernel Mailing List , Christoph Hellwig Subject: Re: panic with CPU hotplug + blk-mq + scsi-mq References: <20150417094152.GA2838@posteo.de> In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2895 Lines: 73 On 04/17/2015 10:23 PM, Ming Lei wrote: > Hi Dongsu, > > On Fri, Apr 17, 2015 at 5:41 AM, Dongsu Park > wrote: >> Hi, >> >> there's a critical bug regarding CPU hotplug, blk-mq, and scsi-mq. >> Every time when a CPU is offlined, some arbitrary range of kernel memory >> seems to get corrupted. Then after a while, kernel panics at random places >> when block IOs are issued. (for example, see the call traces below) > > Thanks for the report. > >> >> This bug can be easily reproducible with a Qemu VM running with virtio-scsi, >> when its guest kernel is 3.19-rc1 or higher, and when scsi-mq is loaded >> with blk-mq enabled. And yes, 4.0 release is still affected, as well as >> Jens' for-4.1/core. How to reproduce: >> >> # echo 0 > /sys/devices/system/cpu/cpu1/online >> (and issue some block IOs, that's it.) >> >> Bisecting between 3.18 and 3.19-rc1, it looks like this bug had been hidden >> until commit ccbedf117f01 ("virtio_scsi: support multi hw queue of blk-mq"), >> which started to allow virtio-scsi to map virtqueues to hardware queues of >> blk-mq. Reverting that commit makes the bug go away. However, I suppose >> reverting it could not be a correct solution. > > I agree, and that patch only enables multiple hw queues. > >> >> More precisely, every time a CPU hotplug event gets triggered, >> a call graph is like the following: >> >> blk_mq_queue_reinit_notify() >> -> blk_mq_queue_reinit() >> -> blk_mq_map_swqueue() >> -> blk_mq_free_rq_map() >> -> scsi_exit_request() >> >> From that point, as soon as any address in the request gets modified, an >> arbitrary range of memory gets corrupted. My first guess was that probably >> the exit routine could try to deallocate tags->rqs[] where invalid >> addresses are stored. But actually it looks like it's not the case, >> and cmd->sense_buffer looks also valid. >> It's not obvious to me, exactly what could go wrong. >> >> Does anyone have an idea? > > As far as I can see, at least two problems exist: > - race between timeout and CPU hotplug > - in case of shared tags, during CPU online handling, about setting > and checking hctx->tags > > So could you please test the attached two patches to see if they fix your issue? > > I run them in my VM, and looks opps does disappear. Hard to comment on your patches directly when they are attached. Both look good to me. I'd perhaps change the ->tags check in #1 to use blk_mq_hw_queue_mapped() instead of checking directly. Might even be worth considering changing the normal iterator to skip unmapped queues, but that can be left for a later change. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/