Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752157AbbERB1B (ORCPT ); Sun, 17 May 2015 21:27:01 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:46713 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751459AbbERB0w (ORCPT ); Sun, 17 May 2015 21:26:52 -0400 MIME-Version: 1.0 In-Reply-To: <5557A4EC.6000508@oracle.com> References: <5557A4EC.6000508@oracle.com> Date: Mon, 18 May 2015 09:26:49 +0800 Message-ID: Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq' From: Ming Lei To: santosh shilimkar Cc: Jens Axboe , Christoph Hellwig , Linux Kernel Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6981 Lines: 172 Hi Santosh, Thanks for your report! On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar wrote: > Hi Ming Lei, Jens, > > While doing few tests with recent kernels with Xen Server, > we saw guests(DOMU) disk image getting corrupted while booting it. > Strangely the issue is seen so far only with disk image over ocfs2 > volume. If the same image kept on the EXT3/4 drive, no corruption > is observed. The issue is easily reproducible. You see the flurry > of errors while guest is mounting the file systems. > > After doing some debug and bisects, we zeroed down the issue with > commit "b5dd2f6 block: loop: improve performance via blk-mq". With > that commit reverted the corruption goes away. > > Some more details on the test setup: > 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel > which includes commit b5dd2f6. Boot the Server. > 2. On DOM0 file system create a ocfs2 volume > 3. Keep the Guest(VM) disk image on ocfs2 volume. > 4. Boot guest image. (xm create vm.cfg) I am not familiar with xen, so is the image accessed via loop block inside of guest VM? Is he loop block created in DOM0 or guest VM? > 5. Observe the VM boot console log. VM itself use the EXT3 fs. > You will see errors like below and after this boot, that file > system/disk-image gets corrupted and mostly won't boot next time. OK, that means the image is corrupted by VM booting. > > Trimmed Guest kernel boot log... > ---> > EXT3-fs (dm-0): using internal journal > EXT3-fs: barriers not enabled > kjournald starting. Commit interval 5 seconds > EXT3-fs (xvda1): using internal journal > EXT3-fs (xvda1): mounted filesystem with ordered data mode > Adding 1048572k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1 > across:1048572k > > [...] > > EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad block > 843250 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385 > JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk > of filesystem corruption in case of system crash. > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394 > > [...] > > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394 > > [...] > > EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #777661: > rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 > > [...] > > automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in > ld-2.5.so[b774c000+1b000] > EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap - > block_group = 34, block = 1114112 > EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap - > block_group = 0, block = 221 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 > EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad block > 370280 > ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in > ntpd[b777d000+74000] > EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory > #618360: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, > name_len=0 > EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #709178: > rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 > EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad block > 372184 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 > EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393 > -------------------- > > From the debug of the actual data on the disk vs what is read by > the guest VM, we suspect the *reads* are actually not going all > the way to disk and possibly returning the wrong data. Because > the actual data on ocfs2 volume at those locations seems > to be non-zero where as the guest seems to be read it as zero. Two big changes in the patchset are: 1) use blk-mq request based IO; 2) submit I/O concurrently(write vs. write is still serialized) Could you apply the patch in below link to see if it can fix the issue? BTW, this patch only removes concurrent submission. http://marc.info/?t=143093223200004&r=1&w=2 > > I tried few experiment without much success so far. One of the > thing I suspected was "requests are now submitted to backend > file/device concurrently so tried to move them under lo->lo_lock > so that they get serialized. Also moved the blk_mq_start_request() > inside the actual work like patch below. But it didn't help. Thought > of reporting the issue to get more ideas on what could be going > wrong. Thanks for help in advance !! > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c > index 39a83c2..22713b2 100644 > --- a/drivers/block/loop.c > +++ b/drivers/block/loop.c > @@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, > const struct blk_mq_queue_data *bd) > { > struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); > + struct loop_device *lo = cmd->rq->q->queuedata; > > - blk_mq_start_request(bd->rq); > - > + spin_lock_irq(&lo->lo_lock); > if (cmd->rq->cmd_flags & REQ_WRITE) { > - struct loop_device *lo = cmd->rq->q->queuedata; > bool need_sched = true; > > - spin_lock_irq(&lo->lo_lock); > if (lo->write_started) > need_sched = false; > else > lo->write_started = true; > list_add_tail(&cmd->list, &lo->write_cmd_head); > - spin_unlock_irq(&lo->lo_lock); > > if (need_sched) > queue_work(loop_wq, &lo->write_work); > @@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, > queue_work(loop_wq, &cmd->read_work); > } > > + spin_unlock_irq(&lo->lo_lock); > return BLK_MQ_RQ_QUEUE_OK; > } > > @@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd) > if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) > goto failed; > > + blk_mq_start_request(cmd->rq); > + > ret = 0; > __rq_for_each_bio(bio, cmd->rq) > ret |= loop_handle_bio(lo, bio); > -- I don't see the above change is necessary. Thanks, Ming -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/