Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752493AbbEPUNp (ORCPT ); Sat, 16 May 2015 16:13:45 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:21989 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751489AbbEPUNn (ORCPT ); Sat, 16 May 2015 16:13:43 -0400 Message-ID: <5557A4EC.6000508@oracle.com> Date: Sat, 16 May 2015 13:13:32 -0700 From: santosh shilimkar Organization: Oracle Corporation User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Thunderbird/31.1.2 MIME-Version: 1.0 To: Ming Lei , Jens Axboe CC: Christoph Hellwig , linux-kernel@vger.kernel.org Subject: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq' Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Source-IP: aserv0021.oracle.com [141.146.126.233] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5818 Lines: 151 Hi Ming Lei, Jens, While doing few tests with recent kernels with Xen Server, we saw guests(DOMU) disk image getting corrupted while booting it. Strangely the issue is seen so far only with disk image over ocfs2 volume. If the same image kept on the EXT3/4 drive, no corruption is observed. The issue is easily reproducible. You see the flurry of errors while guest is mounting the file systems. After doing some debug and bisects, we zeroed down the issue with commit "b5dd2f6 block: loop: improve performance via blk-mq". With that commit reverted the corruption goes away. Some more details on the test setup: 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel which includes commit b5dd2f6. Boot the Server. 2. On DOM0 file system create a ocfs2 volume 3. Keep the Guest(VM) disk image on ocfs2 volume. 4. Boot guest image. (xm create vm.cfg) 5. Observe the VM boot console log. VM itself use the EXT3 fs. You will see errors like below and after this boot, that file system/disk-image gets corrupted and mostly won't boot next time. Trimmed Guest kernel boot log... ---> EXT3-fs (dm-0): using internal journal EXT3-fs: barriers not enabled kjournald starting. Commit interval 5 seconds EXT3-fs (xvda1): using internal journal EXT3-fs (xvda1): mounted filesystem with ordered data mode Adding 1048572k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1 across:1048572k [...] EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad block 843250 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385 JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk of filesystem corruption in case of system crash. EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394 [...] EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394 [...] EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #777661: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 [...] automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in ld-2.5.so[b774c000+1b000] EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap - block_group = 34, block = 1114112 EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap - block_group = 0, block = 221 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841 EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad block 370280 ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in ntpd[b777d000+74000] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #618360: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #709178: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad block 372184 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392 EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393 -------------------- From the debug of the actual data on the disk vs what is read by the guest VM, we suspect the *reads* are actually not going all the way to disk and possibly returning the wrong data. Because the actual data on ocfs2 volume at those locations seems to be non-zero where as the guest seems to be read it as zero. I tried few experiment without much success so far. One of the thing I suspected was "requests are now submitted to backend file/device concurrently so tried to move them under lo->lo_lock so that they get serialized. Also moved the blk_mq_start_request() inside the actual work like patch below. But it didn't help. Thought of reporting the issue to get more ideas on what could be going wrong. Thanks for help in advance !! diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 39a83c2..22713b2 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, const struct blk_mq_queue_data *bd) { struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq); + struct loop_device *lo = cmd->rq->q->queuedata; - blk_mq_start_request(bd->rq); - + spin_lock_irq(&lo->lo_lock); if (cmd->rq->cmd_flags & REQ_WRITE) { - struct loop_device *lo = cmd->rq->q->queuedata; bool need_sched = true; - spin_lock_irq(&lo->lo_lock); if (lo->write_started) need_sched = false; else lo->write_started = true; list_add_tail(&cmd->list, &lo->write_cmd_head); - spin_unlock_irq(&lo->lo_lock); if (need_sched) queue_work(loop_wq, &lo->write_work); @@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx, queue_work(loop_wq, &cmd->read_work); } + spin_unlock_irq(&lo->lo_lock); return BLK_MQ_RQ_QUEUE_OK; } @@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd) if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) goto failed; + blk_mq_start_request(cmd->rq); + ret = 0; __rq_for_each_bio(bio, cmd->rq) ret |= loop_handle_bio(lo, bio); -- 1.7.1 Regards, Santosh -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/