2015-05-16 20:13:45

by Santosh Shilimkar

[permalink] [raw]
Subject: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

Hi Ming Lei, Jens,

While doing few tests with recent kernels with Xen Server,
we saw guests(DOMU) disk image getting corrupted while booting it.
Strangely the issue is seen so far only with disk image over ocfs2
volume. If the same image kept on the EXT3/4 drive, no corruption
is observed. The issue is easily reproducible. You see the flurry
of errors while guest is mounting the file systems.

After doing some debug and bisects, we zeroed down the issue with
commit "b5dd2f6 block: loop: improve performance via blk-mq". With
that commit reverted the corruption goes away.

Some more details on the test setup:
1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
which includes commit b5dd2f6. Boot the Server.
2. On DOM0 file system create a ocfs2 volume
3. Keep the Guest(VM) disk image on ocfs2 volume.
4. Boot guest image. (xm create vm.cfg)
5. Observe the VM boot console log. VM itself use the EXT3 fs.
You will see errors like below and after this boot, that file
system/disk-image gets corrupted and mostly won't boot next time.

Trimmed Guest kernel boot log...
--->
EXT3-fs (dm-0): using internal journal
EXT3-fs: barriers not enabled
kjournald starting. Commit interval 5 seconds
EXT3-fs (xvda1): using internal journal
EXT3-fs (xvda1): mounted filesystem with ordered data mode
Adding 1048572k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1
across:1048572k

[...]

EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad
block 843250
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a
risk of filesystem corruption in case of system crash.
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394

[...]

EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394

[...]

EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory
#777661: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0,
name_len=0

[...]

automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in
ld-2.5.so[b774c000+1b000]
EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block
bitmap - block_group = 34, block = 1114112
EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block
bitmap - block_group = 0, block = 221
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad
block 370280
ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in
ntpd[b777d000+74000]
EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in
directory #618360: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory
#709178: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0,
name_len=0
EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad
block 372184
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393
--------------------

From the debug of the actual data on the disk vs what is read by
the guest VM, we suspect the *reads* are actually not going all
the way to disk and possibly returning the wrong data. Because
the actual data on ocfs2 volume at those locations seems
to be non-zero where as the guest seems to be read it as zero.

I tried few experiment without much success so far. One of the
thing I suspected was "requests are now submitted to backend
file/device concurrently so tried to move them under lo->lo_lock
so that they get serialized. Also moved the blk_mq_start_request()
inside the actual work like patch below. But it didn't help. Thought
of reporting the issue to get more ideas on what could be going
wrong. Thanks for help in advance !!

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 39a83c2..22713b2 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+ struct loop_device *lo = cmd->rq->q->queuedata;

- blk_mq_start_request(bd->rq);
-
+ spin_lock_irq(&lo->lo_lock);
if (cmd->rq->cmd_flags & REQ_WRITE) {
- struct loop_device *lo = cmd->rq->q->queuedata;
bool need_sched = true;

- spin_lock_irq(&lo->lo_lock);
if (lo->write_started)
need_sched = false;
else
lo->write_started = true;
list_add_tail(&cmd->list, &lo->write_cmd_head);
- spin_unlock_irq(&lo->lo_lock);

if (need_sched)
queue_work(loop_wq, &lo->write_work);
@@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
queue_work(loop_wq, &cmd->read_work);
}

+ spin_unlock_irq(&lo->lo_lock);
return BLK_MQ_RQ_QUEUE_OK;
}

@@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
goto failed;

+ blk_mq_start_request(cmd->rq);
+
ret = 0;
__rq_for_each_bio(bio, cmd->rq)
ret |= loop_handle_bio(lo, bio);
--
1.7.1

Regards,
Santosh


2015-05-18 01:27:01

by Ming Lei

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

Hi Santosh,

Thanks for your report!

On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
<[email protected]> wrote:
> Hi Ming Lei, Jens,
>
> While doing few tests with recent kernels with Xen Server,
> we saw guests(DOMU) disk image getting corrupted while booting it.
> Strangely the issue is seen so far only with disk image over ocfs2
> volume. If the same image kept on the EXT3/4 drive, no corruption
> is observed. The issue is easily reproducible. You see the flurry
> of errors while guest is mounting the file systems.
>
> After doing some debug and bisects, we zeroed down the issue with
> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
> that commit reverted the corruption goes away.
>
> Some more details on the test setup:
> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
> which includes commit b5dd2f6. Boot the Server.
> 2. On DOM0 file system create a ocfs2 volume
> 3. Keep the Guest(VM) disk image on ocfs2 volume.
> 4. Boot guest image. (xm create vm.cfg)

I am not familiar with xen, so is the image accessed via
loop block inside of guest VM? Is he loop block created
in DOM0 or guest VM?

> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
> You will see errors like below and after this boot, that file
> system/disk-image gets corrupted and mostly won't boot next time.

OK, that means the image is corrupted by VM booting.

>
> Trimmed Guest kernel boot log...
> --->
> EXT3-fs (dm-0): using internal journal
> EXT3-fs: barriers not enabled
> kjournald starting. Commit interval 5 seconds
> EXT3-fs (xvda1): using internal journal
> EXT3-fs (xvda1): mounted filesystem with ordered data mode
> Adding 1048572k swap on /dev/VolGroup00/LogVol01. Priority:-1 extents:1
> across:1048572k
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 804966: bad block
> 843250
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
> JBD: Spotted dirty metadata buffer (dev = dm-0, blocknr = 0). There's a risk
> of filesystem corruption in case of system crash.
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620385
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620394
>
> [...]
>
> EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #777661:
> rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
>
> [...]
>
> automount[2605]: segfault at 4 ip b7756dd6 sp b6ba8ab0 error 4 in
> ld-2.5.so[b774c000+1b000]
> EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap -
> block_group = 34, block = 1114112
> EXT3-fs error (device dm-0): ext3_valid_block_bitmap: Invalid block bitmap -
> block_group = 0, block = 221
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 589841
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 709252: bad block
> 370280
> ntpd[2691]: segfault at 2563352a ip b77e5000 sp bfe27cec error 6 in
> ntpd[b777d000+74000]
> EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory
> #618360: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0,
> name_len=0
> EXT3-fs error (device dm-0): ext3_add_entry: bad entry in directory #709178:
> rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
> EXT3-fs error (device dm-0): ext3_xattr_block_get: inode 368277: bad block
> 372184
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620392
> EXT3-fs error (device dm-0): ext3_lookup: deleted inode referenced: 620393
> --------------------
>
> From the debug of the actual data on the disk vs what is read by
> the guest VM, we suspect the *reads* are actually not going all
> the way to disk and possibly returning the wrong data. Because
> the actual data on ocfs2 volume at those locations seems
> to be non-zero where as the guest seems to be read it as zero.

Two big changes in the patchset are: 1) use blk-mq request based IO;
2) submit I/O concurrently(write vs. write is still serialized)

Could you apply the patch in below link to see if it can fix the issue?
BTW, this patch only removes concurrent submission.

http://marc.info/?t=143093223200004&r=1&w=2

>
> I tried few experiment without much success so far. One of the
> thing I suspected was "requests are now submitted to backend
> file/device concurrently so tried to move them under lo->lo_lock
> so that they get serialized. Also moved the blk_mq_start_request()
> inside the actual work like patch below. But it didn't help. Thought
> of reporting the issue to get more ideas on what could be going
> wrong. Thanks for help in advance !!
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 39a83c2..22713b2 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -1480,20 +1480,17 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> const struct blk_mq_queue_data *bd)
> {
> struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> + struct loop_device *lo = cmd->rq->q->queuedata;
>
> - blk_mq_start_request(bd->rq);
> -
> + spin_lock_irq(&lo->lo_lock);
> if (cmd->rq->cmd_flags & REQ_WRITE) {
> - struct loop_device *lo = cmd->rq->q->queuedata;
> bool need_sched = true;
>
> - spin_lock_irq(&lo->lo_lock);
> if (lo->write_started)
> need_sched = false;
> else
> lo->write_started = true;
> list_add_tail(&cmd->list, &lo->write_cmd_head);
> - spin_unlock_irq(&lo->lo_lock);
>
> if (need_sched)
> queue_work(loop_wq, &lo->write_work);
> @@ -1501,6 +1498,7 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> queue_work(loop_wq, &cmd->read_work);
> }
>
> + spin_unlock_irq(&lo->lo_lock);
> return BLK_MQ_RQ_QUEUE_OK;
> }
>
> @@ -1517,6 +1515,8 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY))
> goto failed;
>
> + blk_mq_start_request(cmd->rq);
> +
> ret = 0;
> __rq_for_each_bio(bio, cmd->rq)
> ret |= loop_handle_bio(lo, bio);
> --

I don't see the above change is necessary.

Thanks,
Ming

2015-05-18 18:08:01

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 5/17/2015 6:26 PM, Ming Lei wrote:
> Hi Santosh,
>
> Thanks for your report!
>
> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
> <[email protected]> wrote:
>> Hi Ming Lei, Jens,
>>
>> While doing few tests with recent kernels with Xen Server,
>> we saw guests(DOMU) disk image getting corrupted while booting it.
>> Strangely the issue is seen so far only with disk image over ocfs2
>> volume. If the same image kept on the EXT3/4 drive, no corruption
>> is observed. The issue is easily reproducible. You see the flurry
>> of errors while guest is mounting the file systems.
>>
>> After doing some debug and bisects, we zeroed down the issue with
>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>> that commit reverted the corruption goes away.
>>
>> Some more details on the test setup:
>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>> which includes commit b5dd2f6. Boot the Server.
>> 2. On DOM0 file system create a ocfs2 volume
>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>> 4. Boot guest image. (xm create vm.cfg)
>
> I am not familiar with xen, so is the image accessed via
> loop block inside of guest VM? Is he loop block created
> in DOM0 or guest VM?
>
Guest. The Guest disk image is represented as a file by loop
device.

>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>> You will see errors like below and after this boot, that file
>> system/disk-image gets corrupted and mostly won't boot next time.
>
> OK, that means the image is corrupted by VM booting.
>
Right

[...]

>>
>> From the debug of the actual data on the disk vs what is read by
>> the guest VM, we suspect the *reads* are actually not going all
>> the way to disk and possibly returning the wrong data. Because
>> the actual data on ocfs2 volume at those locations seems
>> to be non-zero where as the guest seems to be read it as zero.
>
> Two big changes in the patchset are: 1) use blk-mq request based IO;
> 2) submit I/O concurrently(write vs. write is still serialized)
>
> Could you apply the patch in below link to see if it can fix the issue?
> BTW, this patch only removes concurrent submission.
>
> http://marc.info/?t=143093223200004&r=1&w=2
>
What kernel is this patch generated against ? It doesn't apply against
v4.0. Does this need the AIO/DIO conversion patches as well. Do you
have the dependent patch-set I can't apply it against v4.0.

Regards,
Santosh

2015-05-18 23:14:11

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 5/18/2015 11:07 AM, santosh shilimkar wrote:
> On 5/17/2015 6:26 PM, Ming Lei wrote:
>> Hi Santosh,
>>
>> Thanks for your report!
>>
>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>> <[email protected]> wrote:
>>> Hi Ming Lei, Jens,
>>>
>>> While doing few tests with recent kernels with Xen Server,
>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>> Strangely the issue is seen so far only with disk image over ocfs2
>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>> is observed. The issue is easily reproducible. You see the flurry
>>> of errors while guest is mounting the file systems.
>>>
>>> After doing some debug and bisects, we zeroed down the issue with
>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>> that commit reverted the corruption goes away.
>>>
>>> Some more details on the test setup:
>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>> which includes commit b5dd2f6. Boot the Server.
>>> 2. On DOM0 file system create a ocfs2 volume
>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>> 4. Boot guest image. (xm create vm.cfg)
>>
>> I am not familiar with xen, so is the image accessed via
>> loop block inside of guest VM? Is he loop block created
>> in DOM0 or guest VM?
>>
> Guest. The Guest disk image is represented as a file by loop
> device.
>
>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>> You will see errors like below and after this boot, that file
>>> system/disk-image gets corrupted and mostly won't boot next time.
>>
>> OK, that means the image is corrupted by VM booting.
>>
> Right
>
> [...]
>
>>>
>>> From the debug of the actual data on the disk vs what is read by
>>> the guest VM, we suspect the *reads* are actually not going all
>>> the way to disk and possibly returning the wrong data. Because
>>> the actual data on ocfs2 volume at those locations seems
>>> to be non-zero where as the guest seems to be read it as zero.
>>
>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>> 2) submit I/O concurrently(write vs. write is still serialized)
>>
>> Could you apply the patch in below link to see if it can fix the issue?
>> BTW, this patch only removes concurrent submission.
>>
>> http://marc.info/?t=143093223200004&r=1&w=2
>>
> What kernel is this patch generated against ? It doesn't apply against
> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
> have the dependent patch-set I can't apply it against v4.0.
>
Anyways, I created patch(end of the email) against v4.0, based on your
patch and tested it. The corruption is no more seen so it does fix
the issue after backing out concurrent submission changes from
commit b5dd2f6. Let me know whats you plan with it since linus
tip as well as v4.0 needs this fix.

Regards,
Santosh

[1]
---
drivers/block/loop.c | 89
+++++++++++++++++--------------------------------
drivers/block/loop.h | 9 ++---
2 files changed, 35 insertions(+), 63 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 773e964..8484b8a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -85,8 +85,6 @@ static DEFINE_MUTEX(loop_index_mutex);
static int max_part;
static int part_shift;

-static struct workqueue_struct *loop_wq;
-
/*
* Transfer functions
*/
@@ -720,6 +718,23 @@ static void loop_config_discard(struct loop_device *lo)
queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
}

+static void loop_unprepare_queue(struct loop_device *lo)
+{
+ flush_kthread_worker(&lo->worker);
+ kthread_stop(lo->worker_task);
+}
+
+static int loop_prepare_queue(struct loop_device *lo)
+{
+ init_kthread_worker(&lo->worker);
+ lo->worker_task = kthread_run(kthread_worker_fn,
+ &lo->worker, "loop%d", lo->lo_number);
+ if (IS_ERR(lo->worker_task))
+ return -ENOMEM;
+ set_user_nice(lo->worker_task, MIN_NICE);
+ return 0;
+}
+
static int loop_set_fd(struct loop_device *lo, fmode_t mode,
struct block_device *bdev, unsigned int arg)
{
@@ -778,6 +793,10 @@ static int loop_set_fd(struct loop_device *lo,
fmode_t mode,
if ((loff_t)(sector_t)size != size)
goto out_putf;

+ error = loop_prepare_queue(lo);
+ if (error)
+ goto out_putf;
+
error = 0;

set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
@@ -924,6 +943,8 @@ static int loop_clr_fd(struct loop_device *lo)
lo->lo_flags = 0;
if (!part_shift)
lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
+
+ loop_unprepare_queue(lo);
mutex_unlock(&lo->lo_ctl_mutex);
/*
* Need not hold lo_ctl_mutex to fput backing file.
@@ -1477,26 +1498,14 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
+ struct loop_device *lo = cmd->rq->q->queuedata;

- blk_mq_start_request(bd->rq);
+ if (lo->lo_state != Lo_bound)
+ return -EIO;

- if (cmd->rq->cmd_flags & REQ_WRITE) {
- struct loop_device *lo = cmd->rq->q->queuedata;
- bool need_sched = true;
+ blk_mq_start_request(bd->rq);

- spin_lock_irq(&lo->lo_lock);
- if (lo->write_started)
- need_sched = false;
- else
- lo->write_started = true;
- list_add_tail(&cmd->list, &lo->write_cmd_head);
- spin_unlock_irq(&lo->lo_lock);
-
- if (need_sched)
- queue_work(loop_wq, &lo->write_work);
- } else {
- queue_work(loop_wq, &cmd->read_work);
- }
+ queue_kthread_work(&lo->worker, &cmd->work);

return BLK_MQ_RQ_QUEUE_OK;
}
@@ -1521,35 +1530,11 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
blk_mq_complete_request(cmd->rq);
}

-static void loop_queue_write_work(struct work_struct *work)
-{
- struct loop_device *lo =
- container_of(work, struct loop_device, write_work);
- LIST_HEAD(cmd_list);

- spin_lock_irq(&lo->lo_lock);
- repeat:
- list_splice_init(&lo->write_cmd_head, &cmd_list);
- spin_unlock_irq(&lo->lo_lock);
-
- while (!list_empty(&cmd_list)) {
- struct loop_cmd *cmd = list_first_entry(&cmd_list,
- struct loop_cmd, list);
- list_del_init(&cmd->list);
- loop_handle_cmd(cmd);
- }
-
- spin_lock_irq(&lo->lo_lock);
- if (!list_empty(&lo->write_cmd_head))
- goto repeat;
- lo->write_started = false;
- spin_unlock_irq(&lo->lo_lock);
-}
-
-static void loop_queue_read_work(struct work_struct *work)
+static void loop_queue_work(struct kthread_work *work)
{
struct loop_cmd *cmd =
- container_of(work, struct loop_cmd, read_work);
+ container_of(work, struct loop_cmd, work);

loop_handle_cmd(cmd);
}
@@ -1561,7 +1546,7 @@ static int loop_init_request(void *data, struct
request *rq,
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);

cmd->rq = rq;
- INIT_WORK(&cmd->read_work, loop_queue_read_work);
+ init_kthread_work(&cmd->work, loop_queue_work);

return 0;
}
@@ -1617,9 +1602,6 @@ static int loop_add(struct loop_device **l, int i)
}
lo->lo_queue->queuedata = lo;

- INIT_LIST_HEAD(&lo->write_cmd_head);
- INIT_WORK(&lo->write_work, loop_queue_write_work);
-
disk = lo->lo_disk = alloc_disk(1 << part_shift);
if (!disk)
goto out_free_queue;
@@ -1858,13 +1840,6 @@ static int __init loop_init(void)
goto misc_out;
}

- loop_wq = alloc_workqueue("kloopd",
- WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
- if (!loop_wq) {
- err = -ENOMEM;
- goto misc_out;
- }
-
blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
THIS_MODULE, loop_probe, NULL, NULL);

@@ -1902,8 +1877,6 @@ static void __exit loop_exit(void)
blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
unregister_blkdev(LOOP_MAJOR, "loop");

- destroy_workqueue(loop_wq);
-
misc_deregister(&loop_misc);
}

diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 301c27f..54c6aa5 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -14,7 +14,7 @@
#include <linux/blk-mq.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
-#include <linux/workqueue.h>
+#include <linux/kthread.h>
#include <uapi/linux/loop.h>

/* Possible states of device */
@@ -54,11 +54,10 @@ struct loop_device {
gfp_t old_gfp_mask;

spinlock_t lo_lock;
- struct list_head write_cmd_head;
- struct work_struct write_work;
- bool write_started;
int lo_state;
struct mutex lo_ctl_mutex;
+ struct kthread_worker worker;
+ struct task_struct *worker_task;

struct request_queue *lo_queue;
struct blk_mq_tag_set tag_set;
@@ -66,7 +65,7 @@ struct loop_device {
};

struct loop_cmd {
- struct work_struct read_work;
+ struct kthread_work work;
struct request *rq;
struct list_head list;
};
--
1.7.1



2015-05-18 23:14:27

by Ming Lei

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On Tue, May 19, 2015 at 2:07 AM, santosh shilimkar
<[email protected]> wrote:
> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>
>> Hi Santosh,
>>
>> Thanks for your report!
>>
>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>> <[email protected]> wrote:
>>>
>>> Hi Ming Lei, Jens,
>>>
>>> While doing few tests with recent kernels with Xen Server,
>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>> Strangely the issue is seen so far only with disk image over ocfs2
>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>> is observed. The issue is easily reproducible. You see the flurry
>>> of errors while guest is mounting the file systems.
>>>
>>> After doing some debug and bisects, we zeroed down the issue with
>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>> that commit reverted the corruption goes away.
>>>
>>> Some more details on the test setup:
>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>> which includes commit b5dd2f6. Boot the Server.
>>> 2. On DOM0 file system create a ocfs2 volume
>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>> 4. Boot guest image. (xm create vm.cfg)
>>
>>
>> I am not familiar with xen, so is the image accessed via
>> loop block inside of guest VM? Is he loop block created
>> in DOM0 or guest VM?
>>
> Guest. The Guest disk image is represented as a file by loop
> device.
>
>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>> You will see errors like below and after this boot, that file
>>> system/disk-image gets corrupted and mostly won't boot next time.
>>
>>
>> OK, that means the image is corrupted by VM booting.
>>
> Right
>
> [...]
>
>>>
>>> From the debug of the actual data on the disk vs what is read by
>>> the guest VM, we suspect the *reads* are actually not going all
>>> the way to disk and possibly returning the wrong data. Because
>>> the actual data on ocfs2 volume at those locations seems
>>> to be non-zero where as the guest seems to be read it as zero.
>>
>>
>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>> 2) submit I/O concurrently(write vs. write is still serialized)
>>
>> Could you apply the patch in below link to see if it can fix the issue?
>> BTW, this patch only removes concurrent submission.
>>
>> http://marc.info/?t=143093223200004&r=1&w=2
>>
> What kernel is this patch generated against ? It doesn't apply against
> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
> have the dependent patch-set I can't apply it against v4.0.

My fault, the patch is against -next tree, but you just need another two
patches for applying this one:

http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-next&id=f4aa4c7bbac6c4afdd4adccf90898c1a3685396d

http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-next&id=4d4e41aef9429872ea3b105e83426941f7185ab6


Thanks,
Ming

2015-05-18 23:18:49

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 5/18/2015 4:14 PM, Ming Lei wrote:
> On Tue, May 19, 2015 at 2:07 AM, santosh shilimkar
> <[email protected]> wrote:
>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>
>>> Hi Santosh,
>>>
>>> Thanks for your report!
>>>
>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>> <[email protected]> wrote:
>>>>
>>>> Hi Ming Lei, Jens,
>>>>
>>>> While doing few tests with recent kernels with Xen Server,
>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>> is observed. The issue is easily reproducible. You see the flurry
>>>> of errors while guest is mounting the file systems.
>>>>
>>>> After doing some debug and bisects, we zeroed down the issue with
>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>> that commit reverted the corruption goes away.
>>>>
>>>> Some more details on the test setup:
>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>> which includes commit b5dd2f6. Boot the Server.
>>>> 2. On DOM0 file system create a ocfs2 volume
>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>> 4. Boot guest image. (xm create vm.cfg)
>>>
>>>
>>> I am not familiar with xen, so is the image accessed via
>>> loop block inside of guest VM? Is he loop block created
>>> in DOM0 or guest VM?
>>>
>> Guest. The Guest disk image is represented as a file by loop
>> device.
>>
>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>> You will see errors like below and after this boot, that file
>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>
>>>
>>> OK, that means the image is corrupted by VM booting.
>>>
>> Right
>>
>> [...]
>>
>>>>
>>>> From the debug of the actual data on the disk vs what is read by
>>>> the guest VM, we suspect the *reads* are actually not going all
>>>> the way to disk and possibly returning the wrong data. Because
>>>> the actual data on ocfs2 volume at those locations seems
>>>> to be non-zero where as the guest seems to be read it as zero.
>>>
>>>
>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>
>>> Could you apply the patch in below link to see if it can fix the issue?
>>> BTW, this patch only removes concurrent submission.
>>>
>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>
>> What kernel is this patch generated against ? It doesn't apply against
>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>> have the dependent patch-set I can't apply it against v4.0.
>
> My fault, the patch is against -next tree, but you just need another two
> patches for applying this one:
>
> http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-next&id=f4aa4c7bbac6c4afdd4adccf90898c1a3685396d
>
> http://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/commit/?h=for-next&id=4d4e41aef9429872ea3b105e83426941f7185ab6
>
Our emails crossed. I did port that one patch alone and confirm
that the issue is getting fixed. That patch also should go
to v4.0 stable then along with above 2.

Regards,
Santosh

2015-05-18 23:25:36

by Ming Lei

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On Tue, May 19, 2015 at 7:13 AM, santosh shilimkar
<[email protected]> wrote:
> On 5/18/2015 11:07 AM, santosh shilimkar wrote:
>>
>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>
>>> Hi Santosh,
>>>
>>> Thanks for your report!
>>>
>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>> <[email protected]> wrote:
>>>>
>>>> Hi Ming Lei, Jens,
>>>>
>>>> While doing few tests with recent kernels with Xen Server,
>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>> is observed. The issue is easily reproducible. You see the flurry
>>>> of errors while guest is mounting the file systems.
>>>>
>>>> After doing some debug and bisects, we zeroed down the issue with
>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>> that commit reverted the corruption goes away.
>>>>
>>>> Some more details on the test setup:
>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>> which includes commit b5dd2f6. Boot the Server.
>>>> 2. On DOM0 file system create a ocfs2 volume
>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>> 4. Boot guest image. (xm create vm.cfg)
>>>
>>>
>>> I am not familiar with xen, so is the image accessed via
>>> loop block inside of guest VM? Is he loop block created
>>> in DOM0 or guest VM?
>>>
>> Guest. The Guest disk image is represented as a file by loop
>> device.
>>
>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>> You will see errors like below and after this boot, that file
>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>
>>>
>>> OK, that means the image is corrupted by VM booting.
>>>
>> Right
>>
>> [...]
>>
>>>>
>>>> From the debug of the actual data on the disk vs what is read by
>>>> the guest VM, we suspect the *reads* are actually not going all
>>>> the way to disk and possibly returning the wrong data. Because
>>>> the actual data on ocfs2 volume at those locations seems
>>>> to be non-zero where as the guest seems to be read it as zero.
>>>
>>>
>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>
>>> Could you apply the patch in below link to see if it can fix the issue?
>>> BTW, this patch only removes concurrent submission.
>>>
>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>
>> What kernel is this patch generated against ? It doesn't apply against
>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>> have the dependent patch-set I can't apply it against v4.0.
>>
> Anyways, I created patch(end of the email) against v4.0, based on your patch
> and tested it. The corruption is no more seen so it does fix
> the issue after backing out concurrent submission changes from
> commit b5dd2f6. Let me know whats you plan with it since linus
> tip as well as v4.0 needs this fix.

If your issue is caused by concurrent IO submittion, it might be one
issue of ocfs2. As you see, there isn't such problem for ext3/ext4.

And the single thread patch is introduced for aio/dio support, which
shouldn't have been a fix patch.

Thanks,
Ming

>
> Regards,
> Santosh
>
> [1]
> ---
> drivers/block/loop.c | 89
> +++++++++++++++++--------------------------------
> drivers/block/loop.h | 9 ++---
> 2 files changed, 35 insertions(+), 63 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 773e964..8484b8a 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -85,8 +85,6 @@ static DEFINE_MUTEX(loop_index_mutex);
> static int max_part;
> static int part_shift;
>
> -static struct workqueue_struct *loop_wq;
> -
> /*
> * Transfer functions
> */
> @@ -720,6 +718,23 @@ static void loop_config_discard(struct loop_device *lo)
> queue_flag_set_unlocked(QUEUE_FLAG_DISCARD, q);
> }
>
> +static void loop_unprepare_queue(struct loop_device *lo)
> +{
> + flush_kthread_worker(&lo->worker);
> + kthread_stop(lo->worker_task);
> +}
> +
> +static int loop_prepare_queue(struct loop_device *lo)
> +{
> + init_kthread_worker(&lo->worker);
> + lo->worker_task = kthread_run(kthread_worker_fn,
> + &lo->worker, "loop%d", lo->lo_number);
> + if (IS_ERR(lo->worker_task))
> + return -ENOMEM;
> + set_user_nice(lo->worker_task, MIN_NICE);
> + return 0;
> +}
> +
> static int loop_set_fd(struct loop_device *lo, fmode_t mode,
> struct block_device *bdev, unsigned int arg)
> {
> @@ -778,6 +793,10 @@ static int loop_set_fd(struct loop_device *lo, fmode_t
> mode,
> if ((loff_t)(sector_t)size != size)
> goto out_putf;
>
> + error = loop_prepare_queue(lo);
> + if (error)
> + goto out_putf;
> +
> error = 0;
>
> set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
> @@ -924,6 +943,8 @@ static int loop_clr_fd(struct loop_device *lo)
> lo->lo_flags = 0;
> if (!part_shift)
> lo->lo_disk->flags |= GENHD_FL_NO_PART_SCAN;
> +
> + loop_unprepare_queue(lo);
> mutex_unlock(&lo->lo_ctl_mutex);
> /*
> * Need not hold lo_ctl_mutex to fput backing file.
> @@ -1477,26 +1498,14 @@ static int loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> const struct blk_mq_queue_data *bd)
> {
> struct loop_cmd *cmd = blk_mq_rq_to_pdu(bd->rq);
> + struct loop_device *lo = cmd->rq->q->queuedata;
>
> - blk_mq_start_request(bd->rq);
> + if (lo->lo_state != Lo_bound)
> + return -EIO;
>
> - if (cmd->rq->cmd_flags & REQ_WRITE) {
> - struct loop_device *lo = cmd->rq->q->queuedata;
> - bool need_sched = true;
> + blk_mq_start_request(bd->rq);
>
> - spin_lock_irq(&lo->lo_lock);
> - if (lo->write_started)
> - need_sched = false;
> - else
> - lo->write_started = true;
> - list_add_tail(&cmd->list, &lo->write_cmd_head);
> - spin_unlock_irq(&lo->lo_lock);
> -
> - if (need_sched)
> - queue_work(loop_wq, &lo->write_work);
> - } else {
> - queue_work(loop_wq, &cmd->read_work);
> - }
> + queue_kthread_work(&lo->worker, &cmd->work);
>
> return BLK_MQ_RQ_QUEUE_OK;
> }
> @@ -1521,35 +1530,11 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> blk_mq_complete_request(cmd->rq);
> }
>
> -static void loop_queue_write_work(struct work_struct *work)
> -{
> - struct loop_device *lo =
> - container_of(work, struct loop_device, write_work);
> - LIST_HEAD(cmd_list);
>
> - spin_lock_irq(&lo->lo_lock);
> - repeat:
> - list_splice_init(&lo->write_cmd_head, &cmd_list);
> - spin_unlock_irq(&lo->lo_lock);
> -
> - while (!list_empty(&cmd_list)) {
> - struct loop_cmd *cmd = list_first_entry(&cmd_list,
> - struct loop_cmd, list);
> - list_del_init(&cmd->list);
> - loop_handle_cmd(cmd);
> - }
> -
> - spin_lock_irq(&lo->lo_lock);
> - if (!list_empty(&lo->write_cmd_head))
> - goto repeat;
> - lo->write_started = false;
> - spin_unlock_irq(&lo->lo_lock);
> -}
> -
> -static void loop_queue_read_work(struct work_struct *work)
> +static void loop_queue_work(struct kthread_work *work)
> {
> struct loop_cmd *cmd =
> - container_of(work, struct loop_cmd, read_work);
> + container_of(work, struct loop_cmd, work);
>
> loop_handle_cmd(cmd);
> }
> @@ -1561,7 +1546,7 @@ static int loop_init_request(void *data, struct
> request *rq,
> struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
>
> cmd->rq = rq;
> - INIT_WORK(&cmd->read_work, loop_queue_read_work);
> + init_kthread_work(&cmd->work, loop_queue_work);
>
> return 0;
> }
> @@ -1617,9 +1602,6 @@ static int loop_add(struct loop_device **l, int i)
> }
> lo->lo_queue->queuedata = lo;
>
> - INIT_LIST_HEAD(&lo->write_cmd_head);
> - INIT_WORK(&lo->write_work, loop_queue_write_work);
> -
> disk = lo->lo_disk = alloc_disk(1 << part_shift);
> if (!disk)
> goto out_free_queue;
> @@ -1858,13 +1840,6 @@ static int __init loop_init(void)
> goto misc_out;
> }
>
> - loop_wq = alloc_workqueue("kloopd",
> - WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
> - if (!loop_wq) {
> - err = -ENOMEM;
> - goto misc_out;
> - }
> -
> blk_register_region(MKDEV(LOOP_MAJOR, 0), range,
> THIS_MODULE, loop_probe, NULL, NULL);
>
> @@ -1902,8 +1877,6 @@ static void __exit loop_exit(void)
> blk_unregister_region(MKDEV(LOOP_MAJOR, 0), range);
> unregister_blkdev(LOOP_MAJOR, "loop");
>
> - destroy_workqueue(loop_wq);
> -
> misc_deregister(&loop_misc);
> }
>
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index 301c27f..54c6aa5 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -14,7 +14,7 @@
> #include <linux/blk-mq.h>
> #include <linux/spinlock.h>
> #include <linux/mutex.h>
> -#include <linux/workqueue.h>
> +#include <linux/kthread.h>
> #include <uapi/linux/loop.h>
>
> /* Possible states of device */
> @@ -54,11 +54,10 @@ struct loop_device {
> gfp_t old_gfp_mask;
>
> spinlock_t lo_lock;
> - struct list_head write_cmd_head;
> - struct work_struct write_work;
> - bool write_started;
> int lo_state;
> struct mutex lo_ctl_mutex;
> + struct kthread_worker worker;
> + struct task_struct *worker_task;
>
> struct request_queue *lo_queue;
> struct blk_mq_tag_set tag_set;
> @@ -66,7 +65,7 @@ struct loop_device {
> };
>
> struct loop_cmd {
> - struct work_struct read_work;
> + struct kthread_work work;
> struct request *rq;
> struct list_head list;
> };
> --
> 1.7.1
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2015-05-18 23:39:05

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 5/18/2015 4:25 PM, Ming Lei wrote:
> On Tue, May 19, 2015 at 7:13 AM, santosh shilimkar
> <[email protected]> wrote:
>> On 5/18/2015 11:07 AM, santosh shilimkar wrote:
>>>
>>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>>
>>>> Hi Santosh,
>>>>
>>>> Thanks for your report!
>>>>
>>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>>> <[email protected]> wrote:
>>>>>
>>>>> Hi Ming Lei, Jens,
>>>>>
>>>>> While doing few tests with recent kernels with Xen Server,
>>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>>> is observed. The issue is easily reproducible. You see the flurry
>>>>> of errors while guest is mounting the file systems.
>>>>>
>>>>> After doing some debug and bisects, we zeroed down the issue with
>>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>>> that commit reverted the corruption goes away.
>>>>>
>>>>> Some more details on the test setup:
>>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>>> which includes commit b5dd2f6. Boot the Server.
>>>>> 2. On DOM0 file system create a ocfs2 volume
>>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>>> 4. Boot guest image. (xm create vm.cfg)
>>>>
>>>>
>>>> I am not familiar with xen, so is the image accessed via
>>>> loop block inside of guest VM? Is he loop block created
>>>> in DOM0 or guest VM?
>>>>
>>> Guest. The Guest disk image is represented as a file by loop
>>> device.
>>>
>>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>>> You will see errors like below and after this boot, that file
>>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>>
>>>>
>>>> OK, that means the image is corrupted by VM booting.
>>>>
>>> Right
>>>
>>> [...]
>>>
>>>>>
>>>>> From the debug of the actual data on the disk vs what is read by
>>>>> the guest VM, we suspect the *reads* are actually not going all
>>>>> the way to disk and possibly returning the wrong data. Because
>>>>> the actual data on ocfs2 volume at those locations seems
>>>>> to be non-zero where as the guest seems to be read it as zero.
>>>>
>>>>
>>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>>
>>>> Could you apply the patch in below link to see if it can fix the issue?
>>>> BTW, this patch only removes concurrent submission.
>>>>
>>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>>
>>> What kernel is this patch generated against ? It doesn't apply against
>>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>>> have the dependent patch-set I can't apply it against v4.0.
>>>
>> Anyways, I created patch(end of the email) against v4.0, based on your patch
>> and tested it. The corruption is no more seen so it does fix
>> the issue after backing out concurrent submission changes from
>> commit b5dd2f6. Let me know whats you plan with it since linus
>> tip as well as v4.0 needs this fix.
>
> If your issue is caused by concurrent IO submittion, it might be one
> issue of ocfs2. As you see, there isn't such problem for ext3/ext4.
>
As we speak, I got to know about another regression with XFS as well
and am quite confident based on symptom that its similar issue.
I will get a confirmation on the same by tomorrow whether the patch
fixes it or not.

> And the single thread patch is introduced for aio/dio support, which
> shouldn't have been a fix patch.
>

Well before the loop blk-mq conversion commit b5dd2f6, the loop driver
was single threaded and as you see that issue seen with that
commit. Now with this experiment, it also proves that those work-queue
split changes are problematic. So am not sure why do you say that it
shouldn't be a fix patch.

Am not denying that the issue could be with OCFS2 or XFS(not proved yet)
but they were happy before that commit ;-)

Regards,
Santosh

2015-05-19 00:47:29

by Ming Lei

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On Tue, May 19, 2015 at 7:38 AM, santosh shilimkar
<[email protected]> wrote:
> On 5/18/2015 4:25 PM, Ming Lei wrote:
>>
>> On Tue, May 19, 2015 at 7:13 AM, santosh shilimkar
>> <[email protected]> wrote:
>>>
>>> On 5/18/2015 11:07 AM, santosh shilimkar wrote:
>>>>
>>>>
>>>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>>>
>>>>>
>>>>> Hi Santosh,
>>>>>
>>>>> Thanks for your report!
>>>>>
>>>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>> Hi Ming Lei, Jens,
>>>>>>
>>>>>> While doing few tests with recent kernels with Xen Server,
>>>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>>>> is observed. The issue is easily reproducible. You see the flurry
>>>>>> of errors while guest is mounting the file systems.
>>>>>>
>>>>>> After doing some debug and bisects, we zeroed down the issue with
>>>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>>>> that commit reverted the corruption goes away.
>>>>>>
>>>>>> Some more details on the test setup:
>>>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>>>> which includes commit b5dd2f6. Boot the Server.
>>>>>> 2. On DOM0 file system create a ocfs2 volume
>>>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>>>> 4. Boot guest image. (xm create vm.cfg)
>>>>>
>>>>>
>>>>>
>>>>> I am not familiar with xen, so is the image accessed via
>>>>> loop block inside of guest VM? Is he loop block created
>>>>> in DOM0 or guest VM?
>>>>>
>>>> Guest. The Guest disk image is represented as a file by loop
>>>> device.
>>>>
>>>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>>>> You will see errors like below and after this boot, that file
>>>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>>>
>>>>>
>>>>>
>>>>> OK, that means the image is corrupted by VM booting.
>>>>>
>>>> Right
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> From the debug of the actual data on the disk vs what is read by
>>>>>> the guest VM, we suspect the *reads* are actually not going all
>>>>>> the way to disk and possibly returning the wrong data. Because
>>>>>> the actual data on ocfs2 volume at those locations seems
>>>>>> to be non-zero where as the guest seems to be read it as zero.
>>>>>
>>>>>
>>>>>
>>>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>>>
>>>>> Could you apply the patch in below link to see if it can fix the issue?
>>>>> BTW, this patch only removes concurrent submission.
>>>>>
>>>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>>>
>>>> What kernel is this patch generated against ? It doesn't apply against
>>>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>>>> have the dependent patch-set I can't apply it against v4.0.
>>>>
>>> Anyways, I created patch(end of the email) against v4.0, based on your
>>> patch
>>> and tested it. The corruption is no more seen so it does fix
>>> the issue after backing out concurrent submission changes from
>>> commit b5dd2f6. Let me know whats you plan with it since linus
>>> tip as well as v4.0 needs this fix.
>>
>>
>> If your issue is caused by concurrent IO submittion, it might be one
>> issue of ocfs2. As you see, there isn't such problem for ext3/ext4.
>>
> As we speak, I got to know about another regression with XFS as well
> and am quite confident based on symptom that its similar issue.
> I will get a confirmation on the same by tomorrow whether the patch
> fixes it or not.
>
>> And the single thread patch is introduced for aio/dio support, which
>> shouldn't have been a fix patch.
>>
>
> Well before the loop blk-mq conversion commit b5dd2f6, the loop driver
> was single threaded and as you see that issue seen with that
> commit. Now with this experiment, it also proves that those work-queue
> split changes are problematic. So am not sure why do you say that it
> shouldn't be a fix patch.

Because concurrent I/O submission is allowed and often used, and I doubt
your issue can be reproduced by concurrent userspace IO too, such as
simulating that by fio.

Or if you can see what is wrong with the commit, please let me know.

>
> Am not denying that the issue could be with OCFS2 or XFS(not proved yet)
> but they were happy before that commit ;-)

That doesn't mean someone runs the similar tests over OCFS2 before.


Thanks,
Ming

>
> Regards,
> Santosh
>

2015-05-19 20:00:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 05/18/2015 05:38 PM, santosh shilimkar wrote:
> On 5/18/2015 4:25 PM, Ming Lei wrote:
>> On Tue, May 19, 2015 at 7:13 AM, santosh shilimkar
>> <[email protected]> wrote:
>>> On 5/18/2015 11:07 AM, santosh shilimkar wrote:
>>>>
>>>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>>>
>>>>> Hi Santosh,
>>>>>
>>>>> Thanks for your report!
>>>>>
>>>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Hi Ming Lei, Jens,
>>>>>>
>>>>>> While doing few tests with recent kernels with Xen Server,
>>>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>>>> is observed. The issue is easily reproducible. You see the flurry
>>>>>> of errors while guest is mounting the file systems.
>>>>>>
>>>>>> After doing some debug and bisects, we zeroed down the issue with
>>>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>>>> that commit reverted the corruption goes away.
>>>>>>
>>>>>> Some more details on the test setup:
>>>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>>>> which includes commit b5dd2f6. Boot the Server.
>>>>>> 2. On DOM0 file system create a ocfs2 volume
>>>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>>>> 4. Boot guest image. (xm create vm.cfg)
>>>>>
>>>>>
>>>>> I am not familiar with xen, so is the image accessed via
>>>>> loop block inside of guest VM? Is he loop block created
>>>>> in DOM0 or guest VM?
>>>>>
>>>> Guest. The Guest disk image is represented as a file by loop
>>>> device.
>>>>
>>>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>>>> You will see errors like below and after this boot, that file
>>>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>>>
>>>>>
>>>>> OK, that means the image is corrupted by VM booting.
>>>>>
>>>> Right
>>>>
>>>> [...]
>>>>
>>>>>>
>>>>>> From the debug of the actual data on the disk vs what is read by
>>>>>> the guest VM, we suspect the *reads* are actually not going all
>>>>>> the way to disk and possibly returning the wrong data. Because
>>>>>> the actual data on ocfs2 volume at those locations seems
>>>>>> to be non-zero where as the guest seems to be read it as zero.
>>>>>
>>>>>
>>>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>>>
>>>>> Could you apply the patch in below link to see if it can fix the
>>>>> issue?
>>>>> BTW, this patch only removes concurrent submission.
>>>>>
>>>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>>>
>>>> What kernel is this patch generated against ? It doesn't apply against
>>>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>>>> have the dependent patch-set I can't apply it against v4.0.
>>>>
>>> Anyways, I created patch(end of the email) against v4.0, based on
>>> your patch
>>> and tested it. The corruption is no more seen so it does fix
>>> the issue after backing out concurrent submission changes from
>>> commit b5dd2f6. Let me know whats you plan with it since linus
>>> tip as well as v4.0 needs this fix.
>>
>> If your issue is caused by concurrent IO submittion, it might be one
>> issue of ocfs2. As you see, there isn't such problem for ext3/ext4.
>>
> As we speak, I got to know about another regression with XFS as well
> and am quite confident based on symptom that its similar issue.
> I will get a confirmation on the same by tomorrow whether the patch
> fixes it or not.
>
>> And the single thread patch is introduced for aio/dio support, which
>> shouldn't have been a fix patch.
>>
>
> Well before the loop blk-mq conversion commit b5dd2f6, the loop driver
> was single threaded and as you see that issue seen with that
> commit. Now with this experiment, it also proves that those work-queue
> split changes are problematic. So am not sure why do you say that it
> shouldn't be a fix patch.

There should be no issue with having concurrent submissions. If
something relies on serialization of some sort, then that is broken and
should be fixed up. That's not a problem with the loop driver. That's
why it's not a fix.

--
Jens Axboe

2015-05-19 21:52:43

by Santosh Shilimkar

[permalink] [raw]
Subject: Re: [Regression] Guest fs corruption with 'block: loop: improve performance via blk-mq'

On 5/19/2015 12:59 PM, Jens Axboe wrote:
> On 05/18/2015 05:38 PM, santosh shilimkar wrote:
>> On 5/18/2015 4:25 PM, Ming Lei wrote:
>>> On Tue, May 19, 2015 at 7:13 AM, santosh shilimkar
>>> <[email protected]> wrote:
>>>> On 5/18/2015 11:07 AM, santosh shilimkar wrote:
>>>>>
>>>>> On 5/17/2015 6:26 PM, Ming Lei wrote:
>>>>>>
>>>>>> Hi Santosh,
>>>>>>
>>>>>> Thanks for your report!
>>>>>>
>>>>>> On Sun, May 17, 2015 at 4:13 AM, santosh shilimkar
>>>>>> <[email protected]> wrote:
>>>>>>>
>>>>>>> Hi Ming Lei, Jens,
>>>>>>>
>>>>>>> While doing few tests with recent kernels with Xen Server,
>>>>>>> we saw guests(DOMU) disk image getting corrupted while booting it.
>>>>>>> Strangely the issue is seen so far only with disk image over ocfs2
>>>>>>> volume. If the same image kept on the EXT3/4 drive, no corruption
>>>>>>> is observed. The issue is easily reproducible. You see the flurry
>>>>>>> of errors while guest is mounting the file systems.
>>>>>>>
>>>>>>> After doing some debug and bisects, we zeroed down the issue with
>>>>>>> commit "b5dd2f6 block: loop: improve performance via blk-mq". With
>>>>>>> that commit reverted the corruption goes away.
>>>>>>>
>>>>>>> Some more details on the test setup:
>>>>>>> 1. OVM(XEN) Server kernel(DOM0) upgraded to more recent kernel
>>>>>>> which includes commit b5dd2f6. Boot the Server.
>>>>>>> 2. On DOM0 file system create a ocfs2 volume
>>>>>>> 3. Keep the Guest(VM) disk image on ocfs2 volume.
>>>>>>> 4. Boot guest image. (xm create vm.cfg)
>>>>>>
>>>>>>
>>>>>> I am not familiar with xen, so is the image accessed via
>>>>>> loop block inside of guest VM? Is he loop block created
>>>>>> in DOM0 or guest VM?
>>>>>>
>>>>> Guest. The Guest disk image is represented as a file by loop
>>>>> device.
>>>>>
>>>>>>> 5. Observe the VM boot console log. VM itself use the EXT3 fs.
>>>>>>> You will see errors like below and after this boot, that file
>>>>>>> system/disk-image gets corrupted and mostly won't boot next time.
>>>>>>
>>>>>>
>>>>>> OK, that means the image is corrupted by VM booting.
>>>>>>
>>>>> Right
>>>>>
>>>>> [...]
>>>>>
>>>>>>>
>>>>>>> From the debug of the actual data on the disk vs what is read by
>>>>>>> the guest VM, we suspect the *reads* are actually not going all
>>>>>>> the way to disk and possibly returning the wrong data. Because
>>>>>>> the actual data on ocfs2 volume at those locations seems
>>>>>>> to be non-zero where as the guest seems to be read it as zero.
>>>>>>
>>>>>>
>>>>>> Two big changes in the patchset are: 1) use blk-mq request based IO;
>>>>>> 2) submit I/O concurrently(write vs. write is still serialized)
>>>>>>
>>>>>> Could you apply the patch in below link to see if it can fix the
>>>>>> issue?
>>>>>> BTW, this patch only removes concurrent submission.
>>>>>>
>>>>>> http://marc.info/?t=143093223200004&r=1&w=2
>>>>>>
>>>>> What kernel is this patch generated against ? It doesn't apply against
>>>>> v4.0. Does this need the AIO/DIO conversion patches as well. Do you
>>>>> have the dependent patch-set I can't apply it against v4.0.
>>>>>
>>>> Anyways, I created patch(end of the email) against v4.0, based on
>>>> your patch
>>>> and tested it. The corruption is no more seen so it does fix
>>>> the issue after backing out concurrent submission changes from
>>>> commit b5dd2f6. Let me know whats you plan with it since linus
>>>> tip as well as v4.0 needs this fix.
>>>
>>> If your issue is caused by concurrent IO submittion, it might be one
>>> issue of ocfs2. As you see, there isn't such problem for ext3/ext4.
>>>
>> As we speak, I got to know about another regression with XFS as well
>> and am quite confident based on symptom that its similar issue.
>> I will get a confirmation on the same by tomorrow whether the patch
>> fixes it or not.
>>
>>> And the single thread patch is introduced for aio/dio support, which
>>> shouldn't have been a fix patch.
>>>
>>
>> Well before the loop blk-mq conversion commit b5dd2f6, the loop driver
>> was single threaded and as you see that issue seen with that
>> commit. Now with this experiment, it also proves that those work-queue
>> split changes are problematic. So am not sure why do you say that it
>> shouldn't be a fix patch.
>
> There should be no issue with having concurrent submissions. If
> something relies on serialization of some sort, then that is broken and
> should be fixed up. That's not a problem with the loop driver. That's
> why it's not a fix.
>
Fair enough and I agree with you both in principal.

Bottom line is, we have at least two file systems(ocfs2, xfs) showing
corruption as of now when used over loop device from commit b5dd2f6
onwards. As you can see from the thread, Ming Lei suggested me to try
out the patch[1] which he has cooked up as part of the dio/aio loop
conversion. And that did help in both ocfs2 as well as xfs case.

I will keep looking further into it but just in case some else
stumbles on it, atleast they know now about the problem.

Regards,
Santosh
[1] http://marc.info/?t=143093223200004&r=1&w=2